Integration: Add OpenTelemetry distributed tracing for end-to-end task pipeline observability

🤖 **Kelos Strategist Agent** @gjkim42

## Area: Integration Opportunities

## Summary

Add **OpenTelemetry (OTEL) distributed tracing** to the Kelos controller and spawner, enabling operators to visualize and debug task pipeline execution as end-to-end traces in standard backends (Jaeger, Grafana Tempo, Datadog). While Kelos already has solid **Prometheus metrics** (`internal/controller/metrics.go` — task counts, duration histograms, cost/token counters), it has **zero distributed tracing support**. Metrics answer "how many?" and "how long on average?" but cannot answer "why did this specific pipeline fail at step 3?" or "which spawner discovery cycle created this task?" Tracing fills this gap.

## Problem

### 1. Multi-step pipelines are opaque

Task pipelines using `dependsOn` chains (e.g., `examples/07-task-pipeline/`) create implicit causal relationships:

```
TaskSpawner discovery → Task "scaffold" → Task "write-tests" → Task "open-pr"
```

Today, debugging a failure in this chain requires:
- Manually correlating timestamps across `kelos get tasks`
- Reading individual pod logs (`kelos logs`)
- Guessing which spawner discovery cycle triggered the pipeline
- No way to see the full execution timeline in one view

### 2. Spawner-to-task causality is lost

When a TaskSpawner discovers a work item and creates a Task, there is no trace linking the discovery event to the resulting Task. The spawner (`cmd/kelos-spawner/main.go`) creates Task resources via the Kubernetes API, but no context is propagated. If 9 spawners are running simultaneously (as in the self-development setup), correlating which spawner created which task requires label-based filtering, not causal tracing.

### 3. Controller reconciliation loops are invisible

The `TaskReconciler` performs multiple operations per reconciliation:
- Dependency resolution (`checkDependencies`)
- Branch lock acquisition (`BranchLocker`)
- Prompt template resolution (`resolvePromptTemplate`)
- Job creation (`JobBuilder.BuildJob`)
- Output capture from pod logs

When a reconciliation takes unexpectedly long or fails, there is no breakdown of where time was spent. Prometheus histograms (`kelos_task_duration_seconds`) show end-to-end duration but not internal step timing.

### 4. No cross-task trace context

Task pipelines have no shared trace ID. Each task is an independent unit. When `write-tests` depends on `scaffold`, there is no way to:
- View the full pipeline as a single trace
- See the wait time between dependency completion and downstream task start
- Correlate failures across dependent tasks

## Proposed Design

### Trace Hierarchy

```
Trace: "spawner/{spawnerName}/discovery/{workItemID}"
├── Span: "spawner.discover" (poll cycle)
│   ├── Span: "spawner.create_task" (per work item)
│   │   └── Link: → task/{taskName}
│   └── Span: "spawner.create_task"
│       └── Link: → task/{taskName}
│
├── Trace: "task/{taskName}" (linked from spawner)
│   ├── Span: "task.reconcile"
│   │   ├── Span: "task.check_dependencies"
│   │   ├── Span: "task.acquire_branch_lock"
│   │   ├── Span: "task.resolve_prompt"
│   │   ├── Span: "task.build_job"
│   │   └── Span: "task.create_job"
│   ├── Span: "task.running" (long span covering agent execution)
│   ├── Span: "task.capture_outputs"
│   └── Span: "task.complete"
│
└── Trace: "task/{dependentTaskName}" (linked via dependsOn)
    ├── Span: "task.waiting" (blocked on dependency)
    └── Span: "task.reconcile" ...
```

### Trace Context Propagation

1. **Spawner → Task**: Store trace context in Task annotations (`kelos.dev/traceparent`, `kelos.dev/tracestate`). The spawner creates a trace for each discovery cycle and child spans for each Task creation.

2. **Task → Dependent Task**: When `resolvePromptTemplate` checks dependencies, create a span link from the dependent task's trace to each dependency's trace. This preserves the causal relationship without forcing a single trace (pipelines can be long-running).

3. **Controller → Job/Pod**: Inject `TRACEPARENT` env var into agent pods via `JobBuilder.BuildJob()`. Agents that support OTEL can optionally continue the trace, enabling end-to-end visibility into what the agent did.

### Span Attributes

| Attribute | Source | Example |
|---|---|---|
| `kelos.task.name` | Task metadata | `scaffold` |
| `kelos.task.type` | `spec.type` | `claude-code` |
| `kelos.task.model` | `spec.model` | `opus` |
| `kelos.task.phase` | `status.phase` | `Succeeded` |
| `kelos.task.spawner` | Label `kelos.dev/taskspawner` | `kelos-workers` |
| `kelos.task.cost_usd` | `status.results["total_cost"]` | `0.42` |
| `kelos.task.tokens.input` | `status.results["input_tokens"]` | `15000` |
| `kelos.task.tokens.output` | `status.results["output_tokens"]` | `3200` |
| `kelos.task.branch` | `spec.branch` | `feature/auth` |
| `kelos.spawner.name` | TaskSpawner metadata | `kelos-workers` |
| `kelos.spawner.source` | Source type | `githubIssues` |
| `kelos.spawner.work_item.id` | Work item identifier | `issue-42` |

### Implementation Touchpoints

| File | Change |
|---|---|
| `go.mod` | Add `go.opentelemetry.io/otel`, `go.opentelemetry.io/otel/sdk`, `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc` |
| `cmd/kelos-controller/main.go` | Initialize OTEL TracerProvider with OTLP exporter; respect `OTEL_EXPORTER_OTLP_ENDPOINT` env var |
| `cmd/kelos-spawner/main.go` | Initialize TracerProvider; create spans for discovery cycles and task creation; store traceparent in Task annotations |
| `internal/controller/task_controller.go` | Read traceparent from annotations; create spans for reconciliation steps; inject `TRACEPARENT` into job env |
| `internal/controller/taskspawner_controller.go` | Create spans for spawner reconciliation |
| `internal/controller/job_builder.go` | Add `TRACEPARENT` env var to container spec when trace context is present |
| Helm chart `values.yaml` | Add `tracing.enabled`, `tracing.endpoint`, `tracing.samplingRate` configuration |

### Configuration

```yaml
# Helm values.yaml
tracing:
  enabled: false  # opt-in, no overhead when disabled
  endpoint: ""    # OTLP endpoint, e.g., "otel-collector:4317"
  samplingRate: 1.0  # 0.0-1.0, trace sampling ratio
```

When `tracing.enabled` is false (default), no OTEL SDK is initialized and no spans are created — zero performance impact for users who don't need tracing.

## Why This Matters

### Production debugging
The self-development deployment runs **9 spawners** creating tasks that interact through GitHub (one spawner's PR triggers another spawner's review). Without tracing, debugging cross-spawner interactions requires manual timestamp correlation across potentially dozens of concurrent tasks.

### Pipeline failure diagnosis
Example 07 shows a 3-step pipeline. With tracing, a failed pipeline shows as a single trace with the failing span highlighted, immediate visibility into whether the failure was in dependency resolution, branch locking, agent execution, or output capture.

### Cost attribution per workflow
Existing cost metrics (`kelos_task_cost_usd_total`) are aggregated. Traces would show cost per individual pipeline execution, enabling per-workflow cost analysis.

### Complements existing metrics
This proposal works alongside, not replacing, the existing Prometheus metrics in `internal/controller/metrics.go`. Metrics provide aggregate views (dashboards, alerting); traces provide per-execution debugging. Together they provide full observability:

| Question | Metrics | Tracing |
|---|---|---|
| How many tasks failed this hour? | ✅ `kelos_task_completed_total{phase="Failed"}` | ❌ |
| Why did task X fail? | ❌ | ✅ Trace shows exact step |
| What's the p95 task duration? | ✅ `kelos_task_duration_seconds` | ❌ |
| Where did task X spend its time? | ❌ | ✅ Span breakdown |
| Which spawner created task X? | ⚠️ Label query | ✅ Causal link |

## Backward Compatibility

- Fully opt-in via Helm values (`tracing.enabled: false` default)
- No new CRDs or API changes
- Zero overhead when disabled (no OTEL SDK initialization)
- Trace annotations on Tasks are informational and don't affect behavior

## Alternatives Considered

1. **Structured logging with correlation IDs**: Simpler but less powerful — no timing visualization, no causal links, requires custom log parsing. Tracing provides this and more via standard tooling.

2. **Kubernetes Events only**: Already used for lifecycle events, but events are ephemeral (default 1h TTL), lack timing precision, and don't support parent-child relationships. Good for alerting, insufficient for debugging.

3. **Custom trace format**: Would require custom UI/tooling. OTEL is the industry standard with broad backend support (Jaeger, Tempo, Datadog, Honeycomb, etc.).

/kind feature

Attribute	Source	Example
`kelos.task.name`	Task metadata	`scaffold`
`kelos.task.type`	`spec.type`	`claude-code`
`kelos.task.model`	`spec.model`	`opus`
`kelos.task.phase`	`status.phase`	`Succeeded`
`kelos.task.spawner`	Label `kelos.dev/taskspawner`	`kelos-workers`
`kelos.task.cost_usd`	`status.results["total_cost"]`	`0.42`
`kelos.task.tokens.input`	`status.results["input_tokens"]`	`15000`
`kelos.task.tokens.output`	`status.results["output_tokens"]`	`3200`
`kelos.task.branch`	`spec.branch`	`feature/auth`
`kelos.spawner.name`	TaskSpawner metadata	`kelos-workers`
`kelos.spawner.source`	Source type	`githubIssues`
`kelos.spawner.work_item.id`	Work item identifier	`issue-42`

File	Change
`go.mod`	Add `go.opentelemetry.io/otel`, `go.opentelemetry.io/otel/sdk`, `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc`
`cmd/kelos-controller/main.go`	Initialize OTEL TracerProvider with OTLP exporter; respect `OTEL_EXPORTER_OTLP_ENDPOINT` env var
`cmd/kelos-spawner/main.go`	Initialize TracerProvider; create spans for discovery cycles and task creation; store traceparent in Task annotations
`internal/controller/task_controller.go`	Read traceparent from annotations; create spans for reconciliation steps; inject `TRACEPARENT` into job env
`internal/controller/taskspawner_controller.go`	Create spans for spawner reconciliation
`internal/controller/job_builder.go`	Add `TRACEPARENT` env var to container spec when trace context is present
Helm chart `values.yaml`	Add `tracing.enabled`, `tracing.endpoint`, `tracing.samplingRate` configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration: Add OpenTelemetry distributed tracing for end-to-end task pipeline observability #797

Area: Integration Opportunities

Summary

Problem

1. Multi-step pipelines are opaque

2. Spawner-to-task causality is lost

3. Controller reconciliation loops are invisible

4. No cross-task trace context

Proposed Design

Trace Hierarchy

Trace Context Propagation

Span Attributes

Implementation Touchpoints

Configuration

Why This Matters

Production debugging

Pipeline failure diagnosis

Cost attribution per workflow

Complements existing metrics

Backward Compatibility

Alternatives Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question	Metrics	Tracing
How many tasks failed this hour?	✅ `kelos_task_completed_total{phase="Failed"}`	❌
Why did task X fail?	❌	✅ Trace shows exact step
What's the p95 task duration?	✅ `kelos_task_duration_seconds`	❌
Where did task X spend its time?	❌	✅ Span breakdown
Which spawner created task X?	⚠️ Label query	✅ Causal link

Integration: Add OpenTelemetry distributed tracing for end-to-end task pipeline observability #797

Description

Area: Integration Opportunities

Summary

Problem

1. Multi-step pipelines are opaque

2. Spawner-to-task causality is lost

3. Controller reconciliation loops are invisible

4. No cross-task trace context

Proposed Design

Trace Hierarchy

Trace Context Propagation

Span Attributes

Implementation Touchpoints

Configuration

Why This Matters

Production debugging

Pipeline failure diagnosis

Cost attribution per workflow

Complements existing metrics

Backward Compatibility

Alternatives Considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions