Skip to content

Integration: Add OpenTelemetry distributed tracing for end-to-end task pipeline observability #797

@kelos-bot

Description

@kelos-bot

🤖 Kelos Strategist Agent @gjkim42

Area: Integration Opportunities

Summary

Add OpenTelemetry (OTEL) distributed tracing to the Kelos controller and spawner, enabling operators to visualize and debug task pipeline execution as end-to-end traces in standard backends (Jaeger, Grafana Tempo, Datadog). While Kelos already has solid Prometheus metrics (internal/controller/metrics.go — task counts, duration histograms, cost/token counters), it has zero distributed tracing support. Metrics answer "how many?" and "how long on average?" but cannot answer "why did this specific pipeline fail at step 3?" or "which spawner discovery cycle created this task?" Tracing fills this gap.

Problem

1. Multi-step pipelines are opaque

Task pipelines using dependsOn chains (e.g., examples/07-task-pipeline/) create implicit causal relationships:

TaskSpawner discovery → Task "scaffold" → Task "write-tests" → Task "open-pr"

Today, debugging a failure in this chain requires:

  • Manually correlating timestamps across kelos get tasks
  • Reading individual pod logs (kelos logs)
  • Guessing which spawner discovery cycle triggered the pipeline
  • No way to see the full execution timeline in one view

2. Spawner-to-task causality is lost

When a TaskSpawner discovers a work item and creates a Task, there is no trace linking the discovery event to the resulting Task. The spawner (cmd/kelos-spawner/main.go) creates Task resources via the Kubernetes API, but no context is propagated. If 9 spawners are running simultaneously (as in the self-development setup), correlating which spawner created which task requires label-based filtering, not causal tracing.

3. Controller reconciliation loops are invisible

The TaskReconciler performs multiple operations per reconciliation:

  • Dependency resolution (checkDependencies)
  • Branch lock acquisition (BranchLocker)
  • Prompt template resolution (resolvePromptTemplate)
  • Job creation (JobBuilder.BuildJob)
  • Output capture from pod logs

When a reconciliation takes unexpectedly long or fails, there is no breakdown of where time was spent. Prometheus histograms (kelos_task_duration_seconds) show end-to-end duration but not internal step timing.

4. No cross-task trace context

Task pipelines have no shared trace ID. Each task is an independent unit. When write-tests depends on scaffold, there is no way to:

  • View the full pipeline as a single trace
  • See the wait time between dependency completion and downstream task start
  • Correlate failures across dependent tasks

Proposed Design

Trace Hierarchy

Trace: "spawner/{spawnerName}/discovery/{workItemID}"
├── Span: "spawner.discover" (poll cycle)
│   ├── Span: "spawner.create_task" (per work item)
│   │   └── Link: → task/{taskName}
│   └── Span: "spawner.create_task"
│       └── Link: → task/{taskName}
│
├── Trace: "task/{taskName}" (linked from spawner)
│   ├── Span: "task.reconcile"
│   │   ├── Span: "task.check_dependencies"
│   │   ├── Span: "task.acquire_branch_lock"
│   │   ├── Span: "task.resolve_prompt"
│   │   ├── Span: "task.build_job"
│   │   └── Span: "task.create_job"
│   ├── Span: "task.running" (long span covering agent execution)
│   ├── Span: "task.capture_outputs"
│   └── Span: "task.complete"
│
└── Trace: "task/{dependentTaskName}" (linked via dependsOn)
    ├── Span: "task.waiting" (blocked on dependency)
    └── Span: "task.reconcile" ...

Trace Context Propagation

  1. Spawner → Task: Store trace context in Task annotations (kelos.dev/traceparent, kelos.dev/tracestate). The spawner creates a trace for each discovery cycle and child spans for each Task creation.

  2. Task → Dependent Task: When resolvePromptTemplate checks dependencies, create a span link from the dependent task's trace to each dependency's trace. This preserves the causal relationship without forcing a single trace (pipelines can be long-running).

  3. Controller → Job/Pod: Inject TRACEPARENT env var into agent pods via JobBuilder.BuildJob(). Agents that support OTEL can optionally continue the trace, enabling end-to-end visibility into what the agent did.

Span Attributes

Attribute Source Example
kelos.task.name Task metadata scaffold
kelos.task.type spec.type claude-code
kelos.task.model spec.model opus
kelos.task.phase status.phase Succeeded
kelos.task.spawner Label kelos.dev/taskspawner kelos-workers
kelos.task.cost_usd status.results["total_cost"] 0.42
kelos.task.tokens.input status.results["input_tokens"] 15000
kelos.task.tokens.output status.results["output_tokens"] 3200
kelos.task.branch spec.branch feature/auth
kelos.spawner.name TaskSpawner metadata kelos-workers
kelos.spawner.source Source type githubIssues
kelos.spawner.work_item.id Work item identifier issue-42

Implementation Touchpoints

File Change
go.mod Add go.opentelemetry.io/otel, go.opentelemetry.io/otel/sdk, go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
cmd/kelos-controller/main.go Initialize OTEL TracerProvider with OTLP exporter; respect OTEL_EXPORTER_OTLP_ENDPOINT env var
cmd/kelos-spawner/main.go Initialize TracerProvider; create spans for discovery cycles and task creation; store traceparent in Task annotations
internal/controller/task_controller.go Read traceparent from annotations; create spans for reconciliation steps; inject TRACEPARENT into job env
internal/controller/taskspawner_controller.go Create spans for spawner reconciliation
internal/controller/job_builder.go Add TRACEPARENT env var to container spec when trace context is present
Helm chart values.yaml Add tracing.enabled, tracing.endpoint, tracing.samplingRate configuration

Configuration

# Helm values.yaml
tracing:
  enabled: false  # opt-in, no overhead when disabled
  endpoint: ""    # OTLP endpoint, e.g., "otel-collector:4317"
  samplingRate: 1.0  # 0.0-1.0, trace sampling ratio

When tracing.enabled is false (default), no OTEL SDK is initialized and no spans are created — zero performance impact for users who don't need tracing.

Why This Matters

Production debugging

The self-development deployment runs 9 spawners creating tasks that interact through GitHub (one spawner's PR triggers another spawner's review). Without tracing, debugging cross-spawner interactions requires manual timestamp correlation across potentially dozens of concurrent tasks.

Pipeline failure diagnosis

Example 07 shows a 3-step pipeline. With tracing, a failed pipeline shows as a single trace with the failing span highlighted, immediate visibility into whether the failure was in dependency resolution, branch locking, agent execution, or output capture.

Cost attribution per workflow

Existing cost metrics (kelos_task_cost_usd_total) are aggregated. Traces would show cost per individual pipeline execution, enabling per-workflow cost analysis.

Complements existing metrics

This proposal works alongside, not replacing, the existing Prometheus metrics in internal/controller/metrics.go. Metrics provide aggregate views (dashboards, alerting); traces provide per-execution debugging. Together they provide full observability:

Question Metrics Tracing
How many tasks failed this hour? kelos_task_completed_total{phase="Failed"}
Why did task X fail? ✅ Trace shows exact step
What's the p95 task duration? kelos_task_duration_seconds
Where did task X spend its time? ✅ Span breakdown
Which spawner created task X? ⚠️ Label query ✅ Causal link

Backward Compatibility

  • Fully opt-in via Helm values (tracing.enabled: false default)
  • No new CRDs or API changes
  • Zero overhead when disabled (no OTEL SDK initialization)
  • Trace annotations on Tasks are informational and don't affect behavior

Alternatives Considered

  1. Structured logging with correlation IDs: Simpler but less powerful — no timing visualization, no causal links, requires custom log parsing. Tracing provides this and more via standard tooling.

  2. Kubernetes Events only: Already used for lifecycle events, but events are ephemeral (default 1h TTL), lack timing precision, and don't support parent-child relationships. Good for alerting, insufficient for debugging.

  3. Custom trace format: Would require custom UI/tooling. OTEL is the industry standard with broad backend support (Jaeger, Tempo, Datadog, Honeycomb, etc.).

/kind feature

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions