Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ deno.lock
reference/
codex/
ghostty/
observability/data/
tmp/verification/

# Worktree script SQLite artifacts
scripts/worktree-feature/*.db*
Expand Down
200 changes: 200 additions & 0 deletions docs/guides/agent-trace-otel-verification.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
---
title: Agent Trace OTEL Log Verification
description: Reproducible local verification for canonical agent trace OTEL log export with Loki.
---

## Local stack

Start the local stack:

```bash
smithers observability --detach
```

Expected endpoints:

- Grafana: `http://localhost:3001`
- Prometheus: `http://localhost:9090`
- Tempo: `http://localhost:3200`
- Loki: `http://localhost:3100`
- OTEL collector HTTP: `http://localhost:4318`

Validate the stack:

```bash
docker compose -f observability/docker-compose.otel.yml ps
curl -sf http://localhost:3100/ready
curl -sf http://localhost:3001/api/datasources | jq 'map({name,type,url})'
docker logs observability-otel-collector-1 | tail -n 80
```

## Enable OTEL export

```bash
export SMITHERS_OTEL_ENABLED=1
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_SERVICE_NAME=smithers-dev
export SMITHERS_LOG_FORMAT=json
```

## Demo workflow

Use the built-in reproducible workflow at `workflows/agent-trace-otel-demo.tsx`.
It emits:

- one Pi-like high-fidelity attempt with text deltas, thinking deltas, tool lifecycle, usage, and a redaction fixture
- one SDK-style final-only attempt
- stable run annotations for Loki queries

Run the success case:

```bash
smithers run workflows/agent-trace-otel-demo.tsx \
--run-id agent-trace-otel-demo \
--annotations '{"custom.demo":true,"custom.ticket":"OBS-123"}'
```

Run the malformed JSON failure case:

```bash
smithers run workflows/agent-trace-otel-demo.tsx \
--run-id agent-trace-otel-demo-fail \
--input '{"failureMode":"malformed-json"}' \
--annotations '{"custom.demo":true,"custom.ticket":"OBS-ERR"}'
```

Optional durable-local verification:

```bash
jq 'select(.type == "AgentTraceEvent" or .type == "AgentTraceSummary")' \
.smithers/executions/agent-trace-otel-demo/logs/stream.ndjson

find .smithers/executions/agent-trace-otel-demo/logs/agent-trace -type f -maxdepth 1 -print
```

## Loki queries

Smithers OTEL attributes are exposed to Loki as sanitized structured metadata fields such as:

- `run_id`
- `workflow_path`
- `node_id`
- `node_attempt`
- `agent_family`
- `agent_capture_mode`
- `trace_completeness`
- `event_kind`

Use `{service_name="smithers-dev"}` as the stream selector, then filter on structured metadata in the LogQL pipeline. Use `| json` to inspect the structured log body.

All events for one run:

```logql
{service_name="smithers-dev"} | run_id="agent-trace-otel-demo"
```

One node attempt:

```logql
{service_name="smithers-dev"} | run_id="agent-trace-otel-demo" | node_id="pi-rich-trace" | node_attempt="1"
```

Thinking deltas only:

```logql
{service_name="smithers-dev"} | run_id="agent-trace-otel-demo" | event_kind="assistant.thinking.delta"
```

Tool execution only:

```logql
{service_name="smithers-dev"} | run_id="agent-trace-otel-demo" | event_kind=~"tool\.execution\..*"
```

Capture errors only:

```logql
{service_name="smithers-dev"} | event_kind="capture.error"
```

Inspect the structured JSON body for one run:

```logql
{service_name="smithers-dev"} | run_id="agent-trace-otel-demo" | json
```

Redaction proof query:

```logql
{service_name="smithers-dev"} | run_id="agent-trace-otel-demo" |= "REDACTED"
```

## API query examples

Equivalent direct Loki API checks:

```bash
curl -sG 'http://localhost:3100/loki/api/v1/query_range' \
--data-urlencode 'query={service_name="smithers-dev"} | run_id="agent-trace-otel-demo"' \
--data-urlencode 'limit=200' | jq '.data.result[] | {stream, values: (.values | length)}'

curl -sG 'http://localhost:3100/loki/api/v1/query_range' \
--data-urlencode 'query={service_name="smithers-dev"} | run_id="agent-trace-otel-demo" | event_kind="assistant.thinking.delta"' \
--data-urlencode 'limit=20' | jq '.data.result[]?.values[]?[1]'
```

## Tempo trace checks

Tempo search should show Smithers spans once a workflow has run:

```bash
curl -s http://localhost:3200/api/search | jq .
curl -s http://localhost:3200/api/search/tags | jq .
curl -s 'http://localhost:3200/api/search/tag/service.name/values' | jq .
curl -s 'http://localhost:3200/api/search/tag/runId/values' | jq .
```

Inspect one trace directly:

```bash
TRACE_ID=$(curl -s http://localhost:3200/api/search \
| jq -r '.traces[] | select(.rootTraceName=="engine:run-workflow") | .traceID' \
| head -n 1)

curl -s http://localhost:3200/api/traces/$TRACE_ID | jq .
```

Expected trace attributes include at least:

- `service.name = smithers-dev`
- `runId = <RUN_ID>`
- `workflowPath = <workflow path>`

## Verification checklist

- stack starts successfully in Docker
- Loki is present and queryable
- collector logs pipeline is active
- Pi traces show text deltas, thinking deltas, tool execution lifecycle, final message, usage, and run/node/attempt correlation
- second agent family is exported with truthful `final-only` completeness classification
- Gemini `stream-json` attempts preserve structured deltas truthfully
- Kimi `stream-json` attempts preserve structured tool lifecycle truthfully
- malformed or truncated structured streams emit `capture.error` and classify as `capture-failed`
- artifact write failures emit `capture.warning` and degrade to `partial-observed` without losing durable DB truth
- Tempo search shows Smithers spans and trace attributes including `runId`
- Prometheus is still scraping the collector successfully
- secrets are redacted from canonical events, OTEL log bodies, and persisted trace artifacts

## Automated coverage

The fixture-style and workflow-backed tests in `tests/agent-trace.test.tsx` now cover:

- Pi rich success and malformed JSON failure
- SDK truthful `final-only`
- Claude Code structured stream preservation
- Codex structured delta + usage preservation
- Gemini structured `stream-json` and coarse `cli-json` classification
- Kimi structured `stream-json` tool lifecycle preservation
- truncated structured stream failure classification
- artifact write failure classification
- OTEL body/attribute shaping for Loki query fields
3 changes: 2 additions & 1 deletion docs/guides/monitoring-logs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,9 @@ That starts:

- OTLP collector on `http://localhost:4318`
- Prometheus on `http://localhost:9090`
- Grafana on `http://localhost:3000`
- Grafana on `http://localhost:3001`
- Tempo on `http://localhost:3200`
- Loki on `http://localhost:3100`

The collector exports metrics on `:8889`, Prometheus scrapes that endpoint, and Grafana comes pre-provisioned with a Smithers dashboard.

Expand Down
65 changes: 60 additions & 5 deletions observability/docker-compose.otel.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@ version: "3.9"

services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
image: otel/opentelemetry-collector-contrib:0.148.0
restart: unless-stopped
command: ["--config", "/etc/otelcol/config.yml"]
volumes:
- ./otel-collector-config.yml:/etc/otelcol/config.yml:ro
Expand All @@ -11,37 +12,91 @@ services:
- "4318:4318" # HTTP (Smithers sends here)
- "8889:8889" # Prometheus exporter scrape endpoint
depends_on:
- tempo
tempo:
condition: service_healthy
loki:
condition: service_healthy
healthcheck:
test: ["CMD", "/otelcol-contrib", "--version"]
interval: 10s
timeout: 5s
retries: 12
start_period: 10s

loki:
image: grafana/loki:3.3.2
restart: unless-stopped
command: ["-config.file=/etc/loki/local-config.yaml"]
volumes:
- ./loki/loki-config.yaml:/etc/loki/local-config.yaml:ro
- ./data/loki:/loki
ports:
- "3100:3100"
healthcheck:
test: ["CMD", "wget", "-q", "-O-", "http://127.0.0.1:3100/ready"]
interval: 10s
timeout: 5s
retries: 18
start_period: 10s

prometheus:
image: prom/prometheus:v3.2.0
restart: unless-stopped
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./data/prometheus:/prometheus
ports:
- "9090:9090"
depends_on:
- otel-collector
otel-collector:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "-q", "-O-", "http://127.0.0.1:9090/-/ready"]
interval: 10s
timeout: 5s
retries: 18
start_period: 10s

tempo:
image: grafana/tempo:2.7.0
restart: unless-stopped
command: ["-config.file=/etc/tempo/tempo.yml"]
volumes:
- ./tempo/tempo.yml:/etc/tempo/tempo.yml:ro
- ./data/tempo:/tmp/tempo
ports:
- "3200:3200" # Tempo query API
- "4317" # OTLP gRPC (internal)
healthcheck:
test: ["CMD", "wget", "-q", "-O-", "http://127.0.0.1:3200/api/search"]
interval: 10s
timeout: 5s
retries: 24
start_period: 20s

grafana:
image: grafana/grafana:11.5.0
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
- ./data/grafana:/var/lib/grafana
ports:
- "3001:3000"
depends_on:
- prometheus
- tempo
prometheus:
condition: service_healthy
tempo:
condition: service_healthy
loki:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "-q", "-O-", "http://127.0.0.1:3000/api/health"]
interval: 10s
timeout: 5s
retries: 24
start_period: 20s
20 changes: 20 additions & 0 deletions observability/grafana/provisioning/datasources/datasources.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,33 @@ datasources:
- name: traceID
datasourceUid: tempo

- name: Loki
type: loki
uid: loki
access: proxy
url: http://loki:3100
editable: false
jsonData:
maxLines: 1000

- name: Tempo
type: tempo
uid: tempo
access: proxy
url: http://tempo:3200
editable: false
jsonData:
tracesToLogs:
datasourceUid: loki
tags: ["service.name"]
mappedTags:
- key: service.name
value: service.name
mapTagNamesEnabled: false
spanStartTimeShift: "-1h"
spanEndTimeShift: "1h"
filterByTraceID: false
filterBySpanID: false
tracesToMetrics:
datasourceUid: prometheus
spanStartTimeShift: "-1h"
Expand Down
35 changes: 35 additions & 0 deletions observability/loki/loki-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
auth_enabled: false

server:
http_listen_port: 3100

common:
instance_addr: 127.0.0.1
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory

schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h

limits_config:
allow_structured_metadata: true

ruler:
alertmanager_url: http://localhost:9093

analytics:
reporting_enabled: false
Loading
Loading