feat(observability): add canonical agent trace OTEL logs with Loki verification by SamuelLHuber · Pull Request #119 · codeplaneapp/smithers

SamuelLHuber · 2026-03-28T09:58:13Z

Problem

Smithers had runtime events, traces, and metrics, but it did not have a stable, queryable observability surface for the agent-visible execution transcript itself.

That left a gap for debugging and review:

no canonical model for agent trace events across families
no truthful completeness classification for rich vs coarse/final-only captures
no OTEL log export path for those canonical events
no stable Loki query surface for run/node/attempt correlation
weaker proof around failure handling, redaction, and durable local truth

What this PR adds

Canonical agent trace model

adds a canonical agent trace event model in src/agent-trace.ts
normalizes structured and coarse agent outputs into stable event kinds such as:
- assistant.text.delta
- assistant.thinking.delta
- tool.execution.start
- tool.execution.update
- tool.execution.end
- assistant.message.final
- usage
- capture.error
- capture.warning
preserves run / node / attempt / iteration correlation
preserves truthful capture metadata such as:
- agent.family
- agent.capture_mode
- trace.completeness

OTEL log export + durable local truth

exports canonical trace events as OTEL log records
shapes log attributes so they are queryable in Loki via stable fields like:
- run_id
- workflow_path
- node_id
- node_attempt
- agent_family
- agent_capture_mode
- trace_completeness
- event_kind
keeps Smithers local truth intact instead of relying on OTEL export as the only store

Local observability stack support

adds/finishes Loki support in the local Docker observability stack
wires collector logs pipeline to Loki and traces to Tempo
provisions Grafana datasources for Loki / Tempo / Prometheus
documents reproducible local verification in:
- docs/guides/agent-trace-otel-verification.mdx

Coverage across agent families

Pi rich structured stream: high-fidelity deltas, tool lifecycle, final, usage, redaction
SDK-style final-only family: truthful final-only classification
structured fixture coverage for:
- Claude Code
- Codex
- Gemini
- Kimi

Failure / redaction coverage

malformed upstream JSON → capture.error / capture-failed
truncated structured stream → capture.error / capture-failed
artifact write failure → capture.warning / partial-observed
redaction is verified across canonical payloads and OTEL log bodies

Verification hardening

This branch also hardens the local verification path so the stack is easier to reset and verify repeatedly:

pins the OTEL collector image
adds service healthchecks for Loki / Tempo / Prometheus / Grafana / collector
adds deterministic local reset and health-wait scripts:
- scripts/obs-reset.sh
- scripts/obs-wait-healthy.sh
adds an automated evidence capture script:
- scripts/verify-observability.sh

Why this approach

The goal here is not just “send some logs to Loki.”

The goal is a truthful, queryable, reviewer-usable agent trace pipeline that:

captures what Smithers actually observed
preserves structured fidelity where available
classifies weaker capture modes honestly
survives export failures without losing local truth
is queryable by stable run/node/attempt correlation fields

Validation

Automated

bun test tests/agent-trace.test.tsx
bun test tests/observability.test.ts

Added / verified coverage for:

Pi rich success
Pi malformed JSON failure
SDK truthful final-only
Claude structured stream preservation
Codex structured delta/final/usage preservation
Gemini structured + coarse final-only classification
Kimi structured tool lifecycle preservation
truncated stream failure classification
artifact write failure classification
OTEL record shaping for Loki query fields

Live local verification

Verified locally with the built-in demo workflow:

workflows/agent-trace-otel-demo.tsx

Known-good hardened flow:

scripts/verify-observability.sh

This produces a timestamped evidence bundle under:

tmp/verification/<timestamp>

Observed live validation includes:

Grafana health + datasource provisioning
Prometheus scrape health
Loki run-wide query results
Loki node/attempt correlation
Loki thinking delta query
Loki tool lifecycle queries
Loki failure capture.error query
Loki SDK final-only query
Tempo Smithers trace search
Tempo trace detail showing service.name=smithers-dev and matching runId

Current verification status

Strongest live evidence

PiAgent: end-to-end live verified
OTEL log export to Loki: live verified
Tempo trace correlation: live verified
Grafana availability + datasource provisioning: live verified
durable local artifacts: verified
redaction: verified

Truthful but weaker than Pi

SDK-family final-only path: implemented and live/query verified as final-only
Claude / Codex / Gemini / Kimi: canonical shaping and fidelity covered by automated normalization tests

Scope

This PR intentionally bundles the implementation and the local verification path hardening, because the feature is only useful if maintainers can actually reproduce the observability evidence locally.

Included in scope:

canonical trace event model
OTEL log export
Loki stack integration
docs + demo workflow
automated tests
local observability verification hardening scripts / healthchecks

Not included:

turning every non-Pi family into equal reviewer-grade live evidence yet
production-grade remote backend deployment work
UI-heavy Grafana Explore automation beyond local availability / datasource verification

Notes

the local Docker stack is now much more reproducible via reset + health-gated startup
the compose file still retains the top-level version field intentionally in this branch
the new verification script is meant to be the canonical repro path for reviewers

🤖 Generated with pi / smithers

SamuelLHuber · 2026-03-28T10:00:35Z

Here's the spec. We removed docs/concepts/agent-trace-otel-logs-spec.mdx from the PR because we don't want it to live in the public docs set, but keeping it here for reviewer context.

Agent Trace OTEL Logs Specification

---
title: Agent Trace OTEL Logs Specification
description: Full-fidelity specification for capturing agent-visible execution traces and exporting them as OpenTelemetry logs to Loki or any OTLP-compatible backend.
---

This document specifies how Smithers must capture, normalize, persist, export, and verify agent execution traces as OpenTelemetry logs.

This is a design specification, not an implementation sketch. Every requirement in this document is normative unless explicitly marked as non-normative.

## Status

- Intended scope: new observability surface for agent trace logs
- Intended audience: maintainers implementing runtime, agent, observability, and verification changes
- Intended outcome: a system where every supported agent run produces a complete, queryable, correlated trace of what Smithers could observe

## Problem Statement

Smithers currently captures:

- durable workflow lifecycle events
- structured application logs
- traces and metrics for runtime behavior
- partial agent output in some cases

Smithers does not currently guarantee a full-fidelity record of agent-visible execution behavior across all agent integrations.

In particular:

- `PiAgent` exposes a rich event stream, but Smithers currently collapses it to final text plus usage
- several CLI agents emit machine-readable output that Smithers does not preserve as first-class trace events
- SDK-based agents return final results and rely on Smithers-side tool logging, but do not provide a canonical agent trace model
- there is no OTEL logs pipeline in the local collector configuration

The result is that operators cannot reliably answer questions such as:

- What did the agent stream before it failed?
- Which tools did the agent invoke, in what order, with which visible arguments and results?
- Did the agent emit visible thinking content, compaction events, retries, or queued follow-up behavior?
- Can we reconstruct exactly what Smithers observed for a given run, node, and attempt?
- Can we query this in Grafana Loki or another OTLP log backend with stable run-level correlation?

This specification addresses that gap.

## Goals

The system defined here MUST:

- capture the fullest agent-visible trace Smithers can obtain for each supported agent
- export that trace as OTEL logs to Loki or any OTLP-compatible log backend
- preserve run correlation through stable attributes such as `run.id`, `workflow.path`, `node.id`, `attempt`, and `iteration`
- preserve raw trace fidelity without forcing operators to infer behavior from summary logs
- remain explicit about what was directly observed versus what was derived by Smithers
- provide deterministic verification criteria for correctness and task completion

## Non-Goals

The system defined here MUST NOT claim to provide:

- provider-internal hidden chain-of-thought when the upstream agent or SDK does not expose it
- exact reconstruction of invisible model-side planning not surfaced through events, messages, or tool calls
- a replacement for the durable Smithers event log or database
- a guarantee that every backend will index arbitrary high-cardinality fields efficiently

## Core Principle

Smithers MUST export what it observed, not what it inferred.

Every exported trace record MUST be classifiable as one of:

- raw upstream agent event
- raw Smithers runtime event
- Smithers-derived normalization of one raw event
- Smithers-generated transport or export diagnostic

If a record is derived, the derivation MUST be explicit.

## Definitions

### Agent Trace

An agent trace is the ordered set of agent-visible execution records associated with one Smithers node attempt.

Agent trace records include, where available:

- streamed assistant text
- streamed visible thinking content
- message lifecycle events
- tool call lifecycle events
- tool result lifecycle events
- compaction and retry events
- session metadata
- final assistant message
- final tool results
- agent stderr diagnostics when those are observable to Smithers

### Full Trace

For a given agent integration, a full trace means all upstream-visible records Smithers can access without patching the upstream model provider.

Full trace does not mean hidden reasoning. It means all observable records available through:

- subprocess stdout or stderr
- structured CLI output modes
- RPC event streams
- SDK callback/event surfaces
- persisted session artifacts intentionally provided by the agent system

### Canonical Trace Event

A canonical trace event is the Smithers-normalized representation of one raw observed record.

Canonical trace events are the unit exported to OTEL logs and optionally persisted durably by Smithers.

### Attempt

An attempt is one execution of one node at one iteration with one attempt number. A canonical agent trace is scoped to exactly one attempt.

## Invariants

The implementation MUST satisfy all of the following invariants.

### Identity Invariants

Every canonical trace event MUST include:

- `runId`
- `nodeId` when the event is attempt-scoped
- `iteration` when the event is attempt-scoped
- `attempt` when the event is attempt-scoped
- `timestampMs`
- `source.agentFamily`
- `source.captureMode`
- `event.kind`
- `event.sequence`

### Ordering Invariants

Canonical trace events for a single attempt MUST be totally ordered by `event.sequence`.

If upstream events arrive out of wall-clock order, Smithers MUST preserve receive order and MUST NOT reorder them after capture.

`event.sequence` MUST be monotonic within one attempt.

### Fidelity Invariants

Smithers MUST preserve raw upstream payloads for canonical trace events unless a redaction rule requires modification.

If redaction occurs:

- the record MUST indicate redaction occurred
- the redaction reason MUST be attached
- the original raw value MUST NOT be exported

### Correlation Invariants

Every OTEL log record derived from a canonical trace event MUST be queryable by:

- run
- workflow path
- node
- iteration
- attempt
- agent family
- event kind

### Completeness Invariants

If Smithers receives a parseable upstream event, Smithers MUST either:

- convert it into a canonical trace event and export it
- or emit a diagnostic record explaining why it was dropped

Silent drops are not allowed.

### Truthfulness Invariants

If an agent integration cannot expose a certain class of events, the system MUST record capability absence explicitly and MUST NOT pretend completeness.

Example:

- if an SDK-based integration does not expose thinking deltas, Smithers MUST mark that event class as unsupported for that agent family

## Scope of Observability

The system covers three layers.

### Layer 1: Canonical Runtime Record

Smithers SHOULD persist canonical trace events durably for replay and audit, alongside existing run events and attempt data.

### Layer 2: OTEL Log Export

Smithers MUST export canonical trace events as OTEL logs when OTEL log export is enabled.

### Layer 3: Summary Metrics and Diagnostics

Smithers MAY derive metrics from canonical trace events, but those metrics are secondary and MUST NOT be the sole evidence of capture correctness.

## Agent Capability Model

Each agent family MUST declare an explicit trace capability profile.

The capability profile MUST enumerate support for:

- session metadata
- assistant text deltas
- visible thinking deltas
- final assistant message
- tool execution start
- tool execution update
- tool execution end
- retry events
- compaction events
- raw stderr diagnostics
- persisted session artifact

### PiAgent

`PiAgent` MUST be treated as a high-fidelity integration.

Available sources include:

- JSON event stream mode
- RPC mode event stream
- Pi session JSONL artifacts

Pi exposes event types such as:

- `agent_start`
- `agent_end`
- `turn_start`
- `turn_end`
- `message_start`
- `message_update`
- `message_end`
- `tool_execution_start`
- `tool_execution_update`
- `tool_execution_end`
- `auto_compaction_start`
- `auto_compaction_end`
- `auto_retry_start`
- `auto_retry_end`

Visible thinking content emitted by Pi MUST be captured as trace content.

Pi session artifacts, when enabled and available, SHOULD be recorded as canonical artifacts associated with the attempt.

### CodexAgent

`CodexAgent` MUST be treated as a structured CLI integration with medium fidelity.

Codex emits JSON output. Smithers MUST preserve all parseable structured events made available by that mode.

If Codex exposes usage, step, message, tool, or completion events, Smithers MUST map them to canonical trace events rather than extracting only final text.

If a given Codex event schema is unstable, Smithers MUST preserve the raw event payload and classify the normalization conservatively.

### ClaudeCodeAgent

`ClaudeCodeAgent` MUST be treated as a structured CLI integration with medium fidelity.

When `stream-json` is enabled, Smithers MUST preserve all parseable stream records and map them into canonical trace events where possible.

Partial assistant messages, tool call indicators, and usage events MUST NOT be discarded if they are parseable.

### GeminiAgent

`GeminiAgent` MUST be treated as a structured CLI integration with low to medium fidelity depending on output mode.

Smithers MUST preserve parseable structured output and MUST explicitly mark unsupported event classes when the CLI exposes only final or coarse-grained results.

### KimiAgent

`KimiAgent` MUST be treated as a structured CLI integration with low to medium fidelity depending on `outputFormat`.

If `stream-json` mode is used, Smithers MUST preserve event records. If only final text is available, Smithers MUST mark the trace as partial.

### OpenAIAgent and AnthropicAgent

`OpenAIAgent` and `AnthropicAgent` MUST be treated as SDK integrations.

They do not inherently expose a rich subprocess event stream in the current Smithers wrapper.

For these agents, Smithers MUST capture:

- prompt dispatch boundaries
- final assistant response
- token usage when surfaced
- Smithers-side tool execution start and end
- visible tool output recorded by Smithers
- node output emitted by Smithers if any

Smithers MUST mark thinking deltas and message lifecycle as unsupported unless the underlying SDK path is instrumented to provide them.

### AmpAgent and ForgeAgent

`AmpAgent` and `ForgeAgent` MUST be treated as text-first subprocess integrations unless a structured mode is added.

Smithers MUST capture:

- final response text
- stderr diagnostics
- Smithers-side tool execution and runtime events

Smithers MUST mark full trace fidelity as unsupported for these integrations.

## Capture Modes

Each attempt MUST declare one capture mode:

- `sdk-events`
- `rpc-events`
- `cli-json-stream`
- `cli-json`
- `cli-text`
- `artifact-import`

Capture mode is part of the canonical attempt metadata and MUST be exported with every trace record.

## Canonical Data Model

Smithers MUST introduce a canonical event model for agent traces.

The exact TypeScript shape is an implementation detail, but the semantic fields are mandatory.

### Attempt Metadata

Each attempt MUST expose:

- `traceVersion`
- `agentFamily`
- `agentId`
- `model`
- `captureMode`
- `traceCompleteness`
- `unsupportedEventKinds`
- `traceStartedAtMs`
- `traceFinishedAtMs`
- `rawArtifactRefs`

### `traceCompleteness`

`traceCompleteness` MUST be one of:

- `full-observed`
- `partial-observed`
- `final-only`
- `capture-failed`

Definitions:

- `full-observed`: Smithers captured every event class the integration claims to support
- `partial-observed`: Smithers captured some but not all supported classes
- `final-only`: only final response and coarse metadata were available
- `capture-failed`: Smithers expected trace events but could not capture them reliably

### Canonical Event Fields

Every canonical trace event MUST include:

- `traceVersion`
- `runId`
- `workflowPath`
- `workflowHash` when available
- `nodeId`
- `iteration`
- `attempt`
- `timestampMs`
- `event.sequence`
- `event.kind`
- `event.phase`
- `source.agentFamily`
- `source.captureMode`
- `source.rawType`
- `source.observed`
- `payload`
- `raw`
- `redaction`
- `annotations`

### `event.kind`

`event.kind` MUST be chosen from a controlled vocabulary.

The initial vocabulary MUST include:

- `session.start`
- `session.end`
- `turn.start`
- `turn.end`
- `message.start`
- `message.update`
- `message.end`
- `assistant.text.delta`
- `assistant.thinking.delta`
- `assistant.message.final`
- `tool.execution.start`
- `tool.execution.update`
- `tool.execution.end`
- `tool.result`
- `retry.start`
- `retry.end`
- `compaction.start`
- `compaction.end`
- `stderr`
- `stdout`
- `usage`
- `capture.warning`
- `capture.error`
- `artifact.created`

No integration-specific naming is allowed in `event.kind`. Integration-specific names MUST remain in `source.rawType`.

### `event.phase`

`event.phase` MUST be one of:

- `agent`
- `turn`
- `message`
- `tool`
- `session`
- `capture`
- `artifact`

### `source.observed`

`source.observed` MUST be a boolean indicating whether the payload was directly observed from the upstream integration.

Derived normalization records MUST set `source.observed` to `false`.

### `payload`

`payload` MUST contain normalized fields intended for stable querying and display.

Examples:

- for `assistant.text.delta`: `{ text: string }`
- for `assistant.thinking.delta`: `{ text: string }`
- for `tool.execution.start`: `{ toolCallId: string, toolName: string, argsPreview: unknown }`
- for `tool.execution.end`: `{ toolCallId: string, toolName: string, isError: boolean, resultPreview: unknown }`

### `raw`

`raw` MUST contain the raw upstream object or raw text fragment as captured after redaction.

If no raw form exists, `raw` MAY be `null`.

## Custom Annotations

The system MUST support user-defined annotations attached at run start.

Annotations MUST be:

- provided in run options and server APIs
- stored durably on the run
- merged into every canonical trace event at export time

Annotations MUST support scalar values only:

- string
- number
- boolean

Nested objects and arrays MUST be rejected or flattened before run start. The behavior MUST be explicit and deterministic.

The following annotation namespaces are reserved:

- `smithers.*`
- `run.*`
- `workflow.*`
- `node.*`
- `agent.*`
- `otel.*`

User annotations SHOULD use a `custom.*` prefix in canonical export.

## Workflow Metadata Requirements

Every canonical trace event MUST include:

- `workflow.path` as an OTEL attribute when available
- `workflow.hash` as an OTEL attribute when available

If `workflow.path` is unavailable, Smithers MUST export `workflow.path` as absent rather than inventing a placeholder path.

## Redaction Model

Redaction is mandatory because agent traces can contain sensitive content.

The implementation MUST support:

- disabled redaction
- default redaction
- custom redaction rules

### Minimum Default Redaction

Default redaction MUST handle at least:

- API keys
- bearer tokens
- common secret env vars
- authorization headers
- cookie headers
- explicitly configured secret literals

### Redaction Semantics

Redaction MUST occur before:

- durable canonical trace persistence
- OTEL log export
- artifact snapshot export

If redaction modifies content, the trace event MUST record:

- `redaction.applied = true`
- `redaction.ruleIds = string[]`

## Export Model

Canonical trace events MUST be exportable as OTEL logs.

### OTEL Collector Requirements

The collector configuration MUST define a `logs` pipeline.

The logs pipeline MUST accept OTLP input and MUST support at least one of:

- OTLP logs exporter
- Loki exporter

The local development stack SHOULD include Loki for verification and human inspection.

### OTEL Record Shape

For each canonical trace event, Smithers MUST emit one OTEL log record.

The log body MUST contain a compact structured JSON representation of:

- canonical payload
- raw payload when configured
- redaction metadata

The log attributes MUST include:

- `service.name`
- `smithers.trace.version`
- `run.id`
- `workflow.path`
- `workflow.hash` when available
- `node.id` when available
- `node.iteration` when available
- `node.attempt` when available
- `agent.family`
- `agent.id` when available
- `agent.model` when available
- `agent.capture_mode`
- `trace.completeness`
- `event.kind`
- `event.phase`
- `event.sequence`
- `source.raw_type`
- `source.observed`

Custom annotations MUST be exported as OTEL attributes under `custom.*`.

### Attribute Cardinality Rules

The following MUST be attributes:

- run identifiers
- workflow identifiers
- node identifiers
- attempt identifiers
- event kind
- agent family
- capture mode

The following MUST NOT be indexed as labels in Loki-specific configurations:

- full prompt text
- full response text
- thinking text
- tool args bodies
- tool result bodies
- arbitrary user free-text annotations

These large fields MUST remain in the log body.

### Severity Mapping

Severity SHOULD be assigned as follows:

- normal trace events: `INFO`
- stderr and non-terminal capture anomalies: `WARN`
- capture failures and export failures: `ERROR`

Severity MUST NOT be used to encode event kind.

## Persistence Model

Canonical trace events SHOULD be durably persisted by Smithers in addition to OTEL export.

If durable persistence is implemented, the persistence layer MUST support:

- ordered replay by attempt
- filtering by event kind
- pagination by sequence
- artifact references

OTEL export MUST NOT be the only storage location for canonical trace data.

## Artifact Model

Some agent integrations expose richer external artifacts than can be represented comfortably as log streams.

Examples:

- Pi session JSONL files
- raw CLI JSON event transcripts
- exported HTML or JSONL session artifacts

Smithers SHOULD support trace artifacts with metadata:

- `artifact.kind`
- `artifact.path`
- `artifact.contentType`
- `artifact.bytes`
- `artifact.createdAtMs`
- `artifact.redacted`

Artifact creation MUST also emit canonical `artifact.created` events.

## Failure Model

The implementation MUST classify failures explicitly.

### Capture Failure

Capture failure means Smithers could not reliably obtain agent trace input it expected from the selected capture mode.

Examples:

- malformed JSON stream
- unexpected subprocess termination before terminal event
- SDK callback channel failure

Capture failure MUST:

- mark attempt `traceCompleteness = capture-failed` when terminally broken
- emit a `capture.error` canonical event
- include diagnostic details

### Partial Capture

Partial capture means Smithers obtained some trace events but missed expected categories.

Examples:

- stdout stream cut off after several tool events
- session artifact missing though event stream completed

Partial capture MUST:

- mark attempt `traceCompleteness = partial-observed`
- record missing classes in `unsupportedEventKinds` or `missingExpectedEventKinds`

### Export Failure

Export failure means Smithers captured canonical trace events but could not deliver them to the OTEL backend.

Export failure MUST NOT erase canonical local truth.

If export fails:

- canonical local persistence MUST still succeed when enabled
- Smithers MUST emit operator diagnostics through existing logs
- the run MUST remain inspectable from durable local records

## Normalization Rules

Normalization MUST be conservative.

### One Raw Event to One Canonical Event

As a default rule, one raw upstream event SHOULD map to one canonical trace event.

If one raw event yields multiple canonical events, the implementation MUST document why and MUST include a stable parent link.

### Text Deltas

Assistant text deltas MUST remain deltas if the upstream protocol provided deltas.

Smithers MUST NOT collapse deltas into a single blob during export.

Final assembled messages MAY be emitted separately as `assistant.message.final`.

### Thinking Deltas

Visible thinking content MUST be captured as its own event class and MUST NOT be merged into assistant text.

### Tool Calls

Tool lifecycle MUST preserve:

- stable tool call identifier when upstream provides one
- tool name
- visible arguments or argument preview
- partial updates when available
- final result preview
- error flag

### Usage

Usage records MUST be separate canonical events or attached to terminal message events in a way that remains queryable.

If usage is attached, it MUST still be accessible without parsing free-form text.

## Required Runtime Integration Points

The implementation MUST integrate at these boundaries.

### Agent Boundary

Every agent integration MUST report raw trace observations into the canonical trace capture layer.

No agent integration is allowed to silently parse and discard upstream event records before the capture layer sees them.

### Event Bus Boundary

Canonical trace events SHOULD be emitted through or alongside the existing event bus so that:

- they share run correlation
- they can participate in durable persistence
- they can reuse existing event-driven verification infrastructure

### Attempt Finalization Boundary

When an attempt finishes, Smithers MUST finalize trace metadata:

- `traceFinishedAtMs`
- `traceCompleteness`
- `unsupportedEventKinds`
- `rawArtifactRefs`

## Required Configuration Surface

The implementation MUST define explicit configuration for:

- enabling OTEL log export
- selecting backend endpoint
- enabling or disabling canonical local trace persistence
- selecting redaction mode
- retaining or dropping raw payload bodies
- retaining or dropping raw artifacts
- maximum event body bytes
- maximum artifact bytes

The configuration MUST distinguish:

- runtime operator policy
- run-specific annotations

## Required Operator Queries

The design is incomplete unless the following operator queries are supported.

### Query Set A: Run Reconstruction

Operators MUST be able to answer:

- show all trace records for one run
- show all trace records for one run and node
- show only one attempt for one node
- show ordered assistant text deltas
- show visible thinking deltas when present
- show tool calls and results in order

### Query Set B: Failure Analysis

Operators MUST be able to answer:

- which runs had trace capture failures
- which agents only provide final-only traces
- which attempts terminated without a terminal agent event
- which traces were partially redacted

### Query Set C: Audit

Operators MUST be able to answer:

- what annotations were attached to a run
- which workflow file and workflow hash produced the trace
- which raw artifact file corresponds to this attempt

## Verification Specification

Task completion is not defined by code existing. It is defined by observable correctness.

The implementation is complete only if every verification class below passes.

## Verification Class 1: Schema Correctness

For each supported agent family, automated tests MUST verify that canonical trace events:

- conform to the declared schema
- contain required identity fields
- maintain monotonic `event.sequence`
- correctly classify `traceCompleteness`

Completion criterion:

- zero schema violations in test fixtures

## Verification Class 2: Ordering Correctness

Automated tests MUST verify that for one attempt:

- event sequences are strictly monotonic
- final events occur after preceding deltas
- no duplicate sequence numbers appear

Completion criterion:

- deterministic ordering across repeated test runs

## Verification Class 3: Fidelity Correctness

Fixture-based tests MUST compare raw upstream inputs with canonical trace outputs.

For each fixture:

- every parseable upstream event MUST result in a canonical event or an explicit diagnostic drop event
- visible thinking content MUST remain distinguishable from assistant text
- tool call identifiers and names MUST survive normalization

Completion criterion:

- full fixture coverage for each agent family and capture mode supported by Smithers

## Verification Class 4: Completeness Classification

Tests MUST verify the semantics of:

- `full-observed`
- `partial-observed`
- `final-only`
- `capture-failed`

Completion criterion:

- each classification is produced by at least one explicit test case

## Verification Class 5: OTEL Export Correctness

Integration tests MUST verify that canonical trace events become OTEL log records with:

- required attributes present
- correct body shape
- correct severity mapping
- correct custom annotation export

Completion criterion:

- logs are queryable in the target backend by `run.id`, `workflow.path`, `node.id`, `attempt`, and `event.kind`

## Verification Class 6: Loki Query Correctness

In a local stack with Loki enabled, end-to-end tests MUST verify that an operator can query:

- all records for a run
- all records for a node attempt
- only thinking deltas
- only tool execution records
- only capture errors

Completion criterion:

- documented query examples return expected results against test data

## Verification Class 7: Artifact Correctness

When artifact capture is enabled, tests MUST verify:

- artifact references are recorded
- artifacts exist on disk or in configured storage
- artifact metadata matches actual content
- artifact creation emits corresponding canonical events

Completion criterion:

- no dangling artifact references

## Verification Class 8: Redaction Correctness

Tests MUST verify that redaction:

- removes required secrets from canonical payloads, raw payloads, OTEL bodies, and artifacts
- leaves non-sensitive content intact
- records which rules were applied

Completion criterion:

- zero known secret literals leak in test fixtures

## Verification Class 9: Failure Resilience

Tests MUST verify behavior when:

- collector is unavailable
- backend rejects logs
- malformed upstream JSON is encountered
- subprocess exits before terminal event
- artifact write fails

Completion criterion:

- capture failures are classified
- local diagnostics exist
- durable local truth remains accessible when configured

## Verification Class 10: Cross-Signal Correlation

Tests MUST verify that logs correlate with:

- run lifecycle events
- metrics
- spans

At minimum, operators MUST be able to join by:

- `run.id`
- `node.id`
- `attempt`

Completion criterion:

- one documented workflow run can be traced across event log, OTEL logs, and metrics without ambiguity

## Acceptance Criteria

The feature is not done until all of the following are true.

### A. Canonical Model Exists

Smithers has a canonical agent trace schema with explicit completeness states and per-agent capability declarations.

### B. Pi Is High Fidelity

`PiAgent` exports structured trace records for:

- session lifecycle
- turn lifecycle
- message lifecycle
- assistant text deltas
- visible thinking deltas
- tool execution lifecycle
- retry and compaction events

### C. Other Agents Are Truthfully Classified

Every agent in `src/agents/` has a declared fidelity class and unsupported event set.

### D. OTEL Logs Pipeline Exists

The collector and local observability stack support OTEL logs end to end.

### E. Queries Work

Operators can answer the required run reconstruction, failure analysis, and audit queries from the exported logs.

### F. Verification Is Automated

Automated tests exist for schema, ordering, fidelity, completeness, OTEL export, redaction, failure handling, and query correctness.

## Implementation Phasing

This section is normative for rollout order.

### Phase 1: Canonical Model

Implement:

- canonical trace schema
- completeness classification
- per-agent capability declarations

### Phase 2: Pi Fidelity

Implement:

- Pi raw event capture
- canonical normalization
- OTEL export
- artifact capture for session files if configured

### Phase 3: Structured CLI Agents

Implement:

- Codex
- Claude Code
- Gemini
- Kimi

Each integration MUST ship with fixture-based normalization tests before being considered complete.

### Phase 4: SDK and Text-Only Agents

Implement:

- explicit partial or final-only capture
- truthful capability declarations
- OTEL export for the observable subset

### Phase 5: Redaction and Hardening

Implement:

- default redaction
- export failure handling
- artifact verification
- documented local Loki queries

## Explicit Non-Ambiguities

The following choices are intentional.

- Smithers MUST prefer truthful partial fidelity over fake completeness.
- Smithers MUST preserve raw event boundaries rather than collapsing everything into summaries.
- Smithers MUST keep large content in log bodies, not indexing labels.
- Smithers MUST retain a local source of truth when OTEL export fails.
- Smithers MUST separate assistant text from visible thinking.
- Smithers MUST define task completion in terms of verification evidence, not implementation effort.

## Out of Scope for the First Implementation

The first implementation MAY defer:

- remote artifact storage
- cross-run session graph visualizations
- backend-specific dashboards beyond minimal verification queries
- universal reconstruction of provider-internal hidden reasoning

If deferred, these items MUST be documented explicitly and MUST NOT be implied to exist.

## Summary

The required system is not “send some logs to Loki.”

The required system is:

- a canonical agent trace model
- explicit capability declarations per integration
- conservative capture of all observable upstream events
- durable local truth
- OTEL log export with stable correlation fields
- redaction before persistence and export
- verification that proves fidelity, completeness, and queryability

Anything less produces observability that looks complete while remaining operationally unreliable.

SamuelLHuber · 2026-03-28T11:18:30Z

Added verification coverage beyond Pi.

Testing I ran:

bun test tests/agent-trace.test.tsx --test-name-pattern 'real Gemini|real Claude|artifact.created'
bun run src/cli/index.ts run workflows/agent-trace-otel-demo.tsx --run-id agent-trace-otel-demo-codex-20260328-121057 ...
real Claude CLI run: real-agent-trace-claude-final-20260328-115437
real Gemini CLI run: real-agent-trace-gemini-buffered-20260328-115756

The self-contained observability workflow/script now explicitly exercises pi-rich-trace, claude-structured-trace, gemini-structured-trace, codex-structured-trace, and sdk-final-only, so the verification bundle is no longer Pi-only.

The trace collector was overstating capture fidelity by treating synthesized terminal events as if they had been directly observed. It also kept inconsistent usage payloads and relied on fragile tool lifecycle identity reconstruction. This change separates directly observed events from backfilled canonical events so completeness reflects what Smithers actually captured for each attempt. It normalizes usage records to one schema, preserves later terminal usage data, and threads a stable toolCallId through Smithers tool events so start and finish records stay correlated. The result keeps coarse capture modes truthfully classified as final-only while marking real structured gaps as partial-observed, which makes the trace model easier to reason about and aligns it with the rest of the codebase's observability behavior.

…thfully Smithers was still under-reporting fidelity for real Codex and Claude CLI runs even after the earlier trace semantics cleanup. Codex emitted structured dotted JSONL events that we were treating as coarse text in practice, and Claude exposed an observed terminal result that we were only backfilling into a synthetic final message. This change makes Codex self-describe its stream-json mode, normalizes real dotted Codex events into canonical turn and final-message records, and treats Claude result events as directly observed final assistant output. It also hardens the observability verification script so publication checks assert the actual trace summary semantics for Claude and Codex instead of only checking that logs exist. That keeps the published bookmark aligned with the behavior we verified manually and protects the trace fidelity guarantees the PR is trying to establish.

SamuelLHuber force-pushed the feat/agent-trace-otel-logs branch from 4632899 to 38dbe04 Compare March 28, 2026 10:00

SamuelLHuber force-pushed the feat/agent-trace-otel-logs branch from 38dbe04 to 7f9388b Compare March 28, 2026 11:15

feat(observability): complete agent trace OTEL verification

3ac8f25

SamuelLHuber force-pushed the feat/agent-trace-otel-logs branch from 5897da8 to 3ac8f25 Compare March 28, 2026 11:39

SamuelLHuber added 3 commits March 28, 2026 15:37

refactor(observability): reduce agent trace normalization verbosity

aa39ee6

SamuelLHuber force-pushed the feat/agent-trace-otel-logs branch from 24d48ce to aa39ee6 Compare March 28, 2026 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): add canonical agent trace OTEL logs with Loki verification#119

feat(observability): add canonical agent trace OTEL logs with Loki verification#119
SamuelLHuber wants to merge 4 commits intocodeplaneapp:mainfrom
SamuelLHuber:feat/agent-trace-otel-logs

SamuelLHuber commented Mar 28, 2026

Uh oh!

SamuelLHuber commented Mar 28, 2026

Uh oh!

SamuelLHuber commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SamuelLHuber commented Mar 28, 2026

Problem

What this PR adds

Canonical agent trace model

OTEL log export + durable local truth

Local observability stack support

Coverage across agent families

Failure / redaction coverage

Verification hardening

Why this approach

Validation

Automated

Live local verification

Current verification status

Strongest live evidence

Truthful but weaker than Pi

Scope

Notes

Uh oh!

SamuelLHuber commented Mar 28, 2026

Uh oh!

SamuelLHuber commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant