feat(observability): add canonical agent trace OTEL logs with Loki verification#119
Draft
SamuelLHuber wants to merge 4 commits intocodeplaneapp:mainfrom
Draft
feat(observability): add canonical agent trace OTEL logs with Loki verification#119SamuelLHuber wants to merge 4 commits intocodeplaneapp:mainfrom
SamuelLHuber wants to merge 4 commits intocodeplaneapp:mainfrom
Conversation
4632899 to
38dbe04
Compare
Contributor
Author
|
Here's the spec. We removed Agent Trace OTEL Logs Specification---
title: Agent Trace OTEL Logs Specification
description: Full-fidelity specification for capturing agent-visible execution traces and exporting them as OpenTelemetry logs to Loki or any OTLP-compatible backend.
---
This document specifies how Smithers must capture, normalize, persist, export, and verify agent execution traces as OpenTelemetry logs.
This is a design specification, not an implementation sketch. Every requirement in this document is normative unless explicitly marked as non-normative.
## Status
- Intended scope: new observability surface for agent trace logs
- Intended audience: maintainers implementing runtime, agent, observability, and verification changes
- Intended outcome: a system where every supported agent run produces a complete, queryable, correlated trace of what Smithers could observe
## Problem Statement
Smithers currently captures:
- durable workflow lifecycle events
- structured application logs
- traces and metrics for runtime behavior
- partial agent output in some cases
Smithers does not currently guarantee a full-fidelity record of agent-visible execution behavior across all agent integrations.
In particular:
- `PiAgent` exposes a rich event stream, but Smithers currently collapses it to final text plus usage
- several CLI agents emit machine-readable output that Smithers does not preserve as first-class trace events
- SDK-based agents return final results and rely on Smithers-side tool logging, but do not provide a canonical agent trace model
- there is no OTEL logs pipeline in the local collector configuration
The result is that operators cannot reliably answer questions such as:
- What did the agent stream before it failed?
- Which tools did the agent invoke, in what order, with which visible arguments and results?
- Did the agent emit visible thinking content, compaction events, retries, or queued follow-up behavior?
- Can we reconstruct exactly what Smithers observed for a given run, node, and attempt?
- Can we query this in Grafana Loki or another OTLP log backend with stable run-level correlation?
This specification addresses that gap.
## Goals
The system defined here MUST:
- capture the fullest agent-visible trace Smithers can obtain for each supported agent
- export that trace as OTEL logs to Loki or any OTLP-compatible log backend
- preserve run correlation through stable attributes such as `run.id`, `workflow.path`, `node.id`, `attempt`, and `iteration`
- preserve raw trace fidelity without forcing operators to infer behavior from summary logs
- remain explicit about what was directly observed versus what was derived by Smithers
- provide deterministic verification criteria for correctness and task completion
## Non-Goals
The system defined here MUST NOT claim to provide:
- provider-internal hidden chain-of-thought when the upstream agent or SDK does not expose it
- exact reconstruction of invisible model-side planning not surfaced through events, messages, or tool calls
- a replacement for the durable Smithers event log or database
- a guarantee that every backend will index arbitrary high-cardinality fields efficiently
## Core Principle
Smithers MUST export what it observed, not what it inferred.
Every exported trace record MUST be classifiable as one of:
- raw upstream agent event
- raw Smithers runtime event
- Smithers-derived normalization of one raw event
- Smithers-generated transport or export diagnostic
If a record is derived, the derivation MUST be explicit.
## Definitions
### Agent Trace
An agent trace is the ordered set of agent-visible execution records associated with one Smithers node attempt.
Agent trace records include, where available:
- streamed assistant text
- streamed visible thinking content
- message lifecycle events
- tool call lifecycle events
- tool result lifecycle events
- compaction and retry events
- session metadata
- final assistant message
- final tool results
- agent stderr diagnostics when those are observable to Smithers
### Full Trace
For a given agent integration, a full trace means all upstream-visible records Smithers can access without patching the upstream model provider.
Full trace does not mean hidden reasoning. It means all observable records available through:
- subprocess stdout or stderr
- structured CLI output modes
- RPC event streams
- SDK callback/event surfaces
- persisted session artifacts intentionally provided by the agent system
### Canonical Trace Event
A canonical trace event is the Smithers-normalized representation of one raw observed record.
Canonical trace events are the unit exported to OTEL logs and optionally persisted durably by Smithers.
### Attempt
An attempt is one execution of one node at one iteration with one attempt number. A canonical agent trace is scoped to exactly one attempt.
## Invariants
The implementation MUST satisfy all of the following invariants.
### Identity Invariants
Every canonical trace event MUST include:
- `runId`
- `nodeId` when the event is attempt-scoped
- `iteration` when the event is attempt-scoped
- `attempt` when the event is attempt-scoped
- `timestampMs`
- `source.agentFamily`
- `source.captureMode`
- `event.kind`
- `event.sequence`
### Ordering Invariants
Canonical trace events for a single attempt MUST be totally ordered by `event.sequence`.
If upstream events arrive out of wall-clock order, Smithers MUST preserve receive order and MUST NOT reorder them after capture.
`event.sequence` MUST be monotonic within one attempt.
### Fidelity Invariants
Smithers MUST preserve raw upstream payloads for canonical trace events unless a redaction rule requires modification.
If redaction occurs:
- the record MUST indicate redaction occurred
- the redaction reason MUST be attached
- the original raw value MUST NOT be exported
### Correlation Invariants
Every OTEL log record derived from a canonical trace event MUST be queryable by:
- run
- workflow path
- node
- iteration
- attempt
- agent family
- event kind
### Completeness Invariants
If Smithers receives a parseable upstream event, Smithers MUST either:
- convert it into a canonical trace event and export it
- or emit a diagnostic record explaining why it was dropped
Silent drops are not allowed.
### Truthfulness Invariants
If an agent integration cannot expose a certain class of events, the system MUST record capability absence explicitly and MUST NOT pretend completeness.
Example:
- if an SDK-based integration does not expose thinking deltas, Smithers MUST mark that event class as unsupported for that agent family
## Scope of Observability
The system covers three layers.
### Layer 1: Canonical Runtime Record
Smithers SHOULD persist canonical trace events durably for replay and audit, alongside existing run events and attempt data.
### Layer 2: OTEL Log Export
Smithers MUST export canonical trace events as OTEL logs when OTEL log export is enabled.
### Layer 3: Summary Metrics and Diagnostics
Smithers MAY derive metrics from canonical trace events, but those metrics are secondary and MUST NOT be the sole evidence of capture correctness.
## Agent Capability Model
Each agent family MUST declare an explicit trace capability profile.
The capability profile MUST enumerate support for:
- session metadata
- assistant text deltas
- visible thinking deltas
- final assistant message
- tool execution start
- tool execution update
- tool execution end
- retry events
- compaction events
- raw stderr diagnostics
- persisted session artifact
### PiAgent
`PiAgent` MUST be treated as a high-fidelity integration.
Available sources include:
- JSON event stream mode
- RPC mode event stream
- Pi session JSONL artifacts
Pi exposes event types such as:
- `agent_start`
- `agent_end`
- `turn_start`
- `turn_end`
- `message_start`
- `message_update`
- `message_end`
- `tool_execution_start`
- `tool_execution_update`
- `tool_execution_end`
- `auto_compaction_start`
- `auto_compaction_end`
- `auto_retry_start`
- `auto_retry_end`
Visible thinking content emitted by Pi MUST be captured as trace content.
Pi session artifacts, when enabled and available, SHOULD be recorded as canonical artifacts associated with the attempt.
### CodexAgent
`CodexAgent` MUST be treated as a structured CLI integration with medium fidelity.
Codex emits JSON output. Smithers MUST preserve all parseable structured events made available by that mode.
If Codex exposes usage, step, message, tool, or completion events, Smithers MUST map them to canonical trace events rather than extracting only final text.
If a given Codex event schema is unstable, Smithers MUST preserve the raw event payload and classify the normalization conservatively.
### ClaudeCodeAgent
`ClaudeCodeAgent` MUST be treated as a structured CLI integration with medium fidelity.
When `stream-json` is enabled, Smithers MUST preserve all parseable stream records and map them into canonical trace events where possible.
Partial assistant messages, tool call indicators, and usage events MUST NOT be discarded if they are parseable.
### GeminiAgent
`GeminiAgent` MUST be treated as a structured CLI integration with low to medium fidelity depending on output mode.
Smithers MUST preserve parseable structured output and MUST explicitly mark unsupported event classes when the CLI exposes only final or coarse-grained results.
### KimiAgent
`KimiAgent` MUST be treated as a structured CLI integration with low to medium fidelity depending on `outputFormat`.
If `stream-json` mode is used, Smithers MUST preserve event records. If only final text is available, Smithers MUST mark the trace as partial.
### OpenAIAgent and AnthropicAgent
`OpenAIAgent` and `AnthropicAgent` MUST be treated as SDK integrations.
They do not inherently expose a rich subprocess event stream in the current Smithers wrapper.
For these agents, Smithers MUST capture:
- prompt dispatch boundaries
- final assistant response
- token usage when surfaced
- Smithers-side tool execution start and end
- visible tool output recorded by Smithers
- node output emitted by Smithers if any
Smithers MUST mark thinking deltas and message lifecycle as unsupported unless the underlying SDK path is instrumented to provide them.
### AmpAgent and ForgeAgent
`AmpAgent` and `ForgeAgent` MUST be treated as text-first subprocess integrations unless a structured mode is added.
Smithers MUST capture:
- final response text
- stderr diagnostics
- Smithers-side tool execution and runtime events
Smithers MUST mark full trace fidelity as unsupported for these integrations.
## Capture Modes
Each attempt MUST declare one capture mode:
- `sdk-events`
- `rpc-events`
- `cli-json-stream`
- `cli-json`
- `cli-text`
- `artifact-import`
Capture mode is part of the canonical attempt metadata and MUST be exported with every trace record.
## Canonical Data Model
Smithers MUST introduce a canonical event model for agent traces.
The exact TypeScript shape is an implementation detail, but the semantic fields are mandatory.
### Attempt Metadata
Each attempt MUST expose:
- `traceVersion`
- `agentFamily`
- `agentId`
- `model`
- `captureMode`
- `traceCompleteness`
- `unsupportedEventKinds`
- `traceStartedAtMs`
- `traceFinishedAtMs`
- `rawArtifactRefs`
### `traceCompleteness`
`traceCompleteness` MUST be one of:
- `full-observed`
- `partial-observed`
- `final-only`
- `capture-failed`
Definitions:
- `full-observed`: Smithers captured every event class the integration claims to support
- `partial-observed`: Smithers captured some but not all supported classes
- `final-only`: only final response and coarse metadata were available
- `capture-failed`: Smithers expected trace events but could not capture them reliably
### Canonical Event Fields
Every canonical trace event MUST include:
- `traceVersion`
- `runId`
- `workflowPath`
- `workflowHash` when available
- `nodeId`
- `iteration`
- `attempt`
- `timestampMs`
- `event.sequence`
- `event.kind`
- `event.phase`
- `source.agentFamily`
- `source.captureMode`
- `source.rawType`
- `source.observed`
- `payload`
- `raw`
- `redaction`
- `annotations`
### `event.kind`
`event.kind` MUST be chosen from a controlled vocabulary.
The initial vocabulary MUST include:
- `session.start`
- `session.end`
- `turn.start`
- `turn.end`
- `message.start`
- `message.update`
- `message.end`
- `assistant.text.delta`
- `assistant.thinking.delta`
- `assistant.message.final`
- `tool.execution.start`
- `tool.execution.update`
- `tool.execution.end`
- `tool.result`
- `retry.start`
- `retry.end`
- `compaction.start`
- `compaction.end`
- `stderr`
- `stdout`
- `usage`
- `capture.warning`
- `capture.error`
- `artifact.created`
No integration-specific naming is allowed in `event.kind`. Integration-specific names MUST remain in `source.rawType`.
### `event.phase`
`event.phase` MUST be one of:
- `agent`
- `turn`
- `message`
- `tool`
- `session`
- `capture`
- `artifact`
### `source.observed`
`source.observed` MUST be a boolean indicating whether the payload was directly observed from the upstream integration.
Derived normalization records MUST set `source.observed` to `false`.
### `payload`
`payload` MUST contain normalized fields intended for stable querying and display.
Examples:
- for `assistant.text.delta`: `{ text: string }`
- for `assistant.thinking.delta`: `{ text: string }`
- for `tool.execution.start`: `{ toolCallId: string, toolName: string, argsPreview: unknown }`
- for `tool.execution.end`: `{ toolCallId: string, toolName: string, isError: boolean, resultPreview: unknown }`
### `raw`
`raw` MUST contain the raw upstream object or raw text fragment as captured after redaction.
If no raw form exists, `raw` MAY be `null`.
## Custom Annotations
The system MUST support user-defined annotations attached at run start.
Annotations MUST be:
- provided in run options and server APIs
- stored durably on the run
- merged into every canonical trace event at export time
Annotations MUST support scalar values only:
- string
- number
- boolean
Nested objects and arrays MUST be rejected or flattened before run start. The behavior MUST be explicit and deterministic.
The following annotation namespaces are reserved:
- `smithers.*`
- `run.*`
- `workflow.*`
- `node.*`
- `agent.*`
- `otel.*`
User annotations SHOULD use a `custom.*` prefix in canonical export.
## Workflow Metadata Requirements
Every canonical trace event MUST include:
- `workflow.path` as an OTEL attribute when available
- `workflow.hash` as an OTEL attribute when available
If `workflow.path` is unavailable, Smithers MUST export `workflow.path` as absent rather than inventing a placeholder path.
## Redaction Model
Redaction is mandatory because agent traces can contain sensitive content.
The implementation MUST support:
- disabled redaction
- default redaction
- custom redaction rules
### Minimum Default Redaction
Default redaction MUST handle at least:
- API keys
- bearer tokens
- common secret env vars
- authorization headers
- cookie headers
- explicitly configured secret literals
### Redaction Semantics
Redaction MUST occur before:
- durable canonical trace persistence
- OTEL log export
- artifact snapshot export
If redaction modifies content, the trace event MUST record:
- `redaction.applied = true`
- `redaction.ruleIds = string[]`
## Export Model
Canonical trace events MUST be exportable as OTEL logs.
### OTEL Collector Requirements
The collector configuration MUST define a `logs` pipeline.
The logs pipeline MUST accept OTLP input and MUST support at least one of:
- OTLP logs exporter
- Loki exporter
The local development stack SHOULD include Loki for verification and human inspection.
### OTEL Record Shape
For each canonical trace event, Smithers MUST emit one OTEL log record.
The log body MUST contain a compact structured JSON representation of:
- canonical payload
- raw payload when configured
- redaction metadata
The log attributes MUST include:
- `service.name`
- `smithers.trace.version`
- `run.id`
- `workflow.path`
- `workflow.hash` when available
- `node.id` when available
- `node.iteration` when available
- `node.attempt` when available
- `agent.family`
- `agent.id` when available
- `agent.model` when available
- `agent.capture_mode`
- `trace.completeness`
- `event.kind`
- `event.phase`
- `event.sequence`
- `source.raw_type`
- `source.observed`
Custom annotations MUST be exported as OTEL attributes under `custom.*`.
### Attribute Cardinality Rules
The following MUST be attributes:
- run identifiers
- workflow identifiers
- node identifiers
- attempt identifiers
- event kind
- agent family
- capture mode
The following MUST NOT be indexed as labels in Loki-specific configurations:
- full prompt text
- full response text
- thinking text
- tool args bodies
- tool result bodies
- arbitrary user free-text annotations
These large fields MUST remain in the log body.
### Severity Mapping
Severity SHOULD be assigned as follows:
- normal trace events: `INFO`
- stderr and non-terminal capture anomalies: `WARN`
- capture failures and export failures: `ERROR`
Severity MUST NOT be used to encode event kind.
## Persistence Model
Canonical trace events SHOULD be durably persisted by Smithers in addition to OTEL export.
If durable persistence is implemented, the persistence layer MUST support:
- ordered replay by attempt
- filtering by event kind
- pagination by sequence
- artifact references
OTEL export MUST NOT be the only storage location for canonical trace data.
## Artifact Model
Some agent integrations expose richer external artifacts than can be represented comfortably as log streams.
Examples:
- Pi session JSONL files
- raw CLI JSON event transcripts
- exported HTML or JSONL session artifacts
Smithers SHOULD support trace artifacts with metadata:
- `artifact.kind`
- `artifact.path`
- `artifact.contentType`
- `artifact.bytes`
- `artifact.createdAtMs`
- `artifact.redacted`
Artifact creation MUST also emit canonical `artifact.created` events.
## Failure Model
The implementation MUST classify failures explicitly.
### Capture Failure
Capture failure means Smithers could not reliably obtain agent trace input it expected from the selected capture mode.
Examples:
- malformed JSON stream
- unexpected subprocess termination before terminal event
- SDK callback channel failure
Capture failure MUST:
- mark attempt `traceCompleteness = capture-failed` when terminally broken
- emit a `capture.error` canonical event
- include diagnostic details
### Partial Capture
Partial capture means Smithers obtained some trace events but missed expected categories.
Examples:
- stdout stream cut off after several tool events
- session artifact missing though event stream completed
Partial capture MUST:
- mark attempt `traceCompleteness = partial-observed`
- record missing classes in `unsupportedEventKinds` or `missingExpectedEventKinds`
### Export Failure
Export failure means Smithers captured canonical trace events but could not deliver them to the OTEL backend.
Export failure MUST NOT erase canonical local truth.
If export fails:
- canonical local persistence MUST still succeed when enabled
- Smithers MUST emit operator diagnostics through existing logs
- the run MUST remain inspectable from durable local records
## Normalization Rules
Normalization MUST be conservative.
### One Raw Event to One Canonical Event
As a default rule, one raw upstream event SHOULD map to one canonical trace event.
If one raw event yields multiple canonical events, the implementation MUST document why and MUST include a stable parent link.
### Text Deltas
Assistant text deltas MUST remain deltas if the upstream protocol provided deltas.
Smithers MUST NOT collapse deltas into a single blob during export.
Final assembled messages MAY be emitted separately as `assistant.message.final`.
### Thinking Deltas
Visible thinking content MUST be captured as its own event class and MUST NOT be merged into assistant text.
### Tool Calls
Tool lifecycle MUST preserve:
- stable tool call identifier when upstream provides one
- tool name
- visible arguments or argument preview
- partial updates when available
- final result preview
- error flag
### Usage
Usage records MUST be separate canonical events or attached to terminal message events in a way that remains queryable.
If usage is attached, it MUST still be accessible without parsing free-form text.
## Required Runtime Integration Points
The implementation MUST integrate at these boundaries.
### Agent Boundary
Every agent integration MUST report raw trace observations into the canonical trace capture layer.
No agent integration is allowed to silently parse and discard upstream event records before the capture layer sees them.
### Event Bus Boundary
Canonical trace events SHOULD be emitted through or alongside the existing event bus so that:
- they share run correlation
- they can participate in durable persistence
- they can reuse existing event-driven verification infrastructure
### Attempt Finalization Boundary
When an attempt finishes, Smithers MUST finalize trace metadata:
- `traceFinishedAtMs`
- `traceCompleteness`
- `unsupportedEventKinds`
- `rawArtifactRefs`
## Required Configuration Surface
The implementation MUST define explicit configuration for:
- enabling OTEL log export
- selecting backend endpoint
- enabling or disabling canonical local trace persistence
- selecting redaction mode
- retaining or dropping raw payload bodies
- retaining or dropping raw artifacts
- maximum event body bytes
- maximum artifact bytes
The configuration MUST distinguish:
- runtime operator policy
- run-specific annotations
## Required Operator Queries
The design is incomplete unless the following operator queries are supported.
### Query Set A: Run Reconstruction
Operators MUST be able to answer:
- show all trace records for one run
- show all trace records for one run and node
- show only one attempt for one node
- show ordered assistant text deltas
- show visible thinking deltas when present
- show tool calls and results in order
### Query Set B: Failure Analysis
Operators MUST be able to answer:
- which runs had trace capture failures
- which agents only provide final-only traces
- which attempts terminated without a terminal agent event
- which traces were partially redacted
### Query Set C: Audit
Operators MUST be able to answer:
- what annotations were attached to a run
- which workflow file and workflow hash produced the trace
- which raw artifact file corresponds to this attempt
## Verification Specification
Task completion is not defined by code existing. It is defined by observable correctness.
The implementation is complete only if every verification class below passes.
## Verification Class 1: Schema Correctness
For each supported agent family, automated tests MUST verify that canonical trace events:
- conform to the declared schema
- contain required identity fields
- maintain monotonic `event.sequence`
- correctly classify `traceCompleteness`
Completion criterion:
- zero schema violations in test fixtures
## Verification Class 2: Ordering Correctness
Automated tests MUST verify that for one attempt:
- event sequences are strictly monotonic
- final events occur after preceding deltas
- no duplicate sequence numbers appear
Completion criterion:
- deterministic ordering across repeated test runs
## Verification Class 3: Fidelity Correctness
Fixture-based tests MUST compare raw upstream inputs with canonical trace outputs.
For each fixture:
- every parseable upstream event MUST result in a canonical event or an explicit diagnostic drop event
- visible thinking content MUST remain distinguishable from assistant text
- tool call identifiers and names MUST survive normalization
Completion criterion:
- full fixture coverage for each agent family and capture mode supported by Smithers
## Verification Class 4: Completeness Classification
Tests MUST verify the semantics of:
- `full-observed`
- `partial-observed`
- `final-only`
- `capture-failed`
Completion criterion:
- each classification is produced by at least one explicit test case
## Verification Class 5: OTEL Export Correctness
Integration tests MUST verify that canonical trace events become OTEL log records with:
- required attributes present
- correct body shape
- correct severity mapping
- correct custom annotation export
Completion criterion:
- logs are queryable in the target backend by `run.id`, `workflow.path`, `node.id`, `attempt`, and `event.kind`
## Verification Class 6: Loki Query Correctness
In a local stack with Loki enabled, end-to-end tests MUST verify that an operator can query:
- all records for a run
- all records for a node attempt
- only thinking deltas
- only tool execution records
- only capture errors
Completion criterion:
- documented query examples return expected results against test data
## Verification Class 7: Artifact Correctness
When artifact capture is enabled, tests MUST verify:
- artifact references are recorded
- artifacts exist on disk or in configured storage
- artifact metadata matches actual content
- artifact creation emits corresponding canonical events
Completion criterion:
- no dangling artifact references
## Verification Class 8: Redaction Correctness
Tests MUST verify that redaction:
- removes required secrets from canonical payloads, raw payloads, OTEL bodies, and artifacts
- leaves non-sensitive content intact
- records which rules were applied
Completion criterion:
- zero known secret literals leak in test fixtures
## Verification Class 9: Failure Resilience
Tests MUST verify behavior when:
- collector is unavailable
- backend rejects logs
- malformed upstream JSON is encountered
- subprocess exits before terminal event
- artifact write fails
Completion criterion:
- capture failures are classified
- local diagnostics exist
- durable local truth remains accessible when configured
## Verification Class 10: Cross-Signal Correlation
Tests MUST verify that logs correlate with:
- run lifecycle events
- metrics
- spans
At minimum, operators MUST be able to join by:
- `run.id`
- `node.id`
- `attempt`
Completion criterion:
- one documented workflow run can be traced across event log, OTEL logs, and metrics without ambiguity
## Acceptance Criteria
The feature is not done until all of the following are true.
### A. Canonical Model Exists
Smithers has a canonical agent trace schema with explicit completeness states and per-agent capability declarations.
### B. Pi Is High Fidelity
`PiAgent` exports structured trace records for:
- session lifecycle
- turn lifecycle
- message lifecycle
- assistant text deltas
- visible thinking deltas
- tool execution lifecycle
- retry and compaction events
### C. Other Agents Are Truthfully Classified
Every agent in `src/agents/` has a declared fidelity class and unsupported event set.
### D. OTEL Logs Pipeline Exists
The collector and local observability stack support OTEL logs end to end.
### E. Queries Work
Operators can answer the required run reconstruction, failure analysis, and audit queries from the exported logs.
### F. Verification Is Automated
Automated tests exist for schema, ordering, fidelity, completeness, OTEL export, redaction, failure handling, and query correctness.
## Implementation Phasing
This section is normative for rollout order.
### Phase 1: Canonical Model
Implement:
- canonical trace schema
- completeness classification
- per-agent capability declarations
### Phase 2: Pi Fidelity
Implement:
- Pi raw event capture
- canonical normalization
- OTEL export
- artifact capture for session files if configured
### Phase 3: Structured CLI Agents
Implement:
- Codex
- Claude Code
- Gemini
- Kimi
Each integration MUST ship with fixture-based normalization tests before being considered complete.
### Phase 4: SDK and Text-Only Agents
Implement:
- explicit partial or final-only capture
- truthful capability declarations
- OTEL export for the observable subset
### Phase 5: Redaction and Hardening
Implement:
- default redaction
- export failure handling
- artifact verification
- documented local Loki queries
## Explicit Non-Ambiguities
The following choices are intentional.
- Smithers MUST prefer truthful partial fidelity over fake completeness.
- Smithers MUST preserve raw event boundaries rather than collapsing everything into summaries.
- Smithers MUST keep large content in log bodies, not indexing labels.
- Smithers MUST retain a local source of truth when OTEL export fails.
- Smithers MUST separate assistant text from visible thinking.
- Smithers MUST define task completion in terms of verification evidence, not implementation effort.
## Out of Scope for the First Implementation
The first implementation MAY defer:
- remote artifact storage
- cross-run session graph visualizations
- backend-specific dashboards beyond minimal verification queries
- universal reconstruction of provider-internal hidden reasoning
If deferred, these items MUST be documented explicitly and MUST NOT be implied to exist.
## Summary
The required system is not “send some logs to Loki.”
The required system is:
- a canonical agent trace model
- explicit capability declarations per integration
- conservative capture of all observable upstream events
- durable local truth
- OTEL log export with stable correlation fields
- redaction before persistence and export
- verification that proves fidelity, completeness, and queryability
Anything less produces observability that looks complete while remaining operationally unreliable. |
38dbe04 to
7f9388b
Compare
Contributor
Author
|
Added verification coverage beyond Pi. Testing I ran:
The self-contained observability workflow/script now explicitly exercises |
5897da8 to
3ac8f25
Compare
The trace collector was overstating capture fidelity by treating synthesized terminal events as if they had been directly observed. It also kept inconsistent usage payloads and relied on fragile tool lifecycle identity reconstruction. This change separates directly observed events from backfilled canonical events so completeness reflects what Smithers actually captured for each attempt. It normalizes usage records to one schema, preserves later terminal usage data, and threads a stable toolCallId through Smithers tool events so start and finish records stay correlated. The result keeps coarse capture modes truthfully classified as final-only while marking real structured gaps as partial-observed, which makes the trace model easier to reason about and aligns it with the rest of the codebase's observability behavior.
…thfully Smithers was still under-reporting fidelity for real Codex and Claude CLI runs even after the earlier trace semantics cleanup. Codex emitted structured dotted JSONL events that we were treating as coarse text in practice, and Claude exposed an observed terminal result that we were only backfilling into a synthetic final message. This change makes Codex self-describe its stream-json mode, normalizes real dotted Codex events into canonical turn and final-message records, and treats Claude result events as directly observed final assistant output. It also hardens the observability verification script so publication checks assert the actual trace summary semantics for Claude and Codex instead of only checking that logs exist. That keeps the published bookmark aligned with the behavior we verified manually and protects the trace fidelity guarantees the PR is trying to establish.
24d48ce to
aa39ee6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Smithers had runtime events, traces, and metrics, but it did not have a stable, queryable observability surface for the agent-visible execution transcript itself.
That left a gap for debugging and review:
What this PR adds
Canonical agent trace model
src/agent-trace.tsassistant.text.deltaassistant.thinking.deltatool.execution.starttool.execution.updatetool.execution.endassistant.message.finalusagecapture.errorcapture.warningagent.familyagent.capture_modetrace.completenessOTEL log export + durable local truth
run_idworkflow_pathnode_idnode_attemptagent_familyagent_capture_modetrace_completenessevent_kindLocal observability stack support
docs/guides/agent-trace-otel-verification.mdxCoverage across agent families
Failure / redaction coverage
capture.error/capture-failedcapture.error/capture-failedcapture.warning/partial-observedVerification hardening
This branch also hardens the local verification path so the stack is easier to reset and verify repeatedly:
scripts/obs-reset.shscripts/obs-wait-healthy.shscripts/verify-observability.shWhy this approach
The goal here is not just “send some logs to Loki.”
The goal is a truthful, queryable, reviewer-usable agent trace pipeline that:
Validation
Automated
bun test tests/agent-trace.test.tsxbun test tests/observability.test.tsAdded / verified coverage for:
final-onlyLive local verification
Verified locally with the built-in demo workflow:
workflows/agent-trace-otel-demo.tsxKnown-good hardened flow:
This produces a timestamped evidence bundle under:
Observed live validation includes:
capture.errorqueryservice.name=smithers-devand matchingrunIdCurrent verification status
Strongest live evidence
Truthful but weaker than Pi
Scope
This PR intentionally bundles the implementation and the local verification path hardening, because the feature is only useful if maintainers can actually reproduce the observability evidence locally.
Included in scope:
Not included:
Notes
versionfield intentionally in this branch🤖 Generated with pi / smithers