feat(telemetry): opt-in OpenTelemetry tracing for agents by Mgczacki · Pull Request #2069 · dimensionalOS/dimos

Mgczacki · 2026-05-13T06:15:11Z

Closes #2053.

Why

When iterating on agent changes I kept reaching for trace visibility — to see
how tool calls nested, where time was going, when the LLM was getting confused
mid-turn. I built a quick Langfuse-specific integration locally for my own dev
loop. It worked well enough that contributing it back seemed worth doing —
hierarchical-trace tooling is becoming a standard best practice for agent
development, and it's the kind of thing nobody wants to re-roll per project.

Rather than ship the Langfuse-specific version, I generalized it to an
OpenTelemetry pipeline. One integration covers all four backends named in
#2053 — Langfuse, Arize Phoenix, LangSmith, Opik — plus any other
OTLP-compatible backend (Jaeger, Tempo, Honeycomb, ...) for free. Vendor
selection is by env var, not code change. The agent itself never imports a
vendor SDK.

An example of how this looks in practice can be seen here: https://us.cloud.langfuse.com/project/cmp23t80n09ooad08jnw1lksy/traces/d1b1d6c95ec9107fdcc4df5fa094edef?observation=trace-d1b1d6c95ec9107fdcc4df5fa094edef&timestamp=2026-05-13T05:33:39.063Z

Backend	OTLP endpoint	Session/thread grouping
Langfuse	`https://{us,cloud}.langfuse.com/api/public/otel`	✓ via `session.id` (OpenInference)
Arize Phoenix / AX	local Phoenix collector, or arize.com	✓ via `session.id` (OpenInference; Arize owns this conv)
LangSmith	LangSmith OTEL endpoint	✓ via `langsmith.trace.session_id` (LangSmith namespace)
Opik	comet.com OTEL endpoint	✓ via `thread_id` (added in comet-ml/opik#3441)

What's in this PR

dimos/telemetry/ — new self-contained module

__init__.py — public API: span, enable, configure_tracing, session_attributes, DimosInstrumentor.
_manager.py — single module-global TracerManager (tracer + _export_enabled).
_api.py — the span(name, **attrs) context manager. No-op when off.
instrumentor.py — DimosInstrumentor(BaseInstrumentor) with a top-level try/except fallback to a stub class when dimos[otel] isn't installed.
test_telemetry.py — 6 pytest tests covering the no-op default, dotted-key attribute pass-through, session attribute shape, enable() short-circuits, lazy DimosInstrumentor resolution, and the strict-opt-in import contract (subprocess test).

Agent integration

McpClient._process_message wraps the per-turn state_graph.stream(...) loop in dimos.telemetry.span("agent.turn", ...). Each McpClient instance generates a UUID at construction; every turn's root span is tagged via the session_attributes() helper (sets session.id, langsmith.trace.session_id, and thread_id) so all turns from one agent instance appear under a single session/thread in the backend UI.

Extras

pyproject.toml adds [project.optional-dependencies] otel = [...] with the OTEL API/SDK, OTLP HTTP exporter, opentelemetry-instrumentation, and openinference-instrumentation-langchain for LangChain auto-spans.

Strict opt-in — what "core path unchanged" means here

The base install ships no opentelemetry packages.
import dimos.telemetry triggers zero opentelemetry imports, even when dimos[otel] is installed. Pure stdlib at import time. Locked in by a subprocess test.
The span() helper short-circuits on a single boolean check when tracing is off.
Cost for users who don't enable tracing is one os.environ.get() call at package import.

Three ways to turn it on

1. Env-driven (recommended). Install the extra and set OTEL_EXPORTER_OTLP_ENDPOINT. enable() runs automatically on first import of dimos.telemetry. Auth via OTEL_EXPORTER_OTLP_HEADERS; service name via OTEL_SERVICE_NAME.

2. Caller-owned provider. dimos.telemetry.configure_tracing(my_provider) when the host app already runs OTEL.

3. Standard OTEL idiom. DimosInstrumentor().instrument(tracer_provider=...).

Concrete usage

Langfuse:

OTEL_EXPORTER_OTLP_ENDPOINT='https://us.cloud.langfuse.com/api/public/otel' \
OTEL_EXPORTER_OTLP_HEADERS='Authorization=Basic <base64(public_key:secret_key)>' \
OTEL_SERVICE_NAME='dimos' \
uv run dimos ...

Phoenix (local):

phoenix serve &
OTEL_EXPORTER_OTLP_ENDPOINT='http://localhost:6006' uv run dimos ...

LangSmith:

OTEL_EXPORTER_OTLP_ENDPOINT='https://api.smith.langchain.com/otel' \
OTEL_EXPORTER_OTLP_HEADERS='x-api-key=<your_key>' \
uv run dimos ...

When the env var is unset (the default), behavior is unchanged from main.

Out of scope (intentionally) — happy to follow up

Vendor-native pretty rendering. Langfuse's own langfuse.langchain.CallbackHandler renders messages/tool-calls more richly than generic OpenInference attributes. Adding it would conflict with the OpenInference auto-instrumentor (duplicate spans), so it belongs in a follow-up [langfuse] extra that swaps the LangChain instrumentation path.
Prompt versioning, datasets, evals. These use vendor-native SDKs rather than OTLP. Separate scope.
Tracing other DimOS modules. Only the agent is wrapped for now; dimos.telemetry.span() is the shared helper any module can adopt.

Repo checks

ruff format --check ✓
ruff check ✓
mypy dimos/telemetry/ ✓
pytest dimos/telemetry/ → 6 passed in 0.16s

Design choices reviewers may ask about

Why three session/thread attribute keys? Verified against each backend's current docs. There's no single convention yet: session.id (OpenInference) covers Langfuse + Phoenix; langsmith.trace.session_id covers LangSmith; thread_id covers Opik. Setting all three covers every backend named in the issue without runtime detection. Documented in session_attributes().
Why does auto-enable run at import time when the env var is set? Setting OTEL_EXPORTER_OTLP_ENDPOINT is the user's opt-in signal — eager setup ensures the first agent turn is already traced. When unset, the cost is a single os.environ.get() call.
uv.lock noise. Five new packages (the OTEL stack — openinference-instrumentation-langchain, openinference-instrumentation, openinference-semantic-conventions, opentelemetry-exporter-otlp-proto-http, opentelemetry-instrumentation) plus a one-line transitive patch bump (reportlab 4.5.0 → 4.5.1) that uv lock picked up when regenerating.

Add `dimos.telemetry`, a self-contained module that wraps the agent's per-turn execution in an OTEL span. The base install is unaffected: this package imports no opentelemetry packages at module load, and `dimos.telemetry.span` is a silent no-op until tracing is wired up. Wiring options: * Env-driven (recommended). Install the extra and set `OTEL_EXPORTER_OTLP_ENDPOINT`. `enable()` runs automatically on first import and configures the OTLP HTTP exporter plus, when available, `openinference-instrumentation-langchain` for LangChain auto-spans. * Caller-owned provider: `configure_tracing(my_provider)`. * Standard OTEL idiom: `DimosInstrumentor().instrument(...)`. The class is resolved lazily via module-level `__getattr__` so the heavy `opentelemetry.instrumentation` import only runs on attribute access. Vendor-agnostic via OTLP: Langfuse, Arize Phoenix, LangSmith, and Opik all accept the same pipeline; selection is by env var, not code. Each McpClient instance now generates a UUID at construction and stamps it on every `agent.turn` span via `session_attributes()`, which sets both the OpenInference `session.id` (Langfuse, Phoenix) and `langsmith.trace.session_id` (LangSmith). Backends group all per-turn traces from one instance into a single session in their UI. Opik has no OTEL→Threads mapping yet (comet-ml/opik#3441); use its native SDK there.

Mgczacki force-pushed the agent_observability branch from 0c9395d to 2419dfc Compare May 13, 2026 06:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): opt-in OpenTelemetry tracing for agents#2069

feat(telemetry): opt-in OpenTelemetry tracing for agents#2069
Mgczacki wants to merge 1 commit into
dimensionalOS:mainfrom
Mgczacki:agent_observability

Mgczacki commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mgczacki commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What's in this PR

Strict opt-in — what "core path unchanged" means here

Three ways to turn it on

Concrete usage

Out of scope (intentionally) — happy to follow up

Repo checks

Design choices reviewers may ask about

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Mgczacki commented May 13, 2026 •

edited

Loading