feat(telemetry): opt-in OpenTelemetry tracing for agents#2069
Draft
Mgczacki wants to merge 1 commit into
Draft
Conversation
Add `dimos.telemetry`, a self-contained module that wraps the agent's
per-turn execution in an OTEL span. The base install is unaffected:
this package imports no opentelemetry packages at module load, and
`dimos.telemetry.span` is a silent no-op until tracing is wired up.
Wiring options:
* Env-driven (recommended). Install the extra and set
`OTEL_EXPORTER_OTLP_ENDPOINT`. `enable()` runs automatically on first
import and configures the OTLP HTTP exporter plus, when available,
`openinference-instrumentation-langchain` for LangChain auto-spans.
* Caller-owned provider: `configure_tracing(my_provider)`.
* Standard OTEL idiom: `DimosInstrumentor().instrument(...)`. The class
is resolved lazily via module-level `__getattr__` so the heavy
`opentelemetry.instrumentation` import only runs on attribute access.
Vendor-agnostic via OTLP: Langfuse, Arize Phoenix, LangSmith, and Opik
all accept the same pipeline; selection is by env var, not code.
Each McpClient instance now generates a UUID at construction and stamps
it on every `agent.turn` span via `session_attributes()`, which sets
both the OpenInference `session.id` (Langfuse, Phoenix) and
`langsmith.trace.session_id` (LangSmith). Backends group all per-turn
traces from one instance into a single session in their UI. Opik has no
OTEL→Threads mapping yet (comet-ml/opik#3441); use its native SDK there.
0c9395d to
2419dfc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2053.
Why
When iterating on agent changes I kept reaching for trace visibility — to see
how tool calls nested, where time was going, when the LLM was getting confused
mid-turn. I built a quick Langfuse-specific integration locally for my own dev
loop. It worked well enough that contributing it back seemed worth doing —
hierarchical-trace tooling is becoming a standard best practice for agent
development, and it's the kind of thing nobody wants to re-roll per project.
Rather than ship the Langfuse-specific version, I generalized it to an
OpenTelemetry pipeline. One integration covers all four backends named in
#2053 — Langfuse, Arize Phoenix, LangSmith, Opik — plus any other
OTLP-compatible backend (Jaeger, Tempo, Honeycomb, ...) for free. Vendor
selection is by env var, not code change. The agent itself never imports a
vendor SDK.
An example of how this looks in practice can be seen here: https://us.cloud.langfuse.com/project/cmp23t80n09ooad08jnw1lksy/traces/d1b1d6c95ec9107fdcc4df5fa094edef?observation=trace-d1b1d6c95ec9107fdcc4df5fa094edef×tamp=2026-05-13T05:33:39.063Z
https://{us,cloud}.langfuse.com/api/public/otelsession.id(OpenInference)session.id(OpenInference; Arize owns this conv)langsmith.trace.session_id(LangSmith namespace)thread_id(added in comet-ml/opik#3441)What's in this PR
dimos/telemetry/— new self-contained module__init__.py— public API:span,enable,configure_tracing,session_attributes,DimosInstrumentor._manager.py— single module-globalTracerManager(tracer+_export_enabled)._api.py— thespan(name, **attrs)context manager. No-op when off.instrumentor.py—DimosInstrumentor(BaseInstrumentor)with a top-leveltry/exceptfallback to a stub class whendimos[otel]isn't installed.test_telemetry.py— 6 pytest tests covering the no-op default, dotted-key attribute pass-through, session attribute shape,enable()short-circuits, lazyDimosInstrumentorresolution, and the strict-opt-in import contract (subprocess test).Agent integration
McpClient._process_messagewraps the per-turnstate_graph.stream(...)loop indimos.telemetry.span("agent.turn", ...). EachMcpClientinstance generates a UUID at construction; every turn's root span is tagged via thesession_attributes()helper (setssession.id,langsmith.trace.session_id, andthread_id) so all turns from one agent instance appear under a single session/thread in the backend UI.Extras
pyproject.tomladds[project.optional-dependencies] otel = [...]with the OTEL API/SDK, OTLP HTTP exporter,opentelemetry-instrumentation, andopeninference-instrumentation-langchainfor LangChain auto-spans.Strict opt-in — what "core path unchanged" means here
import dimos.telemetrytriggers zero opentelemetry imports, even whendimos[otel]is installed. Pure stdlib at import time. Locked in by a subprocess test.span()helper short-circuits on a single boolean check when tracing is off.os.environ.get()call at package import.Three ways to turn it on
1. Env-driven (recommended). Install the extra and set
OTEL_EXPORTER_OTLP_ENDPOINT.enable()runs automatically on first import ofdimos.telemetry. Auth viaOTEL_EXPORTER_OTLP_HEADERS; service name viaOTEL_SERVICE_NAME.2. Caller-owned provider.
dimos.telemetry.configure_tracing(my_provider)when the host app already runs OTEL.3. Standard OTEL idiom.
DimosInstrumentor().instrument(tracer_provider=...).Concrete usage
Langfuse:
Phoenix (local):
LangSmith:
When the env var is unset (the default), behavior is unchanged from main.
Out of scope (intentionally) — happy to follow up
langfuse.langchain.CallbackHandlerrenders messages/tool-calls more richly than generic OpenInference attributes. Adding it would conflict with the OpenInference auto-instrumentor (duplicate spans), so it belongs in a follow-up[langfuse]extra that swaps the LangChain instrumentation path.dimos.telemetry.span()is the shared helper any module can adopt.Repo checks
ruff format --check✓ruff check✓mypy dimos/telemetry/✓pytest dimos/telemetry/→ 6 passed in 0.16sDesign choices reviewers may ask about
session.id(OpenInference) covers Langfuse + Phoenix;langsmith.trace.session_idcovers LangSmith;thread_idcovers Opik. Setting all three covers every backend named in the issue without runtime detection. Documented insession_attributes().OTEL_EXPORTER_OTLP_ENDPOINTis the user's opt-in signal — eager setup ensures the first agent turn is already traced. When unset, the cost is a singleos.environ.get()call.uv.locknoise. Five new packages (the OTEL stack —openinference-instrumentation-langchain,openinference-instrumentation,openinference-semantic-conventions,opentelemetry-exporter-otlp-proto-http,opentelemetry-instrumentation) plus a one-line transitive patch bump (reportlab4.5.0 → 4.5.1) thatuv lockpicked up when regenerating.