Skip to content

feat: add OpenTelemetry tracing for tool calls#21

Open
priyank766 wants to merge 5 commits into
kubeflow:mainfrom
priyank766:feat/otel-tracing
Open

feat: add OpenTelemetry tracing for tool calls#21
priyank766 wants to merge 5 commits into
kubeflow:mainfrom
priyank766:feat/otel-tracing

Conversation

@priyank766
Copy link
Copy Markdown

Description

Adds optional OpenTelemetry tracing for tool calls in Kubeflow MCP Server.

  • Adds core/telemetry.py with setup_tracing() and get_tracer() plus safe no-op fallback when OTel deps are unavailable.
  • Instruments core.server._audit_wrap to create one span per tool invocation, set tool/persona/duration/success attributes, attach correlation_id, and record
    exceptions.
  • Wires tracing config through CLI/env/config:
    • --otel-endpoint
    • KUBEFLOW_MCP_OTEL_ENDPOINT
    • observability.otel_endpoint
  • Adds unit tests for no-op path, provider setup/reuse, endpoint validation, and span behavior.
  • Updates README with an Observability section.

Important compatibility note: correlation_id semantics are preserved and exposed as a span attribute; it is not remapped to OTel trace ID.

Type of Change

  • feat: New feature

Checklist

  • Tests pass locally (make test-python)
  • Linting passes (make verify)
  • Documentation updated (if applicable)
  • Commit messages follow conventional format

Related Issues

Fixes #18

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: priyank <priyank8445@gmail.com>
@priyank766 priyank766 force-pushed the feat/otel-tracing branch from ec68142 to bec06a0 Compare May 15, 2026 16:51
@abhijeet-dhumal
Copy link
Copy Markdown
Member

Hey @priyank766 , this looks great 🚀
Thanks for working on this !

Comment thread pyproject.toml Outdated
"sphinx-design>=0.5",
]
otel = [
"opentelemetry-exporter-otlp>=1.25.0",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"opentelemetry-exporter-otlp>=1.25.0",
"opentelemetry-exporter-otlp-proto-http>=1.25.0",

this avoids installation of unnecessary GRPC subpackage

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the Suggestion
I will change the package as well

Comment thread kubeflow_mcp/core/server.py Outdated
exc_info=True,
)
raise
with tracer.start_as_current_span("tool_call") as span:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it just now and it seems span name "tool_call" makes it hard to filter by tool in tracing UIs, Maybe we can consider naming it after the tool: f"tool:{tool_name}" or even just tool_name. The attribute tool.name is still good to keep for structured querying, but the name gives context at a glance.
wdyt?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for Reviewing @abhijeet-dhumal
I will push both changes and then you can review it later whenever you get time
Yes I think this naming scheme is much better I will change it .. 👍🏻

Signed-off-by: priyank <priyank8445@gmail.com>
Copilot AI review requested due to automatic review settings May 18, 2026 15:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional OpenTelemetry tracing for Kubeflow MCP tool calls, wiring tracing through configuration, CLI, runtime instrumentation, tests, and docs.

Changes:

  • Adds telemetry setup/no-op helpers and an otel optional dependency extra.
  • Instruments _audit_wrap spans with tool/persona/correlation/success/duration attributes.
  • Wires --otel-endpoint, config/env loading, startup logging, tests, and README documentation.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
kubeflow_mcp/core/telemetry.py Adds OpenTelemetry setup, validation, provider reuse, and no-op fallback helpers.
kubeflow_mcp/core/server.py Adds span creation and attributes around audited tool calls.
kubeflow_mcp/cli.py Adds --otel-endpoint and invokes tracing setup during server startup.
kubeflow_mcp/core/config.py Adds observability.otel_endpoint config/env support.
kubeflow_mcp/core/logging.py Includes tracing_enabled in structured startup logs.
tests/unit/core/test_telemetry.py Adds telemetry helper and span attribute tests.
kubeflow_mcp/cli_test.py Adds CLI tracing setup wiring tests.
README.md Documents optional tracing setup and span attributes.
pyproject.toml Adds the otel optional dependency extra.
uv.lock Locks OpenTelemetry optional dependencies and related resolution changes.
Comments suppressed due to low confidence (1)

kubeflow_mcp/core/server.py:126

  • The circuit-open early-return tracing path is not covered by the new span behavior tests. Add coverage for a breaker with can_execute() == False so the tool.success=false and duration attributes remain verified for this failure mode.
            if not breaker.can_execute():
                duration_ms = int((time.monotonic() - start) * 1000)
                span.set_attribute("tool.success", False)
                span.set_attribute("tool.duration_ms", duration_ms)
                logger.warning("circuit_open", extra={"tool": tool_name})
                return {
                    "error": f"Circuit breaker open for '{tool_name}' — K8s API may be degraded. Retries automatically after recovery timeout.",
                    "error_code": ErrorCode.CIRCUIT_OPEN,
                }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kubeflow_mcp/core/server.py Outdated
Comment on lines +102 to +105
with tracer.start_as_current_span(f"tool:{tool_name}") as span:
span.set_attribute("tool.name", tool_name)
span.set_attribute("kubeflow.persona", persona)
span.set_attribute("correlation_id", cid)
Comment thread kubeflow_mcp/core/server.py Outdated
breaker.record_failure()
span.set_attribute("tool.success", False)
span.set_attribute("tool.duration_ms", duration_ms)
span.record_exception(exc)
Copy link
Copy Markdown
Member

@abhijeet-dhumal abhijeet-dhumal May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyank766 can we add span.set_status(StatusCode.ERROR) on exception here ?
span.set_status(Status(StatusCode.ERROR, str(exc)))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will make sure to spot failures easily in the trace list, wdyt?

Comment on lines +242 to +248
observability_file = file_config.get("observability", {})
observability = ObservabilityConfig(
otel_endpoint=os.getenv(
"KUBEFLOW_MCP_OTEL_ENDPOINT",
observability_file.get("otel_endpoint"),
)
)
Comment on lines +107 to +115
if _rate_limiter is not None and not _rate_limiter.acquire():
duration_ms = int((time.monotonic() - start) * 1000)
span.set_attribute("tool.success", False)
span.set_attribute("tool.duration_ms", duration_ms)
logger.warning("rate_limited", extra={"tool": tool_name})
return {
"error": "Rate limit exceeded. Retry after a brief pause.",
"error_code": ErrorCode.RATE_LIMITED,
}
Comment thread README.md
Comment on lines +173 to +180
- Install optional dependencies: `pip install ".[otel]"`
- Enable tracing with CLI flag or env var:

```bash
kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces
# or
export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces
kubeflow-mcp serve
Comment thread kubeflow_mcp/core/server.py Outdated

cid = with_correlation_id()
masked = mask_sensitive_data(kwargs) if kwargs else {}
tracer = get_tracer("kubeflow_mcp.tools")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyank766 can you move this outside wrapper closure?
get_tracer is cheap but calling it per-invocation seems semantically wrong, wdyt?

Comment thread kubeflow_mcp/core/server.py Outdated
exc_info=True,
)
raise
with tracer.start_as_current_span(f"tool:{tool_name}") as span:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyank766 thinking can we use SpanKind.CLIENT for tool spans here?
Tool calls invoke an external service i.e. K8s API. SpanKind.CLIENT is semantically correct and enables better Jaeger dependency graph rendering..

return None


class _NoopTracer:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_NoopTracer should accept **kwargs here
Future callers passing kind=SpanKind.CLIENT will otherwise get a TypeError in no-op mode.

…hoist tracer, noop kwargs

- Use SpanKind.CLIENT for tool spans (K8s API = external service)

- Replace manual record_exception() with set_status(Status(StatusCode.ERROR)) to avoid duplicate events and improve Jaeger trace visibility

- Move get_tracer() to _audit_wrap scope (once per tool, not per call)

- Add **kwargs and set_status() to _NoopTracer/_NoopSpan for API compatibility

Signed-off-by: priyank <priyank8445@gmail.com>
@priyank766
Copy link
Copy Markdown
Author

Thanks for the review @abhijeet-dhumal All suggestions have been addressed. Please take another look when you get a chance.

@priyank766 priyank766 force-pushed the feat/otel-tracing branch from 86d30a5 to 6428ffe Compare May 23, 2026 11:19
Comment thread README.md Outdated
```bash
kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces
# or
export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/traces

Comment thread kubeflow_mcp/core/config.py Outdated
observability_file = file_config.get("observability", {})
observability = ObservabilityConfig(
otel_endpoint=os.getenv(
"KUBEFLOW_MCP_OTEL_ENDPOINT",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"KUBEFLOW_MCP_OTEL_ENDPOINT",
"OTEL_EXPORTER_OTLP_ENDPOINT",

Comment thread kubeflow_mcp/cli.py Outdated
"--otel-endpoint",
default=None,
help="OpenTelemetry OTLP HTTP endpoint for tracing. "
"Falls back to KUBEFLOW_MCP_OTEL_ENDPOINT env var, config file.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Falls back to KUBEFLOW_MCP_OTEL_ENDPOINT env var, config file.",
"Falls back to OTEL_EXPORTER_OTLP_ENDPOINT env var, config file.",

with tracer.start_as_current_span(
f"tool:{tool_name}", **span_kwargs
) as span:
span.set_attribute("tool.name", tool_name)
Copy link
Copy Markdown
Member

@abhijeet-dhumal abhijeet-dhumal May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: would be great to also surface user.id and mcp.session_id on these spans for per-session/per-user filtering in Jaeger. That will need identity propagation middleware (ContextVars populated from the MCP request context) as a prerequisite..

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to follow up with a separate PR for that once this lands. Marking as a non-blocking suggestion.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, having user.id and session_id on the spans would definitely help with debugging. Happy to take a crack at it once this lands, just let me know.

Replace custom KUBEFLOW_MCP_OTEL_ENDPOINT with the official OpenTelemetry env var across README, CLI help text, and config loader.

Signed-off-by: priyank <priyank8445@gmail.com>
@abhijeet-dhumal
Copy link
Copy Markdown
Member

abhijeet-dhumal commented Jun 2, 2026

Hey @priyank766 , I tried this out locally against OTel collector stack, works cleanly end-to-end.
Spans show up in Jaeger with the right service name, tool: prefix naming is much better for filtering.
Thanks you for being consistent and addressing all the previous feedback so promptly. 🚀 🙌

putting few nits otherwise lgtm..

provider.add_span_processor(processor)
_otel_trace.set_tracer_provider(provider)

_tracing_initialized = True
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing atexit.register(provider.shutdown) here. Without it, in-flight spans in the BatchSpanProcessor queue get silently dropped on server shutdown, and the background thread can deadlock if the collector is unreachable. One line:

import atexit
atexit.register(provider.shutdown)

with tracer.start_as_current_span(
f"tool:{tool_name}", **span_kwargs
) as span:
span.set_attribute("tool.name", tool_name)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add tool.args_preview here? Without it you can't reconstruct what parameters caused a failure from Jaeger alone.. you'd have to cross-reference the audit log. Something like:

import json
span.set_attribute("tool.args_preview", json.dumps(mask_sensitive_data(kwargs), default=str)[:300])

span.set_attribute("tool.success", False)
span.set_attribute("tool.duration_ms", duration_ms)
logger.warning("circuit_open", extra={"tool": tool_name})
return {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path sets tool.success=False on the span but there's no test covering it.
Worth adding a test with _FakeBreaker(can_execute=False) asserting the attributes are set before the early return.

Comment thread README.md Outdated
- Enable tracing with CLI flag or env var:

```bash
kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up: setup_tracing() auto-appends /v1/traces to whatever is passed, so the example in the README produces http://localhost:4318/v1/traces/v1/traces — double path. Either change the README example to the base URL (http://localhost:4318) or stop auto-appending in code and accept the full path as-is. The latter is less surprising.

Comment thread README.md

Each tool invocation emits a span with attributes:
`tool.name`, `tool.success`, `tool.duration_ms`, `kubeflow.persona`, and `correlation_id`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a one-liner noting that kubeflow-mcp agent --otel-endpoint ... emits a separate kubeflow-mcp-agent service in Jaeger. Without this, users running the agent will wonder why they only see server-side spans and think tracing is broken.

…zation, circuit breaker test

- Register atexit.shutdown on TracerProvider to flush in-flight spans

- Add tool.args_preview span attribute with masked kwargs (truncated to 300 chars)

- Auto-append /v1/traces for base URLs to match OTel SDK convention

- Add test for circuit breaker open path (tool.success=False)

- Document kubeflow-mcp-agent as separate Jaeger service

Signed-off-by: priyank <priyank8445@gmail.com>
@priyank766
Copy link
Copy Markdown
Author

I've addressed all the nits in the latest commit. Let me know if everything looks good!!
Thanks for testing it out and for the kind words ! 🚀
@abhijeet-dhumal

@abhijeet-dhumal
Copy link
Copy Markdown
Member

Hey @priyank766 👋
Apologies for raising this late in the review cycle.. stumbled across something that's worth addressing before this lands.
There's a dedicated MCP-specific semantic convention that shipped in OTel v1.39: https://opentelemetry.io/docs/specs/semconv/gen-ai/mcp/. Would it be good to address it now ?
It defines exactly how MCP server tool call spans should be instrumented, and i afraid our current attribute names don't align.
WDYT? @andreyvelich @astefanutti @kramaranya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add OpenTelemetry tracing to tool calls

3 participants