feat: add OpenTelemetry tracing for tool calls#21
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: priyank <priyank8445@gmail.com>
ec68142 to
bec06a0
Compare
|
Hey @priyank766 , this looks great 🚀 |
| "sphinx-design>=0.5", | ||
| ] | ||
| otel = [ | ||
| "opentelemetry-exporter-otlp>=1.25.0", |
There was a problem hiding this comment.
| "opentelemetry-exporter-otlp>=1.25.0", | |
| "opentelemetry-exporter-otlp-proto-http>=1.25.0", |
this avoids installation of unnecessary GRPC subpackage
There was a problem hiding this comment.
Thanks for the Suggestion
I will change the package as well
| exc_info=True, | ||
| ) | ||
| raise | ||
| with tracer.start_as_current_span("tool_call") as span: |
There was a problem hiding this comment.
I tried it just now and it seems span name "tool_call" makes it hard to filter by tool in tracing UIs, Maybe we can consider naming it after the tool: f"tool:{tool_name}" or even just tool_name. The attribute tool.name is still good to keep for structured querying, but the name gives context at a glance.
wdyt?
There was a problem hiding this comment.
Thanks for Reviewing @abhijeet-dhumal
I will push both changes and then you can review it later whenever you get time
Yes I think this naming scheme is much better I will change it .. 👍🏻
Signed-off-by: priyank <priyank8445@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds optional OpenTelemetry tracing for Kubeflow MCP tool calls, wiring tracing through configuration, CLI, runtime instrumentation, tests, and docs.
Changes:
- Adds telemetry setup/no-op helpers and an
oteloptional dependency extra. - Instruments
_audit_wrapspans with tool/persona/correlation/success/duration attributes. - Wires
--otel-endpoint, config/env loading, startup logging, tests, and README documentation.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
kubeflow_mcp/core/telemetry.py |
Adds OpenTelemetry setup, validation, provider reuse, and no-op fallback helpers. |
kubeflow_mcp/core/server.py |
Adds span creation and attributes around audited tool calls. |
kubeflow_mcp/cli.py |
Adds --otel-endpoint and invokes tracing setup during server startup. |
kubeflow_mcp/core/config.py |
Adds observability.otel_endpoint config/env support. |
kubeflow_mcp/core/logging.py |
Includes tracing_enabled in structured startup logs. |
tests/unit/core/test_telemetry.py |
Adds telemetry helper and span attribute tests. |
kubeflow_mcp/cli_test.py |
Adds CLI tracing setup wiring tests. |
README.md |
Documents optional tracing setup and span attributes. |
pyproject.toml |
Adds the otel optional dependency extra. |
uv.lock |
Locks OpenTelemetry optional dependencies and related resolution changes. |
Comments suppressed due to low confidence (1)
kubeflow_mcp/core/server.py:126
- The circuit-open early-return tracing path is not covered by the new span behavior tests. Add coverage for a breaker with
can_execute() == Falseso thetool.success=falseand duration attributes remain verified for this failure mode.
if not breaker.can_execute():
duration_ms = int((time.monotonic() - start) * 1000)
span.set_attribute("tool.success", False)
span.set_attribute("tool.duration_ms", duration_ms)
logger.warning("circuit_open", extra={"tool": tool_name})
return {
"error": f"Circuit breaker open for '{tool_name}' — K8s API may be degraded. Retries automatically after recovery timeout.",
"error_code": ErrorCode.CIRCUIT_OPEN,
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with tracer.start_as_current_span(f"tool:{tool_name}") as span: | ||
| span.set_attribute("tool.name", tool_name) | ||
| span.set_attribute("kubeflow.persona", persona) | ||
| span.set_attribute("correlation_id", cid) |
| breaker.record_failure() | ||
| span.set_attribute("tool.success", False) | ||
| span.set_attribute("tool.duration_ms", duration_ms) | ||
| span.record_exception(exc) |
There was a problem hiding this comment.
@priyank766 can we add span.set_status(StatusCode.ERROR) on exception here ?
span.set_status(Status(StatusCode.ERROR, str(exc)))
There was a problem hiding this comment.
this will make sure to spot failures easily in the trace list, wdyt?
| observability_file = file_config.get("observability", {}) | ||
| observability = ObservabilityConfig( | ||
| otel_endpoint=os.getenv( | ||
| "KUBEFLOW_MCP_OTEL_ENDPOINT", | ||
| observability_file.get("otel_endpoint"), | ||
| ) | ||
| ) |
| if _rate_limiter is not None and not _rate_limiter.acquire(): | ||
| duration_ms = int((time.monotonic() - start) * 1000) | ||
| span.set_attribute("tool.success", False) | ||
| span.set_attribute("tool.duration_ms", duration_ms) | ||
| logger.warning("rate_limited", extra={"tool": tool_name}) | ||
| return { | ||
| "error": "Rate limit exceeded. Retry after a brief pause.", | ||
| "error_code": ErrorCode.RATE_LIMITED, | ||
| } |
| - Install optional dependencies: `pip install ".[otel]"` | ||
| - Enable tracing with CLI flag or env var: | ||
|
|
||
| ```bash | ||
| kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces | ||
| # or | ||
| export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces | ||
| kubeflow-mcp serve |
|
|
||
| cid = with_correlation_id() | ||
| masked = mask_sensitive_data(kwargs) if kwargs else {} | ||
| tracer = get_tracer("kubeflow_mcp.tools") |
There was a problem hiding this comment.
@priyank766 can you move this outside wrapper closure?
get_tracer is cheap but calling it per-invocation seems semantically wrong, wdyt?
| exc_info=True, | ||
| ) | ||
| raise | ||
| with tracer.start_as_current_span(f"tool:{tool_name}") as span: |
There was a problem hiding this comment.
@priyank766 thinking can we use SpanKind.CLIENT for tool spans here?
Tool calls invoke an external service i.e. K8s API. SpanKind.CLIENT is semantically correct and enables better Jaeger dependency graph rendering..
| return None | ||
|
|
||
|
|
||
| class _NoopTracer: |
There was a problem hiding this comment.
_NoopTracer should accept **kwargs here
Future callers passing kind=SpanKind.CLIENT will otherwise get a TypeError in no-op mode.
…hoist tracer, noop kwargs - Use SpanKind.CLIENT for tool spans (K8s API = external service) - Replace manual record_exception() with set_status(Status(StatusCode.ERROR)) to avoid duplicate events and improve Jaeger trace visibility - Move get_tracer() to _audit_wrap scope (once per tool, not per call) - Add **kwargs and set_status() to _NoopTracer/_NoopSpan for API compatibility Signed-off-by: priyank <priyank8445@gmail.com>
|
Thanks for the review @abhijeet-dhumal All suggestions have been addressed. Please take another look when you get a chance. |
86d30a5 to
6428ffe
Compare
| ```bash | ||
| kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces | ||
| # or | ||
| export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces |
There was a problem hiding this comment.
| export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces | |
| export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/traces |
| observability_file = file_config.get("observability", {}) | ||
| observability = ObservabilityConfig( | ||
| otel_endpoint=os.getenv( | ||
| "KUBEFLOW_MCP_OTEL_ENDPOINT", |
There was a problem hiding this comment.
| "KUBEFLOW_MCP_OTEL_ENDPOINT", | |
| "OTEL_EXPORTER_OTLP_ENDPOINT", |
| "--otel-endpoint", | ||
| default=None, | ||
| help="OpenTelemetry OTLP HTTP endpoint for tracing. " | ||
| "Falls back to KUBEFLOW_MCP_OTEL_ENDPOINT env var, config file.", |
There was a problem hiding this comment.
| "Falls back to KUBEFLOW_MCP_OTEL_ENDPOINT env var, config file.", | |
| "Falls back to OTEL_EXPORTER_OTLP_ENDPOINT env var, config file.", |
| with tracer.start_as_current_span( | ||
| f"tool:{tool_name}", **span_kwargs | ||
| ) as span: | ||
| span.set_attribute("tool.name", tool_name) |
There was a problem hiding this comment.
Minor nit: would be great to also surface user.id and mcp.session_id on these spans for per-session/per-user filtering in Jaeger. That will need identity propagation middleware (ContextVars populated from the MCP request context) as a prerequisite..
There was a problem hiding this comment.
happy to follow up with a separate PR for that once this lands. Marking as a non-blocking suggestion.
There was a problem hiding this comment.
Makes sense, having user.id and session_id on the spans would definitely help with debugging. Happy to take a crack at it once this lands, just let me know.
Replace custom KUBEFLOW_MCP_OTEL_ENDPOINT with the official OpenTelemetry env var across README, CLI help text, and config loader. Signed-off-by: priyank <priyank8445@gmail.com>
|
Hey @priyank766 , I tried this out locally against OTel collector stack, works cleanly end-to-end. putting few nits otherwise lgtm.. |
| provider.add_span_processor(processor) | ||
| _otel_trace.set_tracer_provider(provider) | ||
|
|
||
| _tracing_initialized = True |
There was a problem hiding this comment.
Missing atexit.register(provider.shutdown) here. Without it, in-flight spans in the BatchSpanProcessor queue get silently dropped on server shutdown, and the background thread can deadlock if the collector is unreachable. One line:
import atexit
atexit.register(provider.shutdown)
| with tracer.start_as_current_span( | ||
| f"tool:{tool_name}", **span_kwargs | ||
| ) as span: | ||
| span.set_attribute("tool.name", tool_name) |
There was a problem hiding this comment.
Can we also add tool.args_preview here? Without it you can't reconstruct what parameters caused a failure from Jaeger alone.. you'd have to cross-reference the audit log. Something like:
import json
span.set_attribute("tool.args_preview", json.dumps(mask_sensitive_data(kwargs), default=str)[:300])
| span.set_attribute("tool.success", False) | ||
| span.set_attribute("tool.duration_ms", duration_ms) | ||
| logger.warning("circuit_open", extra={"tool": tool_name}) | ||
| return { |
There was a problem hiding this comment.
This path sets tool.success=False on the span but there's no test covering it.
Worth adding a test with _FakeBreaker(can_execute=False) asserting the attributes are set before the early return.
| - Enable tracing with CLI flag or env var: | ||
|
|
||
| ```bash | ||
| kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces |
There was a problem hiding this comment.
Heads up: setup_tracing() auto-appends /v1/traces to whatever is passed, so the example in the README produces http://localhost:4318/v1/traces/v1/traces — double path. Either change the README example to the base URL (http://localhost:4318) or stop auto-appending in code and accept the full path as-is. The latter is less surprising.
|
|
||
| Each tool invocation emits a span with attributes: | ||
| `tool.name`, `tool.success`, `tool.duration_ms`, `kubeflow.persona`, and `correlation_id`. | ||
|
|
There was a problem hiding this comment.
Worth a one-liner noting that kubeflow-mcp agent --otel-endpoint ... emits a separate kubeflow-mcp-agent service in Jaeger. Without this, users running the agent will wonder why they only see server-side spans and think tracing is broken.
…zation, circuit breaker test - Register atexit.shutdown on TracerProvider to flush in-flight spans - Add tool.args_preview span attribute with masked kwargs (truncated to 300 chars) - Auto-append /v1/traces for base URLs to match OTel SDK convention - Add test for circuit breaker open path (tool.success=False) - Document kubeflow-mcp-agent as separate Jaeger service Signed-off-by: priyank <priyank8445@gmail.com>
|
I've addressed all the nits in the latest commit. Let me know if everything looks good!! |
|
Hey @priyank766 👋 |
Description
Adds optional OpenTelemetry tracing for tool calls in Kubeflow MCP Server.
core/telemetry.pywithsetup_tracing()andget_tracer()plus safe no-op fallback when OTel deps are unavailable.core.server._audit_wrapto create one span per tool invocation, set tool/persona/duration/success attributes, attachcorrelation_id, and recordexceptions.
--otel-endpointKUBEFLOW_MCP_OTEL_ENDPOINTobservability.otel_endpointImportant compatibility note:
correlation_idsemantics are preserved and exposed as a span attribute; it is not remapped to OTel trace ID.Type of Change
Checklist
make test-python)make verify)Related Issues
Fixes #18