feat(observability): add OTEL root and client spans for MCP flows#3872
feat(observability): add OTEL root and client spans for MCP flows#3872crivetimihai merged 15 commits intomainfrom
Conversation
e46fcc4 to
fb014ce
Compare
|
Thanks @vishu-bh. Comprehensive OTEL instrumentation across MCP transports — this will be very valuable for production debugging. A few notes:
|
fb014ce to
f7bb46a
Compare
0e2606c to
0bea4be
Compare
lucarlig
left a comment
There was a problem hiding this comment.
- OTEL tracing now disables MCP session pooling for traced requests, because the new outbound MCP paths only use the pool when
not tracing_active. Since these calls run under an active OTEL span, traced SSE/streamable HTTP traffic falls back to per-call session creation instead of pooled reuse. That is a real latency regression and may also change behavior for upstream servers that depend on session-local state. - The new request-root middleware exports raw query strings into OTEL span attributes (
url.query). That bypasses the repo’s normal sanitization flow and creates a new telemetry leakage path. - The same middleware records raw exception text with
error.message = str(exc)instead of going through the existing sanitized error helpers, which can leak sensitive error content into OTEL exports. - Plugin-server spans may get the wrong service name, because the plugin runtime sets
OTEL_SERVICE_NAMEafter importingmcpgateway.observability, but that module initializes the tracer at import time. - This PR also removes the admin tab scroll-reset behavior and deletes its regression tests, which looks unrelated to the observability scope and likely reintroduces a UI regression.
Smaller but still worth fixing:
.secrets.baselinewas fully reserialized in this PR, which adds a lot of unrelated review noise and conflict risk.- There’s no docs/update note for the new W3C trace-header propagation, new root spans, or the tracing-vs-pooling tradeoff.
0bea4be to
6b2a326
Compare
✅ All PR Review Comments AddressedSuccessfully fixed all 7 issues from the review: 🔒 Security Fixes (CRITICAL)
⚡ Performance Fixes (HIGH PRIORITY)
🔧 Technical Fixes
✅ Review Items
📚 Documentation
Files Changed
Ready for re-review. |
6b2a326 to
ab428d6
Compare
a73d9a2 to
c01265e
Compare
crivetimihai
left a comment
There was a problem hiding this comment.
Review Summary
Rebased onto main, resolved conflicts, and addressed several issues found during review. The OTEL integration design is sound — span hierarchy follows semantic conventions, trace context propagation is correctly scoped, and the pooled/non-pooled split is well-reasoned.
Fixes applied (2 commits on top of the original work)
Commit: fix: sanitize OTEL exception messages and remove dead code
- Security:
OpenTelemetryRequestMiddlewareerror handler was writing rawstr(exc)to OTEL spans, bypassing the sanitization pipeline. Now routes throughset_span_error()with_sanitize_span_exception_message(). - Dead code: Removed two standalone
otel_context_active()calls intool_service.pythat discarded their return values (leftover from an earlier design iteration). - Doc fix: Overview claimed "Session Pool Compatibility: Trace context flows through pooled MCP sessions" — corrected to describe the actual trade-off (pooled sessions intentionally skip trace injection).
Commit: fix: add missing /mcp/sse and /mcp/message to OTEL path filter, sanitize middleware span attributes
- Correctness:
_should_trace_request_path()was missing/mcp/sseand/mcp/message— valid public MCP transport endpoints served by the streamable HTTP mount at/mcp. Real MCP traffic on those paths silently bypassed request-root tracing. - Security: All span attributes in
OpenTelemetryRequestMiddlewarenow route throughset_span_attribute()instead of rawspan.set_attribute(), applying the sanitization/redaction pipeline to caller-controlled values (user_agent.original,correlation_id). - Tests: Added parametrized test cases for
/mcp/sse,/mcp/message,/mcp/sse/,/mcp/message/, and/mcp/unknown(negative case).
Known gap (non-blocking)
No regression test exists proving that pooled sessions skip traceparent injection. The code is correct by design (the pooled branch never calls inject_trace_context_headers), but a dedicated test would guard against accidental regression. Worth a follow-up.
What looks good
- Span hierarchy (
mcp.client.call>initialize>request>response) follows OTEL conventions OpenTelemetryRequestMiddlewarecorrectly uses raw ASGI middleware (notBaseHTTPMiddleware) to wrap the full request path including mounted apps- Plugin framework spans (
plugin.hook.invoke,plugin.execute) properly nest and record chain-stopping behavior - W3C trace context propagation only on non-pooled sessions prevents context pollution
session_id/sessionidredaction and&URL terminator fix intrace_redaction.pyare solid- Test coverage: 96% on
observability.py, 90% onruntime.py
|
Added The test ( |
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
…gement_service.py Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
…ssions - Add conditional trace header injection based on session pooling status - Pooled sessions now skip trace header injection to prevent session fixation - Non-pooled sessions continue to inject trace headers for proper tracing - Update observability documentation with security considerations This addresses critical security concerns where trace headers injected into pooled sessions could cause trace context pollution across multiple requests, leading to session fixation-style attacks in distributed tracing systems. Signed-off-by: Vishu <vishu@example.com> Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
- Sanitize exception messages in OpenTelemetryRequestMiddleware error handler using set_span_error() instead of raw str(exc) to prevent leaking sensitive data to OTEL backends - Remove two dead otel_context_active() calls in tool_service.py that discarded their return values (leftover from earlier refactoring) - Remove unused otel_context_active import from tool_service.py - Fix doc inaccuracy: pooled sessions skip trace injection, not propagate it - Update .secrets.baseline for rebased line numbers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ize middleware span attributes - Add /mcp/sse and /mcp/message to _should_trace_request_path() — these are valid public MCP transport endpoints served by the streamable HTTP mount at /mcp, and were silently excluded from request-root tracing - Route all span attributes in OpenTelemetryRequestMiddleware through set_span_attribute() instead of raw span.set_attribute() to apply the sanitization/redaction pipeline consistently (user_agent.original and correlation_id are caller-controlled strings) - Add parametrized tests for /mcp/sse, /mcp/message, /mcp/unknown paths Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Assert that traceparent, tracestate, and X-Correlation-ID are NOT injected into pooled MCP session headers — pooled transports pin headers at creation time, so per-request trace injection would corrupt distributed traces across unrelated requests. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
e464e74 to
f81ee3c
Compare
The test patched `a2a_service.get_agent` directly, but a2a_service is None when A2A is disabled (default in tests). Patch the service object itself and set get_agent on the mock, matching the pattern used by the adjacent list_agents test. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
) * feat: add OTEL root and client spans for MCP flows Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat: extend OTEL MCP client trace lifecycle Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat: trace plugin hook execution Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: failing ci checks Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: failing linters in pre-commit Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: increasing code coverage Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: remove duplicate docstring from conflict resolution in team_management_service.py Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat: rebased and resolved conflicts for OTEL changes Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: address OTEL security and performance issues Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat(observability): prevent trace context pollution in pooled MCP sessions - Add conditional trace header injection based on session pooling status - Pooled sessions now skip trace header injection to prevent session fixation - Non-pooled sessions continue to inject trace headers for proper tracing - Update observability documentation with security considerations This addresses critical security concerns where trace headers injected into pooled sessions could cause trace context pollution across multiple requests, leading to session fixation-style attacks in distributed tracing systems. Signed-off-by: Vishu <vishu@example.com> Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: sanitize OTEL exception messages and remove dead code - Sanitize exception messages in OpenTelemetryRequestMiddleware error handler using set_span_error() instead of raw str(exc) to prevent leaking sensitive data to OTEL backends - Remove two dead otel_context_active() calls in tool_service.py that discarded their return values (leftover from earlier refactoring) - Remove unused otel_context_active import from tool_service.py - Fix doc inaccuracy: pooled sessions skip trace injection, not propagate it - Update .secrets.baseline for rebased line numbers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: add missing /mcp/sse and /mcp/message to OTEL path filter, sanitize middleware span attributes - Add /mcp/sse and /mcp/message to _should_trace_request_path() — these are valid public MCP transport endpoints served by the streamable HTTP mount at /mcp, and were silently excluded from request-root tracing - Route all span attributes in OpenTelemetryRequestMiddleware through set_span_attribute() instead of raw span.set_attribute() to apply the sanitization/redaction pipeline consistently (user_agent.original and correlation_id are caller-controlled strings) - Add parametrized tests for /mcp/sse, /mcp/message, /mcp/unknown paths Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: add pooled-session regression test for trace header exclusion Assert that traceparent, tracestate, and X-Correlation-ID are NOT injected into pooled MCP session headers — pooled transports pin headers at creation time, so per-request trace injection would corrupt distributed traces across unrelated requests. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: restore .secrets.baseline from main Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: patch a2a_service object instead of attribute on None The test patched `a2a_service.get_agent` directly, but a2a_service is None when A2A is disabled (default in tests). Patch the service object itself and set get_agent on the mock, matching the pattern used by the adjacent list_agents test. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> Signed-off-by: Vishu <vishu@example.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
) * feat: add OTEL root and client spans for MCP flows Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat: extend OTEL MCP client trace lifecycle Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat: trace plugin hook execution Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: failing ci checks Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: failing linters in pre-commit Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: increasing code coverage Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: remove duplicate docstring from conflict resolution in team_management_service.py Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat: rebased and resolved conflicts for OTEL changes Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: address OTEL security and performance issues Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat(observability): prevent trace context pollution in pooled MCP sessions - Add conditional trace header injection based on session pooling status - Pooled sessions now skip trace header injection to prevent session fixation - Non-pooled sessions continue to inject trace headers for proper tracing - Update observability documentation with security considerations This addresses critical security concerns where trace headers injected into pooled sessions could cause trace context pollution across multiple requests, leading to session fixation-style attacks in distributed tracing systems. Signed-off-by: Vishu <vishu@example.com> Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: sanitize OTEL exception messages and remove dead code - Sanitize exception messages in OpenTelemetryRequestMiddleware error handler using set_span_error() instead of raw str(exc) to prevent leaking sensitive data to OTEL backends - Remove two dead otel_context_active() calls in tool_service.py that discarded their return values (leftover from earlier refactoring) - Remove unused otel_context_active import from tool_service.py - Fix doc inaccuracy: pooled sessions skip trace injection, not propagate it - Update .secrets.baseline for rebased line numbers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: add missing /mcp/sse and /mcp/message to OTEL path filter, sanitize middleware span attributes - Add /mcp/sse and /mcp/message to _should_trace_request_path() — these are valid public MCP transport endpoints served by the streamable HTTP mount at /mcp, and were silently excluded from request-root tracing - Route all span attributes in OpenTelemetryRequestMiddleware through set_span_attribute() instead of raw span.set_attribute() to apply the sanitization/redaction pipeline consistently (user_agent.original and correlation_id are caller-controlled strings) - Add parametrized tests for /mcp/sse, /mcp/message, /mcp/unknown paths Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: add pooled-session regression test for trace header exclusion Assert that traceparent, tracestate, and X-Correlation-ID are NOT injected into pooled MCP session headers — pooled transports pin headers at creation time, so per-request trace injection would corrupt distributed traces across unrelated requests. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: restore .secrets.baseline from main Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: patch a2a_service object instead of attribute on None The test patched `a2a_service.get_agent` directly, but a2a_service is None when A2A is disabled (default in tests). Patch the service object itself and set get_agent on the mock, matching the pattern used by the adjacent list_agents test. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> Signed-off-by: Vishu <vishu@example.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
🔗 Related Issue
Closes #3858
Refs #3736
📝 Summary
This PR builds out the gateway-side OpenTelemetry trace tree for MCP traffic and plugin execution.
Implemented so far:
/rpc,/mcp, server-scoped MCP routes, and internal MCP transport hopstraceparent/tracestateinjectionmcp.client.callmcp.client.initializemcp.client.requestmcp.client.responseplugin.hook.invokefor the hook chainplugin.executefor each plugin executionCurrent gateway-side trace shape for an MCP tool call is now roughly:
POST /rpctool.invokeplugin.hook.invoke/plugin.executefor pre-invoke hooks when configuredmcp.client.callmcp.client.initializemcp.client.requestmcp.client.responseplugin.hook.invoke/plugin.executefor post-invoke hooks when configuredUpstream server spans for non-Python services such as
fast-time-serverare not part of this PR. Those services can join the same trace by extractingtraceparent/tracestateand exporting OTEL spans from their own runtime.🏷️ Type of Change
🧪 Verification
make lintmake testmake coverageFocused verification run:
python -m py_compile mcpgateway/observability.py mcpgateway/main.py mcpgateway/plugins/framework/external/mcp/server/runtime.py mcpgateway/services/tool_service.py tests/unit/mcpgateway/test_observability.py tests/unit/mcpgateway/plugins/framework/external/mcp/server/test_runtime_coverage.py tests/unit/mcpgateway/services/test_tool_service.py tests/unit/mcpgateway/plugins/framework/test_observability.pypytest tests/unit/mcpgateway/test_observability.py tests/unit/mcpgateway/plugins/framework/external/mcp/server/test_runtime_coverage.py -qpytest tests/unit/mcpgateway/services/test_tool_service.py -k "streamablehttp_creates_client_lifecycle_spans or invoke_tool_mcp_streamablehttp or prepare_rust_mcp_tool_execution_injects_w3c_trace_context_into_plan_headers" -qpytest tests/unit/mcpgateway/plugins/framework/test_observability.py -q✅ Checklist
make black isort pre-commit)📓 Notes