Skip to content

feat(observability): add OTEL root and client spans for MCP flows#3872

Merged
crivetimihai merged 15 commits intomainfrom
otel-phase1-request-root-spans
Apr 3, 2026
Merged

feat(observability): add OTEL root and client spans for MCP flows#3872
crivetimihai merged 15 commits intomainfrom
otel-phase1-request-root-spans

Conversation

@vishu-bh
Copy link
Copy Markdown
Collaborator

@vishu-bh vishu-bh commented Mar 26, 2026

🔗 Related Issue

Closes #3858
Refs #3736


📝 Summary

This PR builds out the gateway-side OpenTelemetry trace tree for MCP traffic and plugin execution.

Implemented so far:

  • added OTEL request-root spans for gateway transport paths such as /rpc, /mcp, server-scoped MCP routes, and internal MCP transport hops
  • added shared W3C trace-context helpers and outbound traceparent / tracestate injection
  • added MCP client lifecycle spans on the Python gateway path:
    • mcp.client.call
    • mcp.client.initialize
    • mcp.client.request
    • mcp.client.response
  • added explicit gateway-side response tracing so upstream success is visible in the trace even before the upstream service is instrumented
  • added Python FastMCP upstream runtime support so Python MCP servers can join the distributed trace when OTEL is enabled in that server process
  • added plugin framework tracing at the shared hook dispatch layer:
    • plugin.hook.invoke for the hook chain
    • plugin.execute for each plugin execution
  • recorded plugin stop-chain behavior in trace attributes, including which plugin stopped processing
  • preserved Rust compatibility by continuing to inject W3C trace headers into Rust direct-execution plans

Current gateway-side trace shape for an MCP tool call is now roughly:

  • POST /rpc
  • tool.invoke
  • plugin.hook.invoke / plugin.execute for pre-invoke hooks when configured
  • mcp.client.call
  • mcp.client.initialize
  • mcp.client.request
  • mcp.client.response
  • plugin.hook.invoke / plugin.execute for post-invoke hooks when configured

Upstream server spans for non-Python services such as fast-time-server are not part of this PR. Those services can join the same trace by extracting traceparent / tracestate and exporting OTEL spans from their own runtime.


🏷️ Type of Change

  • Bug fix
  • Feature / Enhancement
  • Documentation
  • Refactor
  • Chore (deps, CI, tooling)
  • Other (describe below)

🧪 Verification

Check Command Status
Lint suite make lint Not run
Unit tests make test Focused tests run
Coverage ≥ 80% make coverage Not run

Focused verification run:

  • python -m py_compile mcpgateway/observability.py mcpgateway/main.py mcpgateway/plugins/framework/external/mcp/server/runtime.py mcpgateway/services/tool_service.py tests/unit/mcpgateway/test_observability.py tests/unit/mcpgateway/plugins/framework/external/mcp/server/test_runtime_coverage.py tests/unit/mcpgateway/services/test_tool_service.py tests/unit/mcpgateway/plugins/framework/test_observability.py
  • pytest tests/unit/mcpgateway/test_observability.py tests/unit/mcpgateway/plugins/framework/external/mcp/server/test_runtime_coverage.py -q
  • pytest tests/unit/mcpgateway/services/test_tool_service.py -k "streamablehttp_creates_client_lifecycle_spans or invoke_tool_mcp_streamablehttp or prepare_rust_mcp_tool_execution_injects_w3c_trace_context_into_plan_headers" -q
  • pytest tests/unit/mcpgateway/plugins/framework/test_observability.py -q

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • Tests added/updated for changes
  • Documentation updated (if applicable)
  • No secrets or credentials committed

📓 Notes

  • This PR focuses on the gateway and Python-side trace tree. Upstream MCP server instrumentation can be a separate follow-up.

@vishu-bh vishu-bh force-pushed the otel-phase1-request-root-spans branch 2 times, most recently from e46fcc4 to fb014ce Compare March 27, 2026 11:20
@vishu-bh vishu-bh marked this pull request as ready for review March 27, 2026 11:20
@crivetimihai crivetimihai changed the title feat: add OTEL root and client spans for MCP flows feat(observability): add OTEL root and client spans for MCP flows Mar 29, 2026
@crivetimihai crivetimihai added enhancement New feature or request SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release observability Observability, logging, monitoring labels Mar 29, 2026
@crivetimihai crivetimihai added this to the Release 1.1.0 milestone Mar 29, 2026
@crivetimihai
Copy link
Copy Markdown
Member

Thanks @vishu-bh. Comprehensive OTEL instrumentation across MCP transports — this will be very valuable for production debugging. A few notes:

  1. Please check PR feat: integrate Langfuse LLM observability via OTEL #3900 (feat: integrate Langfuse LLM observability via OTEL) which implements much of this same observability surface. Coordinate to avoid duplication and merge conflicts.
  2. Given the breadth of changes (14 files), please ensure all spans follow the naming conventions established in the existing observability middleware.
  3. DCO Signed-off-by is required on all commits.

@vishu-bh vishu-bh force-pushed the otel-phase1-request-root-spans branch from fb014ce to f7bb46a Compare March 29, 2026 22:23
@jonpspri jonpspri added wxo wxo integration release-fix Critical bugfix required for the release labels Mar 31, 2026
@vishu-bh vishu-bh force-pushed the otel-phase1-request-root-spans branch 2 times, most recently from 0e2606c to 0bea4be Compare April 1, 2026 13:13
@vishu-bh vishu-bh requested a review from lucarlig April 1, 2026 13:18
Copy link
Copy Markdown
Collaborator

@lucarlig lucarlig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • OTEL tracing now disables MCP session pooling for traced requests, because the new outbound MCP paths only use the pool when not tracing_active. Since these calls run under an active OTEL span, traced SSE/streamable HTTP traffic falls back to per-call session creation instead of pooled reuse. That is a real latency regression and may also change behavior for upstream servers that depend on session-local state.
  • The new request-root middleware exports raw query strings into OTEL span attributes (url.query). That bypasses the repo’s normal sanitization flow and creates a new telemetry leakage path.
  • The same middleware records raw exception text with error.message = str(exc) instead of going through the existing sanitized error helpers, which can leak sensitive error content into OTEL exports.
  • Plugin-server spans may get the wrong service name, because the plugin runtime sets OTEL_SERVICE_NAME after importing mcpgateway.observability, but that module initializes the tracer at import time.
  • This PR also removes the admin tab scroll-reset behavior and deletes its regression tests, which looks unrelated to the observability scope and likely reintroduces a UI regression.

Smaller but still worth fixing:

  • .secrets.baseline was fully reserialized in this PR, which adds a lot of unrelated review noise and conflict risk.
  • There’s no docs/update note for the new W3C trace-header propagation, new root spans, or the tracing-vs-pooling tradeoff.

@vishu-bh vishu-bh force-pushed the otel-phase1-request-root-spans branch from 0bea4be to 6b2a326 Compare April 1, 2026 14:59
@vishu-bh
Copy link
Copy Markdown
Collaborator Author

vishu-bh commented Apr 1, 2026

✅ All PR Review Comments Addressed

Successfully fixed all 7 issues from the review:

🔒 Security Fixes (CRITICAL)

  • Query string sanitization - Added sanitize_trace_text() to prevent credential leakage in OTEL spans
  • Exception message sanitization - Applied sanitization to prevent sensitive error details in traces

⚡ Performance Fixes (HIGH PRIORITY)

  • MCP session pooling with tracing - Removed and not tracing_active condition and added trace context injection before pooling
  • Session pool now works with OTEL tracing (10-20x latency improvement restored)
  • W3C trace context properly propagates through pooled sessions

🔧 Technical Fixes

  • Plugin server service name - Moved OTEL_SERVICE_NAME setup before observability import
  • Plugin spans now show correct service identification

✅ Review Items

  • Admin tab scroll-reset - No changes found in this PR (false alarm)
  • .secrets.baseline - Only formatting/timestamp changes (acceptable)

📚 Documentation

  • Created comprehensive OpenTelemetry integration guide at docs/docs/architecture/observability-otel.md
  • Covers W3C trace propagation, session pooling design, security considerations, configuration, and troubleshooting

Files Changed

  • mcpgateway/middleware/observability_middleware.py (security fixes)
  • mcpgateway/services/tool_service.py (session pooling fix)
  • mcpgateway/plugins/framework/external/mcp/server/runtime.py (service name fix)
  • docs/docs/architecture/.pages (navigation update)
  • docs/docs/architecture/observability-otel.md (new documentation)

Ready for re-review.

@vishu-bh vishu-bh requested a review from lucarlig April 1, 2026 15:03
@vishu-bh vishu-bh force-pushed the otel-phase1-request-root-spans branch from 6b2a326 to ab428d6 Compare April 2, 2026 08:53
@vishu-bh vishu-bh added the ready Validated, ready-to-work-on items label Apr 3, 2026
@crivetimihai crivetimihai force-pushed the otel-phase1-request-root-spans branch from a73d9a2 to c01265e Compare April 3, 2026 12:39
crivetimihai
crivetimihai previously approved these changes Apr 3, 2026
Copy link
Copy Markdown
Member

@crivetimihai crivetimihai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Rebased onto main, resolved conflicts, and addressed several issues found during review. The OTEL integration design is sound — span hierarchy follows semantic conventions, trace context propagation is correctly scoped, and the pooled/non-pooled split is well-reasoned.

Fixes applied (2 commits on top of the original work)

Commit: fix: sanitize OTEL exception messages and remove dead code

  • Security: OpenTelemetryRequestMiddleware error handler was writing raw str(exc) to OTEL spans, bypassing the sanitization pipeline. Now routes through set_span_error() with _sanitize_span_exception_message().
  • Dead code: Removed two standalone otel_context_active() calls in tool_service.py that discarded their return values (leftover from an earlier design iteration).
  • Doc fix: Overview claimed "Session Pool Compatibility: Trace context flows through pooled MCP sessions" — corrected to describe the actual trade-off (pooled sessions intentionally skip trace injection).

Commit: fix: add missing /mcp/sse and /mcp/message to OTEL path filter, sanitize middleware span attributes

  • Correctness: _should_trace_request_path() was missing /mcp/sse and /mcp/message — valid public MCP transport endpoints served by the streamable HTTP mount at /mcp. Real MCP traffic on those paths silently bypassed request-root tracing.
  • Security: All span attributes in OpenTelemetryRequestMiddleware now route through set_span_attribute() instead of raw span.set_attribute(), applying the sanitization/redaction pipeline to caller-controlled values (user_agent.original, correlation_id).
  • Tests: Added parametrized test cases for /mcp/sse, /mcp/message, /mcp/sse/, /mcp/message/, and /mcp/unknown (negative case).

Known gap (non-blocking)

No regression test exists proving that pooled sessions skip traceparent injection. The code is correct by design (the pooled branch never calls inject_trace_context_headers), but a dedicated test would guard against accidental regression. Worth a follow-up.

What looks good

  • Span hierarchy (mcp.client.call > initialize > request > response) follows OTEL conventions
  • OpenTelemetryRequestMiddleware correctly uses raw ASGI middleware (not BaseHTTPMiddleware) to wrap the full request path including mounted apps
  • Plugin framework spans (plugin.hook.invoke, plugin.execute) properly nest and record chain-stopping behavior
  • W3C trace context propagation only on non-pooled sessions prevents context pollution
  • session_id/sessionid redaction and & URL terminator fix in trace_redaction.py are solid
  • Test coverage: 96% on observability.py, 90% on runtime.py

@crivetimihai
Copy link
Copy Markdown
Member

Added test: add pooled-session regression test for trace header exclusion — this closes the last open suggestion from the Codex review.

The test (test_invoke_tool_mcp_pooled_path_does_not_inject_trace_headers) exercises the actual pooled code path with mcp_session_pool_enabled = True and a mock pool, then asserts that traceparent, tracestate, and X-Correlation-ID are not present in the headers passed to pool.session(). This guards against accidental regression if someone adds inject_trace_context_headers() to the pooled branch in the future.

vishu-bh and others added 14 commits April 3, 2026 14:52
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
…gement_service.py

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
…ssions

- Add conditional trace header injection based on session pooling status
- Pooled sessions now skip trace header injection to prevent session fixation
- Non-pooled sessions continue to inject trace headers for proper tracing
- Update observability documentation with security considerations

This addresses critical security concerns where trace headers injected into
pooled sessions could cause trace context pollution across multiple requests,
leading to session fixation-style attacks in distributed tracing systems.

Signed-off-by: Vishu <vishu@example.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
- Sanitize exception messages in OpenTelemetryRequestMiddleware error
  handler using set_span_error() instead of raw str(exc) to prevent
  leaking sensitive data to OTEL backends
- Remove two dead otel_context_active() calls in tool_service.py that
  discarded their return values (leftover from earlier refactoring)
- Remove unused otel_context_active import from tool_service.py
- Fix doc inaccuracy: pooled sessions skip trace injection, not
  propagate it
- Update .secrets.baseline for rebased line numbers

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ize middleware span attributes

- Add /mcp/sse and /mcp/message to _should_trace_request_path() — these
  are valid public MCP transport endpoints served by the streamable HTTP
  mount at /mcp, and were silently excluded from request-root tracing
- Route all span attributes in OpenTelemetryRequestMiddleware through
  set_span_attribute() instead of raw span.set_attribute() to apply
  the sanitization/redaction pipeline consistently (user_agent.original
  and correlation_id are caller-controlled strings)
- Add parametrized tests for /mcp/sse, /mcp/message, /mcp/unknown paths

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Assert that traceparent, tracestate, and X-Correlation-ID are NOT
injected into pooled MCP session headers — pooled transports pin
headers at creation time, so per-request trace injection would
corrupt distributed traces across unrelated requests.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai force-pushed the otel-phase1-request-root-spans branch 2 times, most recently from e464e74 to f81ee3c Compare April 3, 2026 13:54
The test patched `a2a_service.get_agent` directly, but a2a_service is
None when A2A is disabled (default in tests). Patch the service object
itself and set get_agent on the mock, matching the pattern used by the
adjacent list_agents test.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai merged commit ab59307 into main Apr 3, 2026
27 checks passed
@crivetimihai crivetimihai deleted the otel-phase1-request-root-spans branch April 3, 2026 14:49
jonpspri pushed a commit that referenced this pull request Apr 10, 2026
)

* feat: add OTEL root and client spans for MCP flows

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* feat: extend OTEL MCP client trace lifecycle

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* feat: trace plugin hook execution

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: failing ci checks

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: failing linters in pre-commit

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: increasing code coverage

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: remove duplicate docstring from conflict resolution in team_management_service.py

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* feat: rebased and resolved conflicts for OTEL changes

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: address OTEL security and performance issues

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* feat(observability): prevent trace context pollution in pooled MCP sessions

- Add conditional trace header injection based on session pooling status
- Pooled sessions now skip trace header injection to prevent session fixation
- Non-pooled sessions continue to inject trace headers for proper tracing
- Update observability documentation with security considerations

This addresses critical security concerns where trace headers injected into
pooled sessions could cause trace context pollution across multiple requests,
leading to session fixation-style attacks in distributed tracing systems.

Signed-off-by: Vishu <vishu@example.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: sanitize OTEL exception messages and remove dead code

- Sanitize exception messages in OpenTelemetryRequestMiddleware error
  handler using set_span_error() instead of raw str(exc) to prevent
  leaking sensitive data to OTEL backends
- Remove two dead otel_context_active() calls in tool_service.py that
  discarded their return values (leftover from earlier refactoring)
- Remove unused otel_context_active import from tool_service.py
- Fix doc inaccuracy: pooled sessions skip trace injection, not
  propagate it
- Update .secrets.baseline for rebased line numbers

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: add missing /mcp/sse and /mcp/message to OTEL path filter, sanitize middleware span attributes

- Add /mcp/sse and /mcp/message to _should_trace_request_path() — these
  are valid public MCP transport endpoints served by the streamable HTTP
  mount at /mcp, and were silently excluded from request-root tracing
- Route all span attributes in OpenTelemetryRequestMiddleware through
  set_span_attribute() instead of raw span.set_attribute() to apply
  the sanitization/redaction pipeline consistently (user_agent.original
  and correlation_id are caller-controlled strings)
- Add parametrized tests for /mcp/sse, /mcp/message, /mcp/unknown paths

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: add pooled-session regression test for trace header exclusion

Assert that traceparent, tracestate, and X-Correlation-ID are NOT
injected into pooled MCP session headers — pooled transports pin
headers at creation time, so per-request trace injection would
corrupt distributed traces across unrelated requests.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: restore .secrets.baseline from main

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: patch a2a_service object instead of attribute on None

The test patched `a2a_service.get_agent` directly, but a2a_service is
None when A2A is disabled (default in tests). Patch the service object
itself and set get_agent on the mock, matching the pattern used by the
adjacent list_agents test.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu <vishu@example.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
claudia-gray pushed a commit that referenced this pull request Apr 13, 2026
)

* feat: add OTEL root and client spans for MCP flows

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* feat: extend OTEL MCP client trace lifecycle

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* feat: trace plugin hook execution

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: failing ci checks

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: failing linters in pre-commit

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: increasing code coverage

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: remove duplicate docstring from conflict resolution in team_management_service.py

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* feat: rebased and resolved conflicts for OTEL changes

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: address OTEL security and performance issues

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* feat(observability): prevent trace context pollution in pooled MCP sessions

- Add conditional trace header injection based on session pooling status
- Pooled sessions now skip trace header injection to prevent session fixation
- Non-pooled sessions continue to inject trace headers for proper tracing
- Update observability documentation with security considerations

This addresses critical security concerns where trace headers injected into
pooled sessions could cause trace context pollution across multiple requests,
leading to session fixation-style attacks in distributed tracing systems.

Signed-off-by: Vishu <vishu@example.com>
Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

* fix: sanitize OTEL exception messages and remove dead code

- Sanitize exception messages in OpenTelemetryRequestMiddleware error
  handler using set_span_error() instead of raw str(exc) to prevent
  leaking sensitive data to OTEL backends
- Remove two dead otel_context_active() calls in tool_service.py that
  discarded their return values (leftover from earlier refactoring)
- Remove unused otel_context_active import from tool_service.py
- Fix doc inaccuracy: pooled sessions skip trace injection, not
  propagate it
- Update .secrets.baseline for rebased line numbers

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: add missing /mcp/sse and /mcp/message to OTEL path filter, sanitize middleware span attributes

- Add /mcp/sse and /mcp/message to _should_trace_request_path() — these
  are valid public MCP transport endpoints served by the streamable HTTP
  mount at /mcp, and were silently excluded from request-root tracing
- Route all span attributes in OpenTelemetryRequestMiddleware through
  set_span_attribute() instead of raw span.set_attribute() to apply
  the sanitization/redaction pipeline consistently (user_agent.original
  and correlation_id are caller-controlled strings)
- Add parametrized tests for /mcp/sse, /mcp/message, /mcp/unknown paths

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: add pooled-session regression test for trace header exclusion

Assert that traceparent, tracestate, and X-Correlation-ID are NOT
injected into pooled MCP session headers — pooled transports pin
headers at creation time, so per-request trace injection would
corrupt distributed traces across unrelated requests.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: restore .secrets.baseline from main

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: patch a2a_service object instead of attribute on None

The test patched `a2a_service.get_agent` directly, but a2a_service is
None when A2A is disabled (default in tests). Patch the service object
itself and set get_agent on the mock, matching the pattern used by the
adjacent list_agents test.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>
Signed-off-by: Vishu <vishu@example.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe observability Observability, logging, monitoring ready Validated, ready-to-work-on items release-fix Critical bugfix required for the release wxo wxo integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TASK][OBSERVABILITY]: Add end-to-end OTEL trace trees and W3C propagation for gateway to MCP servers

4 participants