feat(observability): add OTEL root and client spans for MCP flows by vishu-bh · Pull Request #3872 · IBM/mcp-context-forge

vishu-bh · 2026-03-26T12:02:35Z

🔗 Related Issue

Closes #3858
Refs #3736

📝 Summary

This PR builds out the gateway-side OpenTelemetry trace tree for MCP traffic and plugin execution.

Implemented so far:

added OTEL request-root spans for gateway transport paths such as /rpc, /mcp, server-scoped MCP routes, and internal MCP transport hops
added shared W3C trace-context helpers and outbound traceparent / tracestate injection
added MCP client lifecycle spans on the Python gateway path:
- mcp.client.call
- mcp.client.initialize
- mcp.client.request
- mcp.client.response
added explicit gateway-side response tracing so upstream success is visible in the trace even before the upstream service is instrumented
added Python FastMCP upstream runtime support so Python MCP servers can join the distributed trace when OTEL is enabled in that server process
added plugin framework tracing at the shared hook dispatch layer:
- plugin.hook.invoke for the hook chain
- plugin.execute for each plugin execution
recorded plugin stop-chain behavior in trace attributes, including which plugin stopped processing
preserved Rust compatibility by continuing to inject W3C trace headers into Rust direct-execution plans

Current gateway-side trace shape for an MCP tool call is now roughly:

POST /rpc
tool.invoke
plugin.hook.invoke / plugin.execute for pre-invoke hooks when configured
mcp.client.call
mcp.client.initialize
mcp.client.request
mcp.client.response
plugin.hook.invoke / plugin.execute for post-invoke hooks when configured

Upstream server spans for non-Python services such as fast-time-server are not part of this PR. Those services can join the same trace by extracting traceparent / tracestate and exporting OTEL spans from their own runtime.

🏷️ Type of Change

🧪 Verification

Check	Command	Status
Lint suite	`make lint`	Not run
Unit tests	`make test`	Focused tests run
Coverage ≥ 80%	`make coverage`	Not run

Focused verification run:

python -m py_compile mcpgateway/observability.py mcpgateway/main.py mcpgateway/plugins/framework/external/mcp/server/runtime.py mcpgateway/services/tool_service.py tests/unit/mcpgateway/test_observability.py tests/unit/mcpgateway/plugins/framework/external/mcp/server/test_runtime_coverage.py tests/unit/mcpgateway/services/test_tool_service.py tests/unit/mcpgateway/plugins/framework/test_observability.py
pytest tests/unit/mcpgateway/test_observability.py tests/unit/mcpgateway/plugins/framework/external/mcp/server/test_runtime_coverage.py -q
pytest tests/unit/mcpgateway/services/test_tool_service.py -k "streamablehttp_creates_client_lifecycle_spans or invoke_tool_mcp_streamablehttp or prepare_rust_mcp_tool_execution_injects_w3c_trace_context_into_plan_headers" -q
pytest tests/unit/mcpgateway/plugins/framework/test_observability.py -q

✅ Checklist

Code formatted (make black isort pre-commit)
Tests added/updated for changes
Documentation updated (if applicable)
No secrets or credentials committed

📓 Notes

This PR focuses on the gateway and Python-side trace tree. Upstream MCP server instrumentation can be a separate follow-up.

crivetimihai · 2026-03-29T12:37:21Z

Thanks @vishu-bh. Comprehensive OTEL instrumentation across MCP transports — this will be very valuable for production debugging. A few notes:

Please check PR feat: integrate Langfuse LLM observability via OTEL #3900 (feat: integrate Langfuse LLM observability via OTEL) which implements much of this same observability surface. Coordinate to avoid duplication and merge conflicts.
Given the breadth of changes (14 files), please ensure all spans follow the naming conventions established in the existing observability middleware.
DCO Signed-off-by is required on all commits.

lucarlig

OTEL tracing now disables MCP session pooling for traced requests, because the new outbound MCP paths only use the pool when not tracing_active. Since these calls run under an active OTEL span, traced SSE/streamable HTTP traffic falls back to per-call session creation instead of pooled reuse. That is a real latency regression and may also change behavior for upstream servers that depend on session-local state.
The new request-root middleware exports raw query strings into OTEL span attributes (url.query). That bypasses the repo’s normal sanitization flow and creates a new telemetry leakage path.
The same middleware records raw exception text with error.message = str(exc) instead of going through the existing sanitized error helpers, which can leak sensitive error content into OTEL exports.
Plugin-server spans may get the wrong service name, because the plugin runtime sets OTEL_SERVICE_NAME after importing mcpgateway.observability, but that module initializes the tracer at import time.
This PR also removes the admin tab scroll-reset behavior and deletes its regression tests, which looks unrelated to the observability scope and likely reintroduces a UI regression.

Smaller but still worth fixing:

.secrets.baseline was fully reserialized in this PR, which adds a lot of unrelated review noise and conflict risk.
There’s no docs/update note for the new W3C trace-header propagation, new root spans, or the tracing-vs-pooling tradeoff.

vishu-bh · 2026-04-01T15:02:23Z

✅ All PR Review Comments Addressed

Successfully fixed all 7 issues from the review:

🔒 Security Fixes (CRITICAL)

Query string sanitization - Added sanitize_trace_text() to prevent credential leakage in OTEL spans
Exception message sanitization - Applied sanitization to prevent sensitive error details in traces

⚡ Performance Fixes (HIGH PRIORITY)

MCP session pooling with tracing - Removed and not tracing_active condition and added trace context injection before pooling
Session pool now works with OTEL tracing (10-20x latency improvement restored)
W3C trace context properly propagates through pooled sessions

🔧 Technical Fixes

Plugin server service name - Moved OTEL_SERVICE_NAME setup before observability import
Plugin spans now show correct service identification

✅ Review Items

Admin tab scroll-reset - No changes found in this PR (false alarm)
.secrets.baseline - Only formatting/timestamp changes (acceptable)

📚 Documentation

Created comprehensive OpenTelemetry integration guide at docs/docs/architecture/observability-otel.md
Covers W3C trace propagation, session pooling design, security considerations, configuration, and troubleshooting

Files Changed

mcpgateway/middleware/observability_middleware.py (security fixes)
mcpgateway/services/tool_service.py (session pooling fix)
mcpgateway/plugins/framework/external/mcp/server/runtime.py (service name fix)
docs/docs/architecture/.pages (navigation update)
docs/docs/architecture/observability-otel.md (new documentation)

Ready for re-review.

crivetimihai

Review Summary

Rebased onto main, resolved conflicts, and addressed several issues found during review. The OTEL integration design is sound — span hierarchy follows semantic conventions, trace context propagation is correctly scoped, and the pooled/non-pooled split is well-reasoned.

Fixes applied (2 commits on top of the original work)

Commit: fix: sanitize OTEL exception messages and remove dead code

Security: OpenTelemetryRequestMiddleware error handler was writing raw str(exc) to OTEL spans, bypassing the sanitization pipeline. Now routes through set_span_error() with _sanitize_span_exception_message().
Dead code: Removed two standalone otel_context_active() calls in tool_service.py that discarded their return values (leftover from an earlier design iteration).
Doc fix: Overview claimed "Session Pool Compatibility: Trace context flows through pooled MCP sessions" — corrected to describe the actual trade-off (pooled sessions intentionally skip trace injection).

Commit: fix: add missing /mcp/sse and /mcp/message to OTEL path filter, sanitize middleware span attributes

Correctness: _should_trace_request_path() was missing /mcp/sse and /mcp/message — valid public MCP transport endpoints served by the streamable HTTP mount at /mcp. Real MCP traffic on those paths silently bypassed request-root tracing.
Security: All span attributes in OpenTelemetryRequestMiddleware now route through set_span_attribute() instead of raw span.set_attribute(), applying the sanitization/redaction pipeline to caller-controlled values (user_agent.original, correlation_id).
Tests: Added parametrized test cases for /mcp/sse, /mcp/message, /mcp/sse/, /mcp/message/, and /mcp/unknown (negative case).

Known gap (non-blocking)

No regression test exists proving that pooled sessions skip traceparent injection. The code is correct by design (the pooled branch never calls inject_trace_context_headers), but a dedicated test would guard against accidental regression. Worth a follow-up.

What looks good

Span hierarchy (mcp.client.call > initialize > request > response) follows OTEL conventions
OpenTelemetryRequestMiddleware correctly uses raw ASGI middleware (not BaseHTTPMiddleware) to wrap the full request path including mounted apps
Plugin framework spans (plugin.hook.invoke, plugin.execute) properly nest and record chain-stopping behavior
W3C trace context propagation only on non-pooled sessions prevents context pollution
session_id/sessionid redaction and & URL terminator fix in trace_redaction.py are solid
Test coverage: 96% on observability.py, 90% on runtime.py

crivetimihai · 2026-04-03T13:02:37Z

Added test: add pooled-session regression test for trace header exclusion — this closes the last open suggestion from the Codex review.

The test (test_invoke_tool_mcp_pooled_path_does_not_inject_trace_headers) exercises the actual pooled code path with mcp_session_pool_enabled = True and a mock pool, then asserts that traceparent, tracestate, and X-Correlation-ID are not present in the headers passed to pool.session(). This guards against accidental regression if someone adds inject_trace_context_headers() to the pooled branch in the future.

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

…gement_service.py Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

…ssions - Add conditional trace header injection based on session pooling status - Pooled sessions now skip trace header injection to prevent session fixation - Non-pooled sessions continue to inject trace headers for proper tracing - Update observability documentation with security considerations This addresses critical security concerns where trace headers injected into pooled sessions could cause trace context pollution across multiple requests, leading to session fixation-style attacks in distributed tracing systems. Signed-off-by: Vishu <vishu@example.com> Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

- Sanitize exception messages in OpenTelemetryRequestMiddleware error handler using set_span_error() instead of raw str(exc) to prevent leaking sensitive data to OTEL backends - Remove two dead otel_context_active() calls in tool_service.py that discarded their return values (leftover from earlier refactoring) - Remove unused otel_context_active import from tool_service.py - Fix doc inaccuracy: pooled sessions skip trace injection, not propagate it - Update .secrets.baseline for rebased line numbers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

…ize middleware span attributes - Add /mcp/sse and /mcp/message to _should_trace_request_path() — these are valid public MCP transport endpoints served by the streamable HTTP mount at /mcp, and were silently excluded from request-root tracing - Route all span attributes in OpenTelemetryRequestMiddleware through set_span_attribute() instead of raw span.set_attribute() to apply the sanitization/redaction pipeline consistently (user_agent.original and correlation_id are caller-controlled strings) - Add parametrized tests for /mcp/sse, /mcp/message, /mcp/unknown paths Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Assert that traceparent, tracestate, and X-Correlation-ID are NOT injected into pooled MCP session headers — pooled transports pin headers at creation time, so per-request trace injection would corrupt distributed traces across unrelated requests. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

The test patched `a2a_service.get_agent` directly, but a2a_service is None when A2A is disabled (default in tests). Patch the service object itself and set get_agent on the mock, matching the pattern used by the adjacent list_agents test. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

) * feat: add OTEL root and client spans for MCP flows Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat: extend OTEL MCP client trace lifecycle Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat: trace plugin hook execution Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: failing ci checks Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: failing linters in pre-commit Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: increasing code coverage Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: remove duplicate docstring from conflict resolution in team_management_service.py Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat: rebased and resolved conflicts for OTEL changes Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: address OTEL security and performance issues Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * feat(observability): prevent trace context pollution in pooled MCP sessions - Add conditional trace header injection based on session pooling status - Pooled sessions now skip trace header injection to prevent session fixation - Non-pooled sessions continue to inject trace headers for proper tracing - Update observability documentation with security considerations This addresses critical security concerns where trace headers injected into pooled sessions could cause trace context pollution across multiple requests, leading to session fixation-style attacks in distributed tracing systems. Signed-off-by: Vishu <vishu@example.com> Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> * fix: sanitize OTEL exception messages and remove dead code - Sanitize exception messages in OpenTelemetryRequestMiddleware error handler using set_span_error() instead of raw str(exc) to prevent leaking sensitive data to OTEL backends - Remove two dead otel_context_active() calls in tool_service.py that discarded their return values (leftover from earlier refactoring) - Remove unused otel_context_active import from tool_service.py - Fix doc inaccuracy: pooled sessions skip trace injection, not propagate it - Update .secrets.baseline for rebased line numbers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: add missing /mcp/sse and /mcp/message to OTEL path filter, sanitize middleware span attributes - Add /mcp/sse and /mcp/message to _should_trace_request_path() — these are valid public MCP transport endpoints served by the streamable HTTP mount at /mcp, and were silently excluded from request-root tracing - Route all span attributes in OpenTelemetryRequestMiddleware through set_span_attribute() instead of raw span.set_attribute() to apply the sanitization/redaction pipeline consistently (user_agent.original and correlation_id are caller-controlled strings) - Add parametrized tests for /mcp/sse, /mcp/message, /mcp/unknown paths Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: add pooled-session regression test for trace header exclusion Assert that traceparent, tracestate, and X-Correlation-ID are NOT injected into pooled MCP session headers — pooled transports pin headers at creation time, so per-request trace injection would corrupt distributed traces across unrelated requests. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: restore .secrets.baseline from main Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: patch a2a_service object instead of attribute on None The test patched `a2a_service.get_agent` directly, but a2a_service is None when A2A is disabled (default in tests). Patch the service object itself and set get_agent on the mock, matching the pattern used by the adjacent list_agents test. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com> Signed-off-by: Vishu <vishu@example.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>

vishu-bh force-pushed the otel-phase1-request-root-spans branch 2 times, most recently from e46fcc4 to fb014ce Compare March 27, 2026 11:20

vishu-bh marked this pull request as ready for review March 27, 2026 11:20

vishu-bh requested review from araujof, crivetimihai, jonpspri, kevalmahajan, madhav165 and terylt as code owners March 27, 2026 11:20

crivetimihai changed the title ~~feat: add OTEL root and client spans for MCP flows~~ feat(observability): add OTEL root and client spans for MCP flows Mar 29, 2026

crivetimihai added enhancement New feature or request SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release observability Observability, logging, monitoring labels Mar 29, 2026

crivetimihai added this to the Release 1.1.0 milestone Mar 29, 2026

vishu-bh force-pushed the otel-phase1-request-root-spans branch from fb014ce to f7bb46a Compare March 29, 2026 22:23

vishu-bh requested review from Lang-Akshay and brian-hussey March 30, 2026 08:24

jonpspri added wxo wxo integration release-fix Critical bugfix required for the release labels Mar 31, 2026

jonpspri modified the milestones: Release 1.1.0, Release 1.0.0-RC3 Mar 31, 2026

vishu-bh force-pushed the otel-phase1-request-root-spans branch 2 times, most recently from 0e2606c to 0bea4be Compare April 1, 2026 13:13

vishu-bh requested a review from lucarlig April 1, 2026 13:18

lucarlig requested changes Apr 1, 2026

View reviewed changes

vishu-bh force-pushed the otel-phase1-request-root-spans branch from 0bea4be to 6b2a326 Compare April 1, 2026 14:59

vishu-bh requested a review from lucarlig April 1, 2026 15:03

vishu-bh force-pushed the otel-phase1-request-root-spans branch from 6b2a326 to ab428d6 Compare April 2, 2026 08:53

vishu-bh assigned crivetimihai Apr 2, 2026

vishu-bh added the ready Validated, ready-to-work-on items label Apr 3, 2026

crivetimihai dismissed lucarlig’s stale review via c01265e April 3, 2026 12:39

crivetimihai force-pushed the otel-phase1-request-root-spans branch from a73d9a2 to c01265e Compare April 3, 2026 12:39

crivetimihai previously approved these changes Apr 3, 2026

View reviewed changes

crivetimihai dismissed their stale review via 6d16846 April 3, 2026 13:02

vishu-bh and others added 14 commits April 3, 2026 14:52

feat: add OTEL root and client spans for MCP flows

6826884

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

feat: extend OTEL MCP client trace lifecycle

b554ee9

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

feat: trace plugin hook execution

2443f36

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

fix: failing ci checks

21ed4c5

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

fix: failing linters in pre-commit

44da303

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

fix: increasing code coverage

003a97f

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

fix: remove duplicate docstring from conflict resolution in team_mana…

c3068d9

…gement_service.py Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

feat: rebased and resolved conflicts for OTEL changes

1d1f43f

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

fix: address OTEL security and performance issues

d53a303

Signed-off-by: Vishu Bhatnagar <vishu.bhatnagar@ibm.com>

fix: restore .secrets.baseline from main

e464e74

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai force-pushed the otel-phase1-request-root-spans branch 2 times, most recently from e464e74 to f81ee3c Compare April 3, 2026 13:54

crivetimihai merged commit ab59307 into main Apr 3, 2026
27 checks passed

crivetimihai deleted the otel-phase1-request-root-spans branch April 3, 2026 14:49

araujof mentioned this pull request Apr 8, 2026

[TRACKING]: Port new capabilities from CF's in-tree plugin framework contextforge-org/contextforge-plugins-framework#10

Open

jonpspri mentioned this pull request Apr 15, 2026

[FEATURE]: Add OTEL tracing to supported plugins #4220

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): add OTEL root and client spans for MCP flows#3872

feat(observability): add OTEL root and client spans for MCP flows#3872
crivetimihai merged 15 commits intomainfrom
otel-phase1-request-root-spans

vishu-bh commented Mar 26, 2026 •

edited

Loading

Uh oh!

crivetimihai commented Mar 29, 2026

Uh oh!

lucarlig left a comment

Uh oh!

vishu-bh commented Apr 1, 2026

Uh oh!

crivetimihai left a comment

Uh oh!

crivetimihai commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vishu-bh commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Related Issue

📝 Summary

🏷️ Type of Change

🧪 Verification

✅ Checklist

📓 Notes

Uh oh!

crivetimihai commented Mar 29, 2026

Uh oh!

lucarlig left a comment

Choose a reason for hiding this comment

Uh oh!

vishu-bh commented Apr 1, 2026

✅ All PR Review Comments Addressed

🔒 Security Fixes (CRITICAL)

⚡ Performance Fixes (HIGH PRIORITY)

🔧 Technical Fixes

✅ Review Items

📚 Documentation

Files Changed

Uh oh!

crivetimihai left a comment

Choose a reason for hiding this comment

Review Summary

Fixes applied (2 commits on top of the original work)

Known gap (non-blocking)

What looks good

Uh oh!

crivetimihai commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vishu-bh commented Mar 26, 2026 •

edited

Loading