feat(discovery): add automatic tool discovery with hot/cold classification by Lang-Akshay · Pull Request #3839 · IBM/mcp-context-forge

Lang-Akshay · 2026-03-24T13:44:13Z

Note: This branch also includes the Layer 2 RBAC token-narrowing fix from #3919 (session-token team narrowing in permission checks).

Overview

This PR implements automatic tool discovery for upstream MCP servers via usage-aware adaptive polling. The gateway now continuously synchronises tool lists from registered servers without any manual intervention — polling frequently-used servers at 1× the base interval and deprioritising idle servers to 3× — reducing unnecessary load while keeping active integrations fresh.

Previously, tool lists were only refreshed on demand or via manual admin action. With this change, the gateway discovers new, updated, and removed tools automatically, reflecting upstream changes within one poll cycle.

Design Rationale: Polling vs. Push Notifications

The MCP spec defines notifications/tools/list_changed as the canonical mechanism for dynamic tool discovery, and it is a reasonable default for single-session clients. For a gateway operating at scale, persistent-connection notifications introduce a set of problems that polling sidesteps cleanly — this section explains that tradeoff honestly.

Why persistent notifications don't fit the gateway model

Notifications require a live transport stream. The MCP SDK delivers notifications through a _receive_loop tied to the open connection. The gateway's refresh path (_initialize_gateway → connect_to_sse_server / connect_to_streamablehttp_server) uses ephemeral connections — open, fetch tools/list, close. No message_handler is registered and the notification window is effectively zero.

Session pools are demand-driven, not proactive. MCPSessionPool does maintain persistent sessions with notification handlers, but sessions are only created when users invoke tools. If no tools have been called against a gateway, no session exists and no notifications are received. Idle sessions are evicted after 600 s (MCP_SESSION_POOL_IDLE_EVICTION). The pool covers active user traffic, not passive server monitoring.

The connection cost scales poorly. Listening to N upstream servers requires N open TCP sockets and 2N asyncio tasks per worker, plus keepalive traffic and reconnect logic. At realistic deployment sizes:

Scale	Persistent Notifications	Ephemeral Polling
Connections at rest	N per worker	0
asyncio tasks at rest	2N per worker	0
Multi-worker support	✗ (each worker needs own connections)	✓ (leader election)
Server restart recovery	Requires explicit reconnect	Next poll picks it up
1K servers, 4 workers	~8K connections, ~8K tasks	0 at rest
10K servers, 4 workers	~80K persistent connections	~10K ephemeral calls/interval, batched

Polling holds zero file descriptors at rest, works across workers via leader election (FILELOCK_NAME), and self-heals automatically when upstream servers restart. The existing health-check infrastructure already provides semaphore-based concurrency control, chunked batching with inter-batch pauses, and per-gateway throttling — this PR builds on that foundation rather than replacing it.

If the MCP spec's push model becomes viable for large-scale gateway deployments in the future (e.g. via a dedicated notification broker), this polling layer can be replaced without touching the rest of the refresh pipeline.

Background: What Already Exists

The gateway's health check system already implements:

✅ Semaphore-based concurrency control (adaptive limit)
✅ Chunked processing with 50 ms pauses between batches
✅ Per-gateway throttling via last_refresh_at timestamps
✅ Lock-based conflict prevention (manual vs. auto-refresh)
✅ Configurable intervals (HEALTH_CHECK_INTERVAL, GATEWAY_AUTO_REFRESH_INTERVAL)

Example: 100 gateways → 10 concurrent batches with 50 ms pauses = ~5–10 s total check time

Problem

Despite those safeguards, automatic tool discovery was not enabled and all servers were treated equally:

No automatic synchronisation of tool lists from upstream servers
A server receiving 1,000+ requests/day → checked every 300 s
A server idle for weeks → also checked every 300 s
No differentiation based on real usage patterns
Unnecessary polling of servers that rarely if ever change

Solution

1. Automatic Tool Discovery via Polling

The gateway now runs a background polling loop that periodically calls tools/list on every registered upstream server. Discovered tools are reconciled against the local registry — additions, updates, and removals are applied automatically. No manual refresh or admin action is required.

2. Hot/Cold Server Classification

To make automatic polling efficient at scale, the gateway analyses the MCP session pool to classify each server into one of two tiers:

Tier	Criteria	Poll Interval
Hot (top 20%)	Recent active sessions, high use count	1× base interval (300 s default)
Cold (remaining 80%)	No recent sessions or low usage	3× base interval (900 s default)

Classification algorithm:

Extract per-server metrics from pooled sessions: server_last_used, active_session_count, total_use_count
Filter to servers with a valid pooled session
Sort by recency (most recently used first); ties broken deterministically
Top 20% (floor(0.20 × N)) → hot
Remainder → cold

Classification is deterministic and grounded entirely in observed usage — no heuristics or guesswork.

3. Intelligent Interval Selection

Each server's tier determines its poll frequency:

# Hot server (top 20% by usage)
should_poll = elapsed >= settings.hot_server_check_interval   # 300 s (1× base)

# Cold server (remaining 80%)
should_poll = elapsed >= settings.cold_server_check_interval  # 900 s (3× base)

4. Multi-Worker Coordination

With Redis: Leader election ensures a single worker classifies servers; all workers read the shared classification from Redis.
Without Redis (make dev): Single-worker mode; classification runs locally — no Redis dependency required for local development.

Configuration

To enable automatic tool discovery and health checks:

AUTO_REFRESH_SERVERS=true            # Master switch — enables automatic tool/resource/prompt sync during health checks
GATEWAY_AUTO_REFRESH_INTERVAL=300    # Tool list refresh interval in seconds (default: 300, minimum: 60)

Optional tuning:

HOT_COLD_CLASSIFICATION_ENABLED=true    # Hot/cold classification (default: false, requires Redis for multi-worker)

All poll intervals are derived automatically from GATEWAY_AUTO_REFRESH_INTERVAL:

Server tier	Poll interval
Hot (top 20% by usage)	1× base (300 s)
Cold (remaining 80%)	3× base (900 s)

Lang-Akshay · 2026-03-26T13:19:09Z

Thanks for the review @msureshkumar88

Some of the review are false positive , one which are not are acted upon

Leader Election Timing Mismatch ✅ self._leader_ttl = int(settings.gateway_auto_refresh_interval * 1.5)
Silent Failure on Internal Queue Access ✅ included logger warning
Double Throttling Logic Creates Confusion ✅

msureshkumar88

✅ PR #3839 - APPROVED

Automatic Tool Discovery with Hot/Cold Classification and Staggered Polling

🎉 Approval Summary

Recommendation: ✅ APPROVE FOR MERGE

This PR delivers a significant architectural enhancement to the MCP Gateway, introducing intelligent, usage-aware automatic tool discovery that will dramatically improve operational efficiency at scale. The implementation demonstrates strong engineering fundamentals with excellent test coverage and thoughtful design decisions.

💪 Key Strengths

1. Excellent Architecture & Design

Hot/Cold Classification Algorithm ⭐⭐⭐⭐⭐

Clean, deterministic classification based on real usage metrics
Top 20% hot servers get 3× more frequent polling - smart resource allocation
Grounded in observable session pool state, not heuristics
Scales linearly with server count

Timestamp Management ⭐⭐⭐⭐⭐

Brilliant mark_poll_completed() pattern separates decision from state update
Timestamps only updated after successful refresh - prevents stale data
Fail-safe design ensures failed polls don't block future attempts
This is production-grade error handling

Multi-Worker Coordination ⭐⭐⭐⭐

Redis-based leader election prevents duplicate classification
Graceful degradation to single-worker mode without Redis
Leader TTL properly calculated at 1.5× interval - no classification gaps
Clean separation of concerns

2. Comprehensive Test Coverage

Unit Tests ⭐⭐⭐⭐⭐

1,625 lines of tests for classification service alone
Covers classification logic, leader election, polling decisions, error handling
Edge cases well-tested (empty pools, missing attributes, concurrent scenarios)
Test quality is exceptional

Integration Tests ⭐⭐⭐⭐

Gateway service integration verified
Hot/cold classification end-to-end tested
Multi-worker coordination validated

3. Production-Ready Features

Fail-Safe Design ⭐⭐⭐⭐⭐

All errors fail open (allow polling) - prevents denial of service
Classification failures don't block health checks
Redis unavailable? Falls back to local mode
Queue access fails? Logs warning and continues
This is how production systems should be built

Observability ⭐⭐⭐⭐

Clear, actionable log messages at appropriate levels
Classification metadata tracked (hot count, cold count, eligible servers)
Timestamp validation prevents manipulation
Easy to debug and monitor

Configuration Flexibility ⭐⭐⭐⭐

Feature can be disabled via HOT_COLD_CLASSIFICATION_ENABLED
Intervals configurable per deployment
Minimum interval validation (60s) prevents misconfiguration
Sensible defaults for production use

4. Code Quality

Clean Code ⭐⭐⭐⭐

Well-structured classes with clear responsibilities
Comprehensive docstrings explaining logic
Type hints throughout (Python 3.10+ style)
Follows project coding standards

Documentation ⭐⭐⭐⭐

PR description is outstanding - explains design rationale clearly
Polling vs. push notifications tradeoff well-articulated
Configuration examples provided
Comments explain non-obvious logic

🚀 Impact & Value

Immediate Benefits

Operational Efficiency 📈

Reduces unnecessary polling by 60-70% for idle servers
Active servers stay fresh with frequent updates
Automatic tool discovery eliminates manual intervention
Scales to 1000+ servers efficiently

Resource Optimization 💰

Cold servers polled 3× less frequently (900s vs 300s)
Reduces database queries, network traffic, CPU usage
Multi-worker coordination prevents duplicate work
Connection pooling opportunities identified for future

Developer Experience 👨‍💻

Tools appear automatically - no manual refresh needed
Upstream changes detected within one poll cycle
Clear logs make debugging straightforward
Configuration is intuitive

Strategic Value

Scalability Foundation 🏗️

Architecture supports 10,000+ servers with identified optimizations
Leader election enables horizontal scaling
Staggered polling prevents thundering herd
Performance optimization roadmap clear (connection pooling, caching, batching)

Production Readiness ✅

Fail-safe error handling throughout
Multi-worker coordination battle-tested pattern
Graceful degradation when dependencies fail
Comprehensive test coverage gives confidence

🔮 Future Improvement Opportunities

High-Impact Optimizations (Post-Merge)

1. Connection Pooling (2 hours, 10× latency improvement)

# Reuse HTTP connections across polls
self._http_pool = httpx.AsyncClient(
    limits=httpx.Limits(max_connections=100, max_keepalive_connections=50)
)

Impact: 50-100ms → 5-10ms per poll

2. Classification Result Caching (1 hour, 50× fewer Redis queries)

# Cache classification for 60s (much less than 300s classification interval)
self._classification_cache = TTLCache(maxsize=10000, ttl=60)

Impact: 16 Redis queries/sec → 0.3/sec

3. Set Operation Optimization (5 minutes, linear scaling)

# O(n) instead of O(n²)
cold_servers = list(set(all_gateway_urls) - hot_set)

Impact: Scales to 10,000+ servers

Total Effort: 3 hours | Total Impact: 10-20× performance improvement

Advanced Features (Future Sprints)

Exponential Backoff for Failures (4 hours)

Servers that repeatedly fail get polled less frequently
Automatic recovery detection
Reduces wasted resources on broken servers

Success Rate in Classification (4 hours)

Factor refresh success/failure rates into hot/cold decision
Deprioritize unreliable servers
Improves overall system reliability

Incremental Classification (4 hours)

Only reclassify servers with changed metrics
50-70% CPU reduction when stable
Better for large deployments (1000+ servers)

Bulk Redis Operations (2 hours)

Pipeline Redis calls for 300× faster classification
3000 round-trips → 1 round-trip
Critical for 10,000+ server deployments

📊 Metrics to Track (Post-Deployment)

Recommended metrics for production monitoring:

# Classification Performance
metrics.histogram("mcp.classification.duration_seconds")
metrics.gauge("mcp.classification.servers_hot")
metrics.gauge("mcp.classification.servers_cold")

# Polling Efficiency
metrics.histogram("mcp.poll.duration_seconds", tags={"tier": "hot|cold"})
metrics.counter("mcp.poll.skipped", tags={"reason": "throttled|locked"})
metrics.counter("mcp.poll.completed", tags={"result": "success|failure"})

# Resource Usage
metrics.gauge("mcp.redis.queries_per_second")
metrics.histogram("mcp.classification.memory_bytes")

🎯 Success Criteria (Post-Deployment)

Week 1: Verify basic functionality

All servers classified correctly (hot/cold distribution matches usage)
No classification gaps or errors
Polling intervals respected (hot: 300s, cold: 900s)
Leader election stable in multi-worker deployments

Week 2-4: Monitor efficiency gains

60-70% reduction in cold server polls
Tool updates detected within 1-2 poll cycles
No manual refresh interventions needed
Resource usage (CPU, memory, Redis) stable

Month 2-3: Optimize performance

Implement connection pooling (10× latency improvement)
Add classification caching (50× fewer Redis queries)
Deploy to production at scale (1000+ servers)

🏆 Final Verdict

This PR represents excellent engineering work that delivers immediate value while establishing a solid foundation for future scaling. The implementation quality is high, test coverage is comprehensive, and the design decisions are well-reasoned.

Key Achievements:

✅ Automatic tool discovery - eliminates manual intervention
✅ Intelligent hot/cold classification - optimizes resource usage
✅ Production-ready error handling - fail-safe throughout
✅ Multi-worker coordination - scales horizontally
✅ Comprehensive tests - 1,625 lines of test coverage
✅ Clear documentation - excellent PR description

Why This Deserves Approval:

Solves Real Problem: Manual tool refresh is eliminated
Production Quality: Fail-safe design, comprehensive tests
Scales Well: Handles 1000+ servers, clear optimization path
Well Tested: Exceptional test coverage gives confidence
Future-Proof: Architecture supports identified optimizations

Recommendation: ✅ MERGE WITH CONFIDENCE

The identified performance optimizations (connection pooling, caching, set operations) are enhancements, not blockers. They can be addressed in follow-up PRs as the system scales. The current implementation is solid, well-tested, and ready for production.

👏 Kudos to @Lang-Akshay

Excellent work on:

Thoughtful design rationale (polling vs. push notifications)
Outstanding PR description with clear examples
Comprehensive test coverage (1,625 lines!)
Responsive to feedback (fixed leader TTL, timestamp management)
Production-grade error handling throughout

This is the kind of PR that makes code review a pleasure. 🎉

Status: ✅ APPROVED FOR MERGE
Confidence Level: High
Risk Level: Low
Recommendation: Merge and monitor in production, implement performance optimizations in Q2

crivetimihai · 2026-03-29T13:14:00Z

Thanks @Lang-Akshay. Ambitious feature — will review the hot/cold classification approach and staggered polling implementation in detail.

crivetimihai

Reviewed and approved with fixes applied.

Review summary

Thorough code review of the automatic tool discovery with hot/cold classification feature. Found and fixed 14 issues (3 critical, 5 medium, 6 low) across correctness, security, resilience, and configuration safety.

Critical fixes applied

Atomic leader election: Replaced non-atomic GET+EXPIRE renewal with a Lua compare-and-expire script to prevent split-brain in multi-worker deployments
Query-param-auth URL leak: Pool keys can contain auth-mutated URLs; classifier now resolves canonical URLs via gateway_id from pool keys, preventing secret leakage to Redis
Poll type mismatch: "tools" vs "tool_discovery" generated different Redis keys, silently defeating the entire hot/cold optimization for tool discovery
Active-only server misclassification: Servers with all sessions checked out (empty idle queue) were forced cold; now extracts metrics from active sessions too

Resilience fixes

NOSCRIPT recovery: Re-registers Lua script if Redis flushes it
Background task death detection via add_done_callback
Shutdown safety: stop() catches Exception (not just CancelledError) from already-dead tasks

Configuration safety (no breaking changes on upgrade)

auto_refresh_servers: reverted to False (opt-in via env/docker-compose)
hot_cold_classification_enabled: False (opt-in, requires Redis)
health_check_interval: kept at 60 (unchanged from main)
gateway_auto_refresh_interval: minimum restored to ge=60

Dead code removed

_calculate_gateway_poll_offset, _should_poll_gateway_now, _check_is_leader (defined but never called)
staggered_polling_enabled / staggered_polling_tick_interval / staggered_polling_tolerance config entries

Test coverage

server_classification_service.py: 99% (248/249 statements)
529 tests passing across all affected files
New tests for: Lua leader renewal, query-param-auth URL isolation, active-only server eligibility, shutdown after task failure

crivetimihai

Reviewed, fixed, and approved after 6 review rounds.

What was changed

This PR adds automatic tool discovery for upstream MCP servers with hot/cold server classification based on session pool usage patterns. Cold servers (80%) are polled at 3x the base interval, reducing unnecessary load while keeping active integrations fresh.

Fixes applied during review (20 total)

Critical correctness

Atomic leader election — Lua compare-and-expire script prevents split-brain
Leader TTL + timeout guard — 3x interval TTL, classification capped at 0.8x TTL, post-classification renewal
Poll type mismatch — "tools" → "tool_discovery" consistently
Health checks never skipped — classification only gates auto-refresh, not health monitoring
Active-only servers — busy servers with all sessions checked out are now eligible for hot

Security

Query-param-auth URL leak — canonical URL resolution via gateway_id prevents secrets reaching Redis
Per-gateway poll-state keying — gateway_id in Redis key prevents same-URL gateways suppressing each other
Log sanitization — SecurityValidator.sanitize_log_message() on all gateway name log messages

Resilience

NOSCRIPT recovery — re-registers Lua script after Redis restart
Background task death detection — add_done_callback surfaces errors
Shutdown safety — stop() catches Exception from already-dead tasks
Classification timeout — asyncio.wait_for prevents unbounded runs

Configuration safety (no breaking changes on upgrade)

auto_refresh_servers reverted to False, hot_cold_classification_enabled = False
health_check_interval kept at 60, gateway_auto_refresh_interval ge=60
.env.example mirrors config.py defaults; docker-compose.yml explicitly enables for production
Removed dead code: staggered polling methods + config entries

Pre-existing bugs fixed

Registration race — db.flush() → db.commit() + full cache invalidation (gateways, tools, resources, prompts, tags) so other workers see new data before the response reaches the client
mark_poll_completed coverage — moved into _refresh_gateway_tools_resources_prompts so all refresh paths (health-check, manual, registration) advance the poll schedule; update_gateway path guarded by reinit_succeeded flag
Duplicate-URL gateways — deduplicated via dict.fromkeys() for accurate total_servers/hot_cap

Test coverage

server_classification_service.py: 99% (248/249 statements)
525 tests passing across all affected files
Targeted tests for: Lua leader renewal, query-param-auth isolation, active-only eligibility, shutdown after task failure, per-gateway poll-state keying

…ation Implement automatic tool discovery for upstream MCP servers via usage-aware adaptive polling. The gateway can now continuously synchronise tool lists from registered servers without manual intervention. Server classification (hot/cold): - Classify servers based on MCP session pool usage patterns - Hot servers (top 20% by recent usage): polled at 1x base interval - Cold servers (remaining 80%): polled at 3x base interval - Classification is deterministic: sorted by recency, active sessions, use count, and URL for tie-breaking - Leader election via Redis with TTL renewal for multi-worker coordination - Falls back to local-only operation without Redis Integration with GatewayService: - Health checks respect hot/cold classification intervals - Auto-refresh of tools/resources/prompts respects classification - Fail-open on classification errors (poll anyway) - Poll timestamps tracked via Redis with TTL expiry - Uses base gateway URL (pre-auth) for classification lookups to avoid leaking query-param auth secrets to Redis Configuration: - AUTO_REFRESH_SERVERS=true enables automatic tool sync (default: false) - GATEWAY_AUTO_REFRESH_INTERVAL=300 sets base polling interval - HOT_COLD_CLASSIFICATION_ENABLED=false (opt-in, requires Redis) Includes comprehensive tests with 100% coverage on the new ServerClassificationService and integration tests for the GatewayService hot/cold polling paths. Closes #3734 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

…ation (#3839) Implement automatic tool discovery for upstream MCP servers via usage-aware adaptive polling. The gateway can now continuously synchronise tool lists from registered servers without manual intervention. Server classification (hot/cold): - Classify servers based on MCP session pool usage patterns - Hot servers (top 20% by recent usage): polled at 1x base interval - Cold servers (remaining 80%): polled at 3x base interval - Classification is deterministic: sorted by recency, active sessions, use count, and URL for tie-breaking - Leader election via Redis with TTL renewal for multi-worker coordination - Falls back to local-only operation without Redis Integration with GatewayService: - Health checks respect hot/cold classification intervals - Auto-refresh of tools/resources/prompts respects classification - Fail-open on classification errors (poll anyway) - Poll timestamps tracked via Redis with TTL expiry - Uses base gateway URL (pre-auth) for classification lookups to avoid leaking query-param auth secrets to Redis Configuration: - AUTO_REFRESH_SERVERS=true enables automatic tool sync (default: false) - GATEWAY_AUTO_REFRESH_INTERVAL=300 sets base polling interval - HOT_COLD_CLASSIFICATION_ENABLED=false (opt-in, requires Redis) Includes comprehensive tests with 100% coverage on the new ServerClassificationService and integration tests for the GatewayService hot/cold polling paths. Closes #3734 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

…ation (#3839) Implement automatic tool discovery for upstream MCP servers via usage-aware adaptive polling. The gateway can now continuously synchronise tool lists from registered servers without manual intervention. Server classification (hot/cold): - Classify servers based on MCP session pool usage patterns - Hot servers (top 20% by recent usage): polled at 1x base interval - Cold servers (remaining 80%): polled at 3x base interval - Classification is deterministic: sorted by recency, active sessions, use count, and URL for tie-breaking - Leader election via Redis with TTL renewal for multi-worker coordination - Falls back to local-only operation without Redis Integration with GatewayService: - Health checks respect hot/cold classification intervals - Auto-refresh of tools/resources/prompts respects classification - Fail-open on classification errors (poll anyway) - Poll timestamps tracked via Redis with TTL expiry - Uses base gateway URL (pre-auth) for classification lookups to avoid leaking query-param auth secrets to Redis Configuration: - AUTO_REFRESH_SERVERS=true enables automatic tool sync (default: false) - GATEWAY_AUTO_REFRESH_INTERVAL=300 sets base polling interval - HOT_COLD_CLASSIFICATION_ENABLED=false (opt-in, requires Redis) Includes comprehensive tests with 100% coverage on the new ServerClassificationService and integration tests for the GatewayService hot/cold polling paths. Closes #3734 Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>

…3965) * refactor(plugins): replace in-tree rate_limiter with cpex-rate-limiter package Remove the in-tree rate_limiter plugin and replace it with the cpex-rate-limiter PyPI package, a compiled Rust extension providing the same RateLimiterPlugin class with additional algorithms (sliding-window, token-bucket) alongside the original fixed-window. - Add cpex-rate-limiter>=0.0.2 as a [plugins] optional dependency - Update Containerfile.lite to install the plugins extra - Remove plugins/rate_limiter/ source directory - Remove unit and integration tests that imported plugin internals - Update all config files to use cpex_rate_limiter.RateLimiterPlugin - Disable RateLimiterPlugin in test fixture config (package not available in unit test environment) - Update documentation to reflect the external package Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com> * feat(rate-limiter): pluggable algorithms with Rust-backed execution engine, benchmarks, and validation (#3809) * feat(rate-limiter): pluggable algorithms, tenant isolation fix, and scale load test - Add pluggable algorithm strategy: fixed_window, sliding_window, token_bucket - Add Redis backend for shared cross-instance rate limiting - Fix tenant isolation: skip by_tenant when tenant_id is None - Fix sliding window: sweep expired timestamps before counting - Fix backend validation: restore _validate_config check - Fix token bucket memory path: apply max(1,...) guard to reset timestamp - Add Redis integration tests for all three algorithms - Add direct regression tests for get_current_user tenant_id fallback - Add scale load test with Redis memory timeline and live algorithm detection - Add RL_PACE_MULTIPLIER for near-limit pace testing and boundary burst detection - Remove redundant algorithm locustfile; scale file is canonical - Correct stale comments and README limitations Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> * feat(rate-limiter): add Rust-backed engine, check() API, benchmarks, and validation - Rust-backed sliding window engine with pyo3-log integration - check() API with tenant propagation, sweep/retry-after support - Eliminate redundant ZRANGE in sliding window Lua script - Fix detect-secrets baseline for rate limiter load tests - Clarify memory backend is single-instance only in docs Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> * chore: regenerate detect-secrets baseline after rebase Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> * refactor(rate-limiter): review fixes, Redis hardening, key-format parity tests - Extract _dispatch_hook() shared by prompt_pre_fetch and tool_pre_invoke, reducing each hook to a single-line wrapper - Elevate Redis val_i64/val_f64 parse-error logging from warn to error so silent fail-open degradation surfaces in operator dashboards - Clamp sliding-window reset_timestamp with .max(1) so it is always strictly in the future even when the oldest entry expires in < 1 s - Add 5 s tokio::time::timeout around Redis connection establishment to prevent indefinite blocking on network partition - Replace silent except-pass in EVALSHA SHA tracking with logger.debug - Document dual Lua-script invariant (rolling-upgrade key-format parity) in both Python RedisBackend docstring and Rust redis_backend.rs header - Add 7 parametrized test_redis_key_format_parity_* tests validating that Python and Rust produce identical Redis keys for the same inputs - Revert unrelated .pyi stub changes for encoded_exfil_detection, pii_filter, retry_with_backoff, and secrets_detection Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: strip trailing whitespace in pyi stubs, remove accidental .claude/ralph-loop.local.md - Remove plugins_rust/rate_limiter/.claude/ralph-loop.local.md which was accidentally committed — this is a local Claude Code loop state file and should never have been checked in. - Fix trailing whitespace in plugins_rust/rate_limiter/python/ rate_limiter_rust/__init__.pyi docstrings to pass pre-commit hooks. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> * chore: regenerate detect-secrets baseline for new exfil test strings Update .secrets.baseline after adding test_extra_sensitive_keywords in plugins_rust/encoded_exfil_detection/src/lib.rs:969 which contains a fake credential string that triggers the Secret Keyword detector. All new entries are false positives (test data). Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> * chore: audit new detect-secrets baseline entries as false positives The baseline regeneration reset is_secret to null for entries whose line numbers shifted. Mark all 17 unaudited entries as is_secret=false (test data, example configs, fake credentials) to pass the --fail-on-unaudited pre-commit check. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> --------- Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> Signed-off-by: Jonathan Springer <jps@s390x.com> Co-authored-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com> * feat(discovery): add automatic tool discovery with hot/cold classification (#3839) Implement automatic tool discovery for upstream MCP servers via usage-aware adaptive polling. The gateway can now continuously synchronise tool lists from registered servers without manual intervention. Server classification (hot/cold): - Classify servers based on MCP session pool usage patterns - Hot servers (top 20% by recent usage): polled at 1x base interval - Cold servers (remaining 80%): polled at 3x base interval - Classification is deterministic: sorted by recency, active sessions, use count, and URL for tie-breaking - Leader election via Redis with TTL renewal for multi-worker coordination - Falls back to local-only operation without Redis Integration with GatewayService: - Health checks respect hot/cold classification intervals - Auto-refresh of tools/resources/prompts respects classification - Fail-open on classification errors (poll anyway) - Poll timestamps tracked via Redis with TTL expiry - Uses base gateway URL (pre-auth) for classification lookups to avoid leaking query-param auth secrets to Redis Configuration: - AUTO_REFRESH_SERVERS=true enables automatic tool sync (default: false) - GATEWAY_AUTO_REFRESH_INTERVAL=300 sets base polling interval - HOT_COLD_CLASSIFICATION_ENABLED=false (opt-in, requires Redis) Includes comprehensive tests with 100% coverage on the new ServerClassificationService and integration tests for the GatewayService hot/cold polling paths. Closes #3734 Signed-off-by: Lang-Akshay <akshay.shinde26@ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: lucarlig <luca.carlig@ibm.com> * refactor(plugins): replace in-tree rate_limiter with cpex-rate-limiter package Remove the in-tree rate_limiter plugin and replace it with the cpex-rate-limiter PyPI package, a compiled Rust extension providing the same RateLimiterPlugin class with additional algorithms (sliding-window, token-bucket) alongside the original fixed-window. - Add cpex-rate-limiter>=0.0.2 as a [plugins] optional dependency - Update Containerfile.lite to install the plugins extra - Remove plugins/rate_limiter/ source directory - Remove unit and integration tests that imported plugin internals - Update all config files to use cpex_rate_limiter.RateLimiterPlugin - Disable RateLimiterPlugin in test fixture config (package not available in unit test environment) - Update documentation to reflect the external package Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com> * refactor(plugins): update build, CI, and docs for PyPI plugin migration Remove all plugins_rust/ build infrastructure and update references across Containerfiles, Makefile, CI workflows, pre-commit configs, CODEOWNERS, and documentation to reflect that plugins are now distributed as PyPI packages (cpex-*) via the [plugins] optional extra. - Remove Rust plugin builder stages from all Containerfiles - Remove ~100 lines of rust-* plugin Makefile targets (keep mcp-runtime) - Add --extra plugins to CI pytest workflow - Add [plugins] extra to install-dev Makefile target - Update tool_service.py import to use cpex_retry_with_backoff - Update plugin kind paths in 7 doc files to cpex_pii_filter.* - Clean up pre-commit, CODEOWNERS, MANIFEST.in, whitesource, .gitignore Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com> * fix(plugins): address PR review findings on PyPI plugin migration Round 1 (blockers + high): - Restore exclude-newer = "10 days" in pyproject.toml; replace stale langchain/requests pins with cpex-* per-package overrides anchored to 2026-04-09 so the plugins resolve newer than the global window - Guard cpex_retry_with_backoff import in tool_service.py with try/except ImportError; falls back to (None, True) for the Python pipeline when the optional [plugins] extra is not installed - Delete orphaned .github/workflows/rust-plugins.yml and the associated test cases in tests/unit/test_rust_plugins_workflow.py; drop the workflow card from docs/docs/architecture/explorer.html - Delete orphaned docs/docs/using/plugins/rust-plugins.md and remove it from docs/docs/using/plugins/.pages mkdocs nav - Harden docker-entrypoint.sh install_plugin_requirements: canonicalize /app and the resolved requirements path with readlink -f and require the path to live under /app/, log non-comment lines from the requirements file before pip runs, and skip cleanly on validation failure - Delete PLUGIN-MIGRATION-PLAN.md (one-time planning doc) - Add COPY plugins/requirements.txt to Containerfile.scratch (the layered Containerfile.lite already had it; the broad COPY . in Containerfile already includes it) Round 2 (medium + low): - Bump cpex-* version pin floors in pyproject.toml [plugins] to match resolved versions in uv.lock (cpex-rate-limiter>=0.0.3, cpex-encoded-exfil-detection>=0.2.0, cpex-pii-filter>=0.2.0, cpex-url-reputation>=0.1.1) - Add Prerequisites section to tests/performance/PLUGIN_PROFILING.md documenting the [plugins] extra requirement - Add Status: Partially superseded note to ADR-041 explaining that plugins_rust/ was removed when in-tree Rust plugins migrated to PyPI packages - Document upgrade semantics in plugins/requirements.txt header (pip without --upgrade skips already-satisfied constraints) - Add importlib.util.find_spec() precheck to tests/performance/test_plugins_performance.py main(); the script now skips cleanly with an actionable message if any of the five cpex packages referenced by the perf config are missing - Rename tests/unit/test_rust_plugins_workflow.py to test_go_toolchain_pinning.py to match its remaining contents (Go workflow pin and Makefile toolchain assertion) Follow-ups tracked in #4116 and IBM/cpex-plugins#21 for the longer-term tool_service.py refactor that will eliminate the cross-package import entirely. Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com> * revert: restore tests changes from PR #3965 Signed-off-by: lucarlig <luca.carlig@ibm.com> * fix(ci): align plugin tests with PyPI migration Signed-off-by: lucarlig <luca.carlig@ibm.com> * test: remove legacy plugin test skip infrastructure Signed-off-by: lucarlig <luca.carlig@ibm.com> * test: align packaged plugin tests with rust shims Signed-off-by: lucarlig <luca.carlig@ibm.com> * test: cover retry policy import path in tool service Signed-off-by: lucarlig <luca.carlig@ibm.com> * fix: harden cpex plugin migration paths Signed-off-by: lucarlig <luca.carlig@ibm.com> * test: cover retry policy parser branches Signed-off-by: lucarlig <luca.carlig@ibm.com> * test: cover plugin requirements entrypoint path Signed-off-by: lucarlig <luca.carlig@ibm.com> --------- Signed-off-by: lucarlig <luca.carlig@ibm.com> Signed-off-by: Jonathan Springer <jps@s390x.com> Co-authored-by: Pratik Gandhi <gandhipratik203@gmail.com> Co-authored-by: Lang-Akshay <akshay.shinde26@ibm.com> Co-authored-by: lucarlig <luca.carlig@ibm.com>

Lang-Akshay requested a review from crivetimihai as a code owner March 24, 2026 13:44

Lang-Akshay marked this pull request as draft March 24, 2026 13:44

Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch from e43f064 to 91a9e92 Compare March 24, 2026 14:29

Lang-Akshay marked this pull request as ready for review March 24, 2026 14:55

Lang-Akshay requested review from kevalmahajan and madhav165 as code owners March 24, 2026 15:57

Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch from c91a4a3 to 6caa5e6 Compare March 24, 2026 15:58

Lang-Akshay marked this pull request as draft March 24, 2026 16:04

Lang-Akshay marked this pull request as ready for review March 24, 2026 16:30

Lang-Akshay requested a review from msureshkumar88 March 24, 2026 16:38

Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch from 2666a15 to b381eff Compare March 24, 2026 17:39

Lang-Akshay changed the title ~~feat(polling): Implement hot/cold server classification and staggered polling for tool discovery~~ feat(polling): Automatic Tool Discovery with Hot/Cold Classification and Staggered Polling Mar 25, 2026

This comment was marked as resolved.

Sign in to view

Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch from 43fc447 to 9fdf1ad Compare March 26, 2026 12:17

msureshkumar88 previously approved these changes Mar 26, 2026

View reviewed changes

Lang-Akshay dismissed msureshkumar88’s stale review via a1f3a9d March 26, 2026 22:41

Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch from a1f3a9d to 98b5f97 Compare March 26, 2026 22:45

Lang-Akshay added wxo wxo integration release-fix Critical bugfix required for the release labels Mar 27, 2026

Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch 2 times, most recently from 2e6746d to de89914 Compare March 27, 2026 15:22

crivetimihai changed the title ~~feat(polling): Automatic Tool Discovery with Hot/Cold Classification and Staggered Polling~~ feat(discovery): add automatic tool discovery with hot/cold classification Mar 29, 2026

crivetimihai added enhancement New feature or request COULD P3: Nice-to-have features with minimal impact if left out; included if time permits labels Mar 29, 2026

crivetimihai added this to the Release 1.0.0 milestone Mar 29, 2026

Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch 3 times, most recently from 93a19b9 to fb5cd64 Compare March 30, 2026 14:05

Lang-Akshay force-pushed the Tool-discovery---Auto-Refresh branch 2 times, most recently from 0fc78c9 to 34c164a Compare April 1, 2026 09:12

crivetimihai self-assigned this Apr 2, 2026

crivetimihai force-pushed the Tool-discovery---Auto-Refresh branch from 34c164a to 0803419 Compare April 2, 2026 22:06

crivetimihai previously approved these changes Apr 2, 2026

View reviewed changes

crivetimihai dismissed their stale review via a8fb164 April 2, 2026 22:23

crivetimihai force-pushed the Tool-discovery---Auto-Refresh branch 11 times, most recently from ceb92d6 to e384e18 Compare April 3, 2026 01:13

crivetimihai previously approved these changes Apr 3, 2026

View reviewed changes

crivetimihai dismissed their stale review via 1901114 April 3, 2026 02:29

crivetimihai force-pushed the Tool-discovery---Auto-Refresh branch from e384e18 to 1901114 Compare April 3, 2026 02:29

crivetimihai merged commit 5061516 into main Apr 3, 2026
27 checks passed

crivetimihai deleted the Tool-discovery---Auto-Refresh branch April 3, 2026 02:45

crivetimihai mentioned this pull request Apr 4, 2026

Benchmark docker compose deployment fix #4003

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(discovery): add automatic tool discovery with hot/cold classification#3839

feat(discovery): add automatic tool discovery with hot/cold classification#3839
crivetimihai merged 1 commit intomainfrom
Tool-discovery---Auto-Refresh

Lang-Akshay commented Mar 24, 2026 •

edited by jonpspri

Loading

Uh oh!

This comment was marked as resolved.

Lang-Akshay commented Mar 26, 2026

Uh oh!

msureshkumar88 left a comment

Uh oh!

crivetimihai commented Mar 29, 2026

Uh oh!

crivetimihai left a comment

Uh oh!

crivetimihai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Lang-Akshay commented Mar 24, 2026 • edited by jonpspri Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Design Rationale: Polling vs. Push Notifications

Why persistent notifications don't fit the gateway model

Background: What Already Exists

Problem

Solution

1. Automatic Tool Discovery via Polling

2. Hot/Cold Server Classification

3. Intelligent Interval Selection

4. Multi-Worker Coordination

Configuration

Uh oh!

This comment was marked as resolved.

Lang-Akshay commented Mar 26, 2026

Some of the review are false positive , one which are not are acted upon

Uh oh!

msureshkumar88 left a comment

Choose a reason for hiding this comment

✅ PR #3839 - APPROVED

Automatic Tool Discovery with Hot/Cold Classification and Staggered Polling

🎉 Approval Summary

💪 Key Strengths

1. Excellent Architecture & Design

2. Comprehensive Test Coverage

3. Production-Ready Features

4. Code Quality

🚀 Impact & Value

Immediate Benefits

Strategic Value

🔮 Future Improvement Opportunities

High-Impact Optimizations (Post-Merge)

Advanced Features (Future Sprints)

📊 Metrics to Track (Post-Deployment)

🎯 Success Criteria (Post-Deployment)

🏆 Final Verdict

👏 Kudos to @Lang-Akshay

Uh oh!

crivetimihai commented Mar 29, 2026

Uh oh!

crivetimihai left a comment

Choose a reason for hiding this comment

Review summary

Critical fixes applied

Resilience fixes

Configuration safety (no breaking changes on upgrade)

Dead code removed

Test coverage

Uh oh!

crivetimihai left a comment

Choose a reason for hiding this comment

What was changed

Fixes applied during review (20 total)

Critical correctness

Security

Resilience

Configuration safety (no breaking changes on upgrade)

Pre-existing bugs fixed

Test coverage

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Lang-Akshay commented Mar 24, 2026 •

edited by jonpspri

Loading