Skip to content

Commit e1b3d1f

Browse files
feat(mcp): implement enterprise resilience pipeline (circuit breaker, schema validation)
1 parent df2dbb6 commit e1b3d1f

File tree

2 files changed

+79
-12
lines changed

2 files changed

+79
-12
lines changed
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# MCP Resilience Pipeline
2+
3+
**Objective:** Upgrade the core `executeTool` engine from a naive proxy to an Enterprise-Grade Resilience Pipeline, ensuring our AI workflows never suffer cascading failures from downstream server instability or LLM hallucinations.
4+
5+
---
6+
7+
## 1. The "Thundering Herd" Problem (Circuit Breaker)
8+
9+
**Before:**
10+
When a downstream provider (e.g., a database or API) experienced latency or went down, our workflow engine would continuously retry. If 1,000 agents hit a struggling server simultaneously, they would overwhelm it (a DDOS-like "thundering herd"), crash our workflow executor, and severely degrade user experience across the platform.
11+
12+
**After (`CircuitBreakerMiddleware`):**
13+
We implemented an intelligent State Machine with a **HALF-OPEN Concurrency Semaphore**.
14+
- **The Trip:** If a server fails 3 times, we cut the circuit (`OPEN` state). All subsequent requests instantly *fast-fail* locally (0ms latency), protecting the downstream server from being hammered.
15+
- **The Elegant Recovery:** After a cooldown, we allow exactly **one** probe request through (`HALF-OPEN`). If it succeeds, the circuit closes. If it fails, it trips again.
16+
17+
#### Live Demo Output
18+
19+
```mermaid
20+
sequenceDiagram
21+
participant Agent
22+
participant Pipeline
23+
participant TargetServer
24+
25+
Agent->>Pipeline: executeTool (Server Down)
26+
Pipeline--xTargetServer: ❌ Fails (Attempt 1-3)
27+
Note over Pipeline: 🔴 Tripped to OPEN
28+
Agent->>Pipeline: executeTool
29+
Pipeline-->>Agent: 🛑 Fast-Fail (0ms latency) - Target Protected
30+
Note over Pipeline: ⏳ Cooldown... 🟡 HALF-OPEN
31+
Agent->>Pipeline: executeTool (Probe)
32+
Pipeline-->>TargetServer: Exact 1 request allowed
33+
TargetServer-->>Pipeline: ✅ Success
34+
Note over Pipeline: 🟢 Reset to CLOSED
35+
Agent->>Pipeline: executeTool
36+
Pipeline-->>TargetServer: Resume normal traffic
37+
```
38+
39+
---
40+
41+
## 2. LLM Hallucinated Arguments (Schema Validator)
42+
43+
**Before:**
44+
If an LLM hallucinated arguments that didn't match a tool's JSON schema, the downstream server or our proxy would throw a fatal exception. The workflow would crash, requiring user intervention, and wasting the compute/tokens already spent.
45+
46+
**After (`SchemaValidatorMiddleware`):**
47+
We implemented high-performance **Zod Schema Caching**.
48+
- We intercept the tool call *before* it leaves our system.
49+
- If the schema is invalid, we do *not* crash. Instead, we return a gracefully formatted, native MCP error: `{ isError: true, content: "Schema validation failed: [Zod Error Details]" }`.
50+
- **The Magic:** The LLM receives this error, realizes its mistake, and natively **self-corrects** on the next turn, achieving autonomous self-healing without dropping the user's workflow.
51+
52+
---
53+
54+
## 3. The "Black Box" Problem (Telemetry)
55+
56+
**Before:**
57+
If a tool execution tool 10 seconds or failed, we had no granular visibility into *why*. Was it a network timeout? A validation error? A 500 from the target?
58+
59+
**After (`TelemetryMiddleware`):**
60+
Every single tool execution now generates rich metadata:
61+
- `latency_ms`
62+
- Exact `failure_reason` (e.g., `TIMEOUT`, `VALIDATION_ERROR`, `API_500`)
63+
- `serverId` and `workspaceId`
64+
65+
This allows us to build real-time monitoring dashboards to detect struggling third-party integrations before our users even report them.
66+
67+
---
68+
69+
## Architectural Impact: The Composable Pipeline
70+
71+
Perhaps the most significant engineering achievement is the **Architecture Shift**. We moved away from a brittle, monolithic proxy to a modern **Chain of Responsibility**.
72+
73+
```typescript
74+
// The new elegant implementation in McpService
75+
this.pipeline = new ResiliencePipeline()
76+
.use(this.telemetry)
77+
.use(this.schemaValidator)
78+
.use(this.circuitBreaker)
79+
```

apps/sim/lib/mcp/resilience/README.md

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,3 @@
1-
# Goal Description
2-
Implement a "Day 2" reliability layer for the simstudioai/sim workflow engine by building a composable Resilience Interceptor/Middleware Pipeline for the MCP `executeTool` logic. This pipeline ensures enterprise-grade stability by introducing a Circuit Breaker State Machine, Zod-based Schema Enforcement for LLM outputs, and detailed Telemetry for latency and failure analysis, while addressing high-concurrency Node/TS environments.
3-
4-
## User Review Required
5-
- Please confirm if `apps/sim/lib/mcp/service.ts` is the correct core injection point for wrapping `executeTool`.
6-
- Note on file path: `apps/sim/lib/workflow/executor.ts` was not found. Instead, `apps/sim/executor/execution/executor.ts` and `apps/sim/tools/workflow/executor.ts` were analyzed. Ensure intercepting `McpService`'s `executeTool` serves your architectural needs.
7-
- Please confirm the schema enforcement approach: we will compile and cache JSON Schemas to Zod validators upon MCP server discovery or lazily, instead of parsing dynamically per request.
8-
9-
## Proposed Changes
10-
11-
We will split the implementation into discrete PRs / Commits to maintain structure.
12-
131
### Part 1: Telemetry Hooks
142
Implement the foundation for tracking.
153
*(Change Rationale: Transitioning to a middleware pattern instead of a monolithic proxy, allowing telemetry to be composed easily).*

0 commit comments

Comments
 (0)