|
| 1 | +# Goal Description |
| 2 | +Implement a "Day 2" reliability layer for the simstudioai/sim workflow engine by building a composable Resilience Interceptor/Middleware Pipeline for the MCP `executeTool` logic. This pipeline ensures enterprise-grade stability by introducing a Circuit Breaker State Machine, Zod-based Schema Enforcement for LLM outputs, and detailed Telemetry for latency and failure analysis, while addressing high-concurrency Node/TS environments. |
| 3 | + |
| 4 | +## User Review Required |
| 5 | +- Please confirm if `apps/sim/lib/mcp/service.ts` is the correct core injection point for wrapping `executeTool`. |
| 6 | +- Note on file path: `apps/sim/lib/workflow/executor.ts` was not found. Instead, `apps/sim/executor/execution/executor.ts` and `apps/sim/tools/workflow/executor.ts` were analyzed. Ensure intercepting `McpService`'s `executeTool` serves your architectural needs. |
| 7 | +- Please confirm the schema enforcement approach: we will compile and cache JSON Schemas to Zod validators upon MCP server discovery or lazily, instead of parsing dynamically per request. |
| 8 | + |
| 9 | +## Proposed Changes |
| 10 | + |
| 11 | +We will split the implementation into discrete PRs / Commits to maintain structure. |
| 12 | + |
| 13 | +### Part 1: Telemetry Hooks |
| 14 | +Implement the foundation for tracking. |
| 15 | +*(Change Rationale: Transitioning to a middleware pattern instead of a monolithic proxy, allowing telemetry to be composed easily).* |
| 16 | +#### [NEW] `apps/sim/lib/mcp/resilience/telemetry.ts` |
| 17 | +- Implement telemetry middleware hook to capture `latency_ms` and `failure_reason` (e.g., `TIMEOUT`, `VALIDATION_ERROR`, `API_500`). |
| 18 | + |
| 19 | +### Part 2: Circuit Breaker State Machine |
| 20 | +Implement the state management logic. |
| 21 | +*(Change Rationale: Added a HALF-OPEN concurrency lock (semaphore) to prevent the "thundering herd" issue on the downstream server. Documented that this operates on local, per-instance state using an LRU cache to prevent memory leaks).* |
| 22 | +#### [NEW] `apps/sim/lib/mcp/resilience/circuit-breaker.ts` |
| 23 | +- Implement the `CircuitBreaker` middleware with states: `CLOSED`, `OPEN`, and `HALF-OPEN`. |
| 24 | +- Handle failure thresholds, reset timeouts, and logic for failing fast. |
| 25 | +- **Concurrency Lock:** During `HALF-OPEN`, strictly gate the transition so only **one** probe request is allowed through. All other concurrent requests will fail-fast until the probe resolves. |
| 26 | +- **Memory & State:** Use an LRU cache or scoped ties for the CircuitBreaker registry, binding the lifecycle of the breaker explicitly to the lifecycle of the MCP connection to prevent memory leaks. Also, this operates on local, per-instance state. |
| 27 | + |
| 28 | +### Part 3: Schema Validation |
| 29 | +Implement the Zod validation logic for LLM arguments. |
| 30 | +*(Change Rationale: Added schema compilation caching to avoid severe CPU bottlenecking per request, and returning `isError: true` on validation failures to natively trigger LLM self-correction).* |
| 31 | +#### [NEW] `apps/sim/lib/mcp/resilience/schema-validator.ts` |
| 32 | +- Logic to enforce schemas using `Zod` as a middleware. |
| 33 | +- **Schema Caching:** Compile JSON Schemas to Zod schemas and cache them in a registry mapped to `toolId` during the initial discovery phase or lazily on first compile. Flush cached validators dynamically when listening for MCP lifecycle events (e.g., mid-session tool list updates). |
| 34 | +- **LLM Self-Correction:** Instead of throwing exceptions that crash the workflow engine when Zod validation fails, intercept validation errors and return a gracefully formatted MCP execution result: `{ isError: true, content: [{ type: "text", text: "Schema validation failed: [Zod Error Details]" }] }`. |
| 35 | + |
| 36 | +### Part 4: Resilience Pipeline Integration |
| 37 | +Wrap up the tools via a Pipeline instead of a monolithic proxy. |
| 38 | +*(Change Rationale: Switched from a God Object Proxy to a Middleware Pipeline to support granular, per-tool enablement).* |
| 39 | +#### [NEW] `apps/sim/lib/mcp/resilience/pipeline.ts` |
| 40 | +- Implement a chain of responsibility (interceptor/middleware pipeline) for `executeTool`. |
| 41 | +- Provide an API like `executeTool.use(telemetry).use(validate(cachedSchema)).use(circuitBreaker(config))` rather than a sequential sequence inside a rigid class. |
| 42 | +- This composable architecture allows enabling or disabling specific middlewares dynamically per tool (e.g., un-trusted vs internal tools). |
| 43 | + |
| 44 | +#### [MODIFY] `apps/sim/lib/mcp/service.ts` |
| 45 | +- Update `mcpService.executeTool` to run requests through the configurable `ResiliencePipeline`, rather than hardcoded proxy logic. |
| 46 | + |
| 47 | +## Verification Plan |
| 48 | +### Automated Tests |
| 49 | +- Create a mock MCP server execution test suite. |
| 50 | +- Write tests in `apps/sim/lib/mcp/resilience/pipeline.test.ts` to assert: |
| 51 | + - Circuit Breaker trips to `OPEN` on simulated `API_500` and trips to `HALF-OPEN` after a cooldown. |
| 52 | + - **New Test:** Verify HALF-OPEN strictly allows exactly **one** simulated concurrent probe request through. |
| 53 | + - **New Test:** Schema validation returns `isError: true` standard format for improper LLM args without triggering execution. |
| 54 | +- Telemetry correctly logs latency. |
| 55 | + |
| 56 | +### Manual Verification |
| 57 | +- Execute tests generating visual output demonstrating the circuit breaker "tripping" and "recovering". |
0 commit comments