Skip to content

RagavRida/agent-reliability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agent-reliability

End-to-end reliability and security testing for multi-step AI agent workflows.

If each step has 85% success rate, a 10-step workflow succeeds only 19.7% of the time. Existing tools (DeepEval, Langfuse, LangSmith) test individual LLM calls. This tests the entire chain — where one failure cascades through everything.

As of v0.2, it also ships SecurityReviewer: a workflow-level security check suite covering prompt injection, canary leaks, unsafe step code, missing controls, and dependency advisories. See §10.

npm install agent-reliability

Table of Contents


The Problem

You built an AI agent with 5 steps:

Classify Intent → Fetch Context → Generate Reply → Safety Check → Send
     85%              90%              85%              95%          99%

Each step looks great individually. But end-to-end?

0.85 × 0.90 × 0.85 × 0.95 × 0.99 = 61.2%

Your agent fails 39% of the time. And it's worse than that — failures correlate. When the LLM misclassifies intent in Step 1, Steps 2-5 all get garbage input. The real E2E rate is even lower.

Nobody measures this. DeepEval tests the LLM call. Langfuse traces the call. Neither tests the chain.


Quick Start

import { WorkflowRunner, ReliabilityAnalyzer } from "agent-reliability";

// 1. Define your agent workflow
const runner = new WorkflowRunner("my-agent");

runner
  .addStep({
    id: "classify",
    name: "Classify Intent",
    execute: async (input) => {
      // Your real classification logic
      const intent = await classifyIntent(input.message);
      return { output: { intent, message: input.message } };
    },
  })
  .addStep({
    id: "generate",
    name: "Generate Reply",
    execute: async (input) => {
      const reply = await callLLM(input.intent, input.message);
      return { output: reply, tokens_used: 150, cost_usd: 0.002 };
    },
    retry: { max_attempts: 3, backoff_ms: 1000 },
  })
  .addStep({
    id: "safety",
    name: "Safety Check",
    execute: async (input) => {
      if (input.includes("harmful")) throw new Error("Blocked by safety filter");
      return { output: input };
    },
  });

// 2. Run 100 times
const runs = await runner.benchmark(100);

// 3. Analyze
const analyzer = new ReliabilityAnalyzer();
const report = analyzer.analyze("my-agent", runs);

console.log(`E2E Success Rate: ${(report.e2e_success_rate * 100).toFixed(1)}%`);
console.log(`Failure Hotspots:`, report.failure_hotspots);
console.log(`Cascade Map:`, report.cascade_map);

Core Concepts

Concept What It Means
E2E Success Rate Fraction of runs where ALL steps completed successfully. The metric that matters.
Cascade Failure When Step 2 fails, Steps 3-5 get skipped. One failure kills the whole chain.
Correlation Gap Difference between predicted rate (product of step rates) and actual rate. Gap > 10% = failures are correlated.
Chaos Testing Inject real-world failures (timeouts, rate limits, malformed output) to find breaking points.
Reliability Score Single 0-1 number grading workflow reliability. Empirically calibrated weights.

Full Usage Guide

1. Define Workflow Steps

Every step needs an id, name, and execute function:

import { WorkflowStep } from "agent-reliability";

const step: WorkflowStep = {
  id: "fetch_context",
  name: "Fetch Context from DB",
  execute: async (input, context) => {
    const docs = await db.query(input.query);
    return {
      output: docs,
      latency_ms: 45,
      tokens_used: 0,
      cost_usd: 0,
    };
  },

  // Optional: validate output before passing to next step
  validate: (output) => ({
    valid: output.length > 0,
    errors: output.length === 0 ? ["No documents found"] : [],
    warnings: [],
  }),

  // Optional: retry on failure
  retry: {
    max_attempts: 3,
    backoff_ms: 1000,
    retry_on: ["timeout", "429"],  // only retry these errors
  },

  // Optional: fallback if all retries fail
  fallback: async (error, input, context) => {
    return { output: [{ content: "Default context", score: 0.5 }] };
  },

  // Optional: timeout
  timeout_ms: 5000,

  // Optional: dependencies (for DAG workflows, not just linear chains)
  depends_on: ["classify"],
};

2. Run and Analyze

import { WorkflowRunner, ReliabilityAnalyzer } from "agent-reliability";

const runner = new WorkflowRunner("customer-support");
runner.addStep(classifyStep);
runner.addStep(fetchStep);
runner.addStep(generateStep);
runner.addStep(safetyStep);
runner.addStep(sendStep);

// Single run
const result = await runner.run({ message: "I need help with billing" });
console.log(result.success);        // true or false
console.log(result.failure_point);   // "classify" or null
console.log(result.cascade_skipped); // ["fetch", "generate", "safety", "send"]

// Benchmark: 100 runs with different inputs
const runs = await runner.benchmark(100, (i) => ({
  message: testMessages[i],
}));

// Analyze
const analyzer = new ReliabilityAnalyzer();
const report = analyzer.analyze("customer-support", runs);

// The metrics nobody else computes:
console.log(report.e2e_success_rate);    // 0.62 (62% end-to-end)
console.log(report.step_success_rates);  // { classify: 0.85, fetch: 0.92, ... }
console.log(report.predicted_e2e_rate);  // 0.71 (predicted if independent)
console.log(report.failure_hotspots);    // [{ step_id: "classify", failure_rate: 0.15, ... }]
console.log(report.cascade_map);         // { classify: ["fetch", "generate", "safety", "send"] }

// The correlation gap: predicted vs actual
const gap = analyzer.correlationGap(report);
console.log(gap.interpretation);
// "Strong correlation — failures cascade. Fix the root cause step."

// What step reliability do you need for 90% E2E across 5 steps?
const required = analyzer.requiredStepReliability(5, 0.90);
console.log(required); // 0.9791 — each step needs 97.9%!

3. Chaos Testing

Inject real-world failures to find how your agent breaks:

import { ChaosInjector } from "agent-reliability";

const chaos = new ChaosInjector({
  failure_rate: 0.15,  // 15% of steps will fail
  seed: 42,            // reproducible results
  failure_types: {
    timeout: 0.20,          // network timeout
    rate_limit: 0.20,       // 429 Too Many Requests
    malformed_output: 0.25, // truncated JSON, missing fields
    latency_spike: 0.15,    // 3-5 second delays
    context_overflow: 0.05, // output too long for next step
    null_return: 0.05,      // step returns null
    random_error: 0.10,     // ECONNREFUSED, SSL errors, etc.
  },
});

// Wrap individual steps
const hardenedStep = chaos.wrap(myStep);

// Or wrap all steps at once
const runner = new WorkflowRunner("chaos-test");
const steps = [classifyStep, fetchStep, generateStep, safetyStep];
for (const step of chaos.wrapAll(steps)) {
  runner.addStep(step);
}

// Run and see what breaks
const runs = await runner.benchmark(100);
const report = analyzer.analyze("chaos-test", runs);
// Now you know: under 15% failure injection, your E2E drops to X%

4. Real-Time Monitoring

Stream events as they happen — don't wait for the benchmark to finish:

import { RealtimeRunner } from "agent-reliability";

const rt = new RealtimeRunner("production-monitor", steps, {
  alert_e2e_threshold: 0.5,   // alert when E2E drops below 50%
  alert_cascade_size: 3,       // alert when 3+ steps get skipped
});

// Subscribe to events
rt.on("step:failed", (event) => {
  console.error(`FAILED: ${event.data.step_name}${event.data.error}`);
  sendSlackAlert(event);
});

rt.on("alert:cascade", (event) => {
  console.error(`CASCADE: ${event.data.failure_point} killed ${event.data.count} steps`);
  pageOnCall(event);
});

rt.on("alert:e2e_drop", (event) => {
  console.error(`E2E DROPPED to ${(event.data.e2e_rate * 100).toFixed(0)}%`);
});

rt.on("report:updated", (event) => {
  // Live report updates after every run
  updateDashboard(event.data.report);
});

// Run single
await rt.runOnce(userInput);

// Or run concurrent (5 workers)
const results = await rt.runConcurrent(100, 5);

// Live report always available
const report = rt.getReport();

5. Scale Testing

Rate-limited testing with rolling time windows:

import { ScaleRunner } from "agent-reliability";

const scale = new ScaleRunner("load-test", steps, {
  rate_limit_rps: 10,          // 10 runs per second
  windows: [300, 900, 3600],   // 5min, 15min, 1hr rolling stats
  adaptive_threshold: 0.5,     // speed up testing when failures spike
  adaptive_multiplier: 3,      // 3x faster during failure spikes
});

// Subscribe to events
scale.on("alert:e2e_drop", (e) => console.error("E2E dropped!"));

// Run 1000 tests at 10 RPS
const { report, windowed_reports } = await scale.runBatch(1000, 3);

// windowed_reports shows reliability over different time windows:
// [
//   { window: "300s",  report: { e2e_success_rate: 0.72 } },  // last 5 min
//   { window: "900s",  report: { e2e_success_rate: 0.68 } },  // last 15 min
//   { window: "3600s", report: { e2e_success_rate: 0.65 } },  // last hour
// ]

// Stop mid-run
setTimeout(() => scale.stop(), 30000); // stop after 30s

6. Scoring and Grading

Get a single number and actionable grade:

import { ReliabilityScorer } from "agent-reliability";

const scorer = new ReliabilityScorer();
const result = scorer.score(report);

console.log(result.score);    // 0.72
console.log(result.grade);    // "reliable"
console.log(result.breakdown);
// {
//   e2e_component: 0.432,     // 60% weight × 0.72 E2E rate
//   step_min_component: 0.12, // 15% weight × 0.80 weakest step
//   cascade_component: 0.08,  // 10% weight × 0.80 cascade score
//   latency_component: 0.09,  // 10% weight × 0.90 latency score
//   cost_component: 0.048,    // 5% weight × 0.96 cost score
// }
console.log(result.recommendations);
// [
//   "Step 'classify' is only 80% reliable. Add retry/fallback.",
//   "3 cascade patterns detected. Fix root-cause steps.",
// ]

// Grade scale:
//   0.0-0.3: unreliable — not production-ready
//   0.3-0.6: fragile — works sometimes, needs hardening
//   0.6-0.8: reliable — production candidate
//   0.8-1.0: robust — production-grade

7. Streaming Agents

Test agents that use OpenAI's stream: true:

import { streamingLLMStep, openaiStreamAdapter } from "agent-reliability";
import OpenAI from "openai";

const openai = new OpenAI();

const step = streamingLLMStep({
  id: "generate",
  name: "Generate Reply",
  stream_fn: async function* (input) {
    const stream = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [{ role: "user", content: input.message }],
      stream: true,
    });
    for await (const chunk of stream) {
      const content = chunk.choices[0]?.delta?.content ?? "";
      yield { content, done: chunk.choices[0]?.finish_reason === "stop" };
    }
  },
  parse: (text) => JSON.parse(text),
  on_chunk: (chunk, accumulated) => {
    // Real-time UI update
    process.stdout.write(chunk.content);
  },
});

8. Pre-Built Steps

Common agent patterns as drop-in steps:

import { llmStep, toolStep, ragStep, routerStep, guardrailStep } from "agent-reliability";

// LLM call with automatic retry
const classify = llmStep({
  id: "classify",
  name: "Classify Intent",
  model: "gpt-4",
  client: openai,
  system_prompt: "Classify the user's intent as: billing, support, sales, other",
  format_prompt: (input) => input.message,
  parse_response: (text) => ({ intent: text.trim().toLowerCase() }),
});

// Tool call with timeout
const search = toolStep({
  id: "search",
  name: "Search Knowledge Base",
  tool_fn: async (args) => await kb.search(args.query),
  extract_args: (input) => ({ query: input.message }),
  timeout_ms: 3000,
});

// RAG retrieval with quality validation
const retrieve = ragStep({
  id: "retrieve",
  name: "Retrieve Context",
  retrieve_fn: async (query) => await vectorDB.search(query, { top_k: 5 }),
  extract_query: (input) => input.message,
  min_docs: 2,
  min_score: 0.7,
});

// Conditional routing
const router = routerStep({
  id: "route",
  name: "Route by Intent",
  route_fn: (input) => input.intent,
  branches: {
    billing: billingStep,
    support: supportStep,
    sales: salesStep,
  },
  default_branch: "support",
});

// Safety guardrail
const safety = guardrailStep({
  id: "safety",
  name: "Content Safety",
  checks: [
    { name: "no_pii", check: (input) => !input.match(/\b\d{3}-\d{2}-\d{4}\b/), error_msg: "Contains SSN" },
    { name: "no_harmful", check: (input) => !input.includes("hack"), error_msg: "Contains harmful content" },
    { name: "min_length", check: (input) => input.length >= 10, error_msg: "Response too short" },
  ],
});

9. Reports and Audit Trails

import { ReliabilityReporter } from "agent-reliability";

const reporter = new ReliabilityReporter();

// JSON export (for programmatic consumption)
const json = reporter.toJSON(report);
fs.writeFileSync("reliability-report.json", json);

// HTML dashboard (open in browser)
const html = reporter.toHTML(report, runs);
fs.writeFileSync("reliability-report.html", html);

// JSONL audit trail (for compliance — EU AI Act, SOC2)
const auditLog = runner.getAuditLog();
const jsonl = reporter.auditToJSONL(auditLog);
fs.writeFileSync("audit-trail.jsonl", jsonl);
// Each line:
// {"timestamp":"2026-04-12T...","run_id":"abc","step_id":"classify",
//  "actor":"step:Classify Intent","trigger":"execute","outcome":"success",
//  "input_hash":"<sha256-hex>","output_hash":"<sha256-hex>"}

Security notes for reports:

  • Step error messages, step IDs, and workflow IDs are included verbatim in JSON reports and the HTML dashboard. Do not throw errors whose .message contains secrets (API keys, PII, prompts with user data) — catch and sanitize at the step boundary first.
  • HTML reports escape interpolated values, but the report is still generated from data your steps produce. Treat the HTML output as "trusted only as much as your step inputs are trusted."
  • input_hash / output_hash use SHA-256 over JSON.stringify(data). Key ordering in objects is not canonicalized, so two logically-equal objects with different key order will produce different hashes. Use the hashes for change detection, not cryptographic proof of equivalence.

10. Security Review

SecurityReviewer runs security checks against the same workflow you're testing for reliability. It slots alongside WorkflowRunner and ChaosInjector and emits a SecurityReport with findings graded critical | high | medium | low | info.

Scope. This library tests the workflow layer — the chain of steps you define. Model-layer threats (data poisoning, membership inference, model inversion, backdoors) require training access and are out of scope. Infrastructure threats (auth, cloud misconfig, K8s) belong to your deployment config, not here.

import { WorkflowRunner, SecurityReviewer } from "agent-reliability";

const runner = new WorkflowRunner("support-agent");
// ...addStep(...) as usual

const reviewer = new SecurityReviewer(runner, {
  baselineInput: { message: "hello" },
  canaries: [process.env.TEST_API_KEY!, "user@example.com"],
  toolAllowList: {
    retrieve: ["kb.search"],
    generate: ["openai.chat"],
  },
});

const report = await reviewer.run(); // runs all checks
console.log(report.summary); // { critical: 0, high: 1, medium: 3, low: 2, info: 1 }
for (const f of report.findings) {
  console.log(`[${f.severity}] ${f.check} · ${f.step_id ?? "-"}${f.title}`);
}

Run a subset of checks:

await reviewer.run({ checks: ["injection", "canary", "static-code"] });

Checks available:

  • injection — Replays a corpus of 19 prompt-injection and jailbreak payloads against each step. Flags outputs that echo the payload verbatim or contain known marker strings (PWNED, INJECTED, LEAK, etc.). Import and extend the corpus via INJECTION_PAYLOADS. Customize how payloads are merged into your input shape with injectInput: (baseline, payload) => ....

  • canary — You declare secret strings that should never appear in any step output (API keys, PII in fixtures, prompt-template internals). The reviewer runs the workflow once with your baselineInput and asserts no canary leaks into any StepRunResult.output. Passing no canaries skips the check with an info-level note.

  • tools — Scans step source for tool-like calls (fetch, exec, axios, filesystem writes) and flags steps without a toolAllowList entry. Honest limitation: a static scan can't see tools invoked through helper modules. For rigorous enforcement, wrap each tool with a runtime call-logger — planned for v0.3.

  • static-code — Regex scan of step.execute.toString() for eval(), new Function(), child_process, exec/spawn, dynamic require(). These are rare in LLM agent code and almost always a bug.

  • missing-controls — Inspects step config: no timeout_ms? no validate? no retry? no fallback? Emits a finding per missing control. Good for CI gates — fail the build if any step lacks a timeout.

  • deps — Wraps npm audit --json in the working directory and folds advisories into the report at matching severity. Requires a package-lock.json and npm on PATH.

Extending the injection corpus. The corpus ships as INJECTION_PAYLOADS in agent-reliability/payloads. For your own agent, add project-specific payloads (known customer-reported prompt attacks, regulator requirements) and pass them via injectInput. A growing internal corpus is the strongest defense over time — ship it as a test fixture in your repo and re-run on every prompt-template change.

CI example:

// tests/security.ci.ts
import { WorkflowRunner, SecurityReviewer } from "agent-reliability";
import { buildWorkflow } from "../src/agent";

it("has no high-severity security findings", async () => {
  const runner = buildWorkflow();
  const report = await new SecurityReviewer(runner, {
    baselineInput: { message: "hello" },
    canaries: [process.env.FIXTURE_API_KEY!],
  }).run();

  const blocking = report.findings.filter(
    (f) => f.severity === "critical" || f.severity === "high",
  );
  expect(blocking).toEqual([]);
});

Real-World Examples

Example 1: Customer Support Agent

const runner = new WorkflowRunner("support-agent");
runner
  .addStep(llmStep({ id: "classify", name: "Classify", model: "gpt-4", client: openai, system_prompt: "Classify intent", format_prompt: (i) => i.message }))
  .addStep(ragStep({ id: "retrieve", name: "Retrieve", retrieve_fn: kb.search, extract_query: (i) => i }))
  .addStep(llmStep({ id: "generate", name: "Generate", model: "gpt-4", client: openai, system_prompt: "Answer using context", format_prompt: (i) => JSON.stringify(i) }))
  .addStep(guardrailStep({ id: "safety", name: "Safety", checks: [{ name: "no_pii", check: (i) => !i.match(/\d{3}-\d{2}-\d{4}/), error_msg: "PII detected" }] }));

// Benchmark
const runs = await runner.benchmark(200);
const report = analyzer.analyze("support-agent", runs);
const score = scorer.score(report);
console.log(`Grade: ${score.grade} (${score.score.toFixed(2)})`);

Example 2: Code Generation Agent with Chaos

const chaos = new ChaosInjector({ failure_rate: 0.20, seed: 42 });
const runner = new WorkflowRunner("codegen");
runner
  .addStep(chaos.wrap(planStep))
  .addStep(chaos.wrap(generateStep))
  .addStep(chaos.wrap(testStep))
  .addStep(chaos.wrap(reviewStep));

const runs = await runner.benchmark(500);
const report = analyzer.analyze("codegen", runs);
console.log(`Under 20% chaos: E2E = ${(report.e2e_success_rate * 100).toFixed(1)}%`);

Example 3: Production Monitoring

const rt = new RealtimeRunner("prod-agent", steps);
rt.on("alert:e2e_drop", async (e) => {
  await slack.send(`Agent reliability dropped to ${(e.data.e2e_rate * 100).toFixed(0)}%`);
});
rt.on("alert:cascade", async (e) => {
  await pagerduty.trigger(`Cascade: ${e.data.failure_point}${e.data.count} steps skipped`);
});

// Continuous monitoring
while (true) {
  await rt.runOnce(getNextRequest());
}

API Reference

Class Purpose
WorkflowRunner Execute workflows with retry, fallback, timeout, validation
ReliabilityAnalyzer Compute E2E rates, cascade maps, correlation gaps
ReliabilityScorer Single 0-1 score with grade and recommendations
ChaosInjector Inject failures: timeout, 429, malformed, latency, null
RealtimeRunner Live event streaming, concurrent execution, alerts
ScaleRunner Rate-limited batch runs with rolling time windows
ReliabilityReporter JSON, HTML dashboard, JSONL audit trail
Step Builder Pattern
llmStep() LLM call with retry and response parsing
toolStep() External tool/API call with timeout
ragStep() Vector DB retrieval with quality validation
routerStep() Conditional branching
guardrailStep() Safety checks (PII, harmful content, format)
streamingLLMStep() Streaming LLM with chunk buffering

Architecture

                    ┌─────────────────┐
                    │  Your Agent     │
                    │  (any framework)│
                    └────────┬────────┘
                             │ define steps
                    ┌────────▼────────┐
                    │ WorkflowRunner  │ ←── ChaosInjector (optional)
                    │ retry/fallback/ │
                    │ timeout/validate│
                    └────────┬────────┘
                             │ run N times
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
        ┌──────────┐  ┌──────────┐  ┌──────────┐
        │ Run 1    │  │ Run 2    │  │ Run N    │
        │ step→step│  │ step→step│  │ step→step│
        └────┬─────┘  └────┬─────┘  └────┬─────┘
             │              │              │
             └──────────────┼──────────────┘
                    ┌───────▼───────┐
                    │   Analyzer    │
                    │ E2E rate      │
                    │ cascade map   │
                    │ correlation   │
                    └───────┬───────┘
                    ┌───────▼───────┐
                    │    Scorer     │
                    │ 0-1 score     │
                    │ grade + recs  │
                    └───────┬───────┘
                    ┌───────▼───────┐
                    │   Reporter    │
                    │ JSON/HTML/    │
                    │ JSONL audit   │
                    └───────────────┘

Comparison

Feature DeepEval Langfuse LangSmith agent-reliability
Test individual LLM calls Yes Yes Yes Yes
Test multi-step chains No No No Yes
E2E success rate No No No Yes
Cascade failure detection No No No Yes
Predicted vs actual gap No No No Yes
Chaos/fault injection No No No Yes
Real-time event streaming No Yes Yes Yes
Concurrent benchmark No No No Yes
Rolling time windows No No No Yes
Tamper-evident audit trail No No No Yes
Single reliability score No No No Yes
Streaming agent support No Yes Yes Yes
Pre-built step patterns No No No Yes
Open source Yes Yes No Yes

Use It Directly in Any AI IDE

One command. The IDE does the rest.

Step 1: Install

npm install agent-reliability

Step 2: Tell Your IDE

Just paste this prompt into any AI IDE — Cursor, Claude Code, Windsurf, Copilot, Cody, Aider — it works everywhere:

I installed agent-reliability (npm package). Read my agent code and:

1. Wrap each step as a WorkflowStep
2. Run benchmark(100) with ChaosInjector at 15% failure rate
3. Analyze with ReliabilityAnalyzer
4. Score with ReliabilityScorer
5. Show me: E2E rate, failure hotspots, cascade map, grade
6. Generate the HTML report

Use these imports:
import { WorkflowRunner, ReliabilityAnalyzer, ReliabilityScorer, ChaosInjector } from "agent-reliability"

That's it. The IDE reads your code, wraps your steps, runs the benchmark, and shows the report.


Cursor

# In Cursor chat (Cmd+L):
@codebase Use agent-reliability to test the agent in src/agent.ts.
Run 100 times with 15% chaos injection. Show cascade failures.

Add to .cursor/rules for automatic use:

When the user asks to test agent reliability, use the agent-reliability npm package.
Import WorkflowRunner, ReliabilityAnalyzer, ChaosInjector, ReliabilityScorer.
Wrap each agent step, benchmark 100 runs, report E2E rate and failure hotspots.

Claude Code

# In terminal:
claude "Use agent-reliability to test my agent in src/agent.ts.
Benchmark 100 runs, inject 15% chaos, generate HTML report."

Windsurf

# In Windsurf chat:
Use agent-reliability to benchmark my agent workflow.
Show the reliability score and recommendations.

GitHub Copilot

# In Copilot chat:
/test Use agent-reliability to create a reliability test for my agent.
Include chaos injection and cascade failure detection.

Aider

aider "Add a reliability test using agent-reliability package.
Test src/agent.ts with 100 runs and chaos injection."

What the IDE Will Generate

The IDE reads your agent code and produces something like this:

import { WorkflowRunner, ReliabilityAnalyzer, ChaosInjector, ReliabilityScorer, ReliabilityReporter } from "agent-reliability";
import { myAgent } from "./src/agent";
import * as fs from "fs";

async function testReliability() {
  const chaos = new ChaosInjector({ failure_rate: 0.15, seed: 42 });
  const runner = new WorkflowRunner("my-agent");

  // IDE auto-wraps your agent's steps
  runner.addStep(chaos.wrap({ id: "parse",    name: "Parse Input",    execute: myAgent.parse }));
  runner.addStep(chaos.wrap({ id: "retrieve", name: "Retrieve Docs",  execute: myAgent.retrieve }));
  runner.addStep(chaos.wrap({ id: "generate", name: "Generate Reply", execute: myAgent.generate }));
  runner.addStep(chaos.wrap({ id: "validate", name: "Safety Check",   execute: myAgent.validate }));

  // Benchmark
  const runs = await runner.benchmark(100);

  // Analyze
  const analyzer = new ReliabilityAnalyzer();
  const report = analyzer.analyze("my-agent", runs);

  // Score
  const scorer = new ReliabilityScorer();
  const result = scorer.score(report);

  // Print results
  console.log(analyzer.formatReport(report));
  console.log(`\nScore: ${result.score.toFixed(2)}${result.grade.toUpperCase()}`);
  for (const rec of result.recommendations) console.log(`  → ${rec}`);

  // Save HTML report
  const reporter = new ReliabilityReporter();
  fs.writeFileSync("reliability.html", reporter.toHTML(report, runs));
  console.log("\nReport saved: reliability.html");
}

testReliability();

As a Jest/Vitest Test

// test/reliability.test.ts — the IDE can generate this for you
import { WorkflowRunner, ReliabilityAnalyzer, ReliabilityScorer, ChaosInjector } from "agent-reliability";

test("agent is production-ready", async () => {
  const chaos = new ChaosInjector({ failure_rate: 0.15, seed: 42 });
  const runner = new WorkflowRunner("my-agent");
  // ... add steps ...

  const runs = await runner.benchmark(100);
  const report = new ReliabilityAnalyzer().analyze("my-agent", runs);
  const score = new ReliabilityScorer().score(report);

  expect(score.grade).not.toBe("unreliable");
  expect(report.e2e_success_rate).toBeGreaterThan(0.5);
});

Run: npx jest test/reliability.test.ts


CI/CD — Block Unreliable Agents from Deploying

# .github/workflows/reliability.yml
name: Agent Reliability Gate
on: [push, pull_request]
jobs:
  reliability:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm install
      - run: npx jest test/reliability.test.ts --forceExit

If the agent's reliability score is "unreliable", the CI build fails. No unreliable agents reach production.


License

MIT

About

End-to-end reliability testing for multi-step AI agent workflows. Chaos testing, cascade failure detection, E2E benchmarks. What DeepEval/Langfuse/LangSmith can't test. Works with LangChain, CrewAI, AutoGen, OpenAI. npm install agent-reliability

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors