CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Purpose

Kubernetes-native agent evaluation system that executes test datasets via A2A protocol, evaluates responses using a generic metrics framework (with RAGAS as the default adapter), and publishes metrics via OTLP. Part of the Agentic Layer platform for automated agent testing and quality assurance.

Common Commands

Development Workflow

# Install dependencies
uv sync

# Run all quality checks (tests, mypy, bandit, ruff)
uv run poe check

# Run unit tests only
uv run poe test

# Run end-to-end tests (requires Tilt environment running)
uv run poe test_e2e

# Code formatting and linting
uv run poe format    # Format with Ruff
uv run poe lint      # Lint and auto-fix with Ruff
uv run poe ruff      # Both format and lint

# Type checking and security
uv run poe mypy      # Static type checking
uv run poe bandit    # Security vulnerability scanning

Local Development Environment

# Start full Kubernetes environment (operators, agents, observability)
tilt up

# Stop environment
tilt down

# Required environment variable for local testing
export OPENAI_API_BASE="http://localhost:11001"  # AI Gateway endpoint
export GOOGLE_API_KEY="your-api-key"            # Required for Gemini models

Running the 4-Phase Pipeline Locally

# Phase 1: Download and convert dataset to Experiment JSON
uv run python3 testbench/setup.py "http://localhost:11020/dataset.csv"

# Phase 2: Execute queries through agent via A2A protocol
uv run python3 testbench/run.py "http://localhost:11010" "my-workflow"

# Phase 3: Evaluate responses using metrics (with model override)
uv run python3 testbench/evaluate.py --model gemini-2.5-flash-lite

# Phase 4: Publish metrics to OTLP endpoint
OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318" uv run python3 testbench/publish.py "workflow-name" "exec-001" 1

# Optional: Generate HTML visualization report
uv run python3 testbench/visualize.py "weather-assistant-test" "exec-001" 1

HTML Visualization

Generate a comprehensive HTML dashboard from evaluation results for local viewing and sharing.

# Basic usage (after running evaluate.py)
uv run python3 testbench/visualize.py weather-assistant-test exec-001 1

# Custom input/output paths
uv run python3 testbench/visualize.py weather-assistant-test exec-001 1 \
  --input data/experiments/evaluated_experiment.json \
  --output reports/exec-001.html

Required Arguments:

workflow_name - Name of the test workflow (e.g., 'weather-assistant-test')
execution_id - Testkube execution ID for this workflow run
execution_number - Testkube execution number for this workflow run

Optional Arguments:

--input - Path to evaluated experiment JSON (default: data/experiments/evaluated_experiment.json)
--output - Path for output HTML file (default: data/results/evaluation_report.html)

Features:

Summary Cards: Total samples, metrics count, scenario overview
Workflow Metadata: Displays workflow name, execution ID, and execution number
Overall Scores Chart: Horizontal bar chart showing mean score per metric
Metric Distributions: Histograms showing score distributions with statistics
Detailed Results Table: All steps with evaluations, searchable and sortable
Multi-Turn Support: Chat-bubble visualization for conversational datasets
Self-Contained HTML: Single file with embedded Chart.js, works offline
Responsive Design: Works on desktop and tablet, print-friendly

Testkube Execution

# Run complete evaluation workflow in Kubernetes
kubectl testkube run testworkflow ragas-evaluation-workflow \
    --config datasetUrl="http://data-server.data-server:8000/dataset.csv" \
    --config agentUrl="http://weather-agent.sample-agents:8000" \
    --config model="gemini-2.5-flash-lite" \
    -n testkube

# Watch workflow execution
kubectl testkube watch testworkflow ragas-evaluation-workflow -n testkube

# Get workflow logs
kubectl testkube logs testworkflow ragas-evaluation-workflow -n testkube

Docker Build

# Build Docker image locally
make build

# Run container locally
make run

Architecture Overview

4-Phase Evaluation Pipeline

Core Concept: Sequential pipeline where each phase reads input from previous phase's output via shared /app/data volume. Uses typed Pydantic models (Experiment → ExecutedExperiment → EvaluatedExperiment) with JSON serialisation.

Phase 1: Setup (testbench/setup.py)

Input: Dataset URL (CSV, JSON, or Parquet)
Output: data/datasets/experiment.json (Experiment model)
Purpose: Downloads external dataset, maps rows to Step objects with input, reference, and custom_values fields

Phase 2: Run (testbench/run.py)

Input: data/datasets/experiment.json + Agent URL + workflow name
Output: data/experiments/executed_experiment.json (ExecutedExperiment model)
Purpose: Sends queries to agent via A2A protocol using A2AStepClient, records agent responses as Turn objects
Pattern: Uses ExperimentRuntime with hooks (before_scenario, on_step, after_scenario)
Tracing: Creates OpenTelemetry spans per scenario with trace IDs

Phase 3: Evaluate (testbench/evaluate.py)

Input: data/experiments/executed_experiment.json + optional --model override
Output: data/experiments/evaluated_experiment.json (EvaluatedExperiment model)
Purpose: Calculates metrics using LLM-as-a-judge via generic metrics framework
Pattern: Uses ExperimentRuntime with MetricEvaluator hooks
Metrics: Configured via Metric objects on each step; uses GenericMetricsRegistry with RAGAS adapter

Phase 4: Publish (testbench/publish.py)

Input: data/experiments/evaluated_experiment.json + workflow name + execution ID + execution number
Output: Metrics published to OTLP endpoint (configured via OTEL_EXPORTER_OTLP_ENDPOINT environment variable)
Purpose: Sends per-step evaluation scores to observability backend (LGTM/Grafana) via OpenTelemetry

Optional: Visualize (testbench/visualize.py)

Input: data/experiments/evaluated_experiment.json
Output: data/results/evaluation_report.html (self-contained HTML dashboard)
Purpose: Generates comprehensive HTML report with charts, tables, and statistics
Note: Runs independently of Phase 4, can be used for local development without OTLP backend

Data Flow

External Dataset (CSV/JSON/Parquet)
  ↓ [setup.py]
data/datasets/experiment.json  (Experiment)
  ↓ [run.py + A2AStepClient]
data/experiments/executed_experiment.json  (ExecutedExperiment)
  ↓ [evaluate.py + GenericMetricsRegistry]
data/experiments/evaluated_experiment.json  (EvaluatedExperiment)
  ├─→ [publish.py + OTLP]
  │   Observability Backend (Grafana)
  └─→ [visualize.py]
      data/results/evaluation_report.html (Local Visualization)

Pydantic Model Hierarchy

Step(input, reference, custom_values, metrics)
  └→ ExecutedStep(+id, turns)
       └→ EvaluatedStep(+evaluations: list[Evaluation])

Scenario(name, steps)
  └→ ExecutedScenario(+id, trace_id)
       └→ EvaluatedScenario(+evaluations)

Experiment(llm_as_a_judge_model, default_threshold, scenarios)
  └→ ExecutedExperiment(+id)
       └→ EvaluatedExperiment

Kubernetes Integration (Testkube)

Orchestration Pattern: Each phase is a reusable TestWorkflowTemplate CRD that executes the same Docker image with different script arguments.

Shared State: All phases mount the same emptyDir volume at /app/data, enabling stateless containers with persistent data flow between steps.

Template Files:

deploy/base/templates/setup-template.yaml - Phase 1
deploy/base/templates/run-template.yaml - Phase 2
deploy/base/templates/evaluate-template.yaml - Phase 3
deploy/base/templates/publish-template.yaml - Phase 4
deploy/local/ragas-evaluation-workflow.yaml - Combines all templates into complete workflow

Key Workflow Parameters:

datasetUrl - HTTP URL to test dataset
agentUrl - A2A endpoint of agent to evaluate
model - LLM model for evaluation (e.g., gemini-2.5-flash-lite)
otlpEndpoint - OpenTelemetry collector URL (default: http://lgtm.monitoring:4318)
image - Docker image to use (default: ghcr.io/agentic-layer/testbench/testworkflows:latest)

Key Technology Integrations

Generic Metrics Framework

Architecture: FrameworkAdapter base class with pluggable implementations
Registry: GenericMetricsRegistry lazily loads framework adapters
Default Adapter: RAGAS (metrics/ragas/adapter.py) — translates ExecutedStep into RAGAS EvaluationDataset samples
Protocol: MetricCallable protocol and MetricResult dataclass in metrics/protocol.py
LLM Access: Routes through AI Gateway (LiteLLM) configured via OPENAI_API_BASE environment variable

A2A Protocol (Agent-to-Agent)

Purpose: Platform-agnostic JSON-RPC protocol for agent communication
Client Library: a2a-sdk Python package, wrapped by schema/a2a_client.py (A2AStepClient)
Usage in Testbench: run.py uses A2AStepClient.send_step() to query agents
Response Handling: Agent responses recorded as Turn objects with content and type
Context Management: A2A context_id field maintains conversation state across multiple turns

ExperimentRuntime

Purpose: Generic runtime that iterates Experiment.scenarios[].steps[] and calls user-provided hooks
Hooks: before_run, before_scenario, on_step, after_scenario
Location: schema/runtime.py
Used by: Both run.py (A2AExecutor) and evaluate.py (MetricEvaluator)

OpenTelemetry (OTLP)

Purpose: Standard protocol for publishing observability data
Transport: HTTP/protobuf to OTLP collector endpoint (port 4318)
Metrics Published: Per-step evaluation gauge with metric name, score, and step/scenario attributes
Labeling: Each metric labeled with workflow_name for filtering in Grafana

Tilt (Local Development)

Purpose: Local Kubernetes development environment
What Gets Deployed:
- Core operators: agent-runtime (v0.16.0), ai-gateway-litellm (v0.3.2), agent-gateway-krakend (v0.4.1)
- Test infrastructure: testkube (v2.4.2), sample weather-agent, data-server
- Observability: LGTM stack (Grafana, Loki, Tempo, Mimir)
- TestWorkflow templates and evaluation workflow
Port Forwards: 11001 (AI Gateway), 11010 (Weather Agent), 11000 (Grafana), 11020 (Data Server)

Code Organization

Core Scripts (testbench/)

All scripts follow same pattern: parse arguments → read input file(s) → process → write output file

setup.py: Dataset download and conversion logic
- Supports CSV (with quoted array parsing), JSON, Parquet formats
- Maps DataFrame rows to Step objects with Reference and custom_values
- Output: data/datasets/experiment.json
run.py: Agent query execution via A2AExecutor
- Uses A2AStepClient for async A2A protocol requests
- Creates OpenTelemetry spans per scenario
- Records agent responses as Turn objects in ExecutedStep
evaluate.py: Metric evaluation via MetricEvaluator
- Uses GenericMetricsRegistry to resolve metric callables
- Supports --model CLI arg to override experiment.llm_as_a_judge_model
- Produces Evaluation objects (metric + result with score and pass/fail)
publish.py: OTLP metric publishing
- Reads EvaluatedExperiment, iterates scenarios → steps → evaluations
- Creates gauge metrics per evaluation with step/scenario attributes
- Uses workflow name as metric label
visualize.py: HTML visualization generation
- Reads EvaluatedExperiment and generates self-contained HTML dashboard
- Creates summary cards, bar charts, metric distributions, and results table
- Uses Chart.js via CDN for interactive visualizations

Schema Package (testbench/schema/)

models.py: Full Pydantic model hierarchy (Step → ExecutedStep → EvaluatedStep, etc.)
runtime.py: ExperimentRuntime — generic hook-based experiment iterator
a2a_client.py: A2AStepClient — thin wrapper around A2A SDK

Metrics Package (testbench/metrics/)

protocol.py: MetricCallable protocol, MetricResult dataclass
adapter.py: Abstract FrameworkAdapter base class
registry.py: GenericMetricsRegistry (lazy-loads adapters)
ragas/adapter.py: RAGAS-specific RagasFrameworkAdapter

Test Organization

Unit Tests (tests/):

One test file per script: test_setup.py, test_run.py, test_evaluate.py, test_evaluate_experiment.py, test_publish.py, test_visualize.py
Additional: test_a2a_client.py, test_metrics.py
Uses pytest with async support (pytest-asyncio)
Mocks external dependencies: A2A client, metrics registry, OTLP
Uses tmp_path fixture for file I/O testing

E2E Test (tests_e2e/test_e2e.py):

Runs complete 4-phase pipeline in sequence
Configurable via environment variables: E2E_DATASET_URL, E2E_AGENT_URL, E2E_MODEL, etc.
Validates output files exist after each phase
Requires Tilt environment running for dependencies

Deployment Manifests

Testkube Templates (deploy/base/templates/):

Each template is a TestWorkflowTemplate CRD
Defines container spec, volume mounts, command arguments
Parameterized with config.* variables (e.g., {{ config.datasetUrl }})

Local Development (deploy/local/):

ragas-evaluation-workflow.yaml - Complete workflow definition
weather-agent.yaml - Sample Agent CRD for testing
lgtm.yaml - Grafana LGTM observability stack
data-server/ - ConfigMap with test datasets + Service for HTTP access

Development Guidelines

Testing Requirements

Never delete failing tests - Either update tests to match correct implementation or fix code to pass tests
Unit tests must mock external dependencies - No real HTTP calls, A2A clients, or LLM requests
E2E test validates file existence - Doesn't validate content correctness (use unit tests for that)

Code Quality Standards

Line Length: 120 characters max (Ruff)
Type Hints: Required for all function signatures (mypy enforced)
Import Sorting: Enabled via Ruff (I001 rule)
Security Scanning: Bandit checks for vulnerabilities
Naming Conventions: PEP 8 compliant (Ruff N rule)

Pre-commit Hooks

Run automatically before commits via .pre-commit-config.yaml
Enforces: Ruff formatting/linting, mypy, bandit
Manual run: pre-commit run --all-files

Adding New Metrics

Metrics are resolved through the GenericMetricsRegistry:

Define Metric(metric_name="...", threshold=0.8) on each Step in the experiment JSON
The registry delegates to the appropriate FrameworkAdapter (currently RAGAS)
Add test cases in tests/test_evaluate_experiment.py with mocked metric callables

Modifying Data Flow

If changing intermediate file formats or locations:

Update corresponding script I/O logic
Update all dependent scripts (downstream phases)
Update TestWorkflowTemplate volume mount paths if needed
Update unit test mocks
Update E2E test file path validations

Common Debugging Scenarios

Local Pipeline Failures

Issue: setup.py fails to download dataset

Check: Dataset URL accessible from local machine
Check: File format is CSV, JSON, or Parquet
Check: Dataset contains required fields: user_input, retrieved_contexts, reference

Issue: run.py fails to query agent

Check: Agent URL is correct and agent is running (verify with curl)
Check: Agent exposes A2A protocol endpoint
Check: Network connectivity between testbench and agent

Issue: evaluate.py fails with LLM errors

Check: OPENAI_API_BASE points to AI Gateway (e.g., http://localhost:11001)
Check: GOOGLE_API_KEY environment variable set
Check: AI Gateway has access to specified model (check AI Gateway logs)

Issue: publish.py fails to send metrics

Check: OTLP endpoint is reachable
Check: OTLP collector is running and accepting HTTP on port 4318
Check: Workflow name is valid (no special characters)

Testkube Workflow Failures

Issue: Workflow stuck in "Queued" state

Check: Testkube controller is running: kubectl get pods -n testkube
Check: Sufficient cluster resources for workflow pods

Issue: Workflow fails at specific step

Check step logs: kubectl testkube logs testworkflow ragas-evaluation-workflow -n testkube
Check volume mounts: Verify previous step wrote output file correctly
Check parameter values: Ensure URLs and names are correct in workflow config

Issue: Template not found errors

Check templates exist: kubectl get testworkflowtemplates -n testkube
Reinstall templates: kubectl apply -f deploy/base/templates/ -n testkube

Tilt Environment Issues

Issue: Tilt fails to start operators

Check Kubernetes cluster: kubectl cluster-info
Check tilt-extensions version: Must be v0.6.0 or later in Tiltfile
Check .env file: Must contain GOOGLE_API_KEY

Issue: Port forward conflicts

Check ports available: 11000, 11001, 11010, 11020
Kill conflicting processes: lsof -ti:11001 | xargs kill

Issue: Agent not responding on port 11010

Check agent status: kubectl get pods -n sample-agents
Check agent logs: kubectl logs -n sample-agents deployment/weather-agent

Cross-Repository Dependencies

Platform Operators (Required at Runtime)

agent-runtime-operator (v0.16.0): Provides Agent, ToolServer, AgenticWorkforce CRDs
ai-gateway-litellm-operator (v0.3.2): Provides AiGateway CRD for LLM access during evaluation
agent-gateway-krakend-operator (v0.4.1): Provides AgentGateway CRD for routing (optional, only if using gateway)
tilt-extensions (v0.6.0): Custom Tilt helpers for local operator installation

Version Sync Points

When operators update CRD schemas:

Verify YAML manifests in deploy/local/ still valid
Update TestWorkflowTemplate CRDs if volume paths or parameters changed
Update Tiltfile with new operator versions
Test E2E pipeline with new operator versions

Agent Integration

Testbench can evaluate any agent that:

Exposes A2A protocol endpoint
Is deployed via Agent CRD or accessible HTTP endpoint
Returns text responses to text prompts

Examples: agent-samples/weather-agent, showcase agents (showcase-cross-selling, showcase-news)

Important Constraints

Metric Limitations

LLM-based metrics consume tokens and incur costs
Evaluation speed depends on AI Gateway throughput and model latency
Some metrics (e.g., context_recall) require reference ground truth
RAGAS-specific metrics may require retrieved_contexts in custom_values

A2A Protocol Requirements

Agents must implement A2A JSON-RPC specification
Only supports text-based question-answering (no multi-modal, no streaming in evaluation)
Response timeout configured in a2a-sdk client (default: 30s)

Kubernetes Resource Requirements

TestWorkflows create pods that need persistent volume for shared data
Each phase runs sequentially (no parallel execution of phases)
Workflow pods cleaned up after completion (data persists in volume temporarily)

Data Privacy

Datasets may contain sensitive information - ensure OTLP endpoints are secured
Evaluation results include full prompts and responses - consider data retention policies
AI Gateway logs may contain dataset content - review log retention settings

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Repository Purpose

Common Commands

Development Workflow

Local Development Environment

Running the 4-Phase Pipeline Locally

HTML Visualization

Testkube Execution

Docker Build

Architecture Overview

4-Phase Evaluation Pipeline

Data Flow

Pydantic Model Hierarchy

Kubernetes Integration (Testkube)

Key Technology Integrations

Generic Metrics Framework

A2A Protocol (Agent-to-Agent)

ExperimentRuntime

OpenTelemetry (OTLP)

Tilt (Local Development)

Code Organization

Core Scripts (testbench/)

Schema Package (testbench/schema/)

Metrics Package (testbench/metrics/)

Test Organization

Deployment Manifests

Development Guidelines

Testing Requirements

Code Quality Standards

Pre-commit Hooks

Adding New Metrics

Modifying Data Flow

Common Debugging Scenarios

Local Pipeline Failures

Testkube Workflow Failures

Tilt Environment Issues

Cross-Repository Dependencies

Platform Operators (Required at Runtime)

Version Sync Points

Agent Integration

Important Constraints

Metric Limitations

A2A Protocol Requirements

Kubernetes Resource Requirements

Data Privacy