This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Kubernetes-native agent evaluation system that executes test datasets via A2A protocol, evaluates responses using a generic metrics framework (with RAGAS as the default adapter), and publishes metrics via OTLP. Part of the Agentic Layer platform for automated agent testing and quality assurance.
# Install dependencies
uv sync
# Run all quality checks (tests, mypy, bandit, ruff)
uv run poe check
# Run unit tests only
uv run poe test
# Run end-to-end tests (requires Tilt environment running)
uv run poe test_e2e
# Code formatting and linting
uv run poe format # Format with Ruff
uv run poe lint # Lint and auto-fix with Ruff
uv run poe ruff # Both format and lint
# Type checking and security
uv run poe mypy # Static type checking
uv run poe bandit # Security vulnerability scanning# Start full Kubernetes environment (operators, agents, observability)
tilt up
# Stop environment
tilt down
# Required environment variable for local testing
export OPENAI_API_BASE="http://localhost:11001" # AI Gateway endpoint
export GOOGLE_API_KEY="your-api-key" # Required for Gemini models# Phase 1: Download and convert dataset to Experiment JSON
uv run python3 testbench/setup.py "http://localhost:11020/dataset.csv"
# Phase 2: Execute queries through agent via A2A protocol
uv run python3 testbench/run.py "http://localhost:11010" "my-workflow"
# Phase 3: Evaluate responses using metrics (with model override)
uv run python3 testbench/evaluate.py --model gemini-2.5-flash-lite
# Phase 4: Publish metrics to OTLP endpoint
OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318" uv run python3 testbench/publish.py "workflow-name" "exec-001" 1
# Optional: Generate HTML visualization report
uv run python3 testbench/visualize.py "weather-assistant-test" "exec-001" 1Generate a comprehensive HTML dashboard from evaluation results for local viewing and sharing.
# Basic usage (after running evaluate.py)
uv run python3 testbench/visualize.py weather-assistant-test exec-001 1
# Custom input/output paths
uv run python3 testbench/visualize.py weather-assistant-test exec-001 1 \
--input data/experiments/evaluated_experiment.json \
--output reports/exec-001.htmlRequired Arguments:
workflow_name- Name of the test workflow (e.g., 'weather-assistant-test')execution_id- Testkube execution ID for this workflow runexecution_number- Testkube execution number for this workflow run
Optional Arguments:
--input- Path to evaluated experiment JSON (default:data/experiments/evaluated_experiment.json)--output- Path for output HTML file (default:data/results/evaluation_report.html)
Features:
- Summary Cards: Total samples, metrics count, scenario overview
- Workflow Metadata: Displays workflow name, execution ID, and execution number
- Overall Scores Chart: Horizontal bar chart showing mean score per metric
- Metric Distributions: Histograms showing score distributions with statistics
- Detailed Results Table: All steps with evaluations, searchable and sortable
- Multi-Turn Support: Chat-bubble visualization for conversational datasets
- Self-Contained HTML: Single file with embedded Chart.js, works offline
- Responsive Design: Works on desktop and tablet, print-friendly
# Run complete evaluation workflow in Kubernetes
kubectl testkube run testworkflow ragas-evaluation-workflow \
--config datasetUrl="http://data-server.data-server:8000/dataset.csv" \
--config agentUrl="http://weather-agent.sample-agents:8000" \
--config model="gemini-2.5-flash-lite" \
-n testkube
# Watch workflow execution
kubectl testkube watch testworkflow ragas-evaluation-workflow -n testkube
# Get workflow logs
kubectl testkube logs testworkflow ragas-evaluation-workflow -n testkube# Build Docker image locally
make build
# Run container locally
make runCore Concept: Sequential pipeline where each phase reads input from previous phase's output via shared /app/data volume. Uses typed Pydantic models (Experiment → ExecutedExperiment → EvaluatedExperiment) with JSON serialisation.
Phase 1: Setup (testbench/setup.py)
- Input: Dataset URL (CSV, JSON, or Parquet)
- Output:
data/datasets/experiment.json(Experiment model) - Purpose: Downloads external dataset, maps rows to
Stepobjects withinput,reference, andcustom_valuesfields
Phase 2: Run (testbench/run.py)
- Input:
data/datasets/experiment.json+ Agent URL + workflow name - Output:
data/experiments/executed_experiment.json(ExecutedExperiment model) - Purpose: Sends queries to agent via A2A protocol using
A2AStepClient, records agent responses asTurnobjects - Pattern: Uses
ExperimentRuntimewith hooks (before_scenario,on_step,after_scenario) - Tracing: Creates OpenTelemetry spans per scenario with trace IDs
Phase 3: Evaluate (testbench/evaluate.py)
- Input:
data/experiments/executed_experiment.json+ optional--modeloverride - Output:
data/experiments/evaluated_experiment.json(EvaluatedExperiment model) - Purpose: Calculates metrics using LLM-as-a-judge via generic metrics framework
- Pattern: Uses
ExperimentRuntimewithMetricEvaluatorhooks - Metrics: Configured via
Metricobjects on each step; usesGenericMetricsRegistrywith RAGAS adapter
Phase 4: Publish (testbench/publish.py)
- Input:
data/experiments/evaluated_experiment.json+ workflow name + execution ID + execution number - Output: Metrics published to OTLP endpoint (configured via
OTEL_EXPORTER_OTLP_ENDPOINTenvironment variable) - Purpose: Sends per-step evaluation scores to observability backend (LGTM/Grafana) via OpenTelemetry
Optional: Visualize (testbench/visualize.py)
- Input:
data/experiments/evaluated_experiment.json - Output:
data/results/evaluation_report.html(self-contained HTML dashboard) - Purpose: Generates comprehensive HTML report with charts, tables, and statistics
- Note: Runs independently of Phase 4, can be used for local development without OTLP backend
External Dataset (CSV/JSON/Parquet)
↓ [setup.py]
data/datasets/experiment.json (Experiment)
↓ [run.py + A2AStepClient]
data/experiments/executed_experiment.json (ExecutedExperiment)
↓ [evaluate.py + GenericMetricsRegistry]
data/experiments/evaluated_experiment.json (EvaluatedExperiment)
├─→ [publish.py + OTLP]
│ Observability Backend (Grafana)
└─→ [visualize.py]
data/results/evaluation_report.html (Local Visualization)
Step(input, reference, custom_values, metrics)
└→ ExecutedStep(+id, turns)
└→ EvaluatedStep(+evaluations: list[Evaluation])
Scenario(name, steps)
└→ ExecutedScenario(+id, trace_id)
└→ EvaluatedScenario(+evaluations)
Experiment(llm_as_a_judge_model, default_threshold, scenarios)
└→ ExecutedExperiment(+id)
└→ EvaluatedExperiment
Orchestration Pattern: Each phase is a reusable TestWorkflowTemplate CRD that executes the same Docker image with different script arguments.
Shared State: All phases mount the same emptyDir volume at /app/data, enabling stateless containers with persistent data flow between steps.
Template Files:
deploy/base/templates/setup-template.yaml- Phase 1deploy/base/templates/run-template.yaml- Phase 2deploy/base/templates/evaluate-template.yaml- Phase 3deploy/base/templates/publish-template.yaml- Phase 4deploy/local/ragas-evaluation-workflow.yaml- Combines all templates into complete workflow
Key Workflow Parameters:
datasetUrl- HTTP URL to test datasetagentUrl- A2A endpoint of agent to evaluatemodel- LLM model for evaluation (e.g.,gemini-2.5-flash-lite)otlpEndpoint- OpenTelemetry collector URL (default:http://lgtm.monitoring:4318)image- Docker image to use (default:ghcr.io/agentic-layer/testbench/testworkflows:latest)
- Architecture:
FrameworkAdapterbase class with pluggable implementations - Registry:
GenericMetricsRegistrylazily loads framework adapters - Default Adapter: RAGAS (
metrics/ragas/adapter.py) — translatesExecutedStepinto RAGASEvaluationDatasetsamples - Protocol:
MetricCallableprotocol andMetricResultdataclass inmetrics/protocol.py - LLM Access: Routes through AI Gateway (LiteLLM) configured via
OPENAI_API_BASEenvironment variable
- Purpose: Platform-agnostic JSON-RPC protocol for agent communication
- Client Library:
a2a-sdkPython package, wrapped byschema/a2a_client.py(A2AStepClient) - Usage in Testbench:
run.pyusesA2AStepClient.send_step()to query agents - Response Handling: Agent responses recorded as
Turnobjects withcontentandtype - Context Management: A2A
context_idfield maintains conversation state across multiple turns
- Purpose: Generic runtime that iterates
Experiment.scenarios[].steps[]and calls user-provided hooks - Hooks:
before_run,before_scenario,on_step,after_scenario - Location:
schema/runtime.py - Used by: Both
run.py(A2AExecutor) andevaluate.py(MetricEvaluator)
- Purpose: Standard protocol for publishing observability data
- Transport: HTTP/protobuf to OTLP collector endpoint (port 4318)
- Metrics Published: Per-step evaluation gauge with metric name, score, and step/scenario attributes
- Labeling: Each metric labeled with
workflow_namefor filtering in Grafana
- Purpose: Local Kubernetes development environment
- What Gets Deployed:
- Core operators:
agent-runtime(v0.16.0),ai-gateway-litellm(v0.3.2),agent-gateway-krakend(v0.4.1) - Test infrastructure:
testkube(v2.4.2), sampleweather-agent,data-server - Observability: LGTM stack (Grafana, Loki, Tempo, Mimir)
- TestWorkflow templates and evaluation workflow
- Core operators:
- Port Forwards:
11001(AI Gateway),11010(Weather Agent),11000(Grafana),11020(Data Server)
All scripts follow same pattern: parse arguments → read input file(s) → process → write output file
-
setup.py: Dataset download and conversion logic- Supports CSV (with quoted array parsing), JSON, Parquet formats
- Maps DataFrame rows to
Stepobjects withReferenceandcustom_values - Output:
data/datasets/experiment.json
-
run.py: Agent query execution viaA2AExecutor- Uses
A2AStepClientfor async A2A protocol requests - Creates OpenTelemetry spans per scenario
- Records agent responses as
Turnobjects inExecutedStep
- Uses
-
evaluate.py: Metric evaluation viaMetricEvaluator- Uses
GenericMetricsRegistryto resolve metric callables - Supports
--modelCLI arg to overrideexperiment.llm_as_a_judge_model - Produces
Evaluationobjects (metric + result with score and pass/fail)
- Uses
-
publish.py: OTLP metric publishing- Reads
EvaluatedExperiment, iteratesscenarios → steps → evaluations - Creates gauge metrics per evaluation with step/scenario attributes
- Uses workflow name as metric label
- Reads
-
visualize.py: HTML visualization generation- Reads
EvaluatedExperimentand generates self-contained HTML dashboard - Creates summary cards, bar charts, metric distributions, and results table
- Uses Chart.js via CDN for interactive visualizations
- Reads
models.py: Full Pydantic model hierarchy (Step → ExecutedStep → EvaluatedStep, etc.)runtime.py:ExperimentRuntime— generic hook-based experiment iteratora2a_client.py:A2AStepClient— thin wrapper around A2A SDK
protocol.py:MetricCallableprotocol,MetricResultdataclassadapter.py: AbstractFrameworkAdapterbase classregistry.py:GenericMetricsRegistry(lazy-loads adapters)ragas/adapter.py: RAGAS-specificRagasFrameworkAdapter
Unit Tests (tests/):
- One test file per script:
test_setup.py,test_run.py,test_evaluate.py,test_evaluate_experiment.py,test_publish.py,test_visualize.py - Additional:
test_a2a_client.py,test_metrics.py - Uses pytest with async support (
pytest-asyncio) - Mocks external dependencies: A2A client, metrics registry, OTLP
- Uses
tmp_pathfixture for file I/O testing
E2E Test (tests_e2e/test_e2e.py):
- Runs complete 4-phase pipeline in sequence
- Configurable via environment variables:
E2E_DATASET_URL,E2E_AGENT_URL,E2E_MODEL, etc. - Validates output files exist after each phase
- Requires Tilt environment running for dependencies
Testkube Templates (deploy/base/templates/):
- Each template is a
TestWorkflowTemplateCRD - Defines container spec, volume mounts, command arguments
- Parameterized with
config.*variables (e.g.,{{ config.datasetUrl }})
Local Development (deploy/local/):
ragas-evaluation-workflow.yaml- Complete workflow definitionweather-agent.yaml- Sample Agent CRD for testinglgtm.yaml- Grafana LGTM observability stackdata-server/- ConfigMap with test datasets + Service for HTTP access
- Never delete failing tests - Either update tests to match correct implementation or fix code to pass tests
- Unit tests must mock external dependencies - No real HTTP calls, A2A clients, or LLM requests
- E2E test validates file existence - Doesn't validate content correctness (use unit tests for that)
- Line Length: 120 characters max (Ruff)
- Type Hints: Required for all function signatures (mypy enforced)
- Import Sorting: Enabled via Ruff (I001 rule)
- Security Scanning: Bandit checks for vulnerabilities
- Naming Conventions: PEP 8 compliant (Ruff N rule)
- Run automatically before commits via
.pre-commit-config.yaml - Enforces: Ruff formatting/linting, mypy, bandit
- Manual run:
pre-commit run --all-files
Metrics are resolved through the GenericMetricsRegistry:
- Define
Metric(metric_name="...", threshold=0.8)on eachStepin the experiment JSON - The registry delegates to the appropriate
FrameworkAdapter(currently RAGAS) - Add test cases in
tests/test_evaluate_experiment.pywith mocked metric callables
If changing intermediate file formats or locations:
- Update corresponding script I/O logic
- Update all dependent scripts (downstream phases)
- Update TestWorkflowTemplate volume mount paths if needed
- Update unit test mocks
- Update E2E test file path validations
Issue: setup.py fails to download dataset
- Check: Dataset URL accessible from local machine
- Check: File format is CSV, JSON, or Parquet
- Check: Dataset contains required fields:
user_input,retrieved_contexts,reference
Issue: run.py fails to query agent
- Check: Agent URL is correct and agent is running (verify with
curl) - Check: Agent exposes A2A protocol endpoint
- Check: Network connectivity between testbench and agent
Issue: evaluate.py fails with LLM errors
- Check:
OPENAI_API_BASEpoints to AI Gateway (e.g.,http://localhost:11001) - Check:
GOOGLE_API_KEYenvironment variable set - Check: AI Gateway has access to specified model (check AI Gateway logs)
Issue: publish.py fails to send metrics
- Check: OTLP endpoint is reachable
- Check: OTLP collector is running and accepting HTTP on port 4318
- Check: Workflow name is valid (no special characters)
Issue: Workflow stuck in "Queued" state
- Check: Testkube controller is running:
kubectl get pods -n testkube - Check: Sufficient cluster resources for workflow pods
Issue: Workflow fails at specific step
- Check step logs:
kubectl testkube logs testworkflow ragas-evaluation-workflow -n testkube - Check volume mounts: Verify previous step wrote output file correctly
- Check parameter values: Ensure URLs and names are correct in workflow config
Issue: Template not found errors
- Check templates exist:
kubectl get testworkflowtemplates -n testkube - Reinstall templates:
kubectl apply -f deploy/base/templates/ -n testkube
Issue: Tilt fails to start operators
- Check Kubernetes cluster:
kubectl cluster-info - Check tilt-extensions version: Must be v0.6.0 or later in Tiltfile
- Check .env file: Must contain
GOOGLE_API_KEY
Issue: Port forward conflicts
- Check ports available: 11000, 11001, 11010, 11020
- Kill conflicting processes:
lsof -ti:11001 | xargs kill
Issue: Agent not responding on port 11010
- Check agent status:
kubectl get pods -n sample-agents - Check agent logs:
kubectl logs -n sample-agents deployment/weather-agent
- agent-runtime-operator (v0.16.0): Provides
Agent,ToolServer,AgenticWorkforceCRDs - ai-gateway-litellm-operator (v0.3.2): Provides
AiGatewayCRD for LLM access during evaluation - agent-gateway-krakend-operator (v0.4.1): Provides
AgentGatewayCRD for routing (optional, only if using gateway) - tilt-extensions (v0.6.0): Custom Tilt helpers for local operator installation
When operators update CRD schemas:
- Verify YAML manifests in
deploy/local/still valid - Update TestWorkflowTemplate CRDs if volume paths or parameters changed
- Update Tiltfile with new operator versions
- Test E2E pipeline with new operator versions
Testbench can evaluate any agent that:
- Exposes A2A protocol endpoint
- Is deployed via
AgentCRD or accessible HTTP endpoint - Returns text responses to text prompts
Examples: agent-samples/weather-agent, showcase agents (showcase-cross-selling, showcase-news)
- LLM-based metrics consume tokens and incur costs
- Evaluation speed depends on AI Gateway throughput and model latency
- Some metrics (e.g.,
context_recall) requirereferenceground truth - RAGAS-specific metrics may require
retrieved_contextsincustom_values
- Agents must implement A2A JSON-RPC specification
- Only supports text-based question-answering (no multi-modal, no streaming in evaluation)
- Response timeout configured in
a2a-sdkclient (default: 30s)
- TestWorkflows create pods that need persistent volume for shared data
- Each phase runs sequentially (no parallel execution of phases)
- Workflow pods cleaned up after completion (data persists in volume temporarily)
- Datasets may contain sensitive information - ensure OTLP endpoints are secured
- Evaluation results include full prompts and responses - consider data retention policies
- AI Gateway logs may contain dataset content - review log retention settings