Skip to content

Commit 737b23f

Browse files
amitpaz1claude
andcommitted
feat: implement 30 auto-claude improvement ideas
Security Hardening: - SSRF prevention in webhook URL validation (sec-001) - Redact sensitive headers in webhook error messages (sec-002) - Redis broker TLS/auth warnings for insecure connections (sec-003) - Block dangerous module imports in dynamic agent loading (sec-004) - Add pip-audit dependency scanning to CI pipeline (sec-005) Performance: - Batch INSERT eval_results with executemany() (perf-002) - Replace Redis KEYS scan with SCAN in coordinator (perf-003) - Pipeline Redis enqueue + TTL operations (perf-004) - Cache grader instances by (name, config) within runs (perf-005) UI/UX: - Add Examples sections to CLI command help text (uiux-001) - Show cumulative pass/fail counts in progress output (uiux-002) - Standardize table output with fixed-width columns (uiux-003) - Collapsible <details> blocks in GitHub PR comments (uiux-004) - NO_COLOR env var support per no-color.org (uiux-005) Code Quality: - Split 935-line cli.py into 12 command submodules (cq-001) - Centralize error handling with _fail() helper (cq-002) - Split agentlens importer into client/mapper/repository (cq-003) - Add stricter ruff rulesets and mypy config (cq-004) - Add .pre-commit-config.yaml with ruff hooks (cq-005) Code Improvements: - Add --exclude-tag filtering for run command (ci-001) - Persist runtime config metadata in EvalRun.config (ci-002) - Add --retries/--retry-backoff-ms for transient failures (ci-003) - DB-level LIMIT/OFFSET pagination in store queries (ci-004) - Add Notifier protocol with webhook/github adapters (ci-005) Documentation: - Add distributed execution section to README (doc-001) - Add adapters section with framework examples (doc-002) - Expand AgentLens importer docs with modes/options (doc-003) - Create docs/troubleshooting.md guide (doc-004) - Update README badges and test count references (doc-005) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e165870 commit 737b23f

52 files changed

Lines changed: 2291 additions & 1375 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,6 @@ jobs:
1818
with:
1919
python-version: ${{ matrix.python-version }}
2020
- run: pip install -e ".[dev]"
21+
- name: Security audit
22+
run: pip install pip-audit && pip-audit --strict --desc
2123
- run: pytest

.pre-commit-config.yaml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
repos:
2+
- repo: https://github.com/astral-sh/ruff-pre-commit
3+
rev: v0.8.0
4+
hooks:
5+
- id: ruff
6+
args: [--fix]
7+
- id: ruff-format
8+
- repo: https://github.com/pre-commit/pre-commit-hooks
9+
rev: v5.0.0
10+
hooks:
11+
- id: trailing-whitespace
12+
- id: end-of-file-fixer
13+
- id: check-yaml

README.md

Lines changed: 107 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# AgentEval 🧪
22

33
[![PyPI](https://img.shields.io/pypi/v/agentevalkit)](https://pypi.org/project/agentevalkit/)
4-
[![Tests](https://img.shields.io/badge/tests-127%20passing-brightgreen)]()
4+
[![Tests](https://img.shields.io/badge/tests-passing-brightgreen)]()
55
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)]()
66
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
77

@@ -134,13 +134,25 @@ Run the same suite multiple times and compare groups: `agenteval compare RUN_A1,
134134

135135
### 🔗 AgentLens Integration
136136

137-
Import real agent sessions from [AgentLens](https://github.com/amitpaz/agentlens) as test suites:
137+
Import real agent sessions from [AgentLens](https://github.com/agentkitai/agentlens) as test suites:
138138

139139
```bash
140+
# From AgentLens SQLite database
140141
agenteval import --from agentlens --db sessions.db --output suite.yaml --grader contains
141-
# Imported 42 cases → suite.yaml
142+
143+
# From AgentLens server API
144+
agenteval import-agentlens --url http://localhost:3000 --output suite.yaml --grader contains
145+
146+
# With filtering and interactive review
147+
agenteval import --from agentlens --db sessions.db --output suite.yaml --filter-tag production --auto-assertions --interactive
142148
```
143149

150+
**Import modes:**
151+
- **SQLite mode** (`import --from agentlens --db path`) — reads directly from an AgentLens database file
152+
- **Server mode** (`import-agentlens --url URL`) — fetches sessions via the AgentLens HTTP API
153+
154+
Sessions are converted to eval cases with input/output mapping and optional tool-call assertions. Use `--auto-assertions` to automatically generate expected fields from session data, and `--interactive` to review each case before saving.
155+
144156
Turn production traffic into regression tests — no manual test writing needed.
145157

146158
### 💰 Cost & Latency Tracking
@@ -311,11 +323,102 @@ grader_config:
311323

312324
---
313325

326+
## Adapters
327+
328+
Adapters let you test agents built with popular frameworks without writing a custom callable.
329+
330+
```bash
331+
pip install agentevalkit[langchain] # LangChain
332+
pip install agentevalkit[crewai] # CrewAI
333+
pip install agentevalkit[autogen] # AutoGen
334+
```
335+
336+
| Adapter | Framework Method | Install Extra |
337+
|---------|-----------------|---------------|
338+
| `langchain` | `agent.invoke(input)` | `[langchain]` |
339+
| `crewai` | `crew.kickoff(inputs={"input": ...})` | `[crewai]` |
340+
| `autogen` | `agent.run(input)` or `agent.initiate_chat(message=...)` | `[autogen]` |
341+
342+
Usage with YAML suite defaults:
343+
344+
```yaml
345+
# suite.yaml
346+
name: my-tests
347+
agent: my_module:my_chain
348+
defaults:
349+
adapter: langchain
350+
```
351+
352+
Or via CLI:
353+
354+
```bash
355+
agenteval run --suite suite.yaml --adapter langchain
356+
```
357+
358+
Each adapter extracts output, tool calls, and token usage from the framework's response format into a standard `AgentResult`.
359+
360+
---
361+
362+
## Distributed Execution
363+
364+
Scale eval suites across multiple workers using Redis as a broker.
365+
366+
### Setup
367+
368+
```bash
369+
pip install agentevalkit[distributed]
370+
```
371+
372+
### Start Workers
373+
374+
```bash
375+
# Terminal 1: Start a worker
376+
agenteval worker --broker redis://localhost:6379 --agent my_module:my_agent
377+
378+
# Terminal 2: Start another worker
379+
agenteval worker --broker redis://localhost:6379 --agent my_module:my_agent
380+
```
381+
382+
### Run with Workers
383+
384+
```bash
385+
agenteval run --suite suite.yaml --workers redis://localhost:6379 --worker-timeout 60
386+
```
387+
388+
### How It Works
389+
390+
1. The coordinator pushes eval cases to a Redis queue
391+
2. Workers pop cases, execute the agent, and push results back
392+
3. The coordinator collects results and builds the final `EvalRun`
393+
4. If no workers are detected, execution falls back to local mode automatically
394+
395+
### Configuration
396+
397+
- `--workers URL` — Redis broker URL (supports `redis://` and `rediss://` for TLS)
398+
- `--worker-timeout N` — Seconds to wait for worker results (default: 30)
399+
- Workers register heartbeats and are automatically detected by the coordinator
400+
401+
> **Security:** Use `rediss://` URLs with authentication for production deployments. See [docs/troubleshooting.md](docs/troubleshooting.md) for Redis security guidance.
402+
403+
---
404+
405+
## Troubleshooting
406+
407+
See [docs/troubleshooting.md](docs/troubleshooting.md) for solutions to common issues including:
408+
409+
- Agent callable import errors (`module:function` format)
410+
- Missing dependency extras (`[distributed]`, `[langchain]`, etc.)
411+
- OpenAI API key setup for `llm-judge` grader
412+
- Compare command syntax
413+
- Redis connection issues for distributed execution
414+
415+
---
416+
314417
## Contributing
315418

316419
Contributions welcome! This project uses:
317420

318-
- **pytest** for testing (127 tests passing)
421+
- **pytest** for testing
319422
- **ruff** for linting
320423
- **src layout** (`src/agenteval/`)
321424

docs/troubleshooting.md

Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
# Troubleshooting
2+
3+
Common issues and solutions for AgentEval.
4+
5+
---
6+
7+
## Agent Import Errors
8+
9+
### `ValueError: agent_ref must use 'module:attr' format`
10+
11+
The `--agent` flag expects `module:function` format:
12+
13+
```bash
14+
# Wrong
15+
agenteval run --suite suite.yaml --agent my_agent
16+
17+
# Correct
18+
agenteval run --suite suite.yaml --agent my_module:run_agent
19+
```
20+
21+
### `ModuleNotFoundError: No module named 'my_module'`
22+
23+
Ensure the module is importable from your current directory:
24+
25+
```bash
26+
# Your agent file must be in the current directory or on PYTHONPATH
27+
ls my_module.py # Should exist
28+
29+
# Or install your package
30+
pip install -e .
31+
```
32+
33+
### `AttributeError: module 'my_module' has no attribute 'run_agent'`
34+
35+
Check that the function name after `:` matches an exported function in the module.
36+
37+
---
38+
39+
## Missing Dependencies
40+
41+
### `ImportError: Redis is required for distributed execution`
42+
43+
Install the distributed extra:
44+
45+
```bash
46+
pip install agentevalkit[distributed]
47+
```
48+
49+
### `ImportError: scipy is required for statistical comparison`
50+
51+
Install the stats extra for Welch's t-test:
52+
53+
```bash
54+
pip install agentevalkit[stats]
55+
# or: pip install scipy
56+
```
57+
58+
AgentEval falls back to a pure-Python implementation if scipy is unavailable.
59+
60+
### `ImportError` for adapter frameworks
61+
62+
Install the appropriate extra:
63+
64+
```bash
65+
pip install agentevalkit[langchain] # LangChain adapter
66+
pip install agentevalkit[crewai] # CrewAI adapter
67+
pip install agentevalkit[autogen] # AutoGen adapter
68+
```
69+
70+
---
71+
72+
## LLM Judge Grader
73+
74+
### `Error: OPENAI_API_KEY not set`
75+
76+
The `llm-judge` grader requires an OpenAI API key (or compatible API):
77+
78+
```bash
79+
export OPENAI_API_KEY=sk-...
80+
```
81+
82+
You can also configure a custom API base in the grader config:
83+
84+
```yaml
85+
grader: llm-judge
86+
grader_config:
87+
model: gpt-4o-mini
88+
api_base: https://your-api.com/v1
89+
```
90+
91+
---
92+
93+
## Compare Command
94+
95+
### `Error: Could not parse compare arguments`
96+
97+
The compare command accepts two formats:
98+
99+
```bash
100+
# Two single runs
101+
agenteval compare RUN_ID_A RUN_ID_B
102+
103+
# Two groups (comma-separated, with 'vs')
104+
agenteval compare RUN_A1,RUN_A2 vs RUN_B1,RUN_B2
105+
```
106+
107+
Run IDs are the short hex IDs shown by `agenteval list`.
108+
109+
### `Error: Run not found`
110+
111+
Check available runs with:
112+
113+
```bash
114+
agenteval list --limit 20
115+
```
116+
117+
---
118+
119+
## YAML Suite Errors
120+
121+
### `Error: Suite file not found`
122+
123+
Ensure the path is correct:
124+
125+
```bash
126+
agenteval run --suite ./suites/my_suite.yaml
127+
```
128+
129+
### `Error: Invalid suite format`
130+
131+
Check your YAML syntax. Common issues:
132+
- Missing `name` field
133+
- Missing `cases` list
134+
- Incorrect indentation
135+
- Using tabs instead of spaces
136+
137+
Minimal valid suite:
138+
139+
```yaml
140+
name: my-tests
141+
agent: my_module:my_fn
142+
cases:
143+
- name: test-1
144+
input: "Hello"
145+
expected:
146+
output_contains: ["hello"]
147+
grader: contains
148+
```
149+
150+
---
151+
152+
## Database Issues
153+
154+
### `sqlite3.OperationalError: unable to open database file`
155+
156+
Check that the directory exists and is writable:
157+
158+
```bash
159+
# Default location
160+
ls -la agenteval.db
161+
162+
# Custom location
163+
agenteval run --suite suite.yaml --db /path/to/results.db
164+
```
165+
166+
### Corrupted database
167+
168+
Delete and re-run evaluations:
169+
170+
```bash
171+
rm agenteval.db
172+
agenteval run --suite suite.yaml
173+
```
174+
175+
---
176+
177+
## Redis / Distributed Execution
178+
179+
### Workers not detected
180+
181+
Ensure workers are running and connected to the same Redis instance:
182+
183+
```bash
184+
# Check Redis connectivity
185+
redis-cli -u redis://localhost:6379 ping
186+
# Should return: PONG
187+
188+
# Start a worker
189+
agenteval worker --broker redis://localhost:6379 --agent my_module:my_fn
190+
```
191+
192+
### Redis authentication errors
193+
194+
Use an authenticated URL:
195+
196+
```bash
197+
agenteval run --suite suite.yaml --workers redis://:password@host:6379
198+
```
199+
200+
### Security best practices
201+
202+
For production, use TLS-encrypted connections:
203+
204+
```bash
205+
# Use rediss:// scheme for TLS
206+
agenteval worker --broker rediss://:password@host:6380
207+
208+
# With custom CA certificate
209+
export REDIS_CA_CERT=/path/to/ca.pem
210+
```
211+
212+
---
213+
214+
## CI Integration
215+
216+
### Exit codes
217+
218+
- `0` — All cases passed
219+
- `1` — One or more cases failed (or regressions detected with `--fail-on-regression`)
220+
221+
### GitHub PR comments not posting
222+
223+
Check your token permissions:
224+
225+
```bash
226+
export GITHUB_TOKEN=ghp_... # Needs 'pull_requests: write' permission
227+
agenteval github-comment --run-id RUN_ID --repo owner/repo --pr 123
228+
```
229+
230+
See [docs/github-actions.md](github-actions.md) for full CI setup.

0 commit comments

Comments
 (0)