Goal: Build an evaluation framework that systematically tests and scores your AI agents, using the same local model as both the agent under test and the judge, so you can iterate on prompts with confidence before shipping.
When building AI agents, "it looks about right" is not good enough. Evaluation-led development treats agent outputs like code: you write tests first, measure quality, and only ship when scores meet a threshold.
In the Zava Creative Writer (Part 7), the Editor agent already acts as a lightweight evaluator; it decides ACCEPT or REVISE. This lab formalises that pattern into a repeatable evaluation framework you can apply to any agent or pipeline.
| Problem | Solution |
|---|---|
| Prompt changes silently break quality | Golden dataset catches regressions |
| "Works on one example" bias | Multiple test cases reveal edge cases |
| Subjective quality assessment | Rule-based + LLM-as-judge scoring provides numbers |
| No way to compare prompt variants | Side-by-side eval runs with aggregate scores |
A golden dataset is a curated set of test cases with known expected outputs. Each test case includes:
- Input: The prompt or question to send to the agent
- Expected output: What a correct or high-quality response should contain (keywords, structure, facts)
- Category: Grouping for reporting (e.g. "factual accuracy", "tone", "completeness")
Fast, deterministic checks that do not require an LLM:
| Check | What It Tests |
|---|---|
| Length bounds | Response is not too short (lazy) or too long (rambling) |
| Required keywords | Response mentions expected terms or entities |
| Format validation | JSON is parsable, Markdown headers are present |
| Forbidden content | No hallucinated brand names, no competitor mentions |
Use the same local model to grade its own outputs (or outputs from a different prompt variant). The judge receives:
- The original question
- The agent's response
- Grading criteria
And returns a structured score. This mirrors the Editor pattern from Part 7 but applied systematically across a test suite.
| Requirement | Details |
|---|---|
| Foundry Local CLI | Installed with a model downloaded |
| Language runtime | Python 3.9+ and/or Node.js 18+ and/or .NET 9+ SDK |
| Completed | Part 5: Single Agents and Part 6: Multi-Agent Workflows |
The workshop includes a complete evaluation sample that tests a Foundry Local agent against a golden dataset of Zava DIY-related questions.
🐍 Python
Setup:
cd python
python -m venv venv
# Windows (PowerShell):
venv\Scripts\Activate.ps1
# macOS:
source venv/bin/activate
pip install -r requirements.txtRun:
python foundry-local-eval.pyWhat happens:
- Connects to Foundry Local and loads the model
- Defines a golden dataset of 5 test cases (Zava product questions)
- Runs two prompt variants against every test case
- Scores each response with rule-based checks (length, keywords, format)
- Scores each response with LLM-as-judge (the same model grades quality 1-5)
- Prints a scorecard comparing both prompt variants
📦 JavaScript
Setup:
cd javascript
npm installRun:
node foundry-local-eval.mjsSame evaluation pipeline: golden dataset, dual prompt runs, rule-based + LLM scoring, scorecard.
💜 C#
Setup:
cd csharp
dotnet restoreRun:
dotnet run evalSame evaluation pipeline: golden dataset, dual prompt runs, rule-based + LLM scoring, scorecard.
Examine the test cases defined in the evaluation sample. Each test case has:
{
"input": "What tools do I need to build a deck?",
"expected": ["saw", "drill", "screws", "level"],
"category": "product-recommendation"
}
Questions to consider:
- Why are the expected values keywords rather than full sentences?
- How many test cases do you need for a reliable evaluation?
- What categories would you add for your own application?
The evaluation framework uses two scoring layers:
✓ Length: 50-500 words → 1 point
✓ Keywords: 3/4 found → 0.75 points
✗ Forbidden: mentions "Home Depot" → 0 points
─────────────────────────────
Rule score: 0.58 / 1.0
The same local model acts as a judge with a structured rubric:
Rate this response on a scale of 1-5:
- 1: Completely wrong or irrelevant
- 2: Partially correct but missing key information
- 3: Adequate but could be improved
- 4: Good response with minor issues
- 5: Excellent, comprehensive, well-structured
Score: 4
Reasoning: The response correctly identifies all necessary tools
and provides practical advice, but could include safety equipment.
Questions to consider:
- When would you trust rule-based checks over LLM-as-judge?
- Can a model reliably judge its own output? What are the limitations?
- How does this compare to the Editor agent pattern from Part 7?
The sample runs two prompt variants against the same test cases:
| Variant | System Prompt Style |
|---|---|
| Baseline | Generic: "You are a helpful assistant" |
| Specialised | Detailed: "You are a Zava DIY expert who recommends specific products and provides step-by-step guidance" |
After running, you will see a scorecard like:
╔══════════════════════════════════════════════════════════════╗
║ EVALUATION SCORECARD ║
╠══════════════════════════════════════════════════════════════╣
║ Prompt Variant │ Rule Score │ LLM Score │ Combined ║
╠═══════════════════╪════════════╪═══════════╪════════════════╣
║ Baseline │ 0.62 │ 3.2 / 5 │ 0.62 ║
║ Specialised │ 0.81 │ 4.1 / 5 │ 0.81 ║
╚══════════════════════════════════════════════════════════════╝
Exercises:
- Run the evaluation and note the scores for each variant
- Modify the specialised prompt to be even more specific. Does the score improve?
- Add a third prompt variant and compare all three.
- Try changing the model alias (e.g.
phi-4-minivsphi-3.5-mini) and compare results.
Use the evaluation framework as a template for your own agents:
- Define your golden dataset: write 5 to 10 test cases with expected keywords.
- Write your system prompt: the agent instructions you want to test.
- Run the eval: get baseline scores.
- Iterate: tweak the prompt, re-run, and compare.
- Set a threshold: e.g. "do not ship below 0.75 combined score".
Look back at the Zava Creative Writer's Editor agent (zava-creative-writer-local/src/api/agents/editor/editor.py):
# The editor is an LLM-as-judge in production:
{"decision": "accept/revise", "editorFeedback": "...", "researchFeedback": "..."}This is the same concept as Part 8's LLM-as-judge, but embedded in the production pipeline rather than in an offline test suite. Both patterns:
- Use structured JSON output from the model.
- Apply quality criteria defined in the system prompt.
- Make a pass/fail decision.
The difference: The editor runs in production (on every request). The evaluation framework runs in development (before you ship).
| Concept | Takeaway |
|---|---|
| Golden datasets | Curate test cases early; they are your regression tests for AI |
| Rule-based checks | Fast, deterministic, and catch obvious failures (length, keywords, format) |
| LLM-as-judge | Nuanced quality scoring using the same local model |
| Prompt comparison | Run multiple variants against the same test suite to pick the best |
| On-device advantage | All evaluation runs locally: no API costs, no rate limits, no data leaving your machine |
| Eval-before-ship | Set quality thresholds and gate releases on evaluation scores |
- Scale up: Add more test cases and categories to your golden dataset.
- Automate: Integrate evaluation into your CI/CD pipeline.
- Advanced judges: Use a larger model as the judge whilst testing a smaller model's output.
- Track over time: Save evaluation results to compare across prompt and model versions.
Continue to Part 9: Voice Transcription with Whisper to explore speech-to-text on-device using the Foundry Local SDK.