GEDD - A Systematic Evidence Driven LLM As a Judge Framework

GEDD is a Systematic Evidence Driven LLM As a Judge Framework for AI agents.

It is an annotation-first workflow for turning domain-owner review of AI agent behavior into release gates engineering can run.

The web app gives product managers, domain experts, and ML engineers one shared path:

Define the agent and the work it is supposed to do.
Collect or load representative queries and responses.
Review the responses in a task-shaped workbench.
Name failures in the domain owner's vocabulary.
Convert the observed failures into an LLM-as-a-judge prompt.
Export a validated handoff for CI, MLflow, and model regression work.

The current first-run experience ships with two 50-query PM workbench demos: an AAA game localization session and an AWS cloud GDPR auditor session. They show how a domain owner can move from raw agent traces to open codes, root-cause patterns, saturation evidence, a judge prompt, and an ML engineer implementation queue.

The longer methodology essay is in METHODOLOGY.md. This README is the practical product and engineering guide.

What GEDD Produces

Output	Who creates it	Who uses it	Why it matters
Golden queries	PM or domain expert	ML engineer, eval owner	Defines the user situations the agent must handle
Human labels	PM or domain expert	Judge builder, release owner	Separates acceptable, partial, and failing behavior
Failure codebook	PM or domain expert	ML engineer, prompt owner	Names the exact domain-specific failure modes to fix
Memos and severity	PM or domain expert	ML engineer, reviewer	Explains why the failure matters and how bad it is
Axial coding	PM or domain expert	Product and engineering leads	Groups repeated failures into root causes and consequences
Judge prompt	PM plus ML engineer	CI and model evaluation	Converts observed failures into automated review criteria
`session.json` handoff	App or CLI	ML engineer	Carries agent spec, prompt, queries, labels, and validation state
MLflow artifacts	ML engineer	Release pipeline	Tracks datasets, judges, evaluation runs, and regression gates

GEDD is not a generic model leaderboard. It is a way to preserve expert judgment and make it executable.

Quick Start

Start the web app:

cd grounded-evals
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
grounded-evals serve --host 127.0.0.1 --port 8080

Open http://127.0.0.1:8080.

No Codex skill or plugin is required.

Local runs start in guest mode unless ADMIN_PASSWORD or Cognito environment variables are configured. If port 8080 is busy, use --port 8081.

For the fastest product tour, use one of the seeded 50-query demos. They do not require model calls:

Open Home or Demos.
Click Load 50-query localization demo or Load 50-query AWS Cloud GDPR demo.
Open PM Workbench to review the labeled traces, failure codes, memos, and saturation state.
Open Judge to inspect or revise the generated judge prompt.
Open Report to review release readiness and download the ML engineer handoff.

To reset after loading a demo, use the top-right refresh action. Confirm Start Fresh to clear the loaded project data while keeping the current login session.

Current Web App

grounded-evals serve runs a NiceGUI app with a short primary navigation:

Page	Purpose	Main actions
`Home`	Entry point	Load the 50-query localization or AWS Cloud GDPR demo, continue active work, or start a custom agent
`AI PM Coach`	Guided setup	Capture agent definition, system prompt, runtime choice, and golden-query plan
`PM Workbench`	Annotation surface	Review responses, assign verdicts, create failure codes, set severity, write memos, and monitor saturation
`Judge`	Release gate builder	Generate and edit an LLM-as-a-judge prompt from the observed failure modes
`Report`	Engineering handoff	Review quality signals, CI gates, artifact readiness, implementation queue, and export files

The Demos page remains available for starter data. It is not the main workflow. Demos are seed sessions that help teams understand the annotation loop before they bring their own traces.

The 50-Query Localization Demo

The main demo is a synthetic but complete localization QA session for an AAA game agent called LocaleGate.

It includes:

Asset	Contents
50 golden queries	Runtime strings, storefront copy, subtitles, RTL input prompts, region rules, culturalization, paid-currency copy, live-event dates, and glossary consistency
Synthetic responses	Baseline agent answers with realistic localization failures
PM annotations	Correct, partial, and incorrect verdicts with severity and confidence
Open codes	Localization-specific failure labels rather than generic quality tags
Axial coding	Root causes, context, intervening conditions, action strategy, and consequence mapping
Saturation evidence	Final-window evidence that new annotations repeat existing codes
Judge prompt	A release-gate judge built from the localization failure modes
Report handoff	CI gates, artifact status, implementation queue, and commands for an ML engineer

Example failure codes in the demo include:

Code	What it catches
Placeholder And Markup Corruption	The response approves a translation that drops variables, tags, markup, or runtime-safe formatting
Gameplay Meaning Reversal	The localized text reverses the gameplay instruction or player action
Rating Or Disclosure Softening	Marketing or regional copy weakens required rating, privacy, paid-currency, or platform disclosures
RTL Input Direction Drift	Right-to-left layout or controller input language changes the intended interaction
Locale Format Ambiguity	Dates, times, numbers, or currencies remain ambiguous for the target locale
Entitlement Copy Mistranslation	Storefront text changes what the buyer receives or what content is included
Culturalization Risk Dismissal	The response treats regional content risk as a translation-only issue

Those labels are the point of the workflow. The judge is not asked to score generic helpfulness first. It is asked to enforce the domain owner's observed release blockers.

The 50-Query AWS Cloud GDPR Demo

The second main workbench demo is a synthetic AWS cloud GDPR audit session for CloudAuditGate.

It includes 50 golden queries covering S3 and CloudWatch retention, CloudTrail and centralized logging, Bedrock prompt reuse, Rekognition and high-risk review, DSAR and deletion handling across backups and data lakes, shared responsibility, cross-region transfers, and breach escalation from AWS security incidents. The output is the same PM-owned package as the localization demo: annotations, open codes, axial coding, saturation evidence, and an audit-ready judge prompt.

The AWS Cloud GDPR demo uses plain-language tags on purpose, for example Data Used For The Wrong Job, Collecting Or Keeping Too Much Data, EU Data Moved The Wrong Way, and Trying To Work Around GDPR. The point is to make the GEDD loop easy to follow: annotate the failure in human language first, then turn that observed pattern into the judge gate.

Bring Your Own Agent

Use the app when you have a real or proposed agent and need review evidence before you automate evaluation.

Step	What to do	Output
1. Define	Describe the agent, user, task boundary, and system prompt in `AI PM Coach`	Agent spec and prompt
2. Build queries	Generate or paste golden queries that cover normal, edge, ambiguous, adversarial, multi-turn, and recovery cases	Query set
3. Get responses	Run the saved prompt against Bedrock, Anthropic, or a configured runtime, or paste existing traces	Response queue
4. Annotate	Review each response in `PM Workbench` and capture verdict, code, severity, confidence, and memo	Human labels and codebook
5. Pattern	Use open coding and axial coding to group repeated failures and root causes	Release-risk model
6. Judge	Build the judge prompt from the observed codes and examples	LLM-as-a-judge prompt
7. Handoff	Export the session and ML handoff from `Report`	Engineering package

If you already have production traces, use the app as an annotation surface rather than generating new responses. See Paste In Traces.

ML Engineer Handoff

The Report page contains an ML Engineer Handoff section. It is designed to be actionable, not a narrative status update.

It gives engineering:

Handoff field	Why it exists
Engineering status	Indicates whether the session is blocked by P0 failures, missing a judge, needs calibration, or is ready for a CI pilot
CI gates	Shows current and target values for P0 failures, regression pass rate, human coverage, and judge-human agreement
Artifact status	Confirms whether session handoff, golden dataset, codebook, judge prompt, and calibration evidence are ready
Implementation queue	Prioritizes failure codes by severity and count, with tagged examples and definitions of done
Runbook	Gives commands the ML engineer can run immediately

License And Security

License: MIT-0. See LICENSE.

Security issue reporting: see CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
.github		.github
grounded-evals		grounded-evals
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
SETUP.md		SETUP.md
conference-talk.md		conference-talk.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEDD - A Systematic Evidence Driven LLM As a Judge Framework

What GEDD Produces

Quick Start

Current Web App

The 50-Query Localization Demo

The 50-Query AWS Cloud GDPR Demo

Bring Your Own Agent

ML Engineer Handoff

License And Security

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GEDD - A Systematic Evidence Driven LLM As a Judge Framework

What GEDD Produces

Quick Start

Current Web App

The 50-Query Localization Demo

The 50-Query AWS Cloud GDPR Demo

Bring Your Own Agent

ML Engineer Handoff

License And Security

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages