genai-eval

Multilingual GenAI evaluation service that benchmarks LLM outputs across 5 task types and 3 languages, persists run history, and exposes a small dashboard showing pass rates and regression trends per model version.

Eval results (FakeProvider baseline)

Numbers below are from the committed eval/baselines/fake-large.json baseline, generated by running the full suite against the in-repo FakeProvider. The FakeProvider is scripted to produce some intentionally wrong outputs so the failure path of the scoring code is exercised on every CI run — that is why the cells below are not all 100%. Live-provider numbers (OpenAI / Anthropic) are BYOK; see Running against a live provider.

Overall pass rate: 66.7% (n=39 examples, 0 infrastructure errors).

task	en	es	ja	py	en-es	en-ja	es-en
summarization	33.3%	33.3%	0.0%	—	—	—	—
translation	—	—	—	—	100.0%	66.7%	100.0%
qa	66.7%	66.7%	100.0%	—	—	—	—
classification	66.7%	66.7%	66.7%	—	—	—	—
code_repair	—	—	—	100%	—	—	—

The summarization/ja cell is 0% because the local ROUGE-L tokeniser splits on whitespace, and Japanese has no inter-word whitespace. This is a known limitation of whitespace-tokenised ROUGE-L applied to CJK and is captured in the baseline rather than papered over. A future task module can swap in a character-level metric for ja.

What this studies

Eval-as-CI-gate. A full eval matrix (5 tasks × 3 languages, 30 examples) runs on every push against FakeProvider. The job asserts that pass rates match the committed baseline within 1e-6. A behavioral regression in any scoring or task module fails the build before merge.
Localization quality is its own concern. Beyond raw task metrics, every non-English output is also scored on (a) script correctness (Unicode-block detection, e.g. romaji for Japanese fails), (b) honorific appropriateness for ja (LLM-as-judge against a committed rubric), (c) calque artifacts for es (LLM-as-judge). Script is cheap and deterministic; the two judges catch failure modes script never can.
Regression-flag heuristic. Per (model, task, language) cell, the dashboard flags any run whose pass rate dropped more than 5 percentage points below the rolling 7-run mean. Not a statistical test — a coarse trip-wire.
Hermetic evals. The provider seam is a Protocol. FakeProvider returns deterministic responses keyed on a tag protocol embedded in the prompt. CI never makes a network call. Live OpenAIProvider / AnthropicProvider exist behind the same interface and are tested with respx mocks.

Modules

Path	Role
`src/genai_eval/api.py`	FastAPI: `/v1/runs`, `/v1/trends`, `/healthz`, `/metrics`
`src/genai_eval/cli.py`	`genai-eval run` Click CLI
`src/genai_eval/orchestrator.py`	Task × language matrix runner, persistence
`src/genai_eval/models.py`	SQLAlchemy 2 async; `Run`, `RunItem`, `ModelVersion`
`src/genai_eval/providers/`	`ChatProvider` Protocol + Fake / OpenAI / Anthropic
`src/genai_eval/tasks/`	One module per task type: `load_suite`, `score`
`src/genai_eval/metrics/`	Pure-Python ROUGE-L, chrF, exact-match, token-F1
`src/genai_eval/localization/`	Script check, honorific judge, calque judge
`eval/suites/`	Hand-written YAML examples per task × language
`eval/baselines/fake-large.json`	Committed FakeProvider baseline
`dashboard/`	Next.js 14 dashboard (App Router, Tailwind, recharts)

Quickstart

poetry install
poetry run alembic upgrade head
poetry run genai-eval run --provider fake --model fake-large \
  --output eval/baselines/fake-large.json
poetry run uvicorn genai_eval.api:app --reload
# in another shell
cd dashboard && npm install && npm run dev

Visit http://localhost:3000. The dashboard reads from the API at http://localhost:8000 (override with GENAI_EVAL_API_URL).

Running against a live provider

export OPENAI_API_KEY=sk-...
poetry run genai-eval run \
  --provider openai --model gpt-4o-mini \
  --output eval/baselines/gpt-4o-mini.json

Anthropic:

export ANTHROPIC_API_KEY=sk-ant-...
poetry run genai-eval run \
  --provider anthropic --model claude-3-5-haiku-latest \
  --output eval/baselines/claude-3-5-haiku.json

Live results: <TBD: see BYOK section> — populate by running the commands above against your own keys.

Architecture (ASCII)

                   ┌──────────────────────────────────────────────────┐
                   │                   genai-eval CLI                 │
                   │            poetry run genai-eval run             │
                   └──────────────────────┬───────────────────────────┘
                                          │
                                          ▼
   ┌──────────────────────────────────────────────────────────────────────┐
   │                          orchestrator.run_suite                      │
   │   ┌──────────────────────────────────────────────────────────────┐   │
   │   │  collect_examples  ◄── eval/suites/*.yaml                    │   │
   │   └──────────────────────────────────────────────────────────────┘   │
   │   ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │
   │   │ ChatProvider   │  │ task.score     │  │ localization   │        │
   │   │ (Fake/OpenAI/  │  │ rouge_l/chrf/  │  │ script/judges  │        │
   │   │  Anthropic)    │  │ EM/F1/pytest   │  │                │        │
   │   └───────┬────────┘  └───────┬────────┘  └───────┬────────┘        │
   │           ▼                    ▼                  ▼                  │
   │   ┌──────────────────────────────────────────────────────────────┐   │
   │   │   summarise → cells (task × lang) → SQLAlchemy persist       │   │
   │   └──────────────────────────────────────────────────────────────┘   │
   └────────────────────────────────────┬─────────────────────────────────┘
                                        │ Run, RunItem, ModelVersion
                                        ▼
                ┌─────────────────────────────────────────────┐
                │  SQLite (aiosqlite, study-scale)            │
                └────────────────────┬────────────────────────┘
                                     │
                            ┌────────┴─────────┐
                            ▼                  ▼
                  FastAPI /v1/runs       Next.js dashboard
                  /v1/trends             pass-rate grid +
                  /healthz /metrics      regression trends

What this is not

Not a fine-tuning pipeline.
Not an inference cost optimizer.
Not a human-in-the-loop scoring tool.
Not a public leaderboard.
Not a model-serving gateway.

The scope is deliberately narrow: produce reliable, reproducible eval signals with hermetic CI behavior and a regression-trend view that fits on one screen.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
alembic		alembic
dashboard		dashboard
eval		eval
infra		infra
scripts		scripts
src/genai_eval		src/genai_eval
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

genai-eval

Eval results (FakeProvider baseline)

What this studies

Modules

Quickstart

Running against a live provider

Architecture (ASCII)

What this is not

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

genai-eval

Eval results (FakeProvider baseline)

What this studies

Modules

Quickstart

Running against a live provider

Architecture (ASCII)

What this is not

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages