Multilingual GenAI evaluation service that benchmarks LLM outputs across 5 task types and 3 languages, persists run history, and exposes a small dashboard showing pass rates and regression trends per model version.
Numbers below are from the committed
eval/baselines/fake-large.jsonbaseline, generated by running the full suite against the in-repoFakeProvider. The FakeProvider is scripted to produce some intentionally wrong outputs so the failure path of the scoring code is exercised on every CI run — that is why the cells below are not all 100%. Live-provider numbers (OpenAI / Anthropic) are BYOK; see Running against a live provider.
Overall pass rate: 66.7% (n=39 examples, 0 infrastructure errors).
| task | en | es | ja | py | en-es | en-ja | es-en |
|---|---|---|---|---|---|---|---|
| summarization | 33.3% | 33.3% | 0.0% | — | — | — | — |
| translation | — | — | — | — | 100.0% | 66.7% | 100.0% |
| qa | 66.7% | 66.7% | 100.0% | — | — | — | — |
| classification | 66.7% | 66.7% | 66.7% | — | — | — | — |
| code_repair | — | — | — | 100% | — | — | — |
The
summarization/jacell is 0% because the local ROUGE-L tokeniser splits on whitespace, and Japanese has no inter-word whitespace. This is a known limitation of whitespace-tokenised ROUGE-L applied to CJK and is captured in the baseline rather than papered over. A future task module can swap in a character-level metric forja.
- Eval-as-CI-gate. A full eval matrix (5 tasks × 3 languages, 30 examples)
runs on every push against
FakeProvider. The job asserts that pass rates match the committed baseline within1e-6. A behavioral regression in any scoring or task module fails the build before merge. - Localization quality is its own concern. Beyond raw task metrics, every
non-English output is also scored on (a) script correctness (Unicode-block
detection, e.g. romaji for Japanese fails), (b) honorific appropriateness
for
ja(LLM-as-judge against a committed rubric), (c) calque artifacts fores(LLM-as-judge). Script is cheap and deterministic; the two judges catch failure modes script never can. - Regression-flag heuristic. Per (model, task, language) cell, the dashboard flags any run whose pass rate dropped more than 5 percentage points below the rolling 7-run mean. Not a statistical test — a coarse trip-wire.
- Hermetic evals. The provider seam is a
Protocol.FakeProviderreturns deterministic responses keyed on a tag protocol embedded in the prompt. CI never makes a network call. LiveOpenAIProvider/AnthropicProviderexist behind the same interface and are tested withrespxmocks.
| Path | Role |
|---|---|
src/genai_eval/api.py |
FastAPI: /v1/runs, /v1/trends, /healthz, /metrics |
src/genai_eval/cli.py |
genai-eval run Click CLI |
src/genai_eval/orchestrator.py |
Task × language matrix runner, persistence |
src/genai_eval/models.py |
SQLAlchemy 2 async; Run, RunItem, ModelVersion |
src/genai_eval/providers/ |
ChatProvider Protocol + Fake / OpenAI / Anthropic |
src/genai_eval/tasks/ |
One module per task type: load_suite, score |
src/genai_eval/metrics/ |
Pure-Python ROUGE-L, chrF, exact-match, token-F1 |
src/genai_eval/localization/ |
Script check, honorific judge, calque judge |
eval/suites/ |
Hand-written YAML examples per task × language |
eval/baselines/fake-large.json |
Committed FakeProvider baseline |
dashboard/ |
Next.js 14 dashboard (App Router, Tailwind, recharts) |
poetry install
poetry run alembic upgrade head
poetry run genai-eval run --provider fake --model fake-large \
--output eval/baselines/fake-large.json
poetry run uvicorn genai_eval.api:app --reload
# in another shell
cd dashboard && npm install && npm run devVisit http://localhost:3000. The dashboard reads from the API at
http://localhost:8000 (override with GENAI_EVAL_API_URL).
export OPENAI_API_KEY=sk-...
poetry run genai-eval run \
--provider openai --model gpt-4o-mini \
--output eval/baselines/gpt-4o-mini.jsonAnthropic:
export ANTHROPIC_API_KEY=sk-ant-...
poetry run genai-eval run \
--provider anthropic --model claude-3-5-haiku-latest \
--output eval/baselines/claude-3-5-haiku.jsonLive results: <TBD: see BYOK section> — populate by running the commands above
against your own keys.
┌──────────────────────────────────────────────────┐
│ genai-eval CLI │
│ poetry run genai-eval run │
└──────────────────────┬───────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ orchestrator.run_suite │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ collect_examples ◄── eval/suites/*.yaml │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ ChatProvider │ │ task.score │ │ localization │ │
│ │ (Fake/OpenAI/ │ │ rouge_l/chrf/ │ │ script/judges │ │
│ │ Anthropic) │ │ EM/F1/pytest │ │ │ │
│ └───────┬────────┘ └───────┬────────┘ └───────┬────────┘ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ summarise → cells (task × lang) → SQLAlchemy persist │ │
│ └──────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────┬─────────────────────────────────┘
│ Run, RunItem, ModelVersion
▼
┌─────────────────────────────────────────────┐
│ SQLite (aiosqlite, study-scale) │
└────────────────────┬────────────────────────┘
│
┌────────┴─────────┐
▼ ▼
FastAPI /v1/runs Next.js dashboard
/v1/trends pass-rate grid +
/healthz /metrics regression trends
- Not a fine-tuning pipeline.
- Not an inference cost optimizer.
- Not a human-in-the-loop scoring tool.
- Not a public leaderboard.
- Not a model-serving gateway.
The scope is deliberately narrow: produce reliable, reproducible eval signals with hermetic CI behavior and a regression-trend view that fits on one screen.
MIT — see LICENSE.