Skip to content

SAY-5/genai-eval

Repository files navigation

genai-eval

Multilingual GenAI evaluation service that benchmarks LLM outputs across 5 task types and 3 languages, persists run history, and exposes a small dashboard showing pass rates and regression trends per model version.

Eval results (FakeProvider baseline)

Numbers below are from the committed eval/baselines/fake-large.json baseline, generated by running the full suite against the in-repo FakeProvider. The FakeProvider is scripted to produce some intentionally wrong outputs so the failure path of the scoring code is exercised on every CI run — that is why the cells below are not all 100%. Live-provider numbers (OpenAI / Anthropic) are BYOK; see Running against a live provider.

Overall pass rate: 66.7% (n=39 examples, 0 infrastructure errors).

task en es ja py en-es en-ja es-en
summarization 33.3% 33.3% 0.0%
translation 100.0% 66.7% 100.0%
qa 66.7% 66.7% 100.0%
classification 66.7% 66.7% 66.7%
code_repair 100%

The summarization/ja cell is 0% because the local ROUGE-L tokeniser splits on whitespace, and Japanese has no inter-word whitespace. This is a known limitation of whitespace-tokenised ROUGE-L applied to CJK and is captured in the baseline rather than papered over. A future task module can swap in a character-level metric for ja.

What this studies

  • Eval-as-CI-gate. A full eval matrix (5 tasks × 3 languages, 30 examples) runs on every push against FakeProvider. The job asserts that pass rates match the committed baseline within 1e-6. A behavioral regression in any scoring or task module fails the build before merge.
  • Localization quality is its own concern. Beyond raw task metrics, every non-English output is also scored on (a) script correctness (Unicode-block detection, e.g. romaji for Japanese fails), (b) honorific appropriateness for ja (LLM-as-judge against a committed rubric), (c) calque artifacts for es (LLM-as-judge). Script is cheap and deterministic; the two judges catch failure modes script never can.
  • Regression-flag heuristic. Per (model, task, language) cell, the dashboard flags any run whose pass rate dropped more than 5 percentage points below the rolling 7-run mean. Not a statistical test — a coarse trip-wire.
  • Hermetic evals. The provider seam is a Protocol. FakeProvider returns deterministic responses keyed on a tag protocol embedded in the prompt. CI never makes a network call. Live OpenAIProvider / AnthropicProvider exist behind the same interface and are tested with respx mocks.

Modules

Path Role
src/genai_eval/api.py FastAPI: /v1/runs, /v1/trends, /healthz, /metrics
src/genai_eval/cli.py genai-eval run Click CLI
src/genai_eval/orchestrator.py Task × language matrix runner, persistence
src/genai_eval/models.py SQLAlchemy 2 async; Run, RunItem, ModelVersion
src/genai_eval/providers/ ChatProvider Protocol + Fake / OpenAI / Anthropic
src/genai_eval/tasks/ One module per task type: load_suite, score
src/genai_eval/metrics/ Pure-Python ROUGE-L, chrF, exact-match, token-F1
src/genai_eval/localization/ Script check, honorific judge, calque judge
eval/suites/ Hand-written YAML examples per task × language
eval/baselines/fake-large.json Committed FakeProvider baseline
dashboard/ Next.js 14 dashboard (App Router, Tailwind, recharts)

Quickstart

poetry install
poetry run alembic upgrade head
poetry run genai-eval run --provider fake --model fake-large \
  --output eval/baselines/fake-large.json
poetry run uvicorn genai_eval.api:app --reload
# in another shell
cd dashboard && npm install && npm run dev

Visit http://localhost:3000. The dashboard reads from the API at http://localhost:8000 (override with GENAI_EVAL_API_URL).

Running against a live provider

export OPENAI_API_KEY=sk-...
poetry run genai-eval run \
  --provider openai --model gpt-4o-mini \
  --output eval/baselines/gpt-4o-mini.json

Anthropic:

export ANTHROPIC_API_KEY=sk-ant-...
poetry run genai-eval run \
  --provider anthropic --model claude-3-5-haiku-latest \
  --output eval/baselines/claude-3-5-haiku.json

Live results: <TBD: see BYOK section> — populate by running the commands above against your own keys.

Architecture (ASCII)

                   ┌──────────────────────────────────────────────────┐
                   │                   genai-eval CLI                 │
                   │            poetry run genai-eval run             │
                   └──────────────────────┬───────────────────────────┘
                                          │
                                          ▼
   ┌──────────────────────────────────────────────────────────────────────┐
   │                          orchestrator.run_suite                      │
   │   ┌──────────────────────────────────────────────────────────────┐   │
   │   │  collect_examples  ◄── eval/suites/*.yaml                    │   │
   │   └──────────────────────────────────────────────────────────────┘   │
   │   ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │
   │   │ ChatProvider   │  │ task.score     │  │ localization   │        │
   │   │ (Fake/OpenAI/  │  │ rouge_l/chrf/  │  │ script/judges  │        │
   │   │  Anthropic)    │  │ EM/F1/pytest   │  │                │        │
   │   └───────┬────────┘  └───────┬────────┘  └───────┬────────┘        │
   │           ▼                    ▼                  ▼                  │
   │   ┌──────────────────────────────────────────────────────────────┐   │
   │   │   summarise → cells (task × lang) → SQLAlchemy persist       │   │
   │   └──────────────────────────────────────────────────────────────┘   │
   └────────────────────────────────────┬─────────────────────────────────┘
                                        │ Run, RunItem, ModelVersion
                                        ▼
                ┌─────────────────────────────────────────────┐
                │  SQLite (aiosqlite, study-scale)            │
                └────────────────────┬────────────────────────┘
                                     │
                            ┌────────┴─────────┐
                            ▼                  ▼
                  FastAPI /v1/runs       Next.js dashboard
                  /v1/trends             pass-rate grid +
                  /healthz /metrics      regression trends

What this is not

  • Not a fine-tuning pipeline.
  • Not an inference cost optimizer.
  • Not a human-in-the-loop scoring tool.
  • Not a public leaderboard.
  • Not a model-serving gateway.

The scope is deliberately narrow: produce reliable, reproducible eval signals with hermetic CI behavior and a regression-trend view that fits on one screen.

License

MIT — see LICENSE.

About

Multilingual GenAI evaluation service across 5 task types and 3 languages, with regression-trend dashboard

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors