RAGAS for Go. Faithfulness, context recall, answer relevance, hallucination — measure your RAG pipeline without leaving the Go stack.
⚠️ Pre-MVP. API is exploration-grade untilv0.1.0. Star/watch to follow development. First metric (faithfulness) lands in the next commit batch.
Python owns the RAG-eval ecosystem: RAGAS (8k+ stars), DeepEval (3k+ stars), TruLens, Phoenix from Arize. All Python.
In Go — nothing comparable. Teams building RAG pipelines in Go (on langchaingo, cloudwego/eino, or hand-rolled OpenAI/Anthropic clients) currently have two options:
- Export to Python — run RAGAS in a separate process, plumb the data back. Slow, breaks the single-binary deployment story.
- Roll your own metrics — every team reinvents faithfulness, context-recall, hallucination detection.
Both are wasteful. goeval is the Go-native alternative.
- Streaming-first. Eval runs are pipelines: dataset → evaluator → metric. Channels are first-class. Eval a 10k-sample dataset without blocking your CI on memory.
- LLM-as-judge done right. Faithfulness, context relevance, answer correctness rely on a strong LLM. goeval supports a
Judgeabstraction so you can swap GPT-4 → Claude → local Llama transparently. - Deterministic metrics in addition. Context recall (overlap-based), BLEU/ROUGE-style — no LLM dependency, fast and reproducible.
- No framework lock-in. Adapters for
langchaingo,eino, raw OpenAI/Anthropic Go SDKs, but the core is dependency-light. - CI-friendly. Exit code on regression, JSON/Markdown reports, GitHub Action template.
-
Evaluator,Metric,Dataset,Judgeinterfaces - Streaming pipeline (dataset → evaluator → channel of
Result) - Metric: faithfulness (LLM-judge against retrieved context)
- Metric: context relevance (LLM-judge)
- Metric: answer relevance (cosine on embeddings)
- Metric: context recall (deterministic overlap)
- Metric: hallucination detection (LLM-judge + groundedness check)
- Judge adapters: OpenAI, Anthropic, local Ollama
- Dataset adapter: JSON/JSONL ingestion
- Adapter:
langchaingo - Adapter:
cloudwego/eino - CLI:
goeval run dataset.jsonl --metric faithfulness,context_recall - GitHub Action template for CI regression gates
- CHANGELOG + CONTRIBUTING + CoC
- v0.1.0 tag + release
Estimated runway to v0.1.0: 6 weeks (one solo maintainer, evenings).
go get github.com/goncharovart/goeval@v0.1.0- explodinggradients/ragas — Python progenitor of RAG-eval
- confident-ai/deepeval — 50+ metrics, LLM-as-judge
- truera/trulens — observability + eval
Pre-MVP solo development. Issues + PRs welcome; expect slow review (evenings only) until v0.1.0.
Build openly — every architectural decision goes into docs/design.md once it stabilises.
MIT — see LICENSE.