|
1 | | -# Dingo v2.1.0 — Open-source AI data quality evaluation now available as SaaS |
| 1 | +# [P] How we automated data quality checks for LLM training data — 100+ metrics across rules, LLM, VLM, and agents |
2 | 2 |
|
3 | | -We just released Dingo v2.1.0. Alongside the open-source SDK, Dingo is now available as a hosted SaaS platform at **https://dingo.openxlab.org.cn/** — no local setup required, evaluate your data quality directly in the browser. |
| 3 | +We've been working on evaluating data quality for LLM training pipelines — SFT datasets, RAG outputs, OCR-parsed documents, etc. The problem we kept running into: simple heuristics catch formatting issues but miss semantic problems, and LLM-as-a-Judge alone isn't great at things that need external verification (e.g. factual accuracy). |
4 | 4 |
|
5 | | -## Dingo SaaS Platform |
| 5 | +So we built a layered approach: |
6 | 6 |
|
7 | | -- Upload datasets (JSONL, CSV, HuggingFace) and run evaluations through a web UI |
8 | | -- Configure evaluation metrics, manage experiments, and view detailed reports online |
9 | | -- API key support for programmatic access — integrate Dingo evaluation into your CI/CD or data pipelines |
10 | | -- Free to use during the public preview |
| 7 | +1. **Rule layer** — fast, deterministic checks for the obvious stuff (encoding issues, repetition, special characters, formatting). ~50 rules, runs in milliseconds. |
| 8 | +2. **LLM layer** — any OpenAI-compatible model as evaluator, customizable prompts per use case. |
| 9 | +3. **VLM layer** — for OCR/document parsing: render the parsed output back to an image, then visually compare against the original. Catches layout issues that text-level diffing misses. |
| 10 | +4. **Agent layer** — this is the newest addition. Instead of a single prompt, we have an autonomous agent that can use tools (ArXiv search, claims extraction) to fact-check content step by step. Useful when evaluation requires external knowledge. |
11 | 11 |
|
12 | | -👉 Try it now: https://dingo.openxlab.org.cn/ |
| 12 | +For RAG specifically, we have metrics for faithfulness, context precision/recall, and answer relevancy — similar to RAGAS but integrated into the same framework. |
13 | 13 |
|
14 | | -## Key updates in v2.1.0 |
| 14 | +The Agent layer was the most interesting part to build. The problem: when you need to verify factual claims in an article, a single LLM prompt can't reliably do it — the model just guesses based on its training data. So we built an agent (LangChain ReAct) that autonomously: (1) extracts verifiable claims from text, (2) categorizes them (institutional, temporal, statistical, attribution...), and (3) verifies each claim using the right tool — ArXiv search for academic claims, web search for news/product claims. It generates a structured report with evidence for each claim. We've tested it on academic articles, news, product reviews, and tech blogs — the adaptive tool selection per claim type was key to getting reliable results. |
15 | 15 |
|
16 | | -**Agent-as-a-Judge** |
17 | | -Autonomous evaluation agents with tool use — the article fact-checking agent leverages ArXiv search and claims extraction to verify factual accuracy. |
| 16 | +The whole thing is plugin-based — adding a new evaluator is just a decorated class. Data model uses Pydantic `extra="allow"` so it adapts to whatever schema your dataset has. Multi-field evaluation lets you run different checks on different columns of the same dataset (e.g. check prompt quality and answer quality separately). |
18 | 17 |
|
19 | | -**VLMRenderJudge** |
20 | | -Visual comparison metric for OCR quality: a VLM directly compares rendered document output against the original image to detect parsing errors. |
| 18 | +We also recently put up a hosted version so teams can run evaluations without setting up Python environments locally — useful for non-engineering stakeholders who want to inspect data quality: https://dingo.openxlab.org.cn/ |
21 | 19 |
|
22 | | -**RAG evaluation** |
23 | | -Built-in metrics for end-to-end RAG quality: Faithfulness, Context Precision, Context Recall, Answer Relevancy, Context Relevancy. |
| 20 | +The project is open source (Apache-2.0): https://github.com/MigoXLab/dingo |
24 | 21 |
|
25 | | -**Framework improvements** |
26 | | -- Excel output with summary sheets |
27 | | -- Evaluators declare `_required_fields` for input validation |
28 | | -- Gradio UI auto-displays required fields per evaluator |
| 22 | +`pip install dingo-python` |
29 | 23 |
|
30 | | -## Quick start |
31 | | - |
32 | | -```bash |
33 | | -pip install dingo-python |
34 | | -``` |
35 | | - |
36 | | -- SaaS: https://dingo.openxlab.org.cn/ |
37 | | -- GitHub: https://github.com/MigoXLab/dingo (650+ stars, Apache-2.0) |
38 | | -- PyPI: https://pypi.org/project/dingo-python/ |
39 | | - |
40 | | -Feedback and PRs welcome. |
| 24 | +Curious how others are handling data quality at scale — whether it's RAG evaluation (retrieval + generation), or using agents to automate fact-checking and content verification. Has anyone else tried agent-based approaches for quality assurance? What worked and what didn't? |
0 commit comments