Skip to content

Commit 3bcc9fa

Browse files
e06084actions-user
andauthored
docs: README add SAAS apply (MigoXLab#345)
* docs: README add SAAS apply * x * 📚 Auto-update metrics documentation --------- Co-authored-by: GitHub Action <action@github.com>
1 parent 330a5dc commit 3bcc9fa

File tree

5 files changed

+83
-3
lines changed

5 files changed

+83
-3
lines changed

README.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,27 @@
5959

6060
**Dingo is A Comprehensive AI Data, Model and Application Quality Evaluation Tool**, designed for ML practitioners, data engineers, and AI researchers. It helps you systematically assess and improve the quality of training data, fine-tuning datasets, and production AI systems.
6161

62+
---
63+
64+
## 🚀 Enterprise Dingo SaaS Version
65+
66+
Need a **production-grade data quality platform**? Try [Dingo SaaS](https://github.com/MigoXLab/dingo-saas) Enterprise Edition!
67+
68+
### ✨ Compared to the open-source version, SaaS provides:
69+
70+
- 🌐 **Web UI** - Visual evaluation interface, no coding required
71+
- 🔐 **Access Control** - JWT + Google OAuth 2.0
72+
- 📊 **Visual Reports** - Interactive charts, trend analysis, export features
73+
- 🔌 **RESTful API** - Seamless integration with existing systems
74+
75+
### 📝 How to Get Free SaaS Code
76+
77+
👉 **[Apply for Dingo SaaS Repository Access](https://aicarrier.feishu.cn/share/base/form/shrcn9RqYttByQ5H1np6Yrnmhuf)**
78+
79+
Review time: 1-5 business days | Suitable for enterprise data governance, team collaboration
80+
81+
---
82+
6283
## Why Dingo?
6384

6485
🎯 **Production-Grade Quality Checks** - From pre-training datasets to RAG systems, ensure your AI gets high-quality data

README_ja.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,27 @@
5858

5959
**Dingo は包括的な AI データ、モデル、アプリケーション品質評価ツール**であり、機械学習エンジニア、データエンジニア、AI 研究者向けに設計されています。トレーニングデータ、ファインチューニングデータセット、本番 AI システムの品質を体系的に評価・改善するのを支援します。
6060

61+
---
62+
63+
## 🚀 エンタープライズ SaaS 版
64+
65+
**本番グレードのデータ品質プラットフォーム**が必要ですか?[Dingo SaaS](https://github.com/MigoXLab/dingo-saas) エンタープライズ版をお試しください!
66+
67+
### ✨ オープンソース版と比較して、SaaS 版が提供する機能:
68+
69+
- 🌐 **Web UI** - ビジュアル評価インターフェース、コーディング不要
70+
- 🔐 **アクセス制御** - JWT + Google OAuth 2.0
71+
- 📊 **ビジュアルレポート** - インタラクティブなチャート、トレンド分析、エクスポート機能
72+
- 🔌 **RESTful API** - 既存システムとのシームレスな統合
73+
74+
### 📝 無料 SaaS コードの入手方法
75+
76+
👉 **[Dingo SaaS リポジトリアクセスを申請する](https://aicarrier.feishu.cn/share/base/form/shrcn9RqYttByQ5H1np6Yrnmhuf)**
77+
78+
審査時間:1-5営業日 | エンタープライズデータガバナンス、チームコラボレーションに最適
79+
80+
---
81+
6182
## なぜ Dingo を選ぶのか?
6283

6384
🎯 **本番グレードの品質チェック** - 事前学習データセットから RAG システムまで、AI に高品質なデータを提供

README_zh-CN.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,27 @@
5858

5959
**Dingo 是一款全面的 AI 数据、模型和应用质量评估工具**,专为机器学习工程师、数据工程师和 AI 研究人员设计。它帮助你系统化地评估和改进训练数据、微调数据集和生产AI系统的质量。
6060

61+
---
62+
63+
## 🚀 企业级 Dingo SaaS 版本
64+
65+
需要 **生产级数据质量平台** 吗?试试 [Dingo SaaS](https://github.com/MigoXLab/dingo-saas) 企业版!
66+
67+
### ✨ 相比开源版,SaaS 版提供:
68+
69+
- 🌐 **Web UI** - 可视化评估界面,无需写代码
70+
- 🔐 **权限管理** - JWT + Google OAuth 2.0
71+
- 📊 **可视化报告** - 交互式图表、趋势分析、导出功能
72+
- 🔌 **RESTful API** - 与现有系统无缝集成
73+
74+
### 📝 如何获得免费 SaaS 代码
75+
76+
👉 **[点击申请 Dingo SaaS 代码仓库访问权限](https://aicarrier.feishu.cn/share/base/form/shrcn9RqYttByQ5H1np6Yrnmhuf)**
77+
78+
审核时间:1-5 个工作日 | 适合企业数据治理、团队协作
79+
80+
---
81+
6182
## 为什么选择 Dingo?
6283

6384
🎯 **生产级质量检查** - 从预训练数据集到 RAG 系统,确保你的 AI 获得高质量数据

dingo/model/llm/llm_scout.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ def _clean_response(response: str) -> str:
281281
start = response.find('{')
282282
end = response.rfind('}')
283283
if start != -1 and end != -1:
284-
response = response[start:end+1]
284+
response = response[start:end + 1]
285285

286286
return response.strip()
287287

docs/metrics.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,14 +51,13 @@ This document provides comprehensive information about all quality metrics used
5151
|------|--------|-------------|--------------|-------------------|----------|
5252
| `LLMClassifyQR` | LLMClassifyQR | Identifies images as CAPTCHA, QR code, or normal images | Internal Implementation | N/A | N/A |
5353
| `VLMOCRUnderstanding` | VLMOCRUnderstanding | 评估多模态模型对图片中文字内容的识别和理解能力,使用DeepSeek-OCR作为Ground Truth | [DeepSeek-OCR: Contexts Optical Compression](https://github.com/deepseek-ai/DeepSeek-OCR) | [📊 See Results](通过对比VLM输出与OCR ground truth,识别文字遗漏、错误、幻觉等问题) | N/A |
54-
| `VLMRenderJudge` | VLMRenderJudge | VLM-based OCR quality evaluation through visual render-compare | Internal Implementation | N/A | N/A |
5554

5655
### Rule-Based TEXT Quality Metrics
5756

5857
| Type | Metric | Description | Paper Source | Evaluation Results | Examples |
5958
|------|--------|-------------|--------------|-------------------|----------|
6059
| `QUALITY_BAD_COMPLETENESS` | RuleLineEndWithEllipsis, RuleLineEndWithTerminal, RuleSentenceNumber, RuleWordNumber | Checks whether the ratio of lines ending with ellipsis is below threshold; Checks whether the ratio of lines ending w... | [RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) | [📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md) | N/A |
61-
| `QUALITY_BAD_EFFECTIVENESS` | RuleDoi, RuleIsbn, RuleAbnormalChar, RuleAbnormalHtml, RuleAlphaWords, RuleAudioDataFormat, RuleCharNumber, RuleColonEnd, RuleContentNull, RuleContentShort, RuleContentShortMultiLan, RuleEnterAndSpace, RuleEnterMore, RuleEnterRatioMore, RuleHtmlEntity, RuleHtmlTag, RuleInvisibleChar, RuleImageDataFormat, RuleLatexSpecialChar, RuleLineJavascriptCount, RuleLoremIpsum, RuleMeanWordLength, RuleNlpDataFormat, RuleSftDataFormat, RuleSpaceMore, RuleSpecialCharacter, RuleStopWord, RuleSymbolWordRatio, RuleVedioDataFormat, RuleOnlyUrl | Check whether the string is in the correct format of the doi; Check whether the string is in the correct format of th... | Internal Implementation | N/A | N/A |
60+
| `QUALITY_BAD_EFFECTIVENESS` | RuleAbnormalChar, RuleAbnormalHtml, RuleAlphaWords, RuleAudioDataFormat, RuleCharNumber, RuleColonEnd, RuleContentNull, RuleContentShort, RuleContentShortMultiLan, RuleEnterAndSpace, RuleEnterMore, RuleEnterRatioMore, RuleHtmlEntity, RuleHtmlTag, RuleInvisibleChar, RuleImageDataFormat, RuleLatexSpecialChar, RuleLineJavascriptCount, RuleLoremIpsum, RuleMeanWordLength, RuleNlpDataFormat, RuleSftDataFormat, RuleSpaceMore, RuleSpecialCharacter, RuleStopWord, RuleSymbolWordRatio, RuleVedioDataFormat, RuleOnlyUrl, RuleDoi, RuleIsbn | Detects garbled text and anti-crawling characters by combining special character and invisible character detection; D... | [RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) | [📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md) | N/A |
6261
| `QUALITY_BAD_FLUENCY` | RuleAbnormalNumber, RuleCharSplit, RuleNoPunc, RuleWordSplit, RuleWordStuck | Checks PDF content for abnormal book page or index numbers that disrupt text flow; Checks PDF content for abnormal ch... | [RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) | [📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md) | N/A |
6362
| `QUALITY_BAD_RELEVANCE` | RuleHeadWordAr, RuleHeadWordCs, RuleHeadWordHu, RuleHeadWordKo, RuleHeadWordRu, RuleHeadWordSr, RuleHeadWordTh, RuleHeadWordVi, RulePatternSearch, RuleWatermark | Checks whether Arabic content contains irrelevant tail source information; Checks whether Czech content contains irre... | [RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) | [📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md) | N/A |
6463
| `QUALITY_BAD_SECURITY` | RuleIDCard, RuleUnsafeWords, RulePIIDetection | Checks whether content contains ID card information; Checks whether content contains unsafe words; Detects Personal I... | [RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) | [📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md) | N/A |
@@ -83,6 +82,12 @@ This document provides comprehensive information about all quality metrics used
8382
| `QUALITY_BAD_EFFECTIVENESS` | RuleAudioDuration | Check whether the audio duration meets the standard | Internal Implementation | N/A | N/A |
8483
| `QUALITY_BAD_EFFECTIVENESS` | RuleAudioSnrQuality | Check whether the audio signal-to-noise ratio meets the standard | Internal Implementation | N/A | N/A |
8584

85+
### Job Hunting Strategy Metrics
86+
87+
| Type | Metric | Description | Paper Source | Evaluation Results | Examples |
88+
|------|--------|-------------|--------------|-------------------|----------|
89+
| `LLMScout` | LLMScout | Strategic job hunting analysis with industry report parsing and person-job matching | Internal Implementation | N/A | N/A |
90+
8691
### Meta Rater Evaluation Metrics
8792

8893
| Type | Metric | Description | Paper Source | Evaluation Results | Examples |
@@ -107,6 +112,12 @@ This document provides comprehensive information about all quality metrics used
107112
| `LLMResumeOptimizer` | LLMResumeOptimizer | ATS-focused resume optimization with keyword injection and STAR polishing | Internal Implementation | N/A | N/A |
108113
| `LLMResumeQuality` | LLMResumeQuality | Comprehensive resume quality evaluation covering privacy, contact, format, structure, professionalism, date, and comp... | Internal Implementation | N/A | N/A |
109114

115+
### Rule-Based Metadata Quality Metrics
116+
117+
| Type | Metric | Description | Paper Source | Evaluation Results | Examples |
118+
|------|--------|-------------|--------------|-------------------|----------|
119+
| `QUALITY_BAD_EFFECTIVENESS` | RuleMetadataSimilarity | 检查元数据字段与基准数据的相似度匹配,阈值默认为0.6 | Internal Implementation | N/A | N/A |
120+
110121
### Rule-Based RESUME Quality Metrics
111122

112123
| Type | Metric | Description | Paper Source | Evaluation Results | Examples |
@@ -131,3 +142,9 @@ This document provides comprehensive information about all quality metrics used
131142
|------|--------|-------------|--------------|-------------------|----------|
132143
| `LLMLongVideoQa` | LLMLongVideoQa | Generate video-related question-answer pairs based on the summarized information of the input long video. | [VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos](https://arxiv.org/abs/2506.108572) (Jiashuo Yu et al., 2025) | N/A | N/A |
133144

145+
### Other Metrics
146+
147+
| Type | Metric | Description | Paper Source | Evaluation Results | Examples |
148+
|------|--------|-------------|--------------|-------------------|----------|
149+
| `AgentFactCheck` | AgentFactCheck | Agent-based hallucination detection with autonomous web search | Internal Implementation | N/A | N/A |
150+

0 commit comments

Comments
 (0)