You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21Lines changed: 21 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,6 +59,27 @@
59
59
60
60
**Dingo is A Comprehensive AI Data, Model and Application Quality Evaluation Tool**, designed for ML practitioners, data engineers, and AI researchers. It helps you systematically assess and improve the quality of training data, fine-tuning datasets, and production AI systems.
61
61
62
+
---
63
+
64
+
## 🚀 Enterprise Dingo SaaS Version
65
+
66
+
Need a **production-grade data quality platform**? Try [Dingo SaaS](https://github.com/MigoXLab/dingo-saas) Enterprise Edition!
67
+
68
+
### ✨ Compared to the open-source version, SaaS provides:
|`QUALITY_BAD_COMPLETENESS`| RuleLineEndWithEllipsis, RuleLineEndWithTerminal, RuleSentenceNumber, RuleWordNumber | Checks whether the ratio of lines ending with ellipsis is below threshold; Checks whether the ratio of lines ending w... |[RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) |[📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md)| N/A |
61
-
|`QUALITY_BAD_EFFECTIVENESS`|RuleDoi, RuleIsbn, RuleAbnormalChar, RuleAbnormalHtml, RuleAlphaWords, RuleAudioDataFormat, RuleCharNumber, RuleColonEnd, RuleContentNull, RuleContentShort, RuleContentShortMultiLan, RuleEnterAndSpace, RuleEnterMore, RuleEnterRatioMore, RuleHtmlEntity, RuleHtmlTag, RuleInvisibleChar, RuleImageDataFormat, RuleLatexSpecialChar, RuleLineJavascriptCount, RuleLoremIpsum, RuleMeanWordLength, RuleNlpDataFormat, RuleSftDataFormat, RuleSpaceMore, RuleSpecialCharacter, RuleStopWord, RuleSymbolWordRatio, RuleVedioDataFormat, RuleOnlyUrl| Check whether the string is in the correct format of the doi; Check whether the string is in the correct format of th... | Internal Implementation|N/A| N/A |
60
+
|`QUALITY_BAD_EFFECTIVENESS`| RuleAbnormalChar, RuleAbnormalHtml, RuleAlphaWords, RuleAudioDataFormat, RuleCharNumber, RuleColonEnd, RuleContentNull, RuleContentShort, RuleContentShortMultiLan, RuleEnterAndSpace, RuleEnterMore, RuleEnterRatioMore, RuleHtmlEntity, RuleHtmlTag, RuleInvisibleChar, RuleImageDataFormat, RuleLatexSpecialChar, RuleLineJavascriptCount, RuleLoremIpsum, RuleMeanWordLength, RuleNlpDataFormat, RuleSftDataFormat, RuleSpaceMore, RuleSpecialCharacter, RuleStopWord, RuleSymbolWordRatio, RuleVedioDataFormat, RuleOnlyUrl, RuleDoi, RuleIsbn | Detects garbled text and anti-crawling characters by combining special character and invisible character detection; D... |[RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023)|[📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md)| N/A |
62
61
|`QUALITY_BAD_FLUENCY`| RuleAbnormalNumber, RuleCharSplit, RuleNoPunc, RuleWordSplit, RuleWordStuck | Checks PDF content for abnormal book page or index numbers that disrupt text flow; Checks PDF content for abnormal ch... |[RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) |[📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md)| N/A |
63
62
|`QUALITY_BAD_RELEVANCE`| RuleHeadWordAr, RuleHeadWordCs, RuleHeadWordHu, RuleHeadWordKo, RuleHeadWordRu, RuleHeadWordSr, RuleHeadWordTh, RuleHeadWordVi, RulePatternSearch, RuleWatermark | Checks whether Arabic content contains irrelevant tail source information; Checks whether Czech content contains irre... |[RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) |[📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md)| N/A |
64
63
|`QUALITY_BAD_SECURITY`| RuleIDCard, RuleUnsafeWords, RulePIIDetection | Checks whether content contains ID card information; Checks whether content contains unsafe words; Detects Personal I... |[RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) |[📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md)| N/A |
@@ -83,6 +82,12 @@ This document provides comprehensive information about all quality metrics used
83
82
|`QUALITY_BAD_EFFECTIVENESS`| RuleAudioDuration | Check whether the audio duration meets the standard | Internal Implementation | N/A | N/A |
84
83
|`QUALITY_BAD_EFFECTIVENESS`| RuleAudioSnrQuality | Check whether the audio signal-to-noise ratio meets the standard | Internal Implementation | N/A | N/A |
85
84
85
+
### Job Hunting Strategy Metrics
86
+
87
+
| Type | Metric | Description | Paper Source | Evaluation Results | Examples |
|`LLMLongVideoQa`| LLMLongVideoQa | Generate video-related question-answer pairs based on the summarized information of the input long video. |[VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos](https://arxiv.org/abs/2506.108572) (Jiashuo Yu et al., 2025) | N/A | N/A |
133
144
145
+
### Other Metrics
146
+
147
+
| Type | Metric | Description | Paper Source | Evaluation Results | Examples |
0 commit comments