Skip to content

Commit d4eec83

Browse files
chore: evaluators library update (#130)
1 parent 08d39f7 commit d4eec83

File tree

1 file changed

+172
-57
lines changed

1 file changed

+172
-57
lines changed

evaluators/evaluator-library.mdx

Lines changed: 172 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -17,83 +17,200 @@ The Evaluator Library provides a comprehensive collection of pre-built quality c
1717

1818
Traceloop provides several pre-configured evaluators for common assessment tasks:
1919

20-
### Content Analysis Evaluators
20+
---
2121

22-
**Character Count**
23-
- Analyze response length and verbosity
24-
- Helps ensure responses meet length requirements
22+
### Agent Evaluators
2523

26-
**Character Count Ratio**
27-
- Measure the ratio of characters to the input
28-
- Useful for assessing response proportionality
24+
**Agent Efficiency** <sup>(beta)</sup>
25+
- Evaluates how efficiently an agent completes a task by detecting redundant steps, unnecessary tool calls, loops, or poor reasoning
26+
- Returns a 0-1 score
27+
- *Implementation: Custom GPT-4o prompt*
2928

30-
**Word Count**
31-
- Ensure appropriate response detail level
32-
- Track output length consistency
29+
**Agent Flow Quality** <sup>(beta)</sup>
30+
- Checks whether the agent satisfies all user-defined behavioral or logical conditions; strict full-condition matching
31+
- Returns score as ratio of passed conditions
32+
- *Implementation: Custom GPT-4o prompt*
3333

34-
**Word Count Ratio**
35-
- Measure the ratio of words to the input
36-
- Compare input/output verbosity
34+
**Agent Goal Accuracy**
35+
- Determines if the agent actually achieved the user's goal, with or without a reference expected answer
36+
- Supports both reference-based and reference-free evaluation
37+
- *Implementation: Ragas AgentGoalAccuracy metrics*
38+
39+
**Agent Goal Completeness**
40+
- Extracts user intents across a conversation and evaluates how many were fulfilled end-to-end
41+
- Automatically determines fulfillment rate
42+
- *Implementation: DeepEval ConversationCompletenessMetric*
43+
44+
**Agent Tool Error Detector** <sup>(beta)</sup>
45+
- Detects incorrect tool usage (bad params, failed API calls, unexpected behavior) in agent trajectories
46+
- Returns pass/fail
47+
- *Implementation: Custom GPT-4o prompt*
3748

38-
### Quality Assessment Evaluators
49+
---
50+
51+
### Answer Quality Evaluators
52+
53+
**Answer Completeness**
54+
- Measures how thoroughly the answer uses relevant context, using a rubric from "barely uses context" to "fully covers it"
55+
- Normalized to 0-1 score
56+
- *Implementation: Ragas RubricsScore metric*
57+
58+
**Answer Correctness**
59+
- Evaluates factual correctness by combining semantic similarity with a correctness model vs ground truth
60+
- Returns combined 0-1 score
61+
- *Implementation: Ragas AnswerCorrectness + AnswerSimilarity*
3962

4063
**Answer Relevancy**
41-
- Verify responses address the query
42-
- Ensure AI outputs stay on topic
64+
- Determines whether the answer meaningfully responds to the question
65+
- Outputs pass/fail
66+
- *Implementation: Ragas answer_relevancy metric*
4367

4468
**Faithfulness**
45-
- Detect hallucinations and verify facts
46-
- Maintain accuracy and truthfulness
69+
- Ensures all claims in the answer are grounded in the provided context and not hallucinated
70+
- Binary pass/fail
71+
- *Implementation: Ragas Faithfulness metric*
72+
73+
**Semantic Similarity**
74+
- Computes embedding-based similarity between generated text and a reference answer
75+
- Returns 0-1 score
76+
- *Implementation: Ragas SemanticSimilarity metric*
77+
78+
---
79+
80+
### Conversation Evaluators
81+
82+
**Conversation Quality**
83+
- Overall conversation score combining relevancy (40%), completeness (40%), and memory retention (20%) over multiple turns
84+
- Returns weighted combined score
85+
- *Implementation: DeepEval TurnRelevancy + ConversationCompleteness + KnowledgeRetention*
86+
87+
**Intent Change**
88+
- Detects if the conversation stayed on the original intent or drifted into unrelated topics
89+
- Higher score = better adherence to original topic
90+
- *Implementation: Ragas TopicAdherenceScore (precision mode)*
91+
92+
**Topic Adherence**
93+
- Measures how well conversation messages stay aligned with specified allowed topics
94+
- Returns 0-1 score
95+
- *Implementation: Ragas TopicAdherenceScore*
96+
97+
**Context Relevance**
98+
- Rates whether retrieved context actually contains the information needed to answer the question
99+
- Score = relevant statements / total statements
100+
- *Implementation: DeepEval ContextualRelevancyMetric*
101+
102+
**Instruction Adherence**
103+
- Evaluates how closely the model followed system-level or user instructions
104+
- Returns 0-1 adherence score
105+
- *Implementation: DeepEval PromptAlignmentMetric*
106+
107+
---
47108

48109
### Safety & Security Evaluators
49110

50-
**PII Detection**
51-
- Identify personal information in responses
52-
- Protect user privacy and data security
111+
**PII Detector**
112+
- Detects names, addresses, emails, and other personal identifiers in text; may redact them
113+
- Pass/fail based on confidence threshold
114+
- *Implementation: Microsoft Presidio Analyzer*
53115

54-
**Profanity Detection**
55-
- Monitor for inappropriate language
56-
- Maintain content quality standards
116+
**Secrets Detector**
117+
- Identifies hardcoded secrets such as API keys, tokens, passwords, etc.
118+
- Binary pass/fail with optional redaction
119+
- *Implementation: Yelp detect-secrets*
57120

58-
**Secrets Detection**
59-
- Monitor for sensitive information leakage
60-
- Prevent accidental exposure of credentials
121+
**Profanity Detector**
122+
- Checks whether text contains offensive or profane language
123+
- Binary pass/fail
124+
- *Implementation: profanity-check library*
61125

62-
### Formatting Evaluators
126+
**Prompt Injection Detector**
127+
- Flags attempts to override system behavior or inject malicious instructions
128+
- Binary pass/fail based on threshold
129+
- *Implementation: AWS SageMaker endpoint running DeBERTa-v3 model*
63130

64-
**SQL Validation**
65-
- Validate SQL queries
66-
- Ensure syntactically correct SQL output
131+
**Toxicity Detector**
132+
- Classifies toxic categories like threat, insult, obscenity, hate speech, etc.
133+
- Binary pass/fail based on threshold
134+
- *Implementation: AWS SageMaker unitary/toxic-bert model*
67135

68-
**JSON Validation**
69-
- Validate JSON responses
70-
- Ensure properly formatted JSON structures
136+
**Sexism Detector**
137+
- Detects sexist language or bias specifically toward gender-based discrimination
138+
- Binary pass/fail based on threshold
139+
- *Implementation: AWS SageMaker unitary/toxic-bert model*
71140

72-
**Regex Validation**
73-
- Validate regex patterns
74-
- Verify pattern matching requirements
141+
---
142+
143+
### Format Validators
144+
145+
**JSON Validator**
146+
- Validates that output is valid JSON and optionally matches a schema
147+
- Binary pass/fail
148+
- *Implementation: Python json and jsonschema*
149+
150+
**SQL Validator**
151+
- Checks whether generated text is syntactically valid PostgreSQL SQL
152+
- Binary pass/fail
153+
- *Implementation: pglast Postgres parser*
154+
155+
**Regex Validator**
156+
- Validates whether text matches (or must not match) a regex with flexible flags
157+
- Supports case sensitivity, multiline, and dotall flags
158+
- *Implementation: Python re*
75159

76160
**Placeholder Regex**
77-
- Validate placeholder regex patterns
78-
- Check for expected placeholders in responses
161+
- Similar to regex validator, but dynamically injects a placeholder before matching
162+
- Useful for dynamic pattern validation
163+
- *Implementation: Python re*
79164

80-
### Advanced Quality Evaluators
165+
---
81166

82-
**Semantic Similarity**
83-
- Validate semantic similarity between texts
84-
- Compare meaning and context alignment
167+
### Text Metrics
85168

86-
**Agent Goal Accuracy**
87-
- Validate agent goal accuracy
88-
- Measure how well agent achieves defined goals
169+
**Word Count**
170+
- Counts number of words in generated output
171+
- Returns integer count
172+
- *Implementation: Python string split*
89173

90-
**Topic Adherence**
91-
- Validate topic adherence
92-
- Ensure responses stay within specified topics
174+
**Word Count Ratio**
175+
- Compares output word count to input word count
176+
- Useful for measuring expansion/compression
177+
- *Implementation: Python string operations*
178+
179+
**Char Count**
180+
- Counts number of characters in the generated text
181+
- Returns integer count
182+
- *Implementation: Python len()*
183+
184+
**Char Count Ratio**
185+
- Output character count divided by input character count
186+
- Returns float ratio
187+
- *Implementation: Python len()*
93188

94-
**Measure Perplexity**
95-
- Measure text perplexity from logprobs
96-
- Assess response predictability and coherence
189+
**Perplexity**
190+
- Computes perplexity using provided logprobs to quantify model confidence on its output
191+
- Lower = more confident predictions
192+
- *Implementation: Mathematical calculation exp(-avg_log_prob)*
193+
194+
---
195+
196+
### Specialized Evaluators
197+
198+
**LLM as a Judge**
199+
- Fully flexible LLM-based evaluator using arbitrary prompts and variables; returns JSON directly from the model
200+
- Configurable model, temperature, etc.
201+
- *Implementation: Custom OpenAI API call*
202+
203+
**Tone Detection**
204+
- Classifies emotional tone (joy, anger, sadness, fear, neutral, etc.) based on text
205+
- Returns detected tone and confidence score
206+
- *Implementation: AWS SageMaker emotion-distilroberta model*
207+
208+
**Uncertainty**
209+
- Generates a response with token-level logprobs and calculates uncertainty using max surprisal
210+
- Returns answer + uncertainty score
211+
- *Implementation: GPT-4o-mini with logprobs enabled*
212+
213+
---
97214

98215
## Custom Evaluators
99216

@@ -103,10 +220,6 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
103220
- Create custom metric evaluations
104221
- Define your own evaluation logic and scoring
105222

106-
**Custom LLM Judge**
107-
- Create custom evaluations using LLM-as-a-judge
108-
- Leverage AI models to assess outputs against custom criteria
109-
110223
### Inputs
111224
- **string**: Text-based input parameters
112225
- Support for multiple input types
@@ -115,6 +228,8 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
115228
- **results**: String-based evaluation results
116229
- **pass**: Boolean indicator for pass/fail status
117230

231+
---
232+
118233
## Usage
119234

120235
1. Browse the available evaluators in the library
@@ -123,4 +238,4 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
123238
4. Use the "Use evaluator" button to integrate into your workflow
124239
5. Monitor outputs and pass/fail status for systematic quality assessment
125240

126-
The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications.
241+
The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications.

0 commit comments

Comments
 (0)