@@ -17,83 +17,200 @@ The Evaluator Library provides a comprehensive collection of pre-built quality c
1717
1818Traceloop provides several pre-configured evaluators for common assessment tasks:
1919
20- ### Content Analysis Evaluators
20+ ---
2121
22- ** Character Count**
23- - Analyze response length and verbosity
24- - Helps ensure responses meet length requirements
22+ ### Agent Evaluators
2523
26- ** Character Count Ratio**
27- - Measure the ratio of characters to the input
28- - Useful for assessing response proportionality
24+ ** Agent Efficiency** <sup >(beta)</sup >
25+ - Evaluates how efficiently an agent completes a task by detecting redundant steps, unnecessary tool calls, loops, or poor reasoning
26+ - Returns a 0-1 score
27+ - * Implementation: Custom GPT-4o prompt*
2928
30- ** Word Count**
31- - Ensure appropriate response detail level
32- - Track output length consistency
29+ ** Agent Flow Quality** <sup >(beta)</sup >
30+ - Checks whether the agent satisfies all user-defined behavioral or logical conditions; strict full-condition matching
31+ - Returns score as ratio of passed conditions
32+ - * Implementation: Custom GPT-4o prompt*
3333
34- ** Word Count Ratio**
35- - Measure the ratio of words to the input
36- - Compare input/output verbosity
34+ ** Agent Goal Accuracy**
35+ - Determines if the agent actually achieved the user's goal, with or without a reference expected answer
36+ - Supports both reference-based and reference-free evaluation
37+ - * Implementation: Ragas AgentGoalAccuracy metrics*
38+
39+ ** Agent Goal Completeness**
40+ - Extracts user intents across a conversation and evaluates how many were fulfilled end-to-end
41+ - Automatically determines fulfillment rate
42+ - * Implementation: DeepEval ConversationCompletenessMetric*
43+
44+ ** Agent Tool Error Detector** <sup >(beta)</sup >
45+ - Detects incorrect tool usage (bad params, failed API calls, unexpected behavior) in agent trajectories
46+ - Returns pass/fail
47+ - * Implementation: Custom GPT-4o prompt*
3748
38- ### Quality Assessment Evaluators
49+ ---
50+
51+ ### Answer Quality Evaluators
52+
53+ ** Answer Completeness**
54+ - Measures how thoroughly the answer uses relevant context, using a rubric from "barely uses context" to "fully covers it"
55+ - Normalized to 0-1 score
56+ - * Implementation: Ragas RubricsScore metric*
57+
58+ ** Answer Correctness**
59+ - Evaluates factual correctness by combining semantic similarity with a correctness model vs ground truth
60+ - Returns combined 0-1 score
61+ - * Implementation: Ragas AnswerCorrectness + AnswerSimilarity*
3962
4063** Answer Relevancy**
41- - Verify responses address the query
42- - Ensure AI outputs stay on topic
64+ - Determines whether the answer meaningfully responds to the question
65+ - Outputs pass/fail
66+ - * Implementation: Ragas answer_relevancy metric*
4367
4468** Faithfulness**
45- - Detect hallucinations and verify facts
46- - Maintain accuracy and truthfulness
69+ - Ensures all claims in the answer are grounded in the provided context and not hallucinated
70+ - Binary pass/fail
71+ - * Implementation: Ragas Faithfulness metric*
72+
73+ ** Semantic Similarity**
74+ - Computes embedding-based similarity between generated text and a reference answer
75+ - Returns 0-1 score
76+ - * Implementation: Ragas SemanticSimilarity metric*
77+
78+ ---
79+
80+ ### Conversation Evaluators
81+
82+ ** Conversation Quality**
83+ - Overall conversation score combining relevancy (40%), completeness (40%), and memory retention (20%) over multiple turns
84+ - Returns weighted combined score
85+ - * Implementation: DeepEval TurnRelevancy + ConversationCompleteness + KnowledgeRetention*
86+
87+ ** Intent Change**
88+ - Detects if the conversation stayed on the original intent or drifted into unrelated topics
89+ - Higher score = better adherence to original topic
90+ - * Implementation: Ragas TopicAdherenceScore (precision mode)*
91+
92+ ** Topic Adherence**
93+ - Measures how well conversation messages stay aligned with specified allowed topics
94+ - Returns 0-1 score
95+ - * Implementation: Ragas TopicAdherenceScore*
96+
97+ ** Context Relevance**
98+ - Rates whether retrieved context actually contains the information needed to answer the question
99+ - Score = relevant statements / total statements
100+ - * Implementation: DeepEval ContextualRelevancyMetric*
101+
102+ ** Instruction Adherence**
103+ - Evaluates how closely the model followed system-level or user instructions
104+ - Returns 0-1 adherence score
105+ - * Implementation: DeepEval PromptAlignmentMetric*
106+
107+ ---
47108
48109### Safety & Security Evaluators
49110
50- ** PII Detection**
51- - Identify personal information in responses
52- - Protect user privacy and data security
111+ ** PII Detector**
112+ - Detects names, addresses, emails, and other personal identifiers in text; may redact them
113+ - Pass/fail based on confidence threshold
114+ - * Implementation: Microsoft Presidio Analyzer*
53115
54- ** Profanity Detection**
55- - Monitor for inappropriate language
56- - Maintain content quality standards
116+ ** Secrets Detector**
117+ - Identifies hardcoded secrets such as API keys, tokens, passwords, etc.
118+ - Binary pass/fail with optional redaction
119+ - * Implementation: Yelp detect-secrets*
57120
58- ** Secrets Detection**
59- - Monitor for sensitive information leakage
60- - Prevent accidental exposure of credentials
121+ ** Profanity Detector**
122+ - Checks whether text contains offensive or profane language
123+ - Binary pass/fail
124+ - * Implementation: profanity-check library*
61125
62- ### Formatting Evaluators
126+ ** Prompt Injection Detector**
127+ - Flags attempts to override system behavior or inject malicious instructions
128+ - Binary pass/fail based on threshold
129+ - * Implementation: AWS SageMaker endpoint running DeBERTa-v3 model*
63130
64- ** SQL Validation**
65- - Validate SQL queries
66- - Ensure syntactically correct SQL output
131+ ** Toxicity Detector**
132+ - Classifies toxic categories like threat, insult, obscenity, hate speech, etc.
133+ - Binary pass/fail based on threshold
134+ - * Implementation: AWS SageMaker unitary/toxic-bert model*
67135
68- ** JSON Validation**
69- - Validate JSON responses
70- - Ensure properly formatted JSON structures
136+ ** Sexism Detector**
137+ - Detects sexist language or bias specifically toward gender-based discrimination
138+ - Binary pass/fail based on threshold
139+ - * Implementation: AWS SageMaker unitary/toxic-bert model*
71140
72- ** Regex Validation**
73- - Validate regex patterns
74- - Verify pattern matching requirements
141+ ---
142+
143+ ### Format Validators
144+
145+ ** JSON Validator**
146+ - Validates that output is valid JSON and optionally matches a schema
147+ - Binary pass/fail
148+ - * Implementation: Python json and jsonschema*
149+
150+ ** SQL Validator**
151+ - Checks whether generated text is syntactically valid PostgreSQL SQL
152+ - Binary pass/fail
153+ - * Implementation: pglast Postgres parser*
154+
155+ ** Regex Validator**
156+ - Validates whether text matches (or must not match) a regex with flexible flags
157+ - Supports case sensitivity, multiline, and dotall flags
158+ - * Implementation: Python re*
75159
76160** Placeholder Regex**
77- - Validate placeholder regex patterns
78- - Check for expected placeholders in responses
161+ - Similar to regex validator, but dynamically injects a placeholder before matching
162+ - Useful for dynamic pattern validation
163+ - * Implementation: Python re*
79164
80- ### Advanced Quality Evaluators
165+ ---
81166
82- ** Semantic Similarity**
83- - Validate semantic similarity between texts
84- - Compare meaning and context alignment
167+ ### Text Metrics
85168
86- ** Agent Goal Accuracy**
87- - Validate agent goal accuracy
88- - Measure how well agent achieves defined goals
169+ ** Word Count**
170+ - Counts number of words in generated output
171+ - Returns integer count
172+ - * Implementation: Python string split*
89173
90- ** Topic Adherence**
91- - Validate topic adherence
92- - Ensure responses stay within specified topics
174+ ** Word Count Ratio**
175+ - Compares output word count to input word count
176+ - Useful for measuring expansion/compression
177+ - * Implementation: Python string operations*
178+
179+ ** Char Count**
180+ - Counts number of characters in the generated text
181+ - Returns integer count
182+ - * Implementation: Python len()*
183+
184+ ** Char Count Ratio**
185+ - Output character count divided by input character count
186+ - Returns float ratio
187+ - * Implementation: Python len()*
93188
94- ** Measure Perplexity**
95- - Measure text perplexity from logprobs
96- - Assess response predictability and coherence
189+ ** Perplexity**
190+ - Computes perplexity using provided logprobs to quantify model confidence on its output
191+ - Lower = more confident predictions
192+ - * Implementation: Mathematical calculation exp(-avg_log_prob)*
193+
194+ ---
195+
196+ ### Specialized Evaluators
197+
198+ ** LLM as a Judge**
199+ - Fully flexible LLM-based evaluator using arbitrary prompts and variables; returns JSON directly from the model
200+ - Configurable model, temperature, etc.
201+ - * Implementation: Custom OpenAI API call*
202+
203+ ** Tone Detection**
204+ - Classifies emotional tone (joy, anger, sadness, fear, neutral, etc.) based on text
205+ - Returns detected tone and confidence score
206+ - * Implementation: AWS SageMaker emotion-distilroberta model*
207+
208+ ** Uncertainty**
209+ - Generates a response with token-level logprobs and calculates uncertainty using max surprisal
210+ - Returns answer + uncertainty score
211+ - * Implementation: GPT-4o-mini with logprobs enabled*
212+
213+ ---
97214
98215## Custom Evaluators
99216
@@ -103,10 +220,6 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
103220- Create custom metric evaluations
104221- Define your own evaluation logic and scoring
105222
106- ** Custom LLM Judge**
107- - Create custom evaluations using LLM-as-a-judge
108- - Leverage AI models to assess outputs against custom criteria
109-
110223### Inputs
111224- ** string** : Text-based input parameters
112225- Support for multiple input types
@@ -115,6 +228,8 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
115228- ** results** : String-based evaluation results
116229- ** pass** : Boolean indicator for pass/fail status
117230
231+ ---
232+
118233## Usage
119234
1202351 . Browse the available evaluators in the library
@@ -123,4 +238,4 @@ In addition to the pre-built evaluators, you can create custom evaluators tailor
1232384 . Use the "Use evaluator" button to integrate into your workflow
1242395 . Monitor outputs and pass/fail status for systematic quality assessment
125240
126- The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications.
241+ The Evaluator Library streamlines the process of implementing comprehensive AI output assessment, ensuring consistent quality and safety standards across your applications.
0 commit comments