chore: Add agents evals (#127)

nina-kollman · web-flow · commit 88651e703055 · 2025-12-08T09:31:53.000+02:00
diff --git a/evaluators/evaluator-library.mdx b/evaluators/evaluator-library.mdx
@@ -59,15 +59,59 @@ Traceloop provides several pre-configured evaluators for common assessment tasks
 - Monitor for sensitive information leakage
 - Prevent accidental exposure of credentials
 
+### Formatting Evaluators
+
+**SQL Validation**
+- Validate SQL queries
+- Ensure syntactically correct SQL output
+
+**JSON Validation**
+- Validate JSON responses
+- Ensure properly formatted JSON structures
+
+**Regex Validation**
+- Validate regex patterns
+- Verify pattern matching requirements
+
+**Placeholder Regex**
+- Validate placeholder regex patterns
+- Check for expected placeholders in responses
+
+### Advanced Quality Evaluators
+
+**Semantic Similarity**
+- Validate semantic similarity between texts
+- Compare meaning and context alignment
+
+**Agent Goal Accuracy**
+- Validate agent goal accuracy
+- Measure how well agent achieves defined goals
+
+**Topic Adherence**
+- Validate topic adherence
+- Ensure responses stay within specified topics
+
+**Measure Perplexity**
+- Measure text perplexity from logprobs
+- Assess response predictability and coherence
+
 ## Custom Evaluators
 
-In addition to the pre-built evaluators, you can create custom evaluators with:
+In addition to the pre-built evaluators, you can create custom evaluators tailored to your specific needs:
+
+**Custom Metric**
+- Create custom metric evaluations
+- Define your own evaluation logic and scoring
+
+**Custom LLM Judge**
+- Create custom evaluations using LLM-as-a-judge
+- Leverage AI models to assess outputs against custom criteria
 
 ### Inputs
 - **string**: Text-based input parameters
 - Support for multiple input types
 
-### Outputs  
+### Outputs
 - **results**: String-based evaluation results
 - **pass**: Boolean indicator for pass/fail status
 
diff --git a/evaluators/made-by-traceloop.mdx b/evaluators/made-by-traceloop.mdx
@@ -34,10 +34,6 @@ Each evaluator comes with a predefined input and output schema. When using an ev
   <Card title="Word Count Ratio" icon="hashtag">
     Measure the ratio of words to the input to compare input/output verbosity and expansion patterns.
   </Card>
-
-  <Card title="Tone Detection" icon="smile">
-    Classify emotional tone of responses (joy, anger, sadness, etc.).
-  </Card>
 </CardGroup>
 
 ### Quality & Correctness
@@ -55,8 +51,8 @@ Each evaluator comes with a predefined input and output schema. When using an ev
     Evaluate factual accuracy by comparing answers against ground truth.
   </Card>
 
-  <Card title="Answer Completeness" icon="check-circle">
-    Measure how completely responses use relevant context.
+  <Card title="Answer Completeness" icon="circle-check">
+    Measure how completely responses use relevant context to ensure all relevant information is addressed.
   </Card>
 
   <Card title="Topic Adherence" icon="hashtag">
@@ -67,6 +63,10 @@ Each evaluator comes with a predefined input and output schema. When using an ev
     Validate semantic similarity between expected and actual responses to measure content alignment.
   </Card>
 
+  <Card title="Instruction Adherence" icon="clipboard-check">
+    Measure how well the LLM response follows given instructions to ensure compliance with specified requirements.
+  </Card>
+
   <Card title="Prompt Perplexity" icon="brain">
     Measure how predictable/familiar a prompt is to a language model.
   </Card>
@@ -76,7 +76,11 @@ Each evaluator comes with a predefined input and output schema. When using an ev
   </Card>
 
   <Card title="Uncertainty Detector" icon="gauge">
-    Generate responses and measure model uncertainty from logprobs.
+    Generate responses and measure model uncertainty from logprobs to identify when the model is less confident in its outputs.
+  </Card>
+
+  <Card title="Conversation Quality" icon="comments">
+    Evaluate conversation quality based on tone, clarity, flow, responsiveness, and transparency.
   </Card>
 </CardGroup>
 
@@ -134,4 +138,24 @@ Each evaluator comes with a predefined input and output schema. When using an ev
   <Card title="Agent Goal Accuracy" icon="bullseye">
     Validate agent goal accuracy to ensure AI systems achieve their intended objectives effectively.
   </Card>
+
+  <Card title="Agent Tool Error Detector" icon="wrench">
+    Detect errors or failures during tool execution to monitor agent tool performance.
+  </Card>
+
+  <Card title="Agent Flow Quality" icon="route">
+    Validate agent trajectories against user-defined natural language tests to assess agent decision-making paths.
+  </Card>
+
+  <Card title="Agent Efficiency" icon="zap">
+    Evaluate agent efficiency by checking for redundant calls and optimal paths to optimize agent performance.
+  </Card>
+
+  <Card title="Agent Goal Completeness" icon="circle-check">
+    Measure whether the agent successfully accomplished all user goals to verify comprehensive goal achievement.
+  </Card>
+
+  <Card title="Intent Change" icon="repeat">
+    Detect whether the user's primary intent or workflow changed significantly during a conversation.
+  </Card>
 </CardGroup>