Skip to content

Commit 88651e7

Browse files
authored
chore: Add agents evals (#127)
1 parent b75c102 commit 88651e7

File tree

2 files changed

+77
-9
lines changed

2 files changed

+77
-9
lines changed

evaluators/evaluator-library.mdx

Lines changed: 46 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,15 +59,59 @@ Traceloop provides several pre-configured evaluators for common assessment tasks
5959
- Monitor for sensitive information leakage
6060
- Prevent accidental exposure of credentials
6161

62+
### Formatting Evaluators
63+
64+
**SQL Validation**
65+
- Validate SQL queries
66+
- Ensure syntactically correct SQL output
67+
68+
**JSON Validation**
69+
- Validate JSON responses
70+
- Ensure properly formatted JSON structures
71+
72+
**Regex Validation**
73+
- Validate regex patterns
74+
- Verify pattern matching requirements
75+
76+
**Placeholder Regex**
77+
- Validate placeholder regex patterns
78+
- Check for expected placeholders in responses
79+
80+
### Advanced Quality Evaluators
81+
82+
**Semantic Similarity**
83+
- Validate semantic similarity between texts
84+
- Compare meaning and context alignment
85+
86+
**Agent Goal Accuracy**
87+
- Validate agent goal accuracy
88+
- Measure how well agent achieves defined goals
89+
90+
**Topic Adherence**
91+
- Validate topic adherence
92+
- Ensure responses stay within specified topics
93+
94+
**Measure Perplexity**
95+
- Measure text perplexity from logprobs
96+
- Assess response predictability and coherence
97+
6298
## Custom Evaluators
6399

64-
In addition to the pre-built evaluators, you can create custom evaluators with:
100+
In addition to the pre-built evaluators, you can create custom evaluators tailored to your specific needs:
101+
102+
**Custom Metric**
103+
- Create custom metric evaluations
104+
- Define your own evaluation logic and scoring
105+
106+
**Custom LLM Judge**
107+
- Create custom evaluations using LLM-as-a-judge
108+
- Leverage AI models to assess outputs against custom criteria
65109

66110
### Inputs
67111
- **string**: Text-based input parameters
68112
- Support for multiple input types
69113

70-
### Outputs
114+
### Outputs
71115
- **results**: String-based evaluation results
72116
- **pass**: Boolean indicator for pass/fail status
73117

evaluators/made-by-traceloop.mdx

Lines changed: 31 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,6 @@ Each evaluator comes with a predefined input and output schema. When using an ev
3434
<Card title="Word Count Ratio" icon="hashtag">
3535
Measure the ratio of words to the input to compare input/output verbosity and expansion patterns.
3636
</Card>
37-
38-
<Card title="Tone Detection" icon="smile">
39-
Classify emotional tone of responses (joy, anger, sadness, etc.).
40-
</Card>
4137
</CardGroup>
4238

4339
### Quality & Correctness
@@ -55,8 +51,8 @@ Each evaluator comes with a predefined input and output schema. When using an ev
5551
Evaluate factual accuracy by comparing answers against ground truth.
5652
</Card>
5753

58-
<Card title="Answer Completeness" icon="check-circle">
59-
Measure how completely responses use relevant context.
54+
<Card title="Answer Completeness" icon="circle-check">
55+
Measure how completely responses use relevant context to ensure all relevant information is addressed.
6056
</Card>
6157

6258
<Card title="Topic Adherence" icon="hashtag">
@@ -67,6 +63,10 @@ Each evaluator comes with a predefined input and output schema. When using an ev
6763
Validate semantic similarity between expected and actual responses to measure content alignment.
6864
</Card>
6965

66+
<Card title="Instruction Adherence" icon="clipboard-check">
67+
Measure how well the LLM response follows given instructions to ensure compliance with specified requirements.
68+
</Card>
69+
7070
<Card title="Prompt Perplexity" icon="brain">
7171
Measure how predictable/familiar a prompt is to a language model.
7272
</Card>
@@ -76,7 +76,11 @@ Each evaluator comes with a predefined input and output schema. When using an ev
7676
</Card>
7777

7878
<Card title="Uncertainty Detector" icon="gauge">
79-
Generate responses and measure model uncertainty from logprobs.
79+
Generate responses and measure model uncertainty from logprobs to identify when the model is less confident in its outputs.
80+
</Card>
81+
82+
<Card title="Conversation Quality" icon="comments">
83+
Evaluate conversation quality based on tone, clarity, flow, responsiveness, and transparency.
8084
</Card>
8185
</CardGroup>
8286

@@ -134,4 +138,24 @@ Each evaluator comes with a predefined input and output schema. When using an ev
134138
<Card title="Agent Goal Accuracy" icon="bullseye">
135139
Validate agent goal accuracy to ensure AI systems achieve their intended objectives effectively.
136140
</Card>
141+
142+
<Card title="Agent Tool Error Detector" icon="wrench">
143+
Detect errors or failures during tool execution to monitor agent tool performance.
144+
</Card>
145+
146+
<Card title="Agent Flow Quality" icon="route">
147+
Validate agent trajectories against user-defined natural language tests to assess agent decision-making paths.
148+
</Card>
149+
150+
<Card title="Agent Efficiency" icon="zap">
151+
Evaluate agent efficiency by checking for redundant calls and optimal paths to optimize agent performance.
152+
</Card>
153+
154+
<Card title="Agent Goal Completeness" icon="circle-check">
155+
Measure whether the agent successfully accomplished all user goals to verify comprehensive goal achievement.
156+
</Card>
157+
158+
<Card title="Intent Change" icon="repeat">
159+
Detect whether the user's primary intent or workflow changed significantly during a conversation.
160+
</Card>
137161
</CardGroup>

0 commit comments

Comments
 (0)