Skip to content

Commit fdd76dc

Browse files
author
Dylan Huang
authored
improve (#205)
1 parent 9196f83 commit fdd76dc

File tree

1 file changed

+12
-10
lines changed

1 file changed

+12
-10
lines changed

README.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,18 @@
44

55
**The open-source toolkit for building your internal model leaderboard.**
66

7-
When you have multiple AI models to choose from—different versions, providers, or configurations—how do you know which one is best for your use case?
7+
When you have multiple AI models to choose from—different versions, providers,
8+
or configurations—how do you know which one is best for your use case?
9+
10+
## 🚀 Features
11+
12+
- **Custom Evaluations**: Write evaluations tailored to your specific business needs
13+
- **Auto-Evaluation**: Stack-rank models using LLMs as judges with just model traces using out-of-the-box evaluators
14+
- **RL Environments via MCP**: Build reinforcement learning environments using the Model Control Protocol (MCP) to simulate user interactions and advanced evaluation scenarios
15+
- **Consistent Testing**: Test across various models and configurations with a unified framework
16+
- **Resilient Runtime**: Automatic retries for unstable LLM APIs and concurrent execution for long-running evaluations
17+
- **Rich Visualizations**: Built-in pivot tables and visualizations for result analysis
18+
- **Data-Driven Decisions**: Make informed model deployment decisions based on comprehensive evaluation results
819

920
## Quick Examples
1021

@@ -69,15 +80,6 @@ def test_math_reasoning(row: EvaluationRow) -> EvaluationRow:
6980
return row
7081
```
7182

72-
## 🚀 Features
73-
74-
- **Custom Evaluations**: Write evaluations tailored to your specific business needs
75-
- **Auto-Evaluation**: Stack-rank models using LLMs as judges with just model traces
76-
- **Model Context Protocol (MCP) Integration**: Build reinforcement learning environments and trigger user simulations for complex scenarios
77-
- **Consistent Testing**: Test across various models and configurations with a unified framework
78-
- **Resilient Runtime**: Automatic retries for unstable LLM APIs and concurrent execution for long-running evaluations
79-
- **Rich Visualizations**: Built-in pivot tables and visualizations for result analysis
80-
- **Data-Driven Decisions**: Make informed model deployment decisions based on comprehensive evaluation results
8183

8284
## 📚 Resources
8385

0 commit comments

Comments
 (0)