|
4 | 4 |
|
5 | 5 | **The open-source toolkit for building your internal model leaderboard.** |
6 | 6 |
|
7 | | -When you have multiple AI models to choose from—different versions, providers, or configurations—how do you know which one is best for your use case? |
| 7 | +When you have multiple AI models to choose from—different versions, providers, |
| 8 | +or configurations—how do you know which one is best for your use case? |
| 9 | + |
| 10 | +## 🚀 Features |
| 11 | + |
| 12 | +- **Custom Evaluations**: Write evaluations tailored to your specific business needs |
| 13 | +- **Auto-Evaluation**: Stack-rank models using LLMs as judges with just model traces using out-of-the-box evaluators |
| 14 | +- **RL Environments via MCP**: Build reinforcement learning environments using the Model Control Protocol (MCP) to simulate user interactions and advanced evaluation scenarios |
| 15 | +- **Consistent Testing**: Test across various models and configurations with a unified framework |
| 16 | +- **Resilient Runtime**: Automatic retries for unstable LLM APIs and concurrent execution for long-running evaluations |
| 17 | +- **Rich Visualizations**: Built-in pivot tables and visualizations for result analysis |
| 18 | +- **Data-Driven Decisions**: Make informed model deployment decisions based on comprehensive evaluation results |
8 | 19 |
|
9 | 20 | ## Quick Examples |
10 | 21 |
|
@@ -69,15 +80,6 @@ def test_math_reasoning(row: EvaluationRow) -> EvaluationRow: |
69 | 80 | return row |
70 | 81 | ``` |
71 | 82 |
|
72 | | -## 🚀 Features |
73 | | - |
74 | | -- **Custom Evaluations**: Write evaluations tailored to your specific business needs |
75 | | -- **Auto-Evaluation**: Stack-rank models using LLMs as judges with just model traces |
76 | | -- **Model Context Protocol (MCP) Integration**: Build reinforcement learning environments and trigger user simulations for complex scenarios |
77 | | -- **Consistent Testing**: Test across various models and configurations with a unified framework |
78 | | -- **Resilient Runtime**: Automatic retries for unstable LLM APIs and concurrent execution for long-running evaluations |
79 | | -- **Rich Visualizations**: Built-in pivot tables and visualizations for result analysis |
80 | | -- **Data-Driven Decisions**: Make informed model deployment decisions based on comprehensive evaluation results |
81 | 83 |
|
82 | 84 | ## 📚 Resources |
83 | 85 |
|
|
0 commit comments