66
77When you have multiple AI models to choose from—different versions, providers, or configurations—how do you know which one is best for your use case?
88
9- ## Quick Example
9+ ## Quick Examples
10+
11+ ### Basic Model Comparison
1012
1113Compare models on a simple formatting task:
1214
@@ -21,10 +23,10 @@ from eval_protocol.pytest import default_single_turn_rollout_processor, evaluati
2123 Message(role = " user" , content = " Explain why evaluations matter for AI agents. Make it dramatic!" ),
2224 ],
2325 ],
24- model = [
25- " fireworks_ai /accounts/fireworks/models/llama-v3p1-8b-instruct" ,
26- " openai/gpt-4" ,
27- " anthropic/claude-3-sonnet"
26+ completion_params = [
27+ { " model " : " fireworks /accounts/fireworks/models/llama-v3p1-8b-instruct" } ,
28+ { " model " : " openai/gpt-4" } ,
29+ { " model " : " anthropic/claude-3-sonnet" }
2830 ],
2931 rollout_processor = default_single_turn_rollout_processor,
3032 mode = " pointwise" ,
@@ -45,21 +47,86 @@ def test_bold_format(row: EvaluationRow) -> EvaluationRow:
4547 return row
4648```
4749
50+ ### Using Datasets
51+
52+ Evaluate models on existing datasets:
53+
54+ ``` python
55+ from eval_protocol.pytest import evaluation_test
56+ from eval_protocol.adapters.huggingface import create_gsm8k_adapter
57+
58+ @evaluation_test (
59+ input_dataset = [" development/gsm8k_sample.jsonl" ], # Local JSONL file
60+ dataset_adapter = create_gsm8k_adapter(), # Adapter to convert data
61+ completion_params = [
62+ {" model" : " openai/gpt-4" },
63+ {" model" : " anthropic/claude-3-sonnet" }
64+ ],
65+ mode = " pointwise"
66+ )
67+ def test_math_reasoning (row : EvaluationRow) -> EvaluationRow:
68+ # Your evaluation logic here
69+ return row
70+ ```
71+
72+ ## 🚀 Features
73+
74+ - ** Custom Evaluations** : Write evaluations tailored to your specific business needs
75+ - ** Auto-Evaluation** : Stack-rank models using LLMs as judges with just model traces
76+ - ** Model Context Protocol (MCP) Integration** : Build reinforcement learning environments and trigger user simulations for complex scenarios
77+ - ** Consistent Testing** : Test across various models and configurations with a unified framework
78+ - ** Resilient Runtime** : Automatic retries for unstable LLM APIs and concurrent execution for long-running evaluations
79+ - ** Rich Visualizations** : Built-in pivot tables and visualizations for result analysis
80+ - ** Data-Driven Decisions** : Make informed model deployment decisions based on comprehensive evaluation results
81+
4882## 📚 Resources
4983
5084- ** [ Documentation] ( https://evalprotocol.io ) ** - Complete guides and API reference
5185- ** [ Discord] ( https://discord.com/channels/1137072072808472616/1400975572405850155 ) ** - Community discussions
86+ - ** [ GitHub] ( https://github.com/eval-protocol/python-sdk ) ** - Source code and examples
5287
5388## Installation
5489
5590** This library requires Python >= 3.10.**
5691
92+ ### Basic Installation
93+
5794Install with pip:
5895
59- ```
96+ ``` bash
6097pip install eval-protocol
6198```
6299
100+ ### Recommended Installation with uv
101+
102+ For better dependency management and faster installs, we recommend using [ uv] ( https://docs.astral.sh/uv/ ) :
103+
104+ ``` bash
105+ # Install uv if you haven't already
106+ curl -LsSf https://astral.sh/uv/install.sh | sh
107+
108+ # Install eval-protocol
109+ uv add eval-protocol
110+ ```
111+
112+ ### Optional Dependencies
113+
114+ Install with additional features:
115+
116+ ``` bash
117+ # For Langfuse integration
118+ pip install ' eval-protocol[langfuse]'
119+
120+ # For HuggingFace datasets
121+ pip install ' eval-protocol[huggingface]'
122+
123+ # For all adapters
124+ pip install ' eval-protocol[adapters]'
125+
126+ # For development
127+ pip install ' eval-protocol[dev]'
128+ ```
129+
63130## License
64131
65132[ MIT] ( LICENSE )
0 commit comments