A decorator-first LLM evaluation library for testing AI agents and LLMs. Stack decorators to define evaluation criteria, run with pytest. Read the docs.
- 50+ built-in metrics -- stack
@fe.correctness,@fe.relevance,@fe.hallucination, and more - pytest native -- run evaluations with
pytest, get familiar pass/fail output - LLM-as-judge + deterministic -- semantic LLM metrics alongside ROUGE, exact match, JSON schema, regex
- Custom criteria --
@fe.criteria("Is the response empathetic?")for any evaluation you can describe in plain English - Multi-modal -- evaluate vision, audio, and image generation models
- Conversation metrics -- context retention, topic drift, consistency for multi-turn agents
- RAG metrics -- faithfulness, contextual precision, contextual recall, answer correctness
- Tool trajectory -- verify agent tool calls, argument matching, call sequences
- Reusable metric stacks --
@fe.stack()to compose and reuse metric sets across tests - Human-in-the-loop --
@fe.human_review()for manual review alongside automated metrics - Data-driven testing --
@fe.csv("test_data.csv")to load test cases from CSV files - Pluggable providers -- OpenAI (default), Anthropic, or bring your own
LLMClient
pip install fasteval-coreSet your LLM provider key:
export OPENAI_API_KEY=sk-your-key-hereWrite your first evaluation test:
import fasteval as fe
@fe.correctness(threshold=0.8)
@fe.relevance(threshold=0.7)
def test_qa_agent():
response = my_agent("What is the capital of France?")
fe.score(response, expected_output="Paris", input="What is the capital of France?")Run it:
pytest test_qa_agent.py -v# pip
pip install fasteval-core
# uv
uv add fasteval-core# Anthropic provider
pip install fasteval-core[anthropic]
# Vision-language evaluation (GPT-4V, Claude Vision)
pip install fasteval-core[vision]
# Audio/speech evaluation (Whisper, ASR)
pip install fasteval-core[audio]
# Image generation evaluation (DALL-E, Stable Diffusion)
pip install fasteval-core[image-gen]
# All multi-modal features
pip install fasteval-core[multimodal]import fasteval as fe
@fe.contains()
def test_keyword_present():
fe.score("The answer is 42", expected_output="42")
@fe.rouge(threshold=0.6, rouge_type="rougeL")
def test_summary_quality():
fe.score(actual_output=summary, expected_output=reference)@fe.criteria("Is the response empathetic and professional?")
def test_tone():
response = agent("I'm frustrated with this product!")
fe.score(response)
@fe.criteria(
"Does the response include a legal disclaimer?",
threshold=0.9,
)
def test_compliance():
response = agent("Can I break my lease?")
fe.score(response)@fe.faithfulness(threshold=0.8)
@fe.contextual_precision(threshold=0.7)
def test_rag_pipeline():
result = rag_pipeline("How does photosynthesis work?")
fe.score(
actual_output=result.answer,
context=result.retrieved_docs,
input="How does photosynthesis work?",
)@fe.tool_call_accuracy(threshold=0.9)
def test_agent_tools():
result = agent.run("Book a flight to Paris")
fe.score(
result.response,
tool_calls=result.tool_calls,
expected_tools=[
{"name": "search_flights", "args": {"destination": "Paris"}},
{"name": "book_flight"},
],
)@fe.context_retention(threshold=0.8)
@fe.conversation([
{"query": "My name is Alice and I'm a vegetarian"},
{"query": "Suggest a restaurant for me"},
{"query": "What dietary restriction should they accommodate?"},
])
async def test_memory(query, expected, history):
response = await agent(query, history=history)
fe.score(response, input=query, history=history)# Define a reusable metric stack
@fe.stack()
@fe.correctness(threshold=0.8, weight=2.0)
@fe.relevance(threshold=0.7, weight=1.0)
@fe.coherence(threshold=0.6, weight=1.0)
def quality_metrics():
pass
# Apply to multiple tests
@quality_metrics
def test_chatbot():
response = agent("Explain quantum computing")
fe.score(response, expected_output=reference_answer, input="Explain quantum computing")
@quality_metrics
def test_summarizer():
summary = summarize(long_article)
fe.score(summary, expected_output=reference_summary)| Plugin | Description | Install |
|---|---|---|
| fasteval-langfuse | Evaluate Langfuse production traces with fasteval metrics | pip install fasteval-langfuse |
| fasteval-langgraph | Test harness for LangGraph agents | pip install fasteval-langgraph |
| fasteval-observe | Runtime monitoring with async sampling | pip install fasteval-observe |
# Install uv
brew install uv
# Create virtual environment and install all dependencies
uv sync --all-extras --group dev --group test
# Run the test suite
uv run tox
# Run tests with coverage
uv run pytest tests/ --cov=fasteval --cov-report=term -v
# Format code
uv run black .
uv run isort .
# Type checking
uv run mypy .Full documentation is available at fasteval.io.
- Getting Started -- installation and quickstart guide
- Why FastEval -- motivation and design philosophy
- Core Concepts -- decorators, metrics, scoring, data sources
- Concepts -- LLM-as-judge, scoring thresholds, evaluation strategies
- LLM Metrics -- correctness, relevance, hallucination, and more
- Deterministic Metrics -- ROUGE, exact match, regex, JSON schema
- RAG Metrics -- faithfulness, contextual precision/recall
- Tool Trajectory -- tool call accuracy, sequence, argument matching
- Conversation Metrics -- context retention, consistency, topic drift
- Multi-Modal -- vision, audio, image generation evaluation
- Human Review -- human-in-the-loop evaluation
- Cookbooks -- RAG pipelines, CI/CD setup, prompt regression, production monitoring
- Plugins -- Langfuse, LangGraph, Observe
- Advanced -- custom metrics, providers, output collectors, traces
- API Reference -- decorators, evaluator, models, score
See CONTRIBUTING.md for development setup, coding standards, and how to submit pull requests.
Apache License 2.0 -- see LICENSE for details.


