Query Agent Benchmarking

A tool for benchmarking retrieval and question answering systems. Built for Weaviate's Query Agent, but designed to evaluate any system you can plug in.

It supports two evaluation modes:

Search — Ranked retrieval evaluation using IR metrics (Recall@K, nDCG@K, Coverage, alpha-nDCG)
Ask — Question answering evaluation using LLM-as-judge (DSPy-based ensemble voting for semantic alignment) or exact match accuracy

News 📯

[9/25] 📊 Search Mode Benchmarking is live on the Weaviate Blog.

Quick Start

Clone the repo and install dependencies:

git clone https://github.com/weaviate/query-agent-benchmarking.git
cd query-agent-benchmarking
uv sync

Populate Weaviate with benchmark data:

uv run python3 scripts/populate-db.py

Run search eval:

uv run python3 scripts/run-search-benchmark.py

Run ask eval:

uv run python3 scripts/run-ask-benchmark.py

See query_agent_benchmarking/benchmark-config.yml to change the dataset, agent type (hybrid-search, query-agent-search-mode, etc.), number of samples, and concurrency parameters.

Using as a Python Library

You can also install the package as a dependency and use it programmatically:

pip install query-agent-benchmarking

Evaluate Weaviate's built-in agents

import query_agent_benchmarking

# Search eval
query_agent_benchmarking.run_search_eval(
    search_dataset="beir/scifact/test",
    agent_name="query-agent-search-mode",
)

# Compare multiple search agents
query_agent_benchmarking.compare_search_agents(
    search_dataset="beir/scifact/test",
    agent_names=["hybrid-search", "query-agent-search-mode"],
)

# Ask eval
query_agent_benchmarking.run_ask_eval(
    ask_dataset="multihoprag",
    agent_name="query-agent-ask-mode",
)

Bring your own retriever

Pass any object that implements the SearchAgent protocol directly to run_search_eval:

from query_agent_benchmarking import run_search_eval, ObjectID

class MyRetriever:
    """Any class with a run() method returning list[ObjectID]."""

    def run(self, query: str, tenant=None) -> list[ObjectID]:
        results = my_search_function(query)
        return [ObjectID(object_id=doc_id) for doc_id in results]

    async def run_async(self, query: str, tenant=None) -> list[ObjectID]:
        return self.run(query, tenant)

    async def initialize_async(self) -> None: pass
    async def close_async(self) -> None: pass

metrics = run_search_eval(
    search_dataset="beir/scifact/test",
    search_agent=MyRetriever(),
)

The library handles dataset loading, query execution, metric computation (Recall@K, nDCG@K, etc.), and results aggregation. See Bring Your Own Retriever for the full protocol definition and more examples.

Evaluate with custom questions and answers

from query_agent_benchmarking import run_ask_eval, DocsCollection, InMemoryAskQuery

queries = [
    InMemoryAskQuery(
        question="What is HyDE?",
        ground_truth_answer="HyDE stands for Hypothetical Document Embeddings...",
    ),
]

run_ask_eval(
    docs_collection=DocsCollection(
        collection_name="MyDocs",
        content_key="content",
        id_key="doc_id",
    ),
    queries=queries,
)

Documentation

1. Populate Database — Load benchmark datasets into Weaviate
2. Run Built-in Evals — Evaluate Weaviate agents on standard search and ask benchmarks
3. Run Custom Evals — Bring your own retriever, QA system, queries, or collections
Experimental — Synthetic benchmark creation and hard negatives

Name		Name	Last commit message	Last commit date
Latest commit History 546 Commits
.github/workflows		.github/workflows
console		console
docs		docs
locust		locust
notebooks		notebooks
query_agent_benchmarking		query_agent_benchmarking
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Query Agent Benchmarking

News 📯

Quick Start

Using as a Python Library

Evaluate Weaviate's built-in agents

Bring your own retriever

Evaluate with custom questions and answers

Documentation

About

Uh oh!

Releases 2

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Query Agent Benchmarking

News 📯

Quick Start

Using as a Python Library

Evaluate Weaviate's built-in agents

Bring your own retriever

Evaluate with custom questions and answers

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Contributors

Uh oh!

Languages