Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions index.toml
Original file line number Diff line number Diff line change
Expand Up @@ -370,3 +370,9 @@ title = "Tabular Data Processing with Prior Labs MCP"
notebook = "prior_labs_agent.ipynb"
new = true
topics = ["Agents", "MCP", "Data Processing"]

[[cookbook]]
title = "Benchmark Retrieval Strategies with KB Arena before wiring into Haystack"
notebook = "benchmark_retrieval_strategies_kb_arena.ipynb"
new = true
topics = ["Evaluation", "Advanced Retrieval", "RAG"]
256 changes: 256 additions & 0 deletions notebooks/benchmark_retrieval_strategies_kb_arena.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Benchmark Retrieval Strategies with KB Arena before wiring into Haystack\n",
"\n",
"[KB Arena](https://github.com/xmpuspus/kb-arena) is an open-source benchmark that runs nine architecturally distinct retrieval strategies head-to-head on your own documentation and reports IR metrics with statistical confidence intervals.\n",
"\n",
"Before you commit to a Haystack pipeline using one specific retrieval approach, KB Arena lets you ask: *for my corpus, which strategy actually wins?* This notebook walks through running the benchmark on a small example corpus, reading the results, and wiring the winning strategy into a Haystack `Pipeline`.\n",
"\n",
"Strategies covered: naive vector, contextual vector, QnA pairs, knowledge graph (Neo4j), hybrid (RRF-fused), RAPTOR, PageIndex, BM25, rerank-vector (cross-encoder). The full benchmark requires Neo4j and embedding-provider API keys; this notebook uses BM25 only so it runs end-to-end with no external services."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notebook by [*Xavier Puspus*](https://github.com/xmpuspus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"No API keys are required for the BM25-only path shown here. To unlock the full nine-strategy benchmark, see the KB Arena [README](https://github.com/xmpuspus/kb-arena) for setting `KB_ARENA_ANTHROPIC_API_KEY` (or any other supported provider) and `KB_ARENA_NEO4J_URI`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install kb-arena haystack-ai"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1 — Prepare a small example corpus\n",
"\n",
"KB Arena ingests Markdown, HTML, PDF, DOCX, plaintext, CSV, and a few other formats from a directory. For this walkthrough we drop three short AWS-flavored Markdown files into a fresh directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"corpus_dir = Path(\"raw\")\n",
"corpus_dir.mkdir(exist_ok=True)\n",
"\n",
"(corpus_dir / \"lambda.md\").write_text(\"\"\"# AWS Lambda\n",
"\n",
"AWS Lambda runs code in response to events without provisioning servers. It scales\n",
"automatically and you pay only for the compute time consumed. Common triggers\n",
"include S3 uploads, DynamoDB streams, API Gateway requests, and scheduled\n",
"EventBridge rules. Memory is configurable from 128 MB to 10,240 MB; CPU scales\n",
"with memory. The maximum execution timeout is 15 minutes per invocation.\n",
"\"\"\")\n",
"\n",
"(corpus_dir / \"ec2.md\").write_text(\"\"\"# Amazon EC2\n",
"\n",
"Amazon EC2 provides resizable compute capacity in the cloud through virtual\n",
"machines called instances. Instance types are grouped into families optimized\n",
"for compute, memory, storage, or accelerated computing. Pricing models include\n",
"On-Demand, Reserved, Savings Plans, and Spot. EBS volumes provide persistent\n",
"block storage attached to instances; instance store provides ephemeral local\n",
"storage.\n",
"\"\"\")\n",
"\n",
"(corpus_dir / \"fargate.md\").write_text(\"\"\"# AWS Fargate\n",
"\n",
"AWS Fargate is a serverless compute engine for containers that works with both\n",
"Amazon ECS and Amazon EKS. You define container images, CPU and memory, and\n",
"Fargate provisions the underlying infrastructure for you. Tasks can run for as\n",
"long as needed and you pay per second of vCPU and memory consumed. Unlike\n",
"Lambda, Fargate has no 15-minute execution cap.\n",
"\"\"\")\n",
"\n",
"print(sorted(p.name for p in corpus_dir.iterdir()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2 — Initialize the corpus and ingest documents\n",
"\n",
"KB Arena treats every corpus as an isolated workspace under `datasets/<corpus>/`. `init-corpus` creates the directory layout; `ingest` parses the source files into a JSONL `Document` representation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!kb-arena init-corpus aws-mini\n",
"!cp raw/*.md datasets/aws-mini/raw/\n",
"!kb-arena ingest datasets/aws-mini/raw/ --corpus aws-mini"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3 — Generate benchmark questions\n",
"\n",
"For real corpora, KB Arena auto-generates questions across five difficulty tiers via `kb-arena generate-questions`. For this minimal walkthrough we hand-author a tiny `questions.yaml` so the notebook stays fully offline."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"questions_yaml = '''questions:\n",
" - id: q1\n",
" text: What is the maximum execution timeout for an AWS Lambda invocation?\n",
" tier: 1\n",
" expected_docs: [lambda]\n",
" - id: q2\n",
" text: Which pricing models does Amazon EC2 support?\n",
" tier: 1\n",
" expected_docs: [ec2]\n",
" - id: q3\n",
" text: How does Fargate differ from Lambda for long-running container workloads?\n",
" tier: 3\n",
" expected_docs: [fargate, lambda]\n",
"'''\n",
"\n",
"Path(\"datasets/aws-mini/questions\").mkdir(parents=True, exist_ok=True)\n",
"Path(\"datasets/aws-mini/questions/questions.yaml\").write_text(questions_yaml)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4 — Run KB Arena's retriever-lab on BM25\n",
"\n",
"`retriever-lab` is the retrieval-only benchmark (no generation, no LLM judge). It computes IR metrics per question and aggregates with paired-bootstrap 95% confidence intervals."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!kb-arena retriever-lab --corpus aws-mini --strategies bm25 --top-ks 3,5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 5 — Inspect the IR metrics\n",
"\n",
"Results land in `results/run_<id>/retriever_lab.json`. The per-strategy summary carries Recall@k, Precision@k, MRR, NDCG (binary + graded), MAP, R-Precision, and bpref, plus bootstrap CIs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from pathlib import Path\n",
"\n",
"run_dir = sorted(Path(\"results\").glob(\"run_*\"))[-1]\n",
"summary = json.loads((run_dir / \"retriever_lab.json\").read_text())\n",
"\n",
"for strategy, metrics in summary.get(\"per_strategy\", {}).items():\n",
" print(f\"\\n{strategy}\")\n",
" for key in (\"recall_at_k\", \"ndcg_at_k\", \"mrr\", \"map\"):\n",
" if key in metrics:\n",
" print(f\" {key}: {metrics[key]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 6 — Wire the winning strategy into a Haystack pipeline\n",
"\n",
"Once KB Arena tells you which strategy wins on your corpus (with statistical confidence, not just mean delta), reach for the equivalent Haystack component. BM25 in KB Arena maps to `InMemoryBM25Retriever` here."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from haystack import Document, Pipeline\n",
"from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
"from haystack.components.retrievers.in_memory import InMemoryBM25Retriever\n",
"\n",
"docs = [Document(content=p.read_text(), meta={\"name\": p.stem}) for p in corpus_dir.iterdir()]\n",
"store = InMemoryDocumentStore()\n",
"store.write_documents(docs)\n",
"\n",
"pipe = Pipeline()\n",
"pipe.add_component(\"retriever\", InMemoryBM25Retriever(document_store=store, top_k=3))\n",
"\n",
"result = pipe.run({\"retriever\": {\"query\": \"What is the maximum execution timeout for an AWS Lambda invocation?\"}})\n",
"for doc in result[\"retriever\"][\"documents\"]:\n",
" print(f\"{doc.meta['name']} score={doc.score:.3f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"- Replace the three Markdown files with your real corpus, drop them in `datasets/<your-corpus>/raw/`, and re-run.\n",
"- Run `kb-arena generate-questions --corpus <your-corpus> --count 50` to auto-generate questions across five difficulty tiers.\n",
"- Run `kb-arena label-chunks --corpus <your-corpus>` to get chunk-level ground truth via BM25 + Haiku judge.\n",
"- Add embedding and Neo4j configuration to compare BM25 against vector, knowledge graph, hybrid, RAPTOR, PageIndex, and rerank strategies.\n",
"- Use `kb-arena optimize` to search across strategies and top-k values with bootstrap CIs, Wilcoxon p-values, and a Pareto frontier across (NDCG, latency)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 4
}