deepset-ai · xmpuspus · May 24, 2026
@@ -370,3 +370,9 @@ title = "Tabular Data Processing with Prior Labs MCP"
 notebook = "prior_labs_agent.ipynb"
 new = true
 topics = ["Agents", "MCP", "Data Processing"]
+
+[[cookbook]]
+title = "Benchmark Retrieval Strategies with KB Arena before wiring into Haystack"
+notebook = "benchmark_retrieval_strategies_kb_arena.ipynb"
+new = true
+topics = ["Evaluation", "Advanced Retrieval", "RAG"]
@@ -0,0 +1,256 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Benchmark Retrieval Strategies with KB Arena before wiring into Haystack\n",
+    "\n",
+    "[KB Arena](https://github.com/xmpuspus/kb-arena) is an open-source benchmark that runs nine architecturally distinct retrieval strategies head-to-head on your own documentation and reports IR metrics with statistical confidence intervals.\n",
+    "\n",
+    "Before you commit to a Haystack pipeline using one specific retrieval approach, KB Arena lets you ask: *for my corpus, which strategy actually wins?* This notebook walks through running the benchmark on a small example corpus, reading the results, and wiring the winning strategy into a Haystack `Pipeline`.\n",
+    "\n",
+    "Strategies covered: naive vector, contextual vector, QnA pairs, knowledge graph (Neo4j), hybrid (RRF-fused), RAPTOR, PageIndex, BM25, rerank-vector (cross-encoder). The full benchmark requires Neo4j and embedding-provider API keys; this notebook uses BM25 only so it runs end-to-end with no external services."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Notebook by [*Xavier Puspus*](https://github.com/xmpuspus)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "No API keys are required for the BM25-only path shown here. To unlock the full nine-strategy benchmark, see the KB Arena [README](https://github.com/xmpuspus/kb-arena) for setting `KB_ARENA_ANTHROPIC_API_KEY` (or any other supported provider) and `KB_ARENA_NEO4J_URI`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Install dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install kb-arena haystack-ai"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1 — Prepare a small example corpus\n",
+    "\n",
+    "KB Arena ingests Markdown, HTML, PDF, DOCX, plaintext, CSV, and a few other formats from a directory. For this walkthrough we drop three short AWS-flavored Markdown files into a fresh directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "\n",
+    "corpus_dir = Path(\"raw\")\n",
+    "corpus_dir.mkdir(exist_ok=True)\n",
+    "\n",
+    "(corpus_dir / \"lambda.md\").write_text(\"\"\"# AWS Lambda\n",
+    "\n",
+    "AWS Lambda runs code in response to events without provisioning servers. It scales\n",
+    "automatically and you pay only for the compute time consumed. Common triggers\n",
+    "include S3 uploads, DynamoDB streams, API Gateway requests, and scheduled\n",
+    "EventBridge rules. Memory is configurable from 128 MB to 10,240 MB; CPU scales\n",
+    "with memory. The maximum execution timeout is 15 minutes per invocation.\n",
+    "\"\"\")\n",
+    "\n",
+    "(corpus_dir / \"ec2.md\").write_text(\"\"\"# Amazon EC2\n",
+    "\n",
+    "Amazon EC2 provides resizable compute capacity in the cloud through virtual\n",
+    "machines called instances. Instance types are grouped into families optimized\n",
+    "for compute, memory, storage, or accelerated computing. Pricing models include\n",
+    "On-Demand, Reserved, Savings Plans, and Spot. EBS volumes provide persistent\n",
+    "block storage attached to instances; instance store provides ephemeral local\n",
+    "storage.\n",
+    "\"\"\")\n",
+    "\n",
+    "(corpus_dir / \"fargate.md\").write_text(\"\"\"# AWS Fargate\n",
+    "\n",
+    "AWS Fargate is a serverless compute engine for containers that works with both\n",
+    "Amazon ECS and Amazon EKS. You define container images, CPU and memory, and\n",
+    "Fargate provisions the underlying infrastructure for you. Tasks can run for as\n",
+    "long as needed and you pay per second of vCPU and memory consumed. Unlike\n",
+    "Lambda, Fargate has no 15-minute execution cap.\n",
+    "\"\"\")\n",
+    "\n",
+    "print(sorted(p.name for p in corpus_dir.iterdir()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2 — Initialize the corpus and ingest documents\n",
+    "\n",
+    "KB Arena treats every corpus as an isolated workspace under `datasets/<corpus>/`. `init-corpus` creates the directory layout; `ingest` parses the source files into a JSONL `Document` representation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!kb-arena init-corpus aws-mini\n",
+    "!cp raw/*.md datasets/aws-mini/raw/\n",
+    "!kb-arena ingest datasets/aws-mini/raw/ --corpus aws-mini"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 3 — Generate benchmark questions\n",
+    "\n",
+    "For real corpora, KB Arena auto-generates questions across five difficulty tiers via `kb-arena generate-questions`. For this minimal walkthrough we hand-author a tiny `questions.yaml` so the notebook stays fully offline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "questions_yaml = '''questions:\n",
+    "  - id: q1\n",
+    "    text: What is the maximum execution timeout for an AWS Lambda invocation?\n",
+    "    tier: 1\n",
+    "    expected_docs: [lambda]\n",
+    "  - id: q2\n",
+    "    text: Which pricing models does Amazon EC2 support?\n",
+    "    tier: 1\n",
+    "    expected_docs: [ec2]\n",
+    "  - id: q3\n",
+    "    text: How does Fargate differ from Lambda for long-running container workloads?\n",
+    "    tier: 3\n",
+    "    expected_docs: [fargate, lambda]\n",
+    "'''\n",
+    "\n",
+    "Path(\"datasets/aws-mini/questions\").mkdir(parents=True, exist_ok=True)\n",
+    "Path(\"datasets/aws-mini/questions/questions.yaml\").write_text(questions_yaml)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 4 — Run KB Arena's retriever-lab on BM25\n",
+    "\n",
+    "`retriever-lab` is the retrieval-only benchmark (no generation, no LLM judge). It computes IR metrics per question and aggregates with paired-bootstrap 95% confidence intervals."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!kb-arena retriever-lab --corpus aws-mini --strategies bm25 --top-ks 3,5"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 5 — Inspect the IR metrics\n",
+    "\n",
+    "Results land in `results/run_<id>/retriever_lab.json`. The per-strategy summary carries Recall@k, Precision@k, MRR, NDCG (binary + graded), MAP, R-Precision, and bpref, plus bootstrap CIs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from pathlib import Path\n",
+    "\n",
+    "run_dir = sorted(Path(\"results\").glob(\"run_*\"))[-1]\n",
+    "summary = json.loads((run_dir / \"retriever_lab.json\").read_text())\n",
+    "\n",
+    "for strategy, metrics in summary.get(\"per_strategy\", {}).items():\n",
+    "    print(f\"\\n{strategy}\")\n",
+    "    for key in (\"recall_at_k\", \"ndcg_at_k\", \"mrr\", \"map\"):\n",
+    "        if key in metrics:\n",
+    "            print(f\"  {key}: {metrics[key]}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 6 — Wire the winning strategy into a Haystack pipeline\n",
+    "\n",
+    "Once KB Arena tells you which strategy wins on your corpus (with statistical confidence, not just mean delta), reach for the equivalent Haystack component. BM25 in KB Arena maps to `InMemoryBM25Retriever` here."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from haystack import Document, Pipeline\n",
+    "from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
+    "from haystack.components.retrievers.in_memory import InMemoryBM25Retriever\n",
+    "\n",
+    "docs = [Document(content=p.read_text(), meta={\"name\": p.stem}) for p in corpus_dir.iterdir()]\n",
+    "store = InMemoryDocumentStore()\n",
+    "store.write_documents(docs)\n",
+    "\n",
+    "pipe = Pipeline()\n",
+    "pipe.add_component(\"retriever\", InMemoryBM25Retriever(document_store=store, top_k=3))\n",
+    "\n",
+    "result = pipe.run({\"retriever\": {\"query\": \"What is the maximum execution timeout for an AWS Lambda invocation?\"}})\n",
+    "for doc in result[\"retriever\"][\"documents\"]:\n",
+    "    print(f\"{doc.meta['name']}  score={doc.score:.3f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Next steps\n",
+    "\n",
+    "- Replace the three Markdown files with your real corpus, drop them in `datasets/<your-corpus>/raw/`, and re-run.\n",
+    "- Run `kb-arena generate-questions --corpus <your-corpus> --count 50` to auto-generate questions across five difficulty tiers.\n",
+    "- Run `kb-arena label-chunks --corpus <your-corpus>` to get chunk-level ground truth via BM25 + Haiku judge.\n",
+    "- Add embedding and Neo4j configuration to compare BM25 against vector, knowledge graph, hybrid, RAPTOR, PageIndex, and rerank strategies.\n",
+    "- Use `kb-arena optimize` to search across strategies and top-k values with bootstrap CIs, Wilcoxon p-values, and a Pareto frontier across (NDCG, latency)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}