diff --git a/index.toml b/index.toml index 7eeaf19..775ee91 100644 --- a/index.toml +++ b/index.toml @@ -370,3 +370,9 @@ title = "Tabular Data Processing with Prior Labs MCP" notebook = "prior_labs_agent.ipynb" new = true topics = ["Agents", "MCP", "Data Processing"] + +[[cookbook]] +title = "Benchmark Retrieval Strategies with KB Arena before wiring into Haystack" +notebook = "benchmark_retrieval_strategies_kb_arena.ipynb" +new = true +topics = ["Evaluation", "Advanced Retrieval", "RAG"] diff --git a/notebooks/benchmark_retrieval_strategies_kb_arena.ipynb b/notebooks/benchmark_retrieval_strategies_kb_arena.ipynb new file mode 100644 index 0000000..dcbb4c4 --- /dev/null +++ b/notebooks/benchmark_retrieval_strategies_kb_arena.ipynb @@ -0,0 +1,256 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Benchmark Retrieval Strategies with KB Arena before wiring into Haystack\n", + "\n", + "[KB Arena](https://github.com/xmpuspus/kb-arena) is an open-source benchmark that runs nine architecturally distinct retrieval strategies head-to-head on your own documentation and reports IR metrics with statistical confidence intervals.\n", + "\n", + "Before you commit to a Haystack pipeline using one specific retrieval approach, KB Arena lets you ask: *for my corpus, which strategy actually wins?* This notebook walks through running the benchmark on a small example corpus, reading the results, and wiring the winning strategy into a Haystack `Pipeline`.\n", + "\n", + "Strategies covered: naive vector, contextual vector, QnA pairs, knowledge graph (Neo4j), hybrid (RRF-fused), RAPTOR, PageIndex, BM25, rerank-vector (cross-encoder). The full benchmark requires Neo4j and embedding-provider API keys; this notebook uses BM25 only so it runs end-to-end with no external services." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notebook by [*Xavier Puspus*](https://github.com/xmpuspus)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "No API keys are required for the BM25-only path shown here. To unlock the full nine-strategy benchmark, see the KB Arena [README](https://github.com/xmpuspus/kb-arena) for setting `KB_ARENA_ANTHROPIC_API_KEY` (or any other supported provider) and `KB_ARENA_NEO4J_URI`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install kb-arena haystack-ai" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1 — Prepare a small example corpus\n", + "\n", + "KB Arena ingests Markdown, HTML, PDF, DOCX, plaintext, CSV, and a few other formats from a directory. For this walkthrough we drop three short AWS-flavored Markdown files into a fresh directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "corpus_dir = Path(\"raw\")\n", + "corpus_dir.mkdir(exist_ok=True)\n", + "\n", + "(corpus_dir / \"lambda.md\").write_text(\"\"\"# AWS Lambda\n", + "\n", + "AWS Lambda runs code in response to events without provisioning servers. It scales\n", + "automatically and you pay only for the compute time consumed. Common triggers\n", + "include S3 uploads, DynamoDB streams, API Gateway requests, and scheduled\n", + "EventBridge rules. Memory is configurable from 128 MB to 10,240 MB; CPU scales\n", + "with memory. The maximum execution timeout is 15 minutes per invocation.\n", + "\"\"\")\n", + "\n", + "(corpus_dir / \"ec2.md\").write_text(\"\"\"# Amazon EC2\n", + "\n", + "Amazon EC2 provides resizable compute capacity in the cloud through virtual\n", + "machines called instances. Instance types are grouped into families optimized\n", + "for compute, memory, storage, or accelerated computing. Pricing models include\n", + "On-Demand, Reserved, Savings Plans, and Spot. EBS volumes provide persistent\n", + "block storage attached to instances; instance store provides ephemeral local\n", + "storage.\n", + "\"\"\")\n", + "\n", + "(corpus_dir / \"fargate.md\").write_text(\"\"\"# AWS Fargate\n", + "\n", + "AWS Fargate is a serverless compute engine for containers that works with both\n", + "Amazon ECS and Amazon EKS. You define container images, CPU and memory, and\n", + "Fargate provisions the underlying infrastructure for you. Tasks can run for as\n", + "long as needed and you pay per second of vCPU and memory consumed. Unlike\n", + "Lambda, Fargate has no 15-minute execution cap.\n", + "\"\"\")\n", + "\n", + "print(sorted(p.name for p in corpus_dir.iterdir()))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2 — Initialize the corpus and ingest documents\n", + "\n", + "KB Arena treats every corpus as an isolated workspace under `datasets//`. `init-corpus` creates the directory layout; `ingest` parses the source files into a JSONL `Document` representation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!kb-arena init-corpus aws-mini\n", + "!cp raw/*.md datasets/aws-mini/raw/\n", + "!kb-arena ingest datasets/aws-mini/raw/ --corpus aws-mini" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3 — Generate benchmark questions\n", + "\n", + "For real corpora, KB Arena auto-generates questions across five difficulty tiers via `kb-arena generate-questions`. For this minimal walkthrough we hand-author a tiny `questions.yaml` so the notebook stays fully offline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "questions_yaml = '''questions:\n", + " - id: q1\n", + " text: What is the maximum execution timeout for an AWS Lambda invocation?\n", + " tier: 1\n", + " expected_docs: [lambda]\n", + " - id: q2\n", + " text: Which pricing models does Amazon EC2 support?\n", + " tier: 1\n", + " expected_docs: [ec2]\n", + " - id: q3\n", + " text: How does Fargate differ from Lambda for long-running container workloads?\n", + " tier: 3\n", + " expected_docs: [fargate, lambda]\n", + "'''\n", + "\n", + "Path(\"datasets/aws-mini/questions\").mkdir(parents=True, exist_ok=True)\n", + "Path(\"datasets/aws-mini/questions/questions.yaml\").write_text(questions_yaml)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4 — Run KB Arena's retriever-lab on BM25\n", + "\n", + "`retriever-lab` is the retrieval-only benchmark (no generation, no LLM judge). It computes IR metrics per question and aggregates with paired-bootstrap 95% confidence intervals." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!kb-arena retriever-lab --corpus aws-mini --strategies bm25 --top-ks 3,5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5 — Inspect the IR metrics\n", + "\n", + "Results land in `results/run_/retriever_lab.json`. The per-strategy summary carries Recall@k, Precision@k, MRR, NDCG (binary + graded), MAP, R-Precision, and bpref, plus bootstrap CIs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from pathlib import Path\n", + "\n", + "run_dir = sorted(Path(\"results\").glob(\"run_*\"))[-1]\n", + "summary = json.loads((run_dir / \"retriever_lab.json\").read_text())\n", + "\n", + "for strategy, metrics in summary.get(\"per_strategy\", {}).items():\n", + " print(f\"\\n{strategy}\")\n", + " for key in (\"recall_at_k\", \"ndcg_at_k\", \"mrr\", \"map\"):\n", + " if key in metrics:\n", + " print(f\" {key}: {metrics[key]}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6 — Wire the winning strategy into a Haystack pipeline\n", + "\n", + "Once KB Arena tells you which strategy wins on your corpus (with statistical confidence, not just mean delta), reach for the equivalent Haystack component. BM25 in KB Arena maps to `InMemoryBM25Retriever` here." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from haystack import Document, Pipeline\n", + "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", + "from haystack.components.retrievers.in_memory import InMemoryBM25Retriever\n", + "\n", + "docs = [Document(content=p.read_text(), meta={\"name\": p.stem}) for p in corpus_dir.iterdir()]\n", + "store = InMemoryDocumentStore()\n", + "store.write_documents(docs)\n", + "\n", + "pipe = Pipeline()\n", + "pipe.add_component(\"retriever\", InMemoryBM25Retriever(document_store=store, top_k=3))\n", + "\n", + "result = pipe.run({\"retriever\": {\"query\": \"What is the maximum execution timeout for an AWS Lambda invocation?\"}})\n", + "for doc in result[\"retriever\"][\"documents\"]:\n", + " print(f\"{doc.meta['name']} score={doc.score:.3f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Next steps\n", + "\n", + "- Replace the three Markdown files with your real corpus, drop them in `datasets//raw/`, and re-run.\n", + "- Run `kb-arena generate-questions --corpus --count 50` to auto-generate questions across five difficulty tiers.\n", + "- Run `kb-arena label-chunks --corpus ` to get chunk-level ground truth via BM25 + Haiku judge.\n", + "- Add embedding and Neo4j configuration to compare BM25 against vector, knowledge graph, hybrid, RAPTOR, PageIndex, and rerank strategies.\n", + "- Use `kb-arena optimize` to search across strategies and top-k values with bootstrap CIs, Wilcoxon p-values, and a Pareto frontier across (NDCG, latency)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}