jmponcebe · jmponcebe · Apr 18, 2026 · Apr 18, 2026 · Apr 18, 2026 · Apr 18, 2026
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -25,7 +25,8 @@ All three development phases are finished. The system is fully operational end-t
 | Observability | Complete | `src/pharmagraphrag/observability.py` (Langfuse tracing) |
 | Docker Compose | Complete | `docker-compose.yml` + `docker/` |
 | CI/CD | Complete | `.github/workflows/ci.yml` + `deploy.yml` |
-| Tests | 221 passing | `tests/` |
+| Evaluation | Complete | `src/pharmagraphrag/evaluation/` (RAGAS metrics, agent eval, curated testset) |
+| Tests | 263 passing | `tests/` |
 | Cloud Deployment | Live | Streamlit Cloud + Cloud Run + Neo4j Aura |
 
 ### Data at a Glance
@@ -102,7 +103,8 @@ FDA FAERS (CSV) + DailyMed (API)
 - **UI**: Streamlit 1.54+ with streamlit-agraph, pyvis, plotly
 - **Containers**: Docker Compose (Neo4j + API + UI + optional Ollama)
 - **CI/CD**: GitHub Actions (ci.yml: lint+test on push; deploy.yml: CD on v* tags via Cloud Build)
-- **Testing**: pytest (221 tests passing)
+- **Evaluation**: RAGAS 0.4.3 (Faithfulness, Relevancy, Precision, Recall, Correctness) + custom agent tool accuracy
+- **Testing**: pytest (261 tests passing)
 - **CI/CD**: GitHub Actions (ci.yml: lint + test matrix 3.11/3.13; deploy.yml: v* tags → Cloud Build → Cloud Run)
 - **Cloud Build**: Google Cloud Build (cloudbuild.yaml) — downloads ChromaDB from GCS, builds Docker, deploys
 - **Object Storage**: Google Cloud Storage (gs://pharmagraphrag-data for ChromaDB snapshots)
@@ -170,10 +172,19 @@ PharmaGraphRAG/
 |   |   +-- __init__.py
 |   |   +-- main.py                # FastAPI app: POST /query, POST /agent/query, POST /agent/multi, GET /drug/{name}, GET /health
 |   |   +-- models.py              # Pydantic v2 request/response schemas (incl. AgentQueryRequest/Response)
+|   +-- evaluation/
+|   |   +-- __init__.py
+|   |   +-- metrics.py             # RAGAS metric wrappers (Faithfulness, Relevancy, Precision, Recall, Correctness)
+|   |   +-- dataset.py             # Curated testset loader, EvalSample/EvalDataset
+|   |   +-- runner.py              # Batch evaluation runner (calls API, computes RAGAS scores, exports CSV)
+|   |   +-- agent_eval.py          # Agent tool selection accuracy (precision/recall/F1)
 |   +-- ui/
 |       +-- __init__.py
 |       +-- app.py                 # Streamlit chat: clickable follow-ups, confidence tooltips, pipeline steps (classic), nested sub-agent reasoning (multi)
 |       +-- components.py          # Graph viz, sources panel, drug explorer
++-- data/
+|   +-- evaluation/
+|       +-- testset.json           # 25 curated evaluation questions (8 types, ground truth, expected tools)
 +-- tests/
 |   +-- __init__.py
 |   +-- test_download_faers.py     # 2 tests
@@ -186,9 +197,11 @@ PharmaGraphRAG/
 |   +-- test_ui.py                 # 14 tests (Streamlit components + session state)
 |   +-- test_agent.py              # 61 tests (9 tools, AgentResponse, StructuredResponse, multi-agent, endpoints)
 |   +-- test_observability.py      # 13 tests (Langfuse init, callbacks, decorator, graceful degradation)
+|   +-- test_evaluation.py         # 40 tests (dataset, metrics, runner, agent eval, all mocked)
 +-- scripts/
 |   +-- load_vectorstore.py        # One-off: populate ChromaDB
 |   +-- validate_search.py         # One-off: test semantic search queries
+|   +-- run_evaluation.py          # Batch eval: --mode classic|agent|multi|all, exports CSV reports
 |   +-- setup_demo.py              # Demo setup: load graph + embeddings (~3 min)
 |   +-- migrate_neo4j.py           # Migrate data between Neo4j instances
 +-- docker/
@@ -287,7 +300,7 @@ PharmaGraphRAG/
 - .gitignore: data/raw/, data/processed/, data/chroma/, .env, __pycache__, .pytest_cache
 - **Deploy rule**: NEVER create version tags or trigger deployments without explicit user confirmation. Commits and pushes to main are fine; tags (v*) require user approval.
 
-### Testing (208 tests)
+### Testing (261 tests)
 - pytest with fixtures for sample data and mocked services
 - Mock Neo4j driver for graph tests
 - Mock LLM API calls (never call real API in tests)
@@ -307,7 +320,17 @@ PharmaGraphRAG/
 | test_ui.py | 14 | Streamlit components, session state |
 | test_agent.py | 61 | 9 tools, AgentResponse, StructuredResponse, multi-agent supervisor, model selector, endpoints |
 | test_observability.py | 13 | Langfuse init, callback handler, config builder, decorator, trace generation, flush |
-| **Total** | **221** | |
+| test_evaluation.py | 42 | RAGAS metrics, dataset loading, runner, agent tool eval, call_agent parsing, CSV export |
+| **Total** | **263** | |
+
+### Evaluation (RAGAS)
+- **Framework**: RAGAS 0.4.3 with Gemini via OpenAI-compatible endpoint
+- **Curated testset**: 25 questions across 8 types (drug_info, interaction, adverse_event, outcome, category, comparison, multi_drug, label_search)
+- **Reference-free metrics**: Faithfulness, Answer Relevancy
+- **Reference-based metrics**: Context Precision, Context Recall, Answer Correctness
+- **Agent evaluation**: Custom tool selection accuracy (precision/recall/F1), goal achievement tracking
+- **Batch runner**: Calls API endpoints (classic/agent/multi), computes metrics, exports CSV
+- **Script**: `scripts/run_evaluation.py --mode all --api-url http://localhost:8000`
 
 ## Key Design Decisions
 

diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 [![CI](https://github.com/jmponcebe/PharmaGraphRAG/actions/workflows/ci.yml/badge.svg)](https://github.com/jmponcebe/PharmaGraphRAG/actions/workflows/ci.yml)
 [![CD](https://github.com/jmponcebe/PharmaGraphRAG/actions/workflows/deploy.yml/badge.svg)](https://github.com/jmponcebe/PharmaGraphRAG/actions/workflows/deploy.yml)
 [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
-[![Tests](https://img.shields.io/badge/tests-221%20passing-brightgreen.svg)](#testing)
+[![Tests](https://img.shields.io/badge/tests-263%20passing-brightgreen.svg)](#testing)
 [![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://docs.astral.sh/ruff/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
 [![Live Demo](https://img.shields.io/badge/demo-pharmagraphrag.streamlit.app-FF4B4B.svg)](https://pharmagraphrag.streamlit.app)
@@ -66,7 +66,7 @@ A production-ready question-answering system that combines a **pharmaceutical kn
 - **Agent Mode**: LangGraph ReAct agent that autonomously decides which tools to call (9 tools: drug info, adverse events, interactions, labels, drug search, event search, outcomes, comparison, categories) based on the question. Includes conversation memory, structured output (confidence + follow-ups), multi-agent supervisor with 3 specialized experts, per-query model selector (Flash for agents, Pro for supervisor), response caching and graceful fallback to classic pipeline
 - **Transparent UI**: clickable follow-up suggestions, confidence level tooltips, pipeline steps expander (classic mode), nested sub-agent reasoning hierarchy (multi-agent mode)
 - **Real FDA data**: 816K adverse event reports, 4,998 drugs, 365K causal relationships, 88 drug labels
-- **221 tests** with CI/CD on GitHub Actions (Python 3.11 + 3.13 matrix)
+- **263 tests** with CI/CD on GitHub Actions (Python 3.11 + 3.13 matrix)
 - **Full stack**: data pipeline → knowledge graph → vector store → query engine → REST API → chat UI
 - **One-click Codespaces**: try it instantly from your browser
 
@@ -75,7 +75,7 @@ A production-ready question-answering system that combines a **pharmaceutical kn
 > *"What are the side effects of ibuprofen?"* · *"Does metformin interact with other drugs?"* · *"Compare the safety profiles of aspirin and clopidogrel"* · *"What drugs cause liver damage?"*
 
 <details>
-<summary><strong>Component Status</strong> — all modules complete, 221 tests passing</summary>
+<summary><strong>Component Status</strong> — all modules complete, 263 tests passing</summary>
 
 | Component | Status | Details |
 | --- | --- | --- |
@@ -89,7 +89,8 @@ A production-ready question-answering system that combines a **pharmaceutical kn
 | Docker Compose | ✅ Complete | Neo4j + API + UI + Ollama (optional profile) |
 | CI/CD | ✅ Complete | GitHub Actions: lint, test matrix (3.11/3.13), Docker build |
 | Agent Mode | ✅ Complete | LangGraph ReAct agent with 9 tools, conversation memory, structured output, multi-agent supervisor, nested reasoning |
-| Tests | ✅ 221 passing | Data (29) + vectors (35) + engine (37) + LLM (14) + API (18) + UI (14) + agent (61) + observability (13) |
+| Evaluation | ✅ Complete | RAGAS 0.4.3 (Faithfulness, Relevancy, Precision, Recall, Correctness) + agent tool accuracy (P/R/F1) |
+| Tests | ✅ 263 passing | Data (29) + vectors (35) + engine (37) + LLM (14) + API (18) + UI (14) + agent (61) + observability (13) + evaluation (42) |
 
 </details>
 
@@ -231,7 +232,7 @@ In Agent Mode, the LLM autonomously decides which tools to call:
 | UI | Streamlit + streamlit-agraph (graph visualization) |
 | Containers | Docker Compose (multi-stage, non-root, healthchecks) |
 | CI/CD | GitHub Actions (CI: lint + test matrix; CD: v* tags → Cloud Build → Cloud Run) |
-| Testing | pytest (221 tests, mocked services) |
+| Testing | pytest (263 tests, mocked services) |
 | Linting | ruff (check + format) |
 
 ## Data Sources
@@ -398,11 +399,28 @@ gcloud run deploy pharmagraphrag-api --image gcr.io/<project>/pharmagraphrag-api
 ### Testing
 
 ```bash
-uv run pytest               # Run all 221 tests
+uv run pytest               # Run all 263 tests
 uv run pytest -v             # Verbose output
 uv run pytest tests/test_engine.py  # Specific module
 ```
 
+### Evaluation (RAGAS)
+
+Automated quality evaluation using [RAGAS](https://docs.ragas.io/) metrics against a curated testset of 25 questions (8 types: drug info, interactions, adverse events, outcomes, categories, comparisons, multi-drug, label search).
+
+```bash
+# Evaluate classic pipeline against local API
+python scripts/run_evaluation.py --mode classic --api-url http://localhost:8000
+
+# Evaluate all modes (classic + agent + multi)
+python scripts/run_evaluation.py --mode all --api-url http://localhost:8000
+
+# Against production
+python scripts/run_evaluation.py --mode all --api-url https://pharmagraphrag-api-893694384146.us-central1.run.app
+```
+
+**Metrics**: Faithfulness, Answer Relevancy (reference-free) + Context Precision, Context Recall, Answer Correctness (reference-based) + Agent tool selection accuracy (precision/recall/F1).
+
 <details>
 <summary>📋 Linting, formatting & type checking</summary>
 
@@ -445,6 +463,11 @@ src/pharmagraphrag/
 ├── api/                    # REST API
 │   ├── main.py                 # FastAPI app (POST /query, POST /agent/query, POST /agent/multi, GET /drug, GET /health)
 │   └── models.py               # Pydantic v2 request/response schemas
+├── evaluation/             # RAGAS evaluation framework
+│   ├── metrics.py              # RAGAS metric wrappers (Faithfulness, Relevancy, Precision, Recall, Correctness)
+│   ├── dataset.py              # Curated testset loader, EvalSample/EvalDataset
+│   ├── runner.py               # Batch evaluation runner (calls API, computes RAGAS scores, exports CSV)
+│   └── agent_eval.py           # Agent tool selection accuracy (precision/recall/F1)
 └── ui/                     # Chat interface
     ├── app.py                  # Streamlit app (chat, sidebar, settings)
     └── components.py           # Graph visualization, sources panel, drug explorer