Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 27 additions & 4 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ All three development phases are finished. The system is fully operational end-t
| Observability | Complete | `src/pharmagraphrag/observability.py` (Langfuse tracing) |
| Docker Compose | Complete | `docker-compose.yml` + `docker/` |
| CI/CD | Complete | `.github/workflows/ci.yml` + `deploy.yml` |
| Tests | 221 passing | `tests/` |
| Evaluation | Complete | `src/pharmagraphrag/evaluation/` (RAGAS metrics, agent eval, curated testset) |
| Tests | 263 passing | `tests/` |
| Cloud Deployment | Live | Streamlit Cloud + Cloud Run + Neo4j Aura |

### Data at a Glance
Expand Down Expand Up @@ -102,7 +103,8 @@ FDA FAERS (CSV) + DailyMed (API)
- **UI**: Streamlit 1.54+ with streamlit-agraph, pyvis, plotly
- **Containers**: Docker Compose (Neo4j + API + UI + optional Ollama)
- **CI/CD**: GitHub Actions (ci.yml: lint+test on push; deploy.yml: CD on v* tags via Cloud Build)
- **Testing**: pytest (221 tests passing)
- **Evaluation**: RAGAS 0.4.3 (Faithfulness, Relevancy, Precision, Recall, Correctness) + custom agent tool accuracy
- **Testing**: pytest (261 tests passing)
- **CI/CD**: GitHub Actions (ci.yml: lint + test matrix 3.11/3.13; deploy.yml: v* tags → Cloud Build → Cloud Run)
- **Cloud Build**: Google Cloud Build (cloudbuild.yaml) — downloads ChromaDB from GCS, builds Docker, deploys
- **Object Storage**: Google Cloud Storage (gs://pharmagraphrag-data for ChromaDB snapshots)
Expand Down Expand Up @@ -170,10 +172,19 @@ PharmaGraphRAG/
| | +-- __init__.py
| | +-- main.py # FastAPI app: POST /query, POST /agent/query, POST /agent/multi, GET /drug/{name}, GET /health
| | +-- models.py # Pydantic v2 request/response schemas (incl. AgentQueryRequest/Response)
| +-- evaluation/
| | +-- __init__.py
| | +-- metrics.py # RAGAS metric wrappers (Faithfulness, Relevancy, Precision, Recall, Correctness)
| | +-- dataset.py # Curated testset loader, EvalSample/EvalDataset
| | +-- runner.py # Batch evaluation runner (calls API, computes RAGAS scores, exports CSV)
| | +-- agent_eval.py # Agent tool selection accuracy (precision/recall/F1)
| +-- ui/
| +-- __init__.py
| +-- app.py # Streamlit chat: clickable follow-ups, confidence tooltips, pipeline steps (classic), nested sub-agent reasoning (multi)
| +-- components.py # Graph viz, sources panel, drug explorer
+-- data/
| +-- evaluation/
| +-- testset.json # 25 curated evaluation questions (8 types, ground truth, expected tools)
+-- tests/
| +-- __init__.py
| +-- test_download_faers.py # 2 tests
Expand All @@ -186,9 +197,11 @@ PharmaGraphRAG/
| +-- test_ui.py # 14 tests (Streamlit components + session state)
| +-- test_agent.py # 61 tests (9 tools, AgentResponse, StructuredResponse, multi-agent, endpoints)
| +-- test_observability.py # 13 tests (Langfuse init, callbacks, decorator, graceful degradation)
| +-- test_evaluation.py # 40 tests (dataset, metrics, runner, agent eval, all mocked)
+-- scripts/
| +-- load_vectorstore.py # One-off: populate ChromaDB
| +-- validate_search.py # One-off: test semantic search queries
| +-- run_evaluation.py # Batch eval: --mode classic|agent|multi|all, exports CSV reports
| +-- setup_demo.py # Demo setup: load graph + embeddings (~3 min)
| +-- migrate_neo4j.py # Migrate data between Neo4j instances
+-- docker/
Expand Down Expand Up @@ -287,7 +300,7 @@ PharmaGraphRAG/
- .gitignore: data/raw/, data/processed/, data/chroma/, .env, __pycache__, .pytest_cache
- **Deploy rule**: NEVER create version tags or trigger deployments without explicit user confirmation. Commits and pushes to main are fine; tags (v*) require user approval.

### Testing (208 tests)
### Testing (261 tests)
- pytest with fixtures for sample data and mocked services
- Mock Neo4j driver for graph tests
- Mock LLM API calls (never call real API in tests)
Expand All @@ -307,7 +320,17 @@ PharmaGraphRAG/
| test_ui.py | 14 | Streamlit components, session state |
| test_agent.py | 61 | 9 tools, AgentResponse, StructuredResponse, multi-agent supervisor, model selector, endpoints |
| test_observability.py | 13 | Langfuse init, callback handler, config builder, decorator, trace generation, flush |
| **Total** | **221** | |
| test_evaluation.py | 42 | RAGAS metrics, dataset loading, runner, agent tool eval, call_agent parsing, CSV export |
| **Total** | **263** | |

### Evaluation (RAGAS)
- **Framework**: RAGAS 0.4.3 with Gemini via OpenAI-compatible endpoint
- **Curated testset**: 25 questions across 8 types (drug_info, interaction, adverse_event, outcome, category, comparison, multi_drug, label_search)
- **Reference-free metrics**: Faithfulness, Answer Relevancy
- **Reference-based metrics**: Context Precision, Context Recall, Answer Correctness
- **Agent evaluation**: Custom tool selection accuracy (precision/recall/F1), goal achievement tracking
- **Batch runner**: Calls API endpoints (classic/agent/multi), computes metrics, exports CSV
- **Script**: `scripts/run_evaluation.py --mode all --api-url http://localhost:8000`

## Key Design Decisions

Expand Down
35 changes: 29 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
[![CI](https://github.com/jmponcebe/PharmaGraphRAG/actions/workflows/ci.yml/badge.svg)](https://github.com/jmponcebe/PharmaGraphRAG/actions/workflows/ci.yml)
[![CD](https://github.com/jmponcebe/PharmaGraphRAG/actions/workflows/deploy.yml/badge.svg)](https://github.com/jmponcebe/PharmaGraphRAG/actions/workflows/deploy.yml)
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
[![Tests](https://img.shields.io/badge/tests-221%20passing-brightgreen.svg)](#testing)
[![Tests](https://img.shields.io/badge/tests-263%20passing-brightgreen.svg)](#testing)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://docs.astral.sh/ruff/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Live Demo](https://img.shields.io/badge/demo-pharmagraphrag.streamlit.app-FF4B4B.svg)](https://pharmagraphrag.streamlit.app)
Expand Down Expand Up @@ -66,7 +66,7 @@ A production-ready question-answering system that combines a **pharmaceutical kn
- **Agent Mode**: LangGraph ReAct agent that autonomously decides which tools to call (9 tools: drug info, adverse events, interactions, labels, drug search, event search, outcomes, comparison, categories) based on the question. Includes conversation memory, structured output (confidence + follow-ups), multi-agent supervisor with 3 specialized experts, per-query model selector (Flash for agents, Pro for supervisor), response caching and graceful fallback to classic pipeline
- **Transparent UI**: clickable follow-up suggestions, confidence level tooltips, pipeline steps expander (classic mode), nested sub-agent reasoning hierarchy (multi-agent mode)
- **Real FDA data**: 816K adverse event reports, 4,998 drugs, 365K causal relationships, 88 drug labels
- **221 tests** with CI/CD on GitHub Actions (Python 3.11 + 3.13 matrix)
- **263 tests** with CI/CD on GitHub Actions (Python 3.11 + 3.13 matrix)
- **Full stack**: data pipeline → knowledge graph → vector store → query engine → REST API → chat UI
- **One-click Codespaces**: try it instantly from your browser

Expand All @@ -75,7 +75,7 @@ A production-ready question-answering system that combines a **pharmaceutical kn
> *"What are the side effects of ibuprofen?"* · *"Does metformin interact with other drugs?"* · *"Compare the safety profiles of aspirin and clopidogrel"* · *"What drugs cause liver damage?"*

<details>
<summary><strong>Component Status</strong> — all modules complete, 221 tests passing</summary>
<summary><strong>Component Status</strong> — all modules complete, 263 tests passing</summary>

| Component | Status | Details |
| --- | --- | --- |
Expand All @@ -89,7 +89,8 @@ A production-ready question-answering system that combines a **pharmaceutical kn
| Docker Compose | ✅ Complete | Neo4j + API + UI + Ollama (optional profile) |
| CI/CD | ✅ Complete | GitHub Actions: lint, test matrix (3.11/3.13), Docker build |
| Agent Mode | ✅ Complete | LangGraph ReAct agent with 9 tools, conversation memory, structured output, multi-agent supervisor, nested reasoning |
| Tests | ✅ 221 passing | Data (29) + vectors (35) + engine (37) + LLM (14) + API (18) + UI (14) + agent (61) + observability (13) |
| Evaluation | ✅ Complete | RAGAS 0.4.3 (Faithfulness, Relevancy, Precision, Recall, Correctness) + agent tool accuracy (P/R/F1) |
| Tests | ✅ 263 passing | Data (29) + vectors (35) + engine (37) + LLM (14) + API (18) + UI (14) + agent (61) + observability (13) + evaluation (42) |

</details>

Expand Down Expand Up @@ -231,7 +232,7 @@ In Agent Mode, the LLM autonomously decides which tools to call:
| UI | Streamlit + streamlit-agraph (graph visualization) |
| Containers | Docker Compose (multi-stage, non-root, healthchecks) |
| CI/CD | GitHub Actions (CI: lint + test matrix; CD: v* tags → Cloud Build → Cloud Run) |
| Testing | pytest (221 tests, mocked services) |
| Testing | pytest (263 tests, mocked services) |
| Linting | ruff (check + format) |

## Data Sources
Expand Down Expand Up @@ -398,11 +399,28 @@ gcloud run deploy pharmagraphrag-api --image gcr.io/<project>/pharmagraphrag-api
### Testing

```bash
uv run pytest # Run all 221 tests
uv run pytest # Run all 263 tests
uv run pytest -v # Verbose output
uv run pytest tests/test_engine.py # Specific module
```

### Evaluation (RAGAS)

Automated quality evaluation using [RAGAS](https://docs.ragas.io/) metrics against a curated testset of 25 questions (8 types: drug info, interactions, adverse events, outcomes, categories, comparisons, multi-drug, label search).

```bash
# Evaluate classic pipeline against local API
python scripts/run_evaluation.py --mode classic --api-url http://localhost:8000

# Evaluate all modes (classic + agent + multi)
python scripts/run_evaluation.py --mode all --api-url http://localhost:8000

# Against production
python scripts/run_evaluation.py --mode all --api-url https://pharmagraphrag-api-893694384146.us-central1.run.app
```

**Metrics**: Faithfulness, Answer Relevancy (reference-free) + Context Precision, Context Recall, Answer Correctness (reference-based) + Agent tool selection accuracy (precision/recall/F1).

<details>
<summary>📋 Linting, formatting & type checking</summary>

Expand Down Expand Up @@ -445,6 +463,11 @@ src/pharmagraphrag/
├── api/ # REST API
│ ├── main.py # FastAPI app (POST /query, POST /agent/query, POST /agent/multi, GET /drug, GET /health)
│ └── models.py # Pydantic v2 request/response schemas
├── evaluation/ # RAGAS evaluation framework
│ ├── metrics.py # RAGAS metric wrappers (Faithfulness, Relevancy, Precision, Recall, Correctness)
│ ├── dataset.py # Curated testset loader, EvalSample/EvalDataset
│ ├── runner.py # Batch evaluation runner (calls API, computes RAGAS scores, exports CSV)
│ └── agent_eval.py # Agent tool selection accuracy (precision/recall/F1)
└── ui/ # Chat interface
├── app.py # Streamlit app (chat, sidebar, settings)
└── components.py # Graph visualization, sources panel, drug explorer
Expand Down
Loading
Loading