feat(eval): data-grounded references + full RAGAS run (classic+agent+multi)#5
Conversation
The original references in data/evaluation/testset.json were hand-written from textbook knowledge and didn't match what's actually loaded in FAERS 2024Q3+Q4 / DailyMed. This caused ContextRecall ~ 0 in RAGAS not because the system was wrong, but because RAGAS measured ref-vs-data mismatch. This commit adds: - scripts/regenerate_references.py: per-question generators that query Neo4j (counts, interactions, outcomes, categories) or ChromaDB (label text) to produce data-grounded references. Loads .env.aura before importing pharmagraphrag so pydantic-settings picks up Aura creds. - scripts/test_aura_connection.py: smoke test against Aura. - data/evaluation/testset_v2.json: 25 regenerated refs. Originals preserved under 'original_reference' for traceability. - .gitignore: exclude .env.aura and .env.* (multi-env credentials) Notable findings: - Warfarin has no DrugCategory in the KG (BELONGS_TO missing). The KG also lacks any 'Anticoagulant' category. Refs for q13/q15 surface this honestly rather than hide the data gap. - Some DailyMed sections (e.g. mechanism_of_action for OTC omeprazole) are not present as substantive chunks. fmt_label_search returns an honest message in that case.
There was a problem hiding this comment.
Pull request overview
This PR updates the evaluation artifacts so RAGAS references are grounded in the actual Neo4j Aura + ChromaDB contents, reducing testset-vs-data drift and making evaluation metrics more meaningful.
Changes:
- Add scripts to (a) smoke-test Neo4j Aura connectivity and (b) regenerate the 25 testset references directly from the KG/vector store.
- Add
data/evaluation/testset_v2.jsonwith regenerated, data-grounded references (preserving prior references underoriginal_reference). - Record the full multi-mode evaluation run in
data/evaluation/results/v2_full/SUMMARY.mdand extend.gitignorefor Aura env files.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/test_aura_connection.py |
New Aura connectivity smoke test (counts + labels/reltypes). |
scripts/regenerate_references.py |
New reference-regeneration script sourcing from Neo4j queries + ChromaDB search. |
data/evaluation/testset_v2.json |
New data-grounded testset (25 questions) with preserved originals. |
data/evaluation/results/v2_full/SUMMARY.md |
Persisted summary of the full classic/agent/multi evaluation run. |
.gitignore |
Ignore .env.aura and .env.* variants. |
| aura_env = dotenv_values(ROOT / ".env.aura") | ||
| os.environ["NEO4J_URI"] = aura_env["NEO4J_URI"] | ||
| os.environ["NEO4J_USER"] = aura_env.get("NEO4J_USER") or aura_env["NEO4J_USERNAME"] | ||
| os.environ["NEO4J_PASSWORD"] = aura_env["NEO4J_PASSWORD"] | ||
|
|
There was a problem hiding this comment.
aura_env["NEO4J_URI"] / aura_env["NEO4J_PASSWORD"] / aura_env["NEO4J_USERNAME"] will raise KeyError when .env.aura is missing or incomplete, which makes the script fail with a stack trace before it can explain what’s wrong. Add explicit validation (similar to test_aura_connection.py) and exit with a clear message when required keys are absent.
| aura_env = dotenv_values(ROOT / ".env.aura") | |
| os.environ["NEO4J_URI"] = aura_env["NEO4J_URI"] | |
| os.environ["NEO4J_USER"] = aura_env.get("NEO4J_USER") or aura_env["NEO4J_USERNAME"] | |
| os.environ["NEO4J_PASSWORD"] = aura_env["NEO4J_PASSWORD"] | |
| aura_env_path = ROOT / ".env.aura" | |
| aura_env = dotenv_values(aura_env_path) | |
| neo4j_uri = aura_env.get("NEO4J_URI") | |
| neo4j_user = aura_env.get("NEO4J_USER") or aura_env.get("NEO4J_USERNAME") | |
| neo4j_password = aura_env.get("NEO4J_PASSWORD") | |
| missing_keys: list[str] = [] | |
| if not neo4j_uri: | |
| missing_keys.append("NEO4J_URI") | |
| if not neo4j_user: | |
| missing_keys.append("NEO4J_USER or NEO4J_USERNAME") | |
| if not neo4j_password: | |
| missing_keys.append("NEO4J_PASSWORD") | |
| if missing_keys: | |
| missing = ", ".join(missing_keys) | |
| raise SystemExit( | |
| f"Missing required Neo4j Aura settings in {aura_env_path}: {missing}. " | |
| "Create or update .env.aura with the required connection values." | |
| ) | |
| os.environ["NEO4J_URI"] = neo4j_uri | |
| os.environ["NEO4J_USER"] = neo4j_user | |
| os.environ["NEO4J_PASSWORD"] = neo4j_password |
| if preferred_sections: | ||
| # Tie-break only among top-4 by semantic distance — avoid promoting | ||
| # short/irrelevant chunks just because their section label matches. | ||
| top = sorted(results, key=lambda r: r.get("distance", 1.0))[:4] | ||
| ranked = sorted( | ||
| top, | ||
| key=lambda r: ( | ||
| 0 if r["metadata"].get("section") in preferred_sections else 1, | ||
| r.get("distance", 1.0), | ||
| ), | ||
| ) | ||
| else: | ||
| ranked = sorted(results, key=lambda r: r.get("distance", 1.0)) | ||
| # Pre-clean every candidate so we can filter by useful-content length | ||
| cleaned = [] | ||
| for r in ranked: | ||
| c = _clean_snippet(r["text"], max_chars=260) | ||
| if len(c) >= 150: | ||
| cleaned.append((r, c)) | ||
| chosen = cleaned[:2] |
There was a problem hiding this comment.
The preferred_sections tie-break only considers the top-4 results by semantic distance. If none of those happen to be in the preferred sections, the chosen reference can end up coming from an unrelated section (e.g., warnings/indications for a mechanism-of-action question). Consider selecting the best result within preferred_sections when available (and only falling back to non-preferred sections if no preferred-section chunks exist), while still applying a minimum-length/quality filter.
| if preferred_sections: | |
| # Tie-break only among top-4 by semantic distance — avoid promoting | |
| # short/irrelevant chunks just because their section label matches. | |
| top = sorted(results, key=lambda r: r.get("distance", 1.0))[:4] | |
| ranked = sorted( | |
| top, | |
| key=lambda r: ( | |
| 0 if r["metadata"].get("section") in preferred_sections else 1, | |
| r.get("distance", 1.0), | |
| ), | |
| ) | |
| else: | |
| ranked = sorted(results, key=lambda r: r.get("distance", 1.0)) | |
| # Pre-clean every candidate so we can filter by useful-content length | |
| cleaned = [] | |
| for r in ranked: | |
| c = _clean_snippet(r["text"], max_chars=260) | |
| if len(c) >= 150: | |
| cleaned.append((r, c)) | |
| chosen = cleaned[:2] | |
| ranked = sorted(results, key=lambda r: r.get("distance", 1.0)) | |
| # Pre-clean every candidate so we can filter by useful-content length | |
| cleaned = [] | |
| for r in ranked: | |
| c = _clean_snippet(r["text"], max_chars=260) | |
| if len(c) >= 150: | |
| cleaned.append((r, c)) | |
| if preferred_sections: | |
| preferred_cleaned = [ | |
| (r, c) | |
| for r, c in cleaned | |
| if r["metadata"].get("section") in preferred_sections | |
| ] | |
| fallback_cleaned = [ | |
| (r, c) | |
| for r, c in cleaned | |
| if r["metadata"].get("section") not in preferred_sections | |
| ] | |
| chosen = (preferred_cleaned or fallback_cleaned)[:2] | |
| else: | |
| chosen = cleaned[:2] |
| def fmt_top_ae_overall(n: int = 6) -> str: | ||
| drv = queries._get_driver() | ||
| with drv.session() as s: | ||
| rows = s.run( |
There was a problem hiding this comment.
This script calls the private queries._get_driver() helper. Since underscore-prefixed functions aren’t part of the module’s supported API, this makes the script more fragile if the graph layer is refactored. Prefer adding small public query helpers in pharmagraphrag.graph.queries for these cases (overall top AEs, combined AE counts, per-drug AE count), or use the public query functions exclusively.
Addresses Copilot review on PR #5: raise SystemExit with explicit list of missing keys instead of cryptic KeyError when .env.aura is incomplete.
|
Thanks for the review. Addressed the env-validation comment in 7e95076. On the other two: preferred_sections tie-break (line 288): I considered always preferring chunks from preferred_sections. The current behavior (rank by distance, then prefer within top-4 + length filter) was chosen on purpose: when the preferred section has no substantively long chunk, picking a shorter or noisier preferred-section snippet over a high-similarity off-section one degrades reference quality more than it helps. The 25 generated references were manually inspected after iter 2 and the results are good (q07/q12/q17/q19/q24 all land on the right section). Switching the algorithm now would invalidate testset_v2.json and require re-running the full RAGAS eval (~€1.50). I am keeping the current behavior and flagging this as a known trade-off to revisit if more label-search questions are added. queries._get_driver() use (line 195): this is a one-off regeneration script, not runtime code. Adding three public helpers in pharmagraphrag.graph.queries (overall top AEs, combined AE counts, per-drug AE count) for a script that runs maybe once a quarter is over-engineering. If we ever surface these queries in the API or evaluation runner, I will add the public helpers then. |
Context
The previous RAGAS evaluation (2026-04-21, n=3) showed ContextRecall 0.22 on classic mode. Diagnosis: the testset references were hand-written from textbook knowledge, not from the actual data loaded into Neo4j + ChromaDB (FAERS Q3+Q4 2024 + DailyMed). RAGAS was measuring testset-vs-data drift, not system quality.
Changes
scripts/regenerate_references.py: regenerates each of the 25 testset references by querying Neo4j Aura + ChromaDB directly. Per-question lambdas inREGENERATORSmapped to graph queries or vector search. Includes snippet cleanup, length filtering, ranking by report_count, and honest fallbacks when data gaps exist.scripts/test_aura_connection.py: smoke test for Aura connectivity (verified 11,900 nodes / 381,359 rels).data/evaluation/testset_v2.json: 25 data-grounded references. Original references preserved underoriginal_reference..gitignore: ignore.env.auraand.env.*variants.data/evaluation/results/v2_full/SUMMARY.md: persistent record of the full eval run.Results (n=25, all 3 modes)
Headline: ContextRecall classic 0.22 → 0.504 (+130%) confirms the references were the bottleneck.
Goal accuracy 100% on agent and multi (25/25). See
SUMMARY.mdfor details, including known issues (RAGAS judge timeouts, warfarin DrugCategory gap).Cost
~€1.60 total Gemini Tier 1 (smoke + 3 full runs).
Out of scope