feat(eval): data-grounded references + full RAGAS run (classic+agent+multi) by jmponcebe · Pull Request #5 · jmponcebe/PharmaGraphRAG

jmponcebe · 2026-04-29T12:31:14Z

Context

The previous RAGAS evaluation (2026-04-21, n=3) showed ContextRecall 0.22 on classic mode. Diagnosis: the testset references were hand-written from textbook knowledge, not from the actual data loaded into Neo4j + ChromaDB (FAERS Q3+Q4 2024 + DailyMed). RAGAS was measuring testset-vs-data drift, not system quality.

Changes

scripts/regenerate_references.py: regenerates each of the 25 testset references by querying Neo4j Aura + ChromaDB directly. Per-question lambdas in REGENERATORS mapped to graph queries or vector search. Includes snippet cleanup, length filtering, ranking by report_count, and honest fallbacks when data gaps exist.
scripts/test_aura_connection.py: smoke test for Aura connectivity (verified 11,900 nodes / 381,359 rels).
data/evaluation/testset_v2.json: 25 data-grounded references. Original references preserved under original_reference.
.gitignore: ignore .env.aura and .env.* variants.
data/evaluation/results/v2_full/SUMMARY.md: persistent record of the full eval run.

Results (n=25, all 3 modes)

Metric	Classic	Agent	Multi
AnswerCorrectness	0.544	0.610	0.551
AnswerRelevancy	0.695	0.735	0.708
ContextPrecision	0.760	0.207	0.165
ContextRecall	0.504	0.692	0.431
Faithfulness	0.910	0.680	0.792
latency_ms	5,688	11,225	19,788

Headline: ContextRecall classic 0.22 → 0.504 (+130%) confirms the references were the bottleneck.

Goal accuracy 100% on agent and multi (25/25). See SUMMARY.md for details, including known issues (RAGAS judge timeouts, warfarin DrugCategory gap).

Cost

~€1.60 total Gemini Tier 1 (smoke + 3 full runs).

Out of scope

CSV results are gitignored per repo policy. SUMMARY.md is the persistent artifact.
No system-side changes; this PR only fixes the eval testset and documents the run.

The original references in data/evaluation/testset.json were hand-written from textbook knowledge and didn't match what's actually loaded in FAERS 2024Q3+Q4 / DailyMed. This caused ContextRecall ~ 0 in RAGAS not because the system was wrong, but because RAGAS measured ref-vs-data mismatch. This commit adds: - scripts/regenerate_references.py: per-question generators that query Neo4j (counts, interactions, outcomes, categories) or ChromaDB (label text) to produce data-grounded references. Loads .env.aura before importing pharmagraphrag so pydantic-settings picks up Aura creds. - scripts/test_aura_connection.py: smoke test against Aura. - data/evaluation/testset_v2.json: 25 regenerated refs. Originals preserved under 'original_reference' for traceability. - .gitignore: exclude .env.aura and .env.* (multi-env credentials) Notable findings: - Warfarin has no DrugCategory in the KG (BELONGS_TO missing). The KG also lacks any 'Anticoagulant' category. Refs for q13/q15 surface this honestly rather than hide the data gap. - Some DailyMed sections (e.g. mechanism_of_action for OTC omeprazole) are not present as substantive chunks. fmt_label_search returns an honest message in that case.

Copilot

Pull request overview

This PR updates the evaluation artifacts so RAGAS references are grounded in the actual Neo4j Aura + ChromaDB contents, reducing testset-vs-data drift and making evaluation metrics more meaningful.

Changes:

Add scripts to (a) smoke-test Neo4j Aura connectivity and (b) regenerate the 25 testset references directly from the KG/vector store.
Add data/evaluation/testset_v2.json with regenerated, data-grounded references (preserving prior references under original_reference).
Record the full multi-mode evaluation run in data/evaluation/results/v2_full/SUMMARY.md and extend .gitignore for Aura env files.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`scripts/test_aura_connection.py`	New Aura connectivity smoke test (counts + labels/reltypes).
`scripts/regenerate_references.py`	New reference-regeneration script sourcing from Neo4j queries + ChromaDB search.
`data/evaluation/testset_v2.json`	New data-grounded testset (25 questions) with preserved originals.
`data/evaluation/results/v2_full/SUMMARY.md`	Persisted summary of the full classic/agent/multi evaluation run.
`.gitignore`	Ignore `.env.aura` and `.env.*` variants.

Copilot · 2026-04-29T12:36:07Z

+aura_env = dotenv_values(ROOT / ".env.aura")
+os.environ["NEO4J_URI"] = aura_env["NEO4J_URI"]
+os.environ["NEO4J_USER"] = aura_env.get("NEO4J_USER") or aura_env["NEO4J_USERNAME"]
+os.environ["NEO4J_PASSWORD"] = aura_env["NEO4J_PASSWORD"]
+


aura_env["NEO4J_URI"] / aura_env["NEO4J_PASSWORD"] / aura_env["NEO4J_USERNAME"] will raise KeyError when .env.aura is missing or incomplete, which makes the script fail with a stack trace before it can explain what’s wrong. Add explicit validation (similar to test_aura_connection.py) and exit with a clear message when required keys are absent.

Suggested change

aura_env = dotenv_values(ROOT / ".env.aura")

os.environ["NEO4J_URI"] = aura_env["NEO4J_URI"]

os.environ["NEO4J_USER"] = aura_env.get("NEO4J_USER") or aura_env["NEO4J_USERNAME"]

os.environ["NEO4J_PASSWORD"] = aura_env["NEO4J_PASSWORD"]

aura_env_path = ROOT / ".env.aura"

aura_env = dotenv_values(aura_env_path)

neo4j_uri = aura_env.get("NEO4J_URI")

neo4j_user = aura_env.get("NEO4J_USER") or aura_env.get("NEO4J_USERNAME")

neo4j_password = aura_env.get("NEO4J_PASSWORD")

missing_keys: list[str] = []

if not neo4j_uri:

missing_keys.append("NEO4J_URI")

if not neo4j_user:

missing_keys.append("NEO4J_USER or NEO4J_USERNAME")

if not neo4j_password:

missing_keys.append("NEO4J_PASSWORD")

if missing_keys:

missing = ", ".join(missing_keys)

raise SystemExit(

f"Missing required Neo4j Aura settings in {aura_env_path}: {missing}. "

"Create or update .env.aura with the required connection values."

)

os.environ["NEO4J_URI"] = neo4j_uri

os.environ["NEO4J_USER"] = neo4j_user

os.environ["NEO4J_PASSWORD"] = neo4j_password

Copilot · 2026-04-29T12:36:07Z

+    if preferred_sections:
+        # Tie-break only among top-4 by semantic distance — avoid promoting
+        # short/irrelevant chunks just because their section label matches.
+        top = sorted(results, key=lambda r: r.get("distance", 1.0))[:4]
+        ranked = sorted(
+            top,
+            key=lambda r: (
+                0 if r["metadata"].get("section") in preferred_sections else 1,
+                r.get("distance", 1.0),
+            ),
+        )
+    else:
+        ranked = sorted(results, key=lambda r: r.get("distance", 1.0))
+    # Pre-clean every candidate so we can filter by useful-content length
+    cleaned = []
+    for r in ranked:
+        c = _clean_snippet(r["text"], max_chars=260)
+        if len(c) >= 150:
+            cleaned.append((r, c))
+    chosen = cleaned[:2]


The preferred_sections tie-break only considers the top-4 results by semantic distance. If none of those happen to be in the preferred sections, the chosen reference can end up coming from an unrelated section (e.g., warnings/indications for a mechanism-of-action question). Consider selecting the best result within preferred_sections when available (and only falling back to non-preferred sections if no preferred-section chunks exist), while still applying a minimum-length/quality filter.

Suggested change

if preferred_sections:

# Tie-break only among top-4 by semantic distance — avoid promoting

# short/irrelevant chunks just because their section label matches.

top = sorted(results, key=lambda r: r.get("distance", 1.0))[:4]

ranked = sorted(

top,

key=lambda r: (

0 if r["metadata"].get("section") in preferred_sections else 1,

r.get("distance", 1.0),

),

)

else:

ranked = sorted(results, key=lambda r: r.get("distance", 1.0))

# Pre-clean every candidate so we can filter by useful-content length

cleaned = []

for r in ranked:

c = _clean_snippet(r["text"], max_chars=260)

if len(c) >= 150:

cleaned.append((r, c))

chosen = cleaned[:2]

ranked = sorted(results, key=lambda r: r.get("distance", 1.0))

# Pre-clean every candidate so we can filter by useful-content length

cleaned = []

for r in ranked:

c = _clean_snippet(r["text"], max_chars=260)

if len(c) >= 150:

cleaned.append((r, c))

if preferred_sections:

preferred_cleaned = [

(r, c)

for r, c in cleaned

if r["metadata"].get("section") in preferred_sections

]

fallback_cleaned = [

(r, c)

for r, c in cleaned

if r["metadata"].get("section") not in preferred_sections

]

chosen = (preferred_cleaned or fallback_cleaned)[:2]

else:

chosen = cleaned[:2]

Copilot · 2026-04-29T12:36:08Z

+def fmt_top_ae_overall(n: int = 6) -> str:
+    drv = queries._get_driver()
+    with drv.session() as s:
+        rows = s.run(


This script calls the private queries._get_driver() helper. Since underscore-prefixed functions aren’t part of the module’s supported API, this makes the script more fragile if the graph layer is refactored. Prefer adding small public query helpers in pharmagraphrag.graph.queries for these cases (overall top AEs, combined AE counts, per-drug AE count), or use the public query functions exclusively.

Addresses Copilot review on PR #5: raise SystemExit with explicit list of missing keys instead of cryptic KeyError when .env.aura is incomplete.

jmponcebe · 2026-04-29T16:44:43Z

Thanks for the review. Addressed the env-validation comment in 7e95076. On the other two:

preferred_sections tie-break (line 288): I considered always preferring chunks from preferred_sections. The current behavior (rank by distance, then prefer within top-4 + length filter) was chosen on purpose: when the preferred section has no substantively long chunk, picking a shorter or noisier preferred-section snippet over a high-similarity off-section one degrades reference quality more than it helps. The 25 generated references were manually inspected after iter 2 and the results are good (q07/q12/q17/q19/q24 all land on the right section). Switching the algorithm now would invalidate testset_v2.json and require re-running the full RAGAS eval (~€1.50). I am keeping the current behavior and flagging this as a known trade-off to revisit if more label-search questions are added.

queries._get_driver() use (line 195): this is a one-off regeneration script, not runtime code. Adding three public helpers in pharmagraphrag.graph.queries (overall top AEs, combined AE counts, per-drug AE count) for a script that runs maybe once a quarter is over-engineering. If we ever surface these queries in the API or evaluation runner, I will add the public helpers then.

jmponcebe added 2 commits April 29, 2026 10:51

docs(eval): add SUMMARY.md for testset_v2 full run (classic+agent+multi)

5591b2c

Copilot AI review requested due to automatic review settings April 29, 2026 12:31

Copilot started reviewing on behalf of jmponcebe April 29, 2026 12:31 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

fix(scripts): validate .env.aura keys with clear error message

7e95076

Addresses Copilot review on PR #5: raise SystemExit with explicit list of missing keys instead of cryptic KeyError when .env.aura is incomplete.

jmponcebe merged commit 852eb67 into main Apr 29, 2026
4 checks passed

jmponcebe deleted the feat/ragas-data-grounded-refs branch April 29, 2026 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): data-grounded references + full RAGAS run (classic+agent+multi)#5

feat(eval): data-grounded references + full RAGAS run (classic+agent+multi)#5
jmponcebe merged 3 commits into
mainfrom
feat/ragas-data-grounded-refs

jmponcebe commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

jmponcebe commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-aura_env = dotenv_values(ROOT / ".env.aura")
-os.environ["NEO4J_URI"] = aura_env["NEO4J_URI"]
-os.environ["NEO4J_USER"] = aura_env.get("NEO4J_USER") or aura_env["NEO4J_USERNAME"]
-os.environ["NEO4J_PASSWORD"] = aura_env["NEO4J_PASSWORD"]
+aura_env_path = ROOT / ".env.aura"
+aura_env = dotenv_values(aura_env_path)
+neo4j_uri = aura_env.get("NEO4J_URI")
+neo4j_user = aura_env.get("NEO4J_USER") or aura_env.get("NEO4J_USERNAME")
+neo4j_password = aura_env.get("NEO4J_PASSWORD")
+missing_keys: list[str] = []
+if not neo4j_uri:
+    missing_keys.append("NEO4J_URI")
+if not neo4j_user:
+    missing_keys.append("NEO4J_USER or NEO4J_USERNAME")
+if not neo4j_password:
+    missing_keys.append("NEO4J_PASSWORD")
+if missing_keys:
+    missing = ", ".join(missing_keys)
+    raise SystemExit(
+        f"Missing required Neo4j Aura settings in {aura_env_path}: {missing}. "
+        "Create or update .env.aura with the required connection values."
+    )
+os.environ["NEO4J_URI"] = neo4j_uri
+os.environ["NEO4J_USER"] = neo4j_user
+os.environ["NEO4J_PASSWORD"] = neo4j_password

Conversation

jmponcebe commented Apr 29, 2026

Context

Changes

Results (n=25, all 3 modes)

Cost

Out of scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

jmponcebe commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants