Skip to content

feat(eval): data-grounded references + full RAGAS run (classic+agent+multi)#5

Merged
jmponcebe merged 3 commits into
mainfrom
feat/ragas-data-grounded-refs
Apr 29, 2026
Merged

feat(eval): data-grounded references + full RAGAS run (classic+agent+multi)#5
jmponcebe merged 3 commits into
mainfrom
feat/ragas-data-grounded-refs

Conversation

@jmponcebe
Copy link
Copy Markdown
Owner

Context

The previous RAGAS evaluation (2026-04-21, n=3) showed ContextRecall 0.22 on classic mode. Diagnosis: the testset references were hand-written from textbook knowledge, not from the actual data loaded into Neo4j + ChromaDB (FAERS Q3+Q4 2024 + DailyMed). RAGAS was measuring testset-vs-data drift, not system quality.

Changes

  • scripts/regenerate_references.py: regenerates each of the 25 testset references by querying Neo4j Aura + ChromaDB directly. Per-question lambdas in REGENERATORS mapped to graph queries or vector search. Includes snippet cleanup, length filtering, ranking by report_count, and honest fallbacks when data gaps exist.
  • scripts/test_aura_connection.py: smoke test for Aura connectivity (verified 11,900 nodes / 381,359 rels).
  • data/evaluation/testset_v2.json: 25 data-grounded references. Original references preserved under original_reference.
  • .gitignore: ignore .env.aura and .env.* variants.
  • data/evaluation/results/v2_full/SUMMARY.md: persistent record of the full eval run.

Results (n=25, all 3 modes)

Metric Classic Agent Multi
AnswerCorrectness 0.544 0.610 0.551
AnswerRelevancy 0.695 0.735 0.708
ContextPrecision 0.760 0.207 0.165
ContextRecall 0.504 0.692 0.431
Faithfulness 0.910 0.680 0.792
latency_ms 5,688 11,225 19,788

Headline: ContextRecall classic 0.22 → 0.504 (+130%) confirms the references were the bottleneck.

Goal accuracy 100% on agent and multi (25/25). See SUMMARY.md for details, including known issues (RAGAS judge timeouts, warfarin DrugCategory gap).

Cost

~€1.60 total Gemini Tier 1 (smoke + 3 full runs).

Out of scope

  • CSV results are gitignored per repo policy. SUMMARY.md is the persistent artifact.
  • No system-side changes; this PR only fixes the eval testset and documents the run.

The original references in data/evaluation/testset.json were hand-written
from textbook knowledge and didn't match what's actually loaded in FAERS
2024Q3+Q4 / DailyMed. This caused ContextRecall ~ 0 in RAGAS not because
the system was wrong, but because RAGAS measured ref-vs-data mismatch.

This commit adds:
- scripts/regenerate_references.py: per-question generators that query
  Neo4j (counts, interactions, outcomes, categories) or ChromaDB (label
  text) to produce data-grounded references. Loads .env.aura before
  importing pharmagraphrag so pydantic-settings picks up Aura creds.
- scripts/test_aura_connection.py: smoke test against Aura.
- data/evaluation/testset_v2.json: 25 regenerated refs. Originals
  preserved under 'original_reference' for traceability.
- .gitignore: exclude .env.aura and .env.* (multi-env credentials)

Notable findings:
- Warfarin has no DrugCategory in the KG (BELONGS_TO missing). The KG
  also lacks any 'Anticoagulant' category. Refs for q13/q15 surface this
  honestly rather than hide the data gap.
- Some DailyMed sections (e.g. mechanism_of_action for OTC omeprazole)
  are not present as substantive chunks. fmt_label_search returns an
  honest message in that case.
Copilot AI review requested due to automatic review settings April 29, 2026 12:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the evaluation artifacts so RAGAS references are grounded in the actual Neo4j Aura + ChromaDB contents, reducing testset-vs-data drift and making evaluation metrics more meaningful.

Changes:

  • Add scripts to (a) smoke-test Neo4j Aura connectivity and (b) regenerate the 25 testset references directly from the KG/vector store.
  • Add data/evaluation/testset_v2.json with regenerated, data-grounded references (preserving prior references under original_reference).
  • Record the full multi-mode evaluation run in data/evaluation/results/v2_full/SUMMARY.md and extend .gitignore for Aura env files.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scripts/test_aura_connection.py New Aura connectivity smoke test (counts + labels/reltypes).
scripts/regenerate_references.py New reference-regeneration script sourcing from Neo4j queries + ChromaDB search.
data/evaluation/testset_v2.json New data-grounded testset (25 questions) with preserved originals.
data/evaluation/results/v2_full/SUMMARY.md Persisted summary of the full classic/agent/multi evaluation run.
.gitignore Ignore .env.aura and .env.* variants.

Comment thread scripts/regenerate_references.py Outdated
Comment on lines +31 to +35
aura_env = dotenv_values(ROOT / ".env.aura")
os.environ["NEO4J_URI"] = aura_env["NEO4J_URI"]
os.environ["NEO4J_USER"] = aura_env.get("NEO4J_USER") or aura_env["NEO4J_USERNAME"]
os.environ["NEO4J_PASSWORD"] = aura_env["NEO4J_PASSWORD"]

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aura_env["NEO4J_URI"] / aura_env["NEO4J_PASSWORD"] / aura_env["NEO4J_USERNAME"] will raise KeyError when .env.aura is missing or incomplete, which makes the script fail with a stack trace before it can explain what’s wrong. Add explicit validation (similar to test_aura_connection.py) and exit with a clear message when required keys are absent.

Suggested change
aura_env = dotenv_values(ROOT / ".env.aura")
os.environ["NEO4J_URI"] = aura_env["NEO4J_URI"]
os.environ["NEO4J_USER"] = aura_env.get("NEO4J_USER") or aura_env["NEO4J_USERNAME"]
os.environ["NEO4J_PASSWORD"] = aura_env["NEO4J_PASSWORD"]
aura_env_path = ROOT / ".env.aura"
aura_env = dotenv_values(aura_env_path)
neo4j_uri = aura_env.get("NEO4J_URI")
neo4j_user = aura_env.get("NEO4J_USER") or aura_env.get("NEO4J_USERNAME")
neo4j_password = aura_env.get("NEO4J_PASSWORD")
missing_keys: list[str] = []
if not neo4j_uri:
missing_keys.append("NEO4J_URI")
if not neo4j_user:
missing_keys.append("NEO4J_USER or NEO4J_USERNAME")
if not neo4j_password:
missing_keys.append("NEO4J_PASSWORD")
if missing_keys:
missing = ", ".join(missing_keys)
raise SystemExit(
f"Missing required Neo4j Aura settings in {aura_env_path}: {missing}. "
"Create or update .env.aura with the required connection values."
)
os.environ["NEO4J_URI"] = neo4j_uri
os.environ["NEO4J_USER"] = neo4j_user
os.environ["NEO4J_PASSWORD"] = neo4j_password

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +288
if preferred_sections:
# Tie-break only among top-4 by semantic distance — avoid promoting
# short/irrelevant chunks just because their section label matches.
top = sorted(results, key=lambda r: r.get("distance", 1.0))[:4]
ranked = sorted(
top,
key=lambda r: (
0 if r["metadata"].get("section") in preferred_sections else 1,
r.get("distance", 1.0),
),
)
else:
ranked = sorted(results, key=lambda r: r.get("distance", 1.0))
# Pre-clean every candidate so we can filter by useful-content length
cleaned = []
for r in ranked:
c = _clean_snippet(r["text"], max_chars=260)
if len(c) >= 150:
cleaned.append((r, c))
chosen = cleaned[:2]
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preferred_sections tie-break only considers the top-4 results by semantic distance. If none of those happen to be in the preferred sections, the chosen reference can end up coming from an unrelated section (e.g., warnings/indications for a mechanism-of-action question). Consider selecting the best result within preferred_sections when available (and only falling back to non-preferred sections if no preferred-section chunks exist), while still applying a minimum-length/quality filter.

Suggested change
if preferred_sections:
# Tie-break only among top-4 by semantic distance — avoid promoting
# short/irrelevant chunks just because their section label matches.
top = sorted(results, key=lambda r: r.get("distance", 1.0))[:4]
ranked = sorted(
top,
key=lambda r: (
0 if r["metadata"].get("section") in preferred_sections else 1,
r.get("distance", 1.0),
),
)
else:
ranked = sorted(results, key=lambda r: r.get("distance", 1.0))
# Pre-clean every candidate so we can filter by useful-content length
cleaned = []
for r in ranked:
c = _clean_snippet(r["text"], max_chars=260)
if len(c) >= 150:
cleaned.append((r, c))
chosen = cleaned[:2]
ranked = sorted(results, key=lambda r: r.get("distance", 1.0))
# Pre-clean every candidate so we can filter by useful-content length
cleaned = []
for r in ranked:
c = _clean_snippet(r["text"], max_chars=260)
if len(c) >= 150:
cleaned.append((r, c))
if preferred_sections:
preferred_cleaned = [
(r, c)
for r, c in cleaned
if r["metadata"].get("section") in preferred_sections
]
fallback_cleaned = [
(r, c)
for r, c in cleaned
if r["metadata"].get("section") not in preferred_sections
]
chosen = (preferred_cleaned or fallback_cleaned)[:2]
else:
chosen = cleaned[:2]

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +195
def fmt_top_ae_overall(n: int = 6) -> str:
drv = queries._get_driver()
with drv.session() as s:
rows = s.run(
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script calls the private queries._get_driver() helper. Since underscore-prefixed functions aren’t part of the module’s supported API, this makes the script more fragile if the graph layer is refactored. Prefer adding small public query helpers in pharmagraphrag.graph.queries for these cases (overall top AEs, combined AE counts, per-drug AE count), or use the public query functions exclusively.

Copilot uses AI. Check for mistakes.
Addresses Copilot review on PR #5: raise SystemExit with explicit list of
missing keys instead of cryptic KeyError when .env.aura is incomplete.
@jmponcebe
Copy link
Copy Markdown
Owner Author

Thanks for the review. Addressed the env-validation comment in 7e95076. On the other two:

preferred_sections tie-break (line 288): I considered always preferring chunks from preferred_sections. The current behavior (rank by distance, then prefer within top-4 + length filter) was chosen on purpose: when the preferred section has no substantively long chunk, picking a shorter or noisier preferred-section snippet over a high-similarity off-section one degrades reference quality more than it helps. The 25 generated references were manually inspected after iter 2 and the results are good (q07/q12/q17/q19/q24 all land on the right section). Switching the algorithm now would invalidate testset_v2.json and require re-running the full RAGAS eval (~€1.50). I am keeping the current behavior and flagging this as a known trade-off to revisit if more label-search questions are added.

queries._get_driver() use (line 195): this is a one-off regeneration script, not runtime code. Adding three public helpers in pharmagraphrag.graph.queries (overall top AEs, combined AE counts, per-drug AE count) for a script that runs maybe once a quarter is over-engineering. If we ever surface these queries in the API or evaluation runner, I will add the public helpers then.

@jmponcebe jmponcebe merged commit 852eb67 into main Apr 29, 2026
4 checks passed
@jmponcebe jmponcebe deleted the feat/ragas-data-grounded-refs branch April 29, 2026 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants