feat: RAGAS evaluation framework + Copilot review fixes by jmponcebe · Pull Request #2 · jmponcebe/PharmaGraphRAG

jmponcebe · 2026-04-18T20:02:37Z

Changes

RAGAS Evaluation Framework

Add evaluation module: metrics.py, dataset.py, runner.py, agent_eval.py
Curated testset: 25 questions across 8 types with ground truth and expected tools
RAGAS 0.4.3 metrics: Faithfulness, Answer Relevancy, Context Precision, Context Recall, Answer Correctness
Agent evaluation: tool selection accuracy (precision/recall/F1), goal achievement
Batch runner: classic/agent/multi modes with CSV export
Evaluation script: scripts/run_evaluation.py
40 new tests (261 total, all passing)

Copilot Code Review Fixes (observability)

Fix shallow copy mutation in build_callback_config
Cache disabled state to prevent warning spam
Remove unused as_type parameter from observe_fn
Cache traced function in closure for performance
Wrap classic pipeline span in try/finally
Fix thread_id consistency in agent/multi Langfuse session_id
Remove leftover test_api.py debug script

- Add evaluation module (metrics, dataset, runner, agent_eval) - RAGAS 0.4.3: Faithfulness, Relevancy, Precision, Recall, Correctness - Curated testset: 25 questions across 8 types with ground truth - Agent evaluation: tool selection accuracy (precision/recall/F1) - Batch runner: classic/agent/multi modes, CSV export - 40 new tests (261 total), all passing - Update docs and README with evaluation section

- Fix shallow copy mutation in build_callback_config (copy metadata dict) - Cache disabled state to prevent warning spam on every request - Remove unused as_type parameter from observe_fn - Cache traced function in closure to avoid re-wrapping on every call - Wrap classic pipeline span in try/finally for proper cleanup - Use contextlib.suppress instead of try-except-pass - Fix thread_id consistency in agent/multi Langfuse session_id - Remove leftover test_api.py debug script from root

Copilot

Pull request overview

Adds an evaluation subsystem to PharmaGraphRAG (curated testset + RAGAS metrics + batch runner) while also applying several observability reliability/perf fixes (Langfuse init/metadata handling, classic pipeline span lifecycle, consistent agent session IDs).

Changes:

Introduce pharmagraphrag.evaluation (dataset loader, RAGAS metric wrappers, runner, agent tool-selection evaluation) + curated data/evaluation/testset.json and a CLI runner script.
Add new dependencies for evaluation (ragas, datasets, plus transitive deps) and update docs/metadata to reflect evaluation and new test counts.
Observability fixes: prevent metadata mutation, reduce Langfuse warning spam, improve span cleanup, and align agent/multi session_id defaults.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`uv.lock`	Locks new evaluation/runtime dependencies (ragas/datasets + transitive packages).
`pyproject.toml`	Adds `ragas` and `datasets` to project dependencies.
`src/pharmagraphrag/evaluation/__init__.py`	Declares the new evaluation package.
`src/pharmagraphrag/evaluation/dataset.py`	Loads curated JSON testset + converts samples to RAGAS-compatible records.
`src/pharmagraphrag/evaluation/metrics.py`	Wraps RAGAS metrics; configures evaluator LLM/embeddings via Gemini OpenAI-compatible endpoint.
`src/pharmagraphrag/evaluation/runner.py`	Calls API endpoints per mode, scores samples, exports CSV, computes summaries.
`src/pharmagraphrag/evaluation/agent_eval.py`	Computes tool-selection precision/recall/F1 and goal achievement for agent outputs.
`scripts/run_evaluation.py`	CLI entrypoint to run evaluation across classic/agent/multi and export reports.
`data/evaluation/testset.json`	Adds curated evaluation dataset (25 questions + references + expected tools).
`tests/test_evaluation.py`	Adds unit tests for evaluation modules (dataset/metrics/runner/agent eval).
`src/pharmagraphrag/observability.py`	Langfuse init caching + metadata copy fix + observe wrapper caching change.
`tests/test_observability.py`	Resets the new `_langfuse_disabled` global in test fixture.
`src/pharmagraphrag/engine/query_engine.py`	Wraps classic pipeline Langfuse span in `try/finally` for reliable cleanup.
`src/pharmagraphrag/agent/graph.py`	Ensures consistent `session_id`/`thread_id` defaulting for Langfuse + memory.
`src/pharmagraphrag/agent/multi.py`	Aligns Langfuse `session_id` defaulting with agent mode.
`README.md`	Documents evaluation feature + updates test counts and repo structure section.
`.github/copilot-instructions.md`	Updates project documentation to include evaluation module/test counts.
`test_api.py`	Removes leftover debug script.

Copilot · 2026-04-18T20:06:13Z

+    """
+    ragas_samples = []
+    for sample in dataset.samples:
+        if not sample.answer:


to_ragas_dataset() docstring says it only includes samples where both answer and contexts are populated, but the implementation only checks sample.answer and will include entries with empty retrieved_contexts. This can skew evaluation (and some RAGAS metrics assume at least one retrieved context). Consider also skipping samples with no contexts (or enforcing a non-empty list).

Suggested change

if not sample.answer:

if not sample.answer or not sample.contexts:

Copilot · 2026-04-18T20:06:14Z

+        if DEFAULT_TESTSET_PATH.exists():
+            dataset = load_testset()
+            assert len(dataset) >= 20
+            assert all(s.question for s in dataset.samples)
+            assert all(s.reference for s in dataset.samples)


This test is intended to verify the curated default testset exists, but it silently passes when DEFAULT_TESTSET_PATH is missing due to the if DEFAULT_TESTSET_PATH.exists(): guard. Prefer asserting the file exists (or explicitly pytest.skip(...) when absent) so CI fails if the committed dataset is accidentally removed from the repo/package.

Suggested change

if DEFAULT_TESTSET_PATH.exists():

dataset = load_testset()

assert len(dataset) >= 20

assert all(s.question for s in dataset.samples)

assert all(s.reference for s in dataset.samples)

assert DEFAULT_TESTSET_PATH.exists(), (

f"Expected curated default testset at {DEFAULT_TESTSET_PATH}"

)

dataset = load_testset()

assert len(dataset) >= 20

assert all(s.question for s in dataset.samples)

assert all(s.reference for s in dataset.samples)

Copilot · 2026-04-18T20:06:14Z

+    if not expected and not actual:
+        return AgentEvalResult(
+            sample_id=sample.id,
+            question=sample.question,
+            expected_tools=sample.expected_tools,
+            actual_tools=sample.tool_calls,
+            tool_precision=1.0,
+            tool_recall=1.0,
+            tool_f1=1.0,
+            goal_achieved=bool(sample.answer),
+        )


In evaluate_tool_selection(), the early-return branch for not expected and not actual sets goal_achieved=bool(sample.answer), but the main branch treats answers containing an error marker ("ERROR") as not achieved. This makes goal accuracy inconsistent for samples that produce an error but have no tools expected/called. Consider aligning the goal check across both branches (e.g., also excluding error answers).

Copilot · 2026-04-18T20:06:14Z

+def _call_agent(
+    question: str, config: RunConfig, endpoint: str = "/agent/query"
+) -> PipelineResponse:
+    """Call the agent or multi-agent pipeline."""
+    url = f"{config.api_url}{endpoint}"
+    body: dict[str, Any] = {"question": question}
+    if config.model:
+        body["model"] = config.model
+
+    start = time.perf_counter()
+    try:
+        resp = httpx.post(url, json=body, timeout=config.timeout)
+        latency = (time.perf_counter() - start) * 1000
+        resp.raise_for_status()
+        data = resp.json()
+
+        # Extract contexts from tool results and graph/vector data
+        contexts = []
+        for tr in data.get("tool_results", []):
+            content = tr.get("content", "")
+            if content:
+                contexts.append(content[:2000])
+
+        # Also use vector data snippets
+        for vd in data.get("vector_data", []):
+            snippet = vd.get("text", vd.get("snippet", ""))
+            if snippet:
+                contexts.append(snippet)
+
+        tools = [tc.get("tool", "") for tc in data.get("tool_calls", [])]
+


_call_agent() contains non-trivial response parsing (tool results → contexts, vector_data snippets, tool_calls extraction) but the current tests only mock _call_agent at the routing layer and never validate this parsing logic. Adding a focused unit test that mocks httpx.post() for an agent/multi response would help prevent regressions (e.g., schema/key changes in tool_calls, tool_results, or vector_data).

- dataset.py: skip samples with empty contexts in to_ragas_dataset() - agent_eval.py: consistent goal_achieved check (exclude ERROR answers) - test_evaluation.py: assert DEFAULT_TESTSET_PATH exists (fail if missing) - test_evaluation.py: add TestCallAgent with _call_agent parsing tests (42 tests) Total: 263 tests passing

jmponcebe added 2 commits April 18, 2026 21:23

Copilot AI review requested due to automatic review settings April 18, 2026 20:02

Copilot started reviewing on behalf of jmponcebe April 18, 2026 20:03 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

jmponcebe added 2 commits April 18, 2026 22:11

docs: update test counts to 263 (evaluation review fixes)

4d89db2

jmponcebe merged commit 87878bc into main Apr 18, 2026
3 checks passed

jmponcebe deleted the feat/ragas-evaluation branch April 18, 2026 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: RAGAS evaluation framework + Copilot review fixes#2

feat: RAGAS evaluation framework + Copilot review fixes#2
jmponcebe merged 4 commits into
mainfrom
feat/ragas-evaluation

jmponcebe commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if not sample.answer:
	if not sample.answer or not sample.contexts:

Conversation

jmponcebe commented Apr 18, 2026

Changes

RAGAS Evaluation Framework

Copilot Code Review Fixes (observability)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants