feat: RAGAS evaluation framework + Copilot review fixes#2
Conversation
- Add evaluation module (metrics, dataset, runner, agent_eval) - RAGAS 0.4.3: Faithfulness, Relevancy, Precision, Recall, Correctness - Curated testset: 25 questions across 8 types with ground truth - Agent evaluation: tool selection accuracy (precision/recall/F1) - Batch runner: classic/agent/multi modes, CSV export - 40 new tests (261 total), all passing - Update docs and README with evaluation section
- Fix shallow copy mutation in build_callback_config (copy metadata dict) - Cache disabled state to prevent warning spam on every request - Remove unused as_type parameter from observe_fn - Cache traced function in closure to avoid re-wrapping on every call - Wrap classic pipeline span in try/finally for proper cleanup - Use contextlib.suppress instead of try-except-pass - Fix thread_id consistency in agent/multi Langfuse session_id - Remove leftover test_api.py debug script from root
There was a problem hiding this comment.
Pull request overview
Adds an evaluation subsystem to PharmaGraphRAG (curated testset + RAGAS metrics + batch runner) while also applying several observability reliability/perf fixes (Langfuse init/metadata handling, classic pipeline span lifecycle, consistent agent session IDs).
Changes:
- Introduce
pharmagraphrag.evaluation(dataset loader, RAGAS metric wrappers, runner, agent tool-selection evaluation) + curateddata/evaluation/testset.jsonand a CLI runner script. - Add new dependencies for evaluation (
ragas,datasets, plus transitive deps) and update docs/metadata to reflect evaluation and new test counts. - Observability fixes: prevent metadata mutation, reduce Langfuse warning spam, improve span cleanup, and align agent/multi
session_iddefaults.
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Locks new evaluation/runtime dependencies (ragas/datasets + transitive packages). |
pyproject.toml |
Adds ragas and datasets to project dependencies. |
src/pharmagraphrag/evaluation/__init__.py |
Declares the new evaluation package. |
src/pharmagraphrag/evaluation/dataset.py |
Loads curated JSON testset + converts samples to RAGAS-compatible records. |
src/pharmagraphrag/evaluation/metrics.py |
Wraps RAGAS metrics; configures evaluator LLM/embeddings via Gemini OpenAI-compatible endpoint. |
src/pharmagraphrag/evaluation/runner.py |
Calls API endpoints per mode, scores samples, exports CSV, computes summaries. |
src/pharmagraphrag/evaluation/agent_eval.py |
Computes tool-selection precision/recall/F1 and goal achievement for agent outputs. |
scripts/run_evaluation.py |
CLI entrypoint to run evaluation across classic/agent/multi and export reports. |
data/evaluation/testset.json |
Adds curated evaluation dataset (25 questions + references + expected tools). |
tests/test_evaluation.py |
Adds unit tests for evaluation modules (dataset/metrics/runner/agent eval). |
src/pharmagraphrag/observability.py |
Langfuse init caching + metadata copy fix + observe wrapper caching change. |
tests/test_observability.py |
Resets the new _langfuse_disabled global in test fixture. |
src/pharmagraphrag/engine/query_engine.py |
Wraps classic pipeline Langfuse span in try/finally for reliable cleanup. |
src/pharmagraphrag/agent/graph.py |
Ensures consistent session_id/thread_id defaulting for Langfuse + memory. |
src/pharmagraphrag/agent/multi.py |
Aligns Langfuse session_id defaulting with agent mode. |
README.md |
Documents evaluation feature + updates test counts and repo structure section. |
.github/copilot-instructions.md |
Updates project documentation to include evaluation module/test counts. |
test_api.py |
Removes leftover debug script. |
| """ | ||
| ragas_samples = [] | ||
| for sample in dataset.samples: | ||
| if not sample.answer: |
There was a problem hiding this comment.
to_ragas_dataset() docstring says it only includes samples where both answer and contexts are populated, but the implementation only checks sample.answer and will include entries with empty retrieved_contexts. This can skew evaluation (and some RAGAS metrics assume at least one retrieved context). Consider also skipping samples with no contexts (or enforcing a non-empty list).
| if not sample.answer: | |
| if not sample.answer or not sample.contexts: |
| if DEFAULT_TESTSET_PATH.exists(): | ||
| dataset = load_testset() | ||
| assert len(dataset) >= 20 | ||
| assert all(s.question for s in dataset.samples) | ||
| assert all(s.reference for s in dataset.samples) |
There was a problem hiding this comment.
This test is intended to verify the curated default testset exists, but it silently passes when DEFAULT_TESTSET_PATH is missing due to the if DEFAULT_TESTSET_PATH.exists(): guard. Prefer asserting the file exists (or explicitly pytest.skip(...) when absent) so CI fails if the committed dataset is accidentally removed from the repo/package.
| if DEFAULT_TESTSET_PATH.exists(): | |
| dataset = load_testset() | |
| assert len(dataset) >= 20 | |
| assert all(s.question for s in dataset.samples) | |
| assert all(s.reference for s in dataset.samples) | |
| assert DEFAULT_TESTSET_PATH.exists(), ( | |
| f"Expected curated default testset at {DEFAULT_TESTSET_PATH}" | |
| ) | |
| dataset = load_testset() | |
| assert len(dataset) >= 20 | |
| assert all(s.question for s in dataset.samples) | |
| assert all(s.reference for s in dataset.samples) |
| if not expected and not actual: | ||
| return AgentEvalResult( | ||
| sample_id=sample.id, | ||
| question=sample.question, | ||
| expected_tools=sample.expected_tools, | ||
| actual_tools=sample.tool_calls, | ||
| tool_precision=1.0, | ||
| tool_recall=1.0, | ||
| tool_f1=1.0, | ||
| goal_achieved=bool(sample.answer), | ||
| ) |
There was a problem hiding this comment.
In evaluate_tool_selection(), the early-return branch for not expected and not actual sets goal_achieved=bool(sample.answer), but the main branch treats answers containing an error marker ("ERROR") as not achieved. This makes goal accuracy inconsistent for samples that produce an error but have no tools expected/called. Consider aligning the goal check across both branches (e.g., also excluding error answers).
| def _call_agent( | ||
| question: str, config: RunConfig, endpoint: str = "/agent/query" | ||
| ) -> PipelineResponse: | ||
| """Call the agent or multi-agent pipeline.""" | ||
| url = f"{config.api_url}{endpoint}" | ||
| body: dict[str, Any] = {"question": question} | ||
| if config.model: | ||
| body["model"] = config.model | ||
|
|
||
| start = time.perf_counter() | ||
| try: | ||
| resp = httpx.post(url, json=body, timeout=config.timeout) | ||
| latency = (time.perf_counter() - start) * 1000 | ||
| resp.raise_for_status() | ||
| data = resp.json() | ||
|
|
||
| # Extract contexts from tool results and graph/vector data | ||
| contexts = [] | ||
| for tr in data.get("tool_results", []): | ||
| content = tr.get("content", "") | ||
| if content: | ||
| contexts.append(content[:2000]) | ||
|
|
||
| # Also use vector data snippets | ||
| for vd in data.get("vector_data", []): | ||
| snippet = vd.get("text", vd.get("snippet", "")) | ||
| if snippet: | ||
| contexts.append(snippet) | ||
|
|
||
| tools = [tc.get("tool", "") for tc in data.get("tool_calls", [])] | ||
|
|
There was a problem hiding this comment.
_call_agent() contains non-trivial response parsing (tool results → contexts, vector_data snippets, tool_calls extraction) but the current tests only mock _call_agent at the routing layer and never validate this parsing logic. Adding a focused unit test that mocks httpx.post() for an agent/multi response would help prevent regressions (e.g., schema/key changes in tool_calls, tool_results, or vector_data).
- dataset.py: skip samples with empty contexts in to_ragas_dataset() - agent_eval.py: consistent goal_achieved check (exclude ERROR answers) - test_evaluation.py: assert DEFAULT_TESTSET_PATH exists (fail if missing) - test_evaluation.py: add TestCallAgent with _call_agent parsing tests (42 tests) Total: 263 tests passing
Changes
RAGAS Evaluation Framework
Copilot Code Review Fixes (observability)