Skip to content

feat: RAGAS evaluation framework + Copilot review fixes#2

Merged
jmponcebe merged 4 commits into
mainfrom
feat/ragas-evaluation
Apr 18, 2026
Merged

feat: RAGAS evaluation framework + Copilot review fixes#2
jmponcebe merged 4 commits into
mainfrom
feat/ragas-evaluation

Conversation

@jmponcebe
Copy link
Copy Markdown
Owner

Changes

RAGAS Evaluation Framework

  • Add evaluation module: metrics.py, dataset.py, runner.py, agent_eval.py
  • Curated testset: 25 questions across 8 types with ground truth and expected tools
  • RAGAS 0.4.3 metrics: Faithfulness, Answer Relevancy, Context Precision, Context Recall, Answer Correctness
  • Agent evaluation: tool selection accuracy (precision/recall/F1), goal achievement
  • Batch runner: classic/agent/multi modes with CSV export
  • Evaluation script: scripts/run_evaluation.py
  • 40 new tests (261 total, all passing)

Copilot Code Review Fixes (observability)

  • Fix shallow copy mutation in build_callback_config
  • Cache disabled state to prevent warning spam
  • Remove unused as_type parameter from observe_fn
  • Cache traced function in closure for performance
  • Wrap classic pipeline span in try/finally
  • Fix thread_id consistency in agent/multi Langfuse session_id
  • Remove leftover test_api.py debug script

- Add evaluation module (metrics, dataset, runner, agent_eval)
- RAGAS 0.4.3: Faithfulness, Relevancy, Precision, Recall, Correctness
- Curated testset: 25 questions across 8 types with ground truth
- Agent evaluation: tool selection accuracy (precision/recall/F1)
- Batch runner: classic/agent/multi modes, CSV export
- 40 new tests (261 total), all passing
- Update docs and README with evaluation section
- Fix shallow copy mutation in build_callback_config (copy metadata dict)
- Cache disabled state to prevent warning spam on every request
- Remove unused as_type parameter from observe_fn
- Cache traced function in closure to avoid re-wrapping on every call
- Wrap classic pipeline span in try/finally for proper cleanup
- Use contextlib.suppress instead of try-except-pass
- Fix thread_id consistency in agent/multi Langfuse session_id
- Remove leftover test_api.py debug script from root
Copilot AI review requested due to automatic review settings April 18, 2026 20:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an evaluation subsystem to PharmaGraphRAG (curated testset + RAGAS metrics + batch runner) while also applying several observability reliability/perf fixes (Langfuse init/metadata handling, classic pipeline span lifecycle, consistent agent session IDs).

Changes:

  • Introduce pharmagraphrag.evaluation (dataset loader, RAGAS metric wrappers, runner, agent tool-selection evaluation) + curated data/evaluation/testset.json and a CLI runner script.
  • Add new dependencies for evaluation (ragas, datasets, plus transitive deps) and update docs/metadata to reflect evaluation and new test counts.
  • Observability fixes: prevent metadata mutation, reduce Langfuse warning spam, improve span cleanup, and align agent/multi session_id defaults.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
uv.lock Locks new evaluation/runtime dependencies (ragas/datasets + transitive packages).
pyproject.toml Adds ragas and datasets to project dependencies.
src/pharmagraphrag/evaluation/__init__.py Declares the new evaluation package.
src/pharmagraphrag/evaluation/dataset.py Loads curated JSON testset + converts samples to RAGAS-compatible records.
src/pharmagraphrag/evaluation/metrics.py Wraps RAGAS metrics; configures evaluator LLM/embeddings via Gemini OpenAI-compatible endpoint.
src/pharmagraphrag/evaluation/runner.py Calls API endpoints per mode, scores samples, exports CSV, computes summaries.
src/pharmagraphrag/evaluation/agent_eval.py Computes tool-selection precision/recall/F1 and goal achievement for agent outputs.
scripts/run_evaluation.py CLI entrypoint to run evaluation across classic/agent/multi and export reports.
data/evaluation/testset.json Adds curated evaluation dataset (25 questions + references + expected tools).
tests/test_evaluation.py Adds unit tests for evaluation modules (dataset/metrics/runner/agent eval).
src/pharmagraphrag/observability.py Langfuse init caching + metadata copy fix + observe wrapper caching change.
tests/test_observability.py Resets the new _langfuse_disabled global in test fixture.
src/pharmagraphrag/engine/query_engine.py Wraps classic pipeline Langfuse span in try/finally for reliable cleanup.
src/pharmagraphrag/agent/graph.py Ensures consistent session_id/thread_id defaulting for Langfuse + memory.
src/pharmagraphrag/agent/multi.py Aligns Langfuse session_id defaulting with agent mode.
README.md Documents evaluation feature + updates test counts and repo structure section.
.github/copilot-instructions.md Updates project documentation to include evaluation module/test counts.
test_api.py Removes leftover debug script.

"""
ragas_samples = []
for sample in dataset.samples:
if not sample.answer:
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_ragas_dataset() docstring says it only includes samples where both answer and contexts are populated, but the implementation only checks sample.answer and will include entries with empty retrieved_contexts. This can skew evaluation (and some RAGAS metrics assume at least one retrieved context). Consider also skipping samples with no contexts (or enforcing a non-empty list).

Suggested change
if not sample.answer:
if not sample.answer or not sample.contexts:

Copilot uses AI. Check for mistakes.
Comment thread tests/test_evaluation.py Outdated
Comment on lines +539 to +543
if DEFAULT_TESTSET_PATH.exists():
dataset = load_testset()
assert len(dataset) >= 20
assert all(s.question for s in dataset.samples)
assert all(s.reference for s in dataset.samples)
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is intended to verify the curated default testset exists, but it silently passes when DEFAULT_TESTSET_PATH is missing due to the if DEFAULT_TESTSET_PATH.exists(): guard. Prefer asserting the file exists (or explicitly pytest.skip(...) when absent) so CI fails if the committed dataset is accidentally removed from the repo/package.

Suggested change
if DEFAULT_TESTSET_PATH.exists():
dataset = load_testset()
assert len(dataset) >= 20
assert all(s.question for s in dataset.samples)
assert all(s.reference for s in dataset.samples)
assert DEFAULT_TESTSET_PATH.exists(), (
f"Expected curated default testset at {DEFAULT_TESTSET_PATH}"
)
dataset = load_testset()
assert len(dataset) >= 20
assert all(s.question for s in dataset.samples)
assert all(s.reference for s in dataset.samples)

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +68
if not expected and not actual:
return AgentEvalResult(
sample_id=sample.id,
question=sample.question,
expected_tools=sample.expected_tools,
actual_tools=sample.tool_calls,
tool_precision=1.0,
tool_recall=1.0,
tool_f1=1.0,
goal_achieved=bool(sample.answer),
)
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In evaluate_tool_selection(), the early-return branch for not expected and not actual sets goal_achieved=bool(sample.answer), but the main branch treats answers containing an error marker ("ERROR") as not achieved. This makes goal accuracy inconsistent for samples that produce an error but have no tools expected/called. Consider aligning the goal check across both branches (e.g., also excluding error answers).

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +107
def _call_agent(
question: str, config: RunConfig, endpoint: str = "/agent/query"
) -> PipelineResponse:
"""Call the agent or multi-agent pipeline."""
url = f"{config.api_url}{endpoint}"
body: dict[str, Any] = {"question": question}
if config.model:
body["model"] = config.model

start = time.perf_counter()
try:
resp = httpx.post(url, json=body, timeout=config.timeout)
latency = (time.perf_counter() - start) * 1000
resp.raise_for_status()
data = resp.json()

# Extract contexts from tool results and graph/vector data
contexts = []
for tr in data.get("tool_results", []):
content = tr.get("content", "")
if content:
contexts.append(content[:2000])

# Also use vector data snippets
for vd in data.get("vector_data", []):
snippet = vd.get("text", vd.get("snippet", ""))
if snippet:
contexts.append(snippet)

tools = [tc.get("tool", "") for tc in data.get("tool_calls", [])]

Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_call_agent() contains non-trivial response parsing (tool results → contexts, vector_data snippets, tool_calls extraction) but the current tests only mock _call_agent at the routing layer and never validate this parsing logic. Adding a focused unit test that mocks httpx.post() for an agent/multi response would help prevent regressions (e.g., schema/key changes in tool_calls, tool_results, or vector_data).

Copilot generated this review using guidance from repository custom instructions.
- dataset.py: skip samples with empty contexts in to_ragas_dataset()
- agent_eval.py: consistent goal_achieved check (exclude ERROR answers)
- test_evaluation.py: assert DEFAULT_TESTSET_PATH exists (fail if missing)
- test_evaluation.py: add TestCallAgent with _call_agent parsing tests (42 tests)

Total: 263 tests passing
@jmponcebe jmponcebe merged commit 87878bc into main Apr 18, 2026
3 checks passed
@jmponcebe jmponcebe deleted the feat/ragas-evaluation branch April 18, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants