Skip to content

fix(eval): expose full graph/vector context for RAGAS and update mocks for async API#3

Merged
jmponcebe merged 3 commits into
mainfrom
fix/eval-context-and-ragas-async
Apr 21, 2026
Merged

fix(eval): expose full graph/vector context for RAGAS and update mocks for async API#3
jmponcebe merged 3 commits into
mainfrom
fix/eval-context-and-ragas-async

Conversation

@jmponcebe
Copy link
Copy Markdown
Owner

Summary

Fix RAGAS evaluation so it scores the real context the LLM saw, not placeholder snippets, and migrate the metrics module to the RAGAS 0.4.x async API.

Background

/query was returning sources with snippets like "Knowledge graph data for {drug}" (a fixed string) and vector texts truncated to 200 chars. The runner fed those snippets to RAGAS, so Faithfulness scored around 0.12 on classic mode even when the LLM answer was fully grounded in the real graph/vector context.

Changes

  • API: QueryResponse now includes graph_context and vector_context with the full text the LLM received. sources[] is kept for UI compatibility.
  • Runner: _call_classic prefers the new full context fields; falls back to sources[].snippet for backward compatibility.
  • Metrics: migrated to RAGAS 0.4.x using SingleTurnSample + single_turn_ascore wrapped in asyncio.run. LLM via LangchainLLMWrapper(ChatOpenAI(...)) against Gemini's OpenAI-compatible endpoint. Embeddings via GoogleGenerativeAIEmbeddings (the OpenAI-compat endpoint returns 501 for embeddings). Explicit AnswerSimilarity(embeddings=...) for AnswerCorrectness. max_tokens raised to 8192 so Faithfulness and AnswerCorrectness don't truncate.
  • Tests: mocks updated to AsyncMock on single_turn_ascore; added one case that verifies _call_classic prefers full contexts over snippets.

Validation

Ran --limit 3 against local API on this branch:

Metric Classic before Classic after
Faithfulness 0.12 0.94
Context Precision 0.22 0.83

All 43 tests in test_evaluation.py and 18 tests in test_api.py pass.

…s for async API

- api/main.py + api/models.py: add graph_context and vector_context fields to
  QueryResponse so evaluation runners can score the actual text the LLM saw
  instead of placeholder snippets like 'Knowledge graph data for {drug}'
- evaluation/runner.py: prefer full graph_context/vector_context with
  graceful fallback to sources[].snippet for backward compatibility
- evaluation/metrics.py: migrate to RAGAS 0.4.x async API using
  single_turn_ascore + SingleTurnSample, LangchainLLMWrapper for the
  Gemini OpenAI-compatible endpoint, and GoogleGenerativeAIEmbeddings
  for embeddings (OpenAI-compat returns 501 for embeddings)
- metrics.py: max_tokens raised to 8192 to avoid truncation in
  Faithfulness and AnswerCorrectness multi-statement prompts
- metrics.py: explicit AnswerSimilarity(embeddings=...) wiring for
  AnswerCorrectness
- tests/test_evaluation.py: update mocks to AsyncMock on
  single_turn_ascore, add test for full context priority in _call_classic
Copilot AI review requested due to automatic review settings April 21, 2026 09:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes RAGAS evaluation so metrics are computed against the actual graph/vector context provided to the LLM (instead of placeholder/truncated snippets), and updates the evaluation metrics wrapper to the RAGAS 0.4.x async scoring API.

Changes:

  • Expose full graph_context / vector_context in POST /query responses and prefer them in the evaluation runner (with backward-compatible fallback to sources[].snippet).
  • Migrate metric scoring to RAGAS 0.4.x using SingleTurnSample + single_turn_ascore via asyncio.run.
  • Update evaluation tests to mock async metric scoring and add coverage for the “prefer full contexts” behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_evaluation.py Updates mocks to AsyncMock for async metric API; adds test ensuring runner prefers full contexts over snippets.
src/pharmagraphrag/evaluation/runner.py Prefers graph_context/vector_context in _call_classic, falling back to snippets for older API responses.
src/pharmagraphrag/evaluation/metrics.py Migrates scoring to RAGAS 0.4.x async API; switches evaluator LLM wrapper to LangChain; updates embeddings to use Gemini native embeddings.
src/pharmagraphrag/api/models.py Extends QueryResponse schema with full graph_context/vector_context fields.
src/pharmagraphrag/api/main.py Populates graph_context/vector_context in /query response from retrieved context.

Comment on lines +50 to +64
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper

api_key = os.environ.get("GEMINI_API_KEY", "")
if not api_key:
raise ValueError("GEMINI_API_KEY env var is required for RAGAS evaluation")

client = OpenAI(
chat = ChatOpenAI(
model=model,
api_key=api_key,
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
max_tokens=8192,
temperature=0.0,
)
return llm_factory(model, provider="openai", client=client)
return LangchainLLMWrapper(chat)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_evaluator_llm() now imports langchain_openai.ChatOpenAI, but langchain-openai is not listed in pyproject.toml dependencies. A fresh install will fail at runtime when running evaluation. Add langchain-openai (and ensure compatible openai dependency) to the project dependencies or switch to an evaluator LLM wrapper that’s already included in the dependency set.

Copilot uses AI. Check for mistakes.
Comment thread src/pharmagraphrag/api/main.py Outdated
Comment on lines +144 to +148
has_graph_context=result.context.has_graph,
has_vector_context=result.context.has_vector,
sources=sources,
graph_context=result.context.graph_context or "",
vector_context=result.context.vector_context or "",
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueryResponse now always includes graph_context/vector_context strings. These can be very large (especially vector_context), increasing response payload size/latency for every /query call even when the UI only needs sources. Consider gating these fields behind an opt-in request flag (e.g. include_full_context: bool = False) or only populating them when running in evaluation/debug mode.

Copilot uses AI. Check for mistakes.
Comment thread src/pharmagraphrag/api/models.py Outdated
Comment on lines +61 to +66
graph_context: str = Field(
"", description="Full graph context passed to the LLM (for evaluation/debugging)."
)
vector_context: str = Field(
"", description="Full vector context passed to the LLM (for evaluation/debugging)."
)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding full graph_context/vector_context to the response schema is useful for evaluation, but since these fields may contain large texts, it would help API consumers if the schema clearly indicates when they are populated (or if they’re optional/omitted by default). Consider making them optional (None) unless explicitly requested, or adding an explicit include_full_context request parameter to avoid unintentionally large responses.

Copilot uses AI. Check for mistakes.
Addresses PR review feedback:

- Add include_full_context flag (default False) to QueryRequest so large graph/vector context is opt-in

- Make graph_context/vector_context optional (None by default) in QueryResponse

- Update /query endpoint to respect the flag

- Update evaluation runner to set include_full_context=True when calling classic

- Add langchain-openai dependency (used by RAGAS metrics via OpenAI-compatible endpoint)
Uses dorny/paths-filter to detect changes in docker/, pyproject.toml, uv.lock, requirements.txt or ci.yml.

PRs that only touch Python code will skip the ~6-10 min Docker build. To force a build, include '[docker]' in the commit message.
@jmponcebe jmponcebe merged commit 8abd9f4 into main Apr 21, 2026
4 checks passed
@jmponcebe jmponcebe deleted the fix/eval-context-and-ragas-async branch April 21, 2026 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants