fix(eval): expose full graph/vector context for RAGAS and update mocks for async API#3
Conversation
…s for async API
- api/main.py + api/models.py: add graph_context and vector_context fields to
QueryResponse so evaluation runners can score the actual text the LLM saw
instead of placeholder snippets like 'Knowledge graph data for {drug}'
- evaluation/runner.py: prefer full graph_context/vector_context with
graceful fallback to sources[].snippet for backward compatibility
- evaluation/metrics.py: migrate to RAGAS 0.4.x async API using
single_turn_ascore + SingleTurnSample, LangchainLLMWrapper for the
Gemini OpenAI-compatible endpoint, and GoogleGenerativeAIEmbeddings
for embeddings (OpenAI-compat returns 501 for embeddings)
- metrics.py: max_tokens raised to 8192 to avoid truncation in
Faithfulness and AnswerCorrectness multi-statement prompts
- metrics.py: explicit AnswerSimilarity(embeddings=...) wiring for
AnswerCorrectness
- tests/test_evaluation.py: update mocks to AsyncMock on
single_turn_ascore, add test for full context priority in _call_classic
There was a problem hiding this comment.
Pull request overview
This PR fixes RAGAS evaluation so metrics are computed against the actual graph/vector context provided to the LLM (instead of placeholder/truncated snippets), and updates the evaluation metrics wrapper to the RAGAS 0.4.x async scoring API.
Changes:
- Expose full
graph_context/vector_contextinPOST /queryresponses and prefer them in the evaluation runner (with backward-compatible fallback tosources[].snippet). - Migrate metric scoring to RAGAS 0.4.x using
SingleTurnSample+single_turn_ascoreviaasyncio.run. - Update evaluation tests to mock async metric scoring and add coverage for the “prefer full contexts” behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_evaluation.py |
Updates mocks to AsyncMock for async metric API; adds test ensuring runner prefers full contexts over snippets. |
src/pharmagraphrag/evaluation/runner.py |
Prefers graph_context/vector_context in _call_classic, falling back to snippets for older API responses. |
src/pharmagraphrag/evaluation/metrics.py |
Migrates scoring to RAGAS 0.4.x async API; switches evaluator LLM wrapper to LangChain; updates embeddings to use Gemini native embeddings. |
src/pharmagraphrag/api/models.py |
Extends QueryResponse schema with full graph_context/vector_context fields. |
src/pharmagraphrag/api/main.py |
Populates graph_context/vector_context in /query response from retrieved context. |
| from langchain_openai import ChatOpenAI | ||
| from ragas.llms import LangchainLLMWrapper | ||
|
|
||
| api_key = os.environ.get("GEMINI_API_KEY", "") | ||
| if not api_key: | ||
| raise ValueError("GEMINI_API_KEY env var is required for RAGAS evaluation") | ||
|
|
||
| client = OpenAI( | ||
| chat = ChatOpenAI( | ||
| model=model, | ||
| api_key=api_key, | ||
| base_url="https://generativelanguage.googleapis.com/v1beta/openai/", | ||
| max_tokens=8192, | ||
| temperature=0.0, | ||
| ) | ||
| return llm_factory(model, provider="openai", client=client) | ||
| return LangchainLLMWrapper(chat) |
There was a problem hiding this comment.
_get_evaluator_llm() now imports langchain_openai.ChatOpenAI, but langchain-openai is not listed in pyproject.toml dependencies. A fresh install will fail at runtime when running evaluation. Add langchain-openai (and ensure compatible openai dependency) to the project dependencies or switch to an evaluator LLM wrapper that’s already included in the dependency set.
| has_graph_context=result.context.has_graph, | ||
| has_vector_context=result.context.has_vector, | ||
| sources=sources, | ||
| graph_context=result.context.graph_context or "", | ||
| vector_context=result.context.vector_context or "", |
There was a problem hiding this comment.
QueryResponse now always includes graph_context/vector_context strings. These can be very large (especially vector_context), increasing response payload size/latency for every /query call even when the UI only needs sources. Consider gating these fields behind an opt-in request flag (e.g. include_full_context: bool = False) or only populating them when running in evaluation/debug mode.
| graph_context: str = Field( | ||
| "", description="Full graph context passed to the LLM (for evaluation/debugging)." | ||
| ) | ||
| vector_context: str = Field( | ||
| "", description="Full vector context passed to the LLM (for evaluation/debugging)." | ||
| ) |
There was a problem hiding this comment.
Adding full graph_context/vector_context to the response schema is useful for evaluation, but since these fields may contain large texts, it would help API consumers if the schema clearly indicates when they are populated (or if they’re optional/omitted by default). Consider making them optional (None) unless explicitly requested, or adding an explicit include_full_context request parameter to avoid unintentionally large responses.
Addresses PR review feedback: - Add include_full_context flag (default False) to QueryRequest so large graph/vector context is opt-in - Make graph_context/vector_context optional (None by default) in QueryResponse - Update /query endpoint to respect the flag - Update evaluation runner to set include_full_context=True when calling classic - Add langchain-openai dependency (used by RAGAS metrics via OpenAI-compatible endpoint)
Uses dorny/paths-filter to detect changes in docker/, pyproject.toml, uv.lock, requirements.txt or ci.yml. PRs that only touch Python code will skip the ~6-10 min Docker build. To force a build, include '[docker]' in the commit message.
Summary
Fix RAGAS evaluation so it scores the real context the LLM saw, not placeholder snippets, and migrate the metrics module to the RAGAS 0.4.x async API.
Background
/querywas returning sources with snippets like"Knowledge graph data for {drug}"(a fixed string) and vector texts truncated to 200 chars. The runner fed those snippets to RAGAS, so Faithfulness scored around 0.12 on classic mode even when the LLM answer was fully grounded in the real graph/vector context.Changes
QueryResponsenow includesgraph_contextandvector_contextwith the full text the LLM received.sources[]is kept for UI compatibility._call_classicprefers the new full context fields; falls back tosources[].snippetfor backward compatibility.SingleTurnSample+single_turn_ascorewrapped inasyncio.run. LLM viaLangchainLLMWrapper(ChatOpenAI(...))against Gemini's OpenAI-compatible endpoint. Embeddings viaGoogleGenerativeAIEmbeddings(the OpenAI-compat endpoint returns 501 for embeddings). ExplicitAnswerSimilarity(embeddings=...)forAnswerCorrectness.max_tokensraised to 8192 so Faithfulness and AnswerCorrectness don't truncate.AsyncMockonsingle_turn_ascore; added one case that verifies_call_classicprefers full contexts over snippets.Validation
Ran
--limit 3against local API on this branch:All 43 tests in
test_evaluation.pyand 18 tests intest_api.pypass.