Random + oracle retrieval baselines (TREC bounds)#32
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds two baseline retrieval adapters (random lower bound + oracle upper bound) with CLI wiring and unit tests to support benchmarking/evaluation.
Changes:
- Introduce
RandomRetrievalAdapterandOracleRetrievalAdapterbaseline adapters. - Wire both adapters into
sme/cli.pyadapter loader aliases. - Add test suites covering ingest/query behavior and graph snapshot behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
sme/adapters/random_retrieval.py |
Implements random-K retrieval baseline adapter. |
sme/adapters/oracle_retrieval.py |
Implements oracle retrieval baseline adapter driven by gold questions. |
sme/cli.py |
Adds CLI adapter-loading support for random* and oracle* names. |
tests/test_random_retrieval.py |
Adds unit tests for random retrieval adapter behaviors. |
tests/test_oracle_retrieval.py |
Adds unit tests for oracle retrieval adapter behaviors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_different_seeds_diverge() -> None: | ||
| corpus = _make_corpus(50) | ||
| a = RandomRetrievalAdapter(seed=1, n_results=10) | ||
| b = RandomRetrievalAdapter(seed=2, n_results=10) | ||
| a.ingest_corpus(corpus) | ||
| b.ingest_corpus(corpus) | ||
| names_a = [e.name for e in a.query("q").retrieved_entities] | ||
| names_b = [e.name for e in b.query("q").retrieved_entities] | ||
| assert names_a != names_b | ||
|
|
||
|
|
| if name in ("random", "random-retrieval", "random_retrieval"): | ||
| from sme.adapters.random_retrieval import RandomRetrievalAdapter | ||
|
|
||
| for k in ( | ||
| "include_node_tables", | ||
| "include_edge_tables", | ||
| "auto_discover", | ||
| "kg_path", | ||
| "collection_name", | ||
| "default_query_mode", | ||
| "db_path", | ||
| "buffer_pool_size", | ||
| "api_url", | ||
| "api_key", | ||
| "kind", | ||
| "read_only", | ||
| ): | ||
| kwargs.pop(k, None) | ||
| return RandomRetrievalAdapter(**kwargs) | ||
|
|
||
| if name in ("oracle", "oracle-retrieval", "oracle_retrieval"): | ||
| from sme.adapters.oracle_retrieval import OracleRetrievalAdapter | ||
|
|
||
| for k in ( | ||
| "include_node_tables", | ||
| "include_edge_tables", | ||
| "auto_discover", | ||
| "kg_path", | ||
| "collection_name", | ||
| "default_query_mode", | ||
| "db_path", | ||
| "buffer_pool_size", | ||
| "api_url", | ||
| "api_key", | ||
| "kind", | ||
| "read_only", | ||
| ): | ||
| kwargs.pop(k, None) | ||
| return OracleRetrievalAdapter(**kwargs) |
| for i, item in enumerate(selected): | ||
| source = item.get("source_file", item.get("id", f"random_{i}")) | ||
| text = item.get("text", item.get("content", "")) | ||
| context_parts.append(f"[{i+1}] {source}\n{text}") | ||
| entities.append( | ||
| Entity( | ||
| id=f"random:{i}", | ||
| name=str(source), | ||
| entity_type="random_selection", | ||
| ) | ||
| ) |
|
Standard TREC bounds (random as floor, oracle as ceiling) is the right framing — every reading downstream becomes interpretable as "where in the [random, oracle] interval did this land," and the seeded random adapter is the reproducibility piece that makes it usable as a regression bound. One real concern: the The duplicated kwargs-stripping block Copilot flagged is real but low-impact; that should fall out of the registry refactor in #30 anyway once you rebase onto it. Also: please add both |
6792fd5 to
d9d2beb
Compare
Two reference adapters that establish the standard lower/upper bounds for retrieval evaluation per TREC methodology: - RandomRetrievalAdapter: returns K uniformly random items from the ingested corpus, seeded for reproducibility. Any system that can't beat this isn't doing retrieval. - OracleRetrievalAdapter: returns the gold expected_sources verbatim in context_string. Substring-scorer ceiling — no system can do better than perfect retrieval. Both wired into sme/cli.py _load_adapter() under the names random/random-retrieval/random_retrieval and oracle/oracle-retrieval/ oracle_retrieval. Adapter-specific kwargs from other adapters are silently dropped for CLI parity. Per the issue brief, baselines are intentionally NOT added to the adapter harness manifest contract — they're reference bounds, not adapters under test. Closes M0nkeyFl0wer#23 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d9d2beb to
23753d9
Compare
|
Rebased onto
Diff is one clean commit on 🫏 |
|
The wave of fixes that landed on #31/#33/#34/#35/#36/#37 this morning was lovely to walk through — small focused commits, named constants, deterministic outputs, tests added, spec doc updated where the methodology actually changed. Not pushing on this PR to match that cadence; you've earned the right to sequence as makes sense. One coordination question so this doesn't quietly decay: the random-ID concern (the synthetic The TREC-bounds framing is the right shape regardless, so this isn't gated by the ID question — happy to land it either way once we know which. |
Summary
Closes #23.
Adds two standard-methodology baseline adapters that bound system-under-test performance:
RandomRetrievalAdapter— uniform random K items from corpus, seeded for reproducibility. TREC-standard lower bound.OracleRetrievalAdapter— returns goldexpected_sourcesverbatim. TREC-standard upper bound (ceiling)._load_adapter()asrandomandoracleadapter namesThese enable the TREC-standard normalized metric:
(system − random) / (oracle − random), which is a prerequisite for meaningful cross-system comparisons.Design decisions:
expected_sourcesfrom questions YAML — matches the current substring-based scoringTest plan
test_random_retrieval.py— seed determinism, corpus-size edge cases, entity typetest_oracle_retrieval.py— gold matching, no-gold fallback, entity construction🫏 Generated with Claude Code