Skip to content

Random + oracle retrieval baselines (TREC bounds)#32

Open
jphein wants to merge 1 commit into
M0nkeyFl0wer:mainfrom
techempower-org:feat/random-oracle-baselines
Open

Random + oracle retrieval baselines (TREC bounds)#32
jphein wants to merge 1 commit into
M0nkeyFl0wer:mainfrom
techempower-org:feat/random-oracle-baselines

Conversation

@jphein
Copy link
Copy Markdown
Contributor

@jphein jphein commented May 25, 2026

Summary

Closes #23.

Adds two standard-methodology baseline adapters that bound system-under-test performance:

  • RandomRetrievalAdapter — uniform random K items from corpus, seeded for reproducibility. TREC-standard lower bound.
  • OracleRetrievalAdapter — returns gold expected_sources verbatim. TREC-standard upper bound (ceiling).
  • Both wired into the CLI _load_adapter() as random and oracle adapter names

These enable the TREC-standard normalized metric: (system − random) / (oracle − random), which is a prerequisite for meaningful cross-system comparisons.

Design decisions:

  • Random baseline uses uniform-random (not stratified by node-type) per the issue discussion — purity over structural awareness
  • Oracle reads expected_sources from questions YAML — matches the current substring-based scoring
  • BM25 calibrated-lower deferred to follow-up PR per issue scope

Test plan

  • test_random_retrieval.py — seed determinism, corpus-size edge cases, entity type
  • test_oracle_retrieval.py — gold matching, no-gold fallback, entity construction
  • Existing tests pass unchanged

🫏 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 25, 2026 01:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds two baseline retrieval adapters (random lower bound + oracle upper bound) with CLI wiring and unit tests to support benchmarking/evaluation.

Changes:

  • Introduce RandomRetrievalAdapter and OracleRetrievalAdapter baseline adapters.
  • Wire both adapters into sme/cli.py adapter loader aliases.
  • Add test suites covering ingest/query behavior and graph snapshot behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sme/adapters/random_retrieval.py Implements random-K retrieval baseline adapter.
sme/adapters/oracle_retrieval.py Implements oracle retrieval baseline adapter driven by gold questions.
sme/cli.py Adds CLI adapter-loading support for random* and oracle* names.
tests/test_random_retrieval.py Adds unit tests for random retrieval adapter behaviors.
tests/test_oracle_retrieval.py Adds unit tests for oracle retrieval adapter behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +63 to +73
def test_different_seeds_diverge() -> None:
corpus = _make_corpus(50)
a = RandomRetrievalAdapter(seed=1, n_results=10)
b = RandomRetrievalAdapter(seed=2, n_results=10)
a.ingest_corpus(corpus)
b.ingest_corpus(corpus)
names_a = [e.name for e in a.query("q").retrieved_entities]
names_b = [e.name for e in b.query("q").retrieved_entities]
assert names_a != names_b


Comment thread sme/cli.py Outdated
Comment on lines +163 to +201
if name in ("random", "random-retrieval", "random_retrieval"):
from sme.adapters.random_retrieval import RandomRetrievalAdapter

for k in (
"include_node_tables",
"include_edge_tables",
"auto_discover",
"kg_path",
"collection_name",
"default_query_mode",
"db_path",
"buffer_pool_size",
"api_url",
"api_key",
"kind",
"read_only",
):
kwargs.pop(k, None)
return RandomRetrievalAdapter(**kwargs)

if name in ("oracle", "oracle-retrieval", "oracle_retrieval"):
from sme.adapters.oracle_retrieval import OracleRetrievalAdapter

for k in (
"include_node_tables",
"include_edge_tables",
"auto_discover",
"kg_path",
"collection_name",
"default_query_mode",
"db_path",
"buffer_pool_size",
"api_url",
"api_key",
"kind",
"read_only",
):
kwargs.pop(k, None)
return OracleRetrievalAdapter(**kwargs)
Comment on lines +38 to +48
for i, item in enumerate(selected):
source = item.get("source_file", item.get("id", f"random_{i}"))
text = item.get("text", item.get("content", ""))
context_parts.append(f"[{i+1}] {source}\n{text}")
entities.append(
Entity(
id=f"random:{i}",
name=str(source),
entity_type="random_selection",
)
)
@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

Standard TREC bounds (random as floor, oracle as ceiling) is the right framing — every reading downstream becomes interpretable as "where in the [random, oracle] interval did this land," and the seeded random adapter is the reproducibility piece that makes it usable as a regression bound.

One real concern: the random:{i} synthetic entity IDs don't track to corpus items. That means random-baseline retrievals can't be diff'd against system-under-test retrievals at the entity-ID layer — any per-ID downstream analysis (Cat 4 collision audits, Cat 5 component overlap, etc.) silently breaks across adapter classes. Suggest using the corpus item's intrinsic ID instead, so the random baseline produces the same shape of entity_ids the other adapters return.

The duplicated kwargs-stripping block Copilot flagged is real but low-impact; that should fall out of the registry refactor in #30 anyway once you rebase onto it.

Also: please add both RandomRetrievalAdapter and OracleRetrievalAdapter to the contract testkit parametrization (tests/test_adapter_contract.py) when you rebase onto #30 — keeps the "27 conformance tests" invariant true.

@jphein jphein force-pushed the feat/random-oracle-baselines branch 2 times, most recently from 6792fd5 to d9d2beb Compare May 27, 2026 14:19
Two reference adapters that establish the standard lower/upper bounds
for retrieval evaluation per TREC methodology:

- RandomRetrievalAdapter: returns K uniformly random items from the
  ingested corpus, seeded for reproducibility. Any system that can't
  beat this isn't doing retrieval.
- OracleRetrievalAdapter: returns the gold expected_sources verbatim
  in context_string. Substring-scorer ceiling — no system can do
  better than perfect retrieval.

Both wired into sme/cli.py _load_adapter() under the names
random/random-retrieval/random_retrieval and oracle/oracle-retrieval/
oracle_retrieval. Adapter-specific kwargs from other adapters are
silently dropped for CLI parity.

Per the issue brief, baselines are intentionally NOT added to the
adapter harness manifest contract — they're reference bounds, not
adapters under test.

Closes M0nkeyFl0wer#23

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jphein jphein force-pushed the feat/random-oracle-baselines branch from d9d2beb to 23753d9 Compare May 27, 2026 14:22
@jphein
Copy link
Copy Markdown
Contributor Author

jphein commented May 27, 2026

Rebased onto upstream/main (2337e7d) at 23753d9 — landed all three asks:

  1. Corpus-tracked IDsRandomRetrievalAdapter now emits Entity.id = f"random:{item.get('id') or item.get('source_file') or f'item_{i}'}" so per-ID downstream analysis (Cat 4 collision audits, Cat 5 component overlap) holds across adapter classes. OracleRetrievalAdapter emits f"oracle:{source}" from the gold expected_sources directly.

  2. Contract testkit parametrization_random_retrieval_factory and _oracle_retrieval_factory added to tests/test_adapter_contract.py::ADAPTER_FACTORIES. The full conformance suite is now parametrized over {mock, flat_baseline, full_context, random_retrieval, oracle_retrieval} and passes locally (60 tests).

  3. Duplicated kwargs-stripping — gone. Both adapters are now registered via _AdapterSpec entries in _ADAPTER_REGISTRY (_load_adapter: invert drop-list pattern to allowlist registry (root cause of the PR #7 cat5 regression) #20's allowlist pattern), so the registry's accepts=frozenset(...) filter replaces the hand-written drop list. CLI aliases preserved: random/random-retrieval/random_retrieval and oracle/oracle-retrieval/oracle_retrieval.

Diff is one clean commit on upstream/main: 6 files, 357 insertions. CI green across 3.10/3.11/3.12.

🫏

@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

The wave of fixes that landed on #31/#33/#34/#35/#36/#37 this morning was lovely to walk through — small focused commits, named constants, deterministic outputs, tests added, spec doc updated where the methodology actually changed. Not pushing on this PR to match that cadence; you've earned the right to sequence as makes sense.

One coordination question so this doesn't quietly decay: the random-ID concern (the synthetic random:{i} IDs don't share an entity-ID space with the corpus, so any per-ID downstream analysis silently breaks across adapter classes) — are you parking that behind the #30 rebase and handling it in the same touch, or is it cleaner to address it directly here? Either is fine; I just want to make sure neither of us assumes the other is holding it.

The TREC-bounds framing is the right shape regardless, so this isn't gated by the ID question — happy to land it either way once we know which.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Random + oracle baselines as standard bounds (TREC methodology)

3 participants