Random + oracle retrieval baselines (TREC bounds) by jphein · Pull Request #32 · M0nkeyFl0wer/multipass-structural-memory-eval

jphein · 2026-05-25T01:01:32Z

Summary

Closes #23.

Adds two standard-methodology baseline adapters that bound system-under-test performance:

RandomRetrievalAdapter — uniform random K items from corpus, seeded for reproducibility. TREC-standard lower bound.
OracleRetrievalAdapter — returns gold expected_sources verbatim. TREC-standard upper bound (ceiling).
Both wired into the CLI _load_adapter() as random and oracle adapter names

These enable the TREC-standard normalized metric: (system − random) / (oracle − random), which is a prerequisite for meaningful cross-system comparisons.

Design decisions:

Random baseline uses uniform-random (not stratified by node-type) per the issue discussion — purity over structural awareness
Oracle reads expected_sources from questions YAML — matches the current substring-based scoring
BM25 calibrated-lower deferred to follow-up PR per issue scope

Test plan

test_random_retrieval.py — seed determinism, corpus-size edge cases, entity type
test_oracle_retrieval.py — gold matching, no-gold fallback, entity construction
Existing tests pass unchanged

🫏 Generated with Claude Code

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds two baseline retrieval adapters (random lower bound + oracle upper bound) with CLI wiring and unit tests to support benchmarking/evaluation.

Changes:

Introduce RandomRetrievalAdapter and OracleRetrievalAdapter baseline adapters.
Wire both adapters into sme/cli.py adapter loader aliases.
Add test suites covering ingest/query behavior and graph snapshot behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`sme/adapters/random_retrieval.py`	Implements random-K retrieval baseline adapter.
`sme/adapters/oracle_retrieval.py`	Implements oracle retrieval baseline adapter driven by gold questions.
`sme/cli.py`	Adds CLI adapter-loading support for `random` and `oracle` names.
`tests/test_random_retrieval.py`	Adds unit tests for random retrieval adapter behaviors.
`tests/test_oracle_retrieval.py`	Adds unit tests for oracle retrieval adapter behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+def test_different_seeds_diverge() -> None:
+    corpus = _make_corpus(50)
+    a = RandomRetrievalAdapter(seed=1, n_results=10)
+    b = RandomRetrievalAdapter(seed=2, n_results=10)
+    a.ingest_corpus(corpus)
+    b.ingest_corpus(corpus)
+    names_a = [e.name for e in a.query("q").retrieved_entities]
+    names_b = [e.name for e in b.query("q").retrieved_entities]
+    assert names_a != names_b
+
+


+    if name in ("random", "random-retrieval", "random_retrieval"):
+        from sme.adapters.random_retrieval import RandomRetrievalAdapter
+
+        for k in (
+            "include_node_tables",
+            "include_edge_tables",
+            "auto_discover",
+            "kg_path",
+            "collection_name",
+            "default_query_mode",
+            "db_path",
+            "buffer_pool_size",
+            "api_url",
+            "api_key",
+            "kind",
+            "read_only",
+        ):
+            kwargs.pop(k, None)
+        return RandomRetrievalAdapter(**kwargs)
+
+    if name in ("oracle", "oracle-retrieval", "oracle_retrieval"):
+        from sme.adapters.oracle_retrieval import OracleRetrievalAdapter
+
+        for k in (
+            "include_node_tables",
+            "include_edge_tables",
+            "auto_discover",
+            "kg_path",
+            "collection_name",
+            "default_query_mode",
+            "db_path",
+            "buffer_pool_size",
+            "api_url",
+            "api_key",
+            "kind",
+            "read_only",
+        ):
+            kwargs.pop(k, None)
+        return OracleRetrievalAdapter(**kwargs)


+        for i, item in enumerate(selected):
+            source = item.get("source_file", item.get("id", f"random_{i}"))
+            text = item.get("text", item.get("content", ""))
+            context_parts.append(f"[{i+1}] {source}\n{text}")
+            entities.append(
+                Entity(
+                    id=f"random:{i}",
+                    name=str(source),
+                    entity_type="random_selection",
+                )
+            )


M0nkeyFl0wer · 2026-05-25T23:33:46Z

Standard TREC bounds (random as floor, oracle as ceiling) is the right framing — every reading downstream becomes interpretable as "where in the [random, oracle] interval did this land," and the seeded random adapter is the reproducibility piece that makes it usable as a regression bound.

One real concern: the random:{i} synthetic entity IDs don't track to corpus items. That means random-baseline retrievals can't be diff'd against system-under-test retrievals at the entity-ID layer — any per-ID downstream analysis (Cat 4 collision audits, Cat 5 component overlap, etc.) silently breaks across adapter classes. Suggest using the corpus item's intrinsic ID instead, so the random baseline produces the same shape of entity_ids the other adapters return.

The duplicated kwargs-stripping block Copilot flagged is real but low-impact; that should fall out of the registry refactor in #30 anyway once you rebase onto it.

Also: please add both RandomRetrievalAdapter and OracleRetrievalAdapter to the contract testkit parametrization (tests/test_adapter_contract.py) when you rebase onto #30 — keeps the "27 conformance tests" invariant true.

Two reference adapters that establish the standard lower/upper bounds for retrieval evaluation per TREC methodology: - RandomRetrievalAdapter: returns K uniformly random items from the ingested corpus, seeded for reproducibility. Any system that can't beat this isn't doing retrieval. - OracleRetrievalAdapter: returns the gold expected_sources verbatim in context_string. Substring-scorer ceiling — no system can do better than perfect retrieval. Both wired into sme/cli.py _load_adapter() under the names random/random-retrieval/random_retrieval and oracle/oracle-retrieval/ oracle_retrieval. Adapter-specific kwargs from other adapters are silently dropped for CLI parity. Per the issue brief, baselines are intentionally NOT added to the adapter harness manifest contract — they're reference bounds, not adapters under test. Closes M0nkeyFl0wer#23 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jphein · 2026-05-27T14:37:48Z

Rebased onto upstream/main (2337e7d) at 23753d9 — landed all three asks:

Corpus-tracked IDs — RandomRetrievalAdapter now emits Entity.id = f"random:{item.get('id') or item.get('source_file') or f'item_{i}'}" so per-ID downstream analysis (Cat 4 collision audits, Cat 5 component overlap) holds across adapter classes. OracleRetrievalAdapter emits f"oracle:{source}" from the gold expected_sources directly.
Contract testkit parametrization — _random_retrieval_factory and _oracle_retrieval_factory added to tests/test_adapter_contract.py::ADAPTER_FACTORIES. The full conformance suite is now parametrized over {mock, flat_baseline, full_context, random_retrieval, oracle_retrieval} and passes locally (60 tests).
Duplicated kwargs-stripping — gone. Both adapters are now registered via _AdapterSpec entries in _ADAPTER_REGISTRY (_load_adapter: invert drop-list pattern to allowlist registry (root cause of the PR #7 cat5 regression) #20's allowlist pattern), so the registry's accepts=frozenset(...) filter replaces the hand-written drop list. CLI aliases preserved: random/random-retrieval/random_retrieval and oracle/oracle-retrieval/oracle_retrieval.

Diff is one clean commit on upstream/main: 6 files, 357 insertions. CI green across 3.10/3.11/3.12.

🫏

M0nkeyFl0wer · 2026-05-27T15:49:45Z

The wave of fixes that landed on #31/#33/#34/#35/#36/#37 this morning was lovely to walk through — small focused commits, named constants, deterministic outputs, tests added, spec doc updated where the methodology actually changed. Not pushing on this PR to match that cadence; you've earned the right to sequence as makes sense.

One coordination question so this doesn't quietly decay: the random-ID concern (the synthetic random:{i} IDs don't share an entity-ID space with the corpus, so any per-ID downstream analysis silently breaks across adapter classes) — are you parking that behind the #30 rebase and handling it in the same touch, or is it cleaner to address it directly here? Either is fine; I just want to make sure neither of us assumes the other is holding it.

The TREC-bounds framing is the right shape regardless, so this isn't gated by the ID question — happy to land it either way once we know which.

Copilot AI review requested due to automatic review settings May 25, 2026 01:01

Copilot AI reviewed May 25, 2026

View reviewed changes

M0nkeyFl0wer mentioned this pull request May 25, 2026

test+refactor: adapter contract testkit, registry allowlist, test coverage (#8, #18, #19, #20) #30

Merged

3 tasks

M0nkeyFl0wer mentioned this pull request May 26, 2026

Tier-1 standards integration: ship the five audited items #41

Open

jphein force-pushed the feat/random-oracle-baselines branch 2 times, most recently from 6792fd5 to d9d2beb Compare May 27, 2026 14:19

jphein force-pushed the feat/random-oracle-baselines branch from d9d2beb to 23753d9 Compare May 27, 2026 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random + oracle retrieval baselines (TREC bounds)#32

Random + oracle retrieval baselines (TREC bounds)#32
jphein wants to merge 1 commit into
M0nkeyFl0wer:mainfrom
techempower-org:feat/random-oracle-baselines

jphein commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

jphein commented May 27, 2026

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jphein commented May 25, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

jphein commented May 27, 2026

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants