test+refactor: adapter contract testkit, registry allowlist, test coverage (#8, #18, #19, #20) by jphein · Pull Request #30 · M0nkeyFl0wer/multipass-structural-memory-eval

jphein · 2026-05-24T21:49:52Z

Summary

Addresses four open code-quality issues from this repo:

Adapter contract testkit — extract shared SMEAdapter conformance suite #8 — Adapter contract testkit: 27 parametrized conformance tests verifying all adapters satisfy the SMEAdapter ABC (query() returns QueryResult, get_graph_snapshot() is internally consistent, etc.)
Test coverage gaps for measurement-critical modules (multi_hop, ontology_coherence, topology, adapters) #18 — Test coverage for multi_hop, ontology_coherence, _graph_mapping: 101 new tests covering hop-bucket grouping, A/B/C deltas, Cat 8a–8e sub-tests, and project_graph() edge cases
get_harness_manifest() referenced from cmd_cat9 but not in SMEAdapter ABC; only 2/8 adapters implement it #19 — Document get_harness_manifest() as ABC contract: docstring + spec update + 3 regression tests catching the silent-AttributeError class
_load_adapter: invert drop-list pattern to allowlist registry (root cause of the PR #7 cat5 regression) #20 — Invert _load_adapter() drop-list to allowlist registry: frozen _AdapterSpec dataclass with accepts: frozenset[str] and rename: dict[str, str]; unknown kwargs silently dropped, preventing the PR feat(adapters): RlmAdapter + Qwen-7B/Llama-70B baselines #7 drop-list-drift regression class. 15 new tests.

146 new tests total, all passing.

Test plan

python -m pytest tests/test_adapter_contract.py tests/test_adapter_harness_manifest_contract.py tests/test_cli_adapter_forwarding.py tests/test_graph_mapping.py tests/test_multi_hop.py tests/test_ontology_coherence.py -v
Verify no regressions in existing tests: python -m pytest tests/ -x -q
Spot-check that sme-eval --adapter flat --db /tmp/test still resolves through the new registry

🫏 Generated with Claude Code

…yFl0wer#19) The ABC default returning [] was added in 3754e8a as part of Cat 9 scaffolding, but the upstream issue tracks two follow-throughs that were still missing: - Module docstring + class docstring in `base.py` claimed "Two optional" methods; promote to "Three optional" so readers see `get_harness_manifest` as part of the contract. - Spec § Adapter Interface (sme_spec_v8.md) listed only `get_flat_retrieval` and `get_ontology_source` as optional. Add `get_harness_manifest` alongside them with its default-list rationale. Adds tests/test_adapter_harness_manifest_contract.py: every shipped adapter must resolve `get_harness_manifest` (either inherited or overridden) and the ABC default must return []. Class-level checks avoid instantiating adapters with heavy constructors. Closes the silent-AttributeError class entirely for any future adapter that forgets to override.

…Fl0wer#8) Parametric test module verifying adapters conform to the SMEAdapter ABC: query returns QueryResult, graph snapshot is internally consistent, ingest accepts corpus shape. Covers FlatBaselineAdapter and a MockAdapter; other adapters opt in by registering a factory.

…er#20) Replace 8-branch if-chain with _AdapterSpec registry. Each adapter declares its accepted kwargs; unknown kwargs are silently dropped. Makes the PR #7 class of drop-list-drift regression structurally impossible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…keyFl0wer#18) Three new test files for measurement-critical modules that had zero direct tests. Covers hop-bucket grouping, Cat 8 sub-tests, and the shared graph projection helper. Prevents silent regressions in published category readings.

Remove HindsightAdapter, Mem0Adapter, OmegaAdapter, RlmAdapter imports since those adapter files don't exist in upstream repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

M0nkeyFl0wer · 2026-05-25T23:33:27Z

Tested locally in a worktree: ruff clean, 338 passed / 15 skipped, the 146-new-test claim verifies exactly.

The _AdapterSpec dataclass with accepts: frozenset[str] is the structural fix #20 was asking for — drop-list drift is now impossible by construction, which is the right shape (vs. the maintenance-burden shape of "be more careful with future drop lists"). The 27-test contract testkit is the abstraction the "same shape three times" pattern wanted, and the get_harness_manifest() ABC + regression tests close out the silent-AttributeError class from #19.

One forward-looking note (not blocking): when new adapters land in subsequent PRs (#32 adds two — random + oracle), the contract testkit needs them parametrized in or the "27 conformance tests" claim quietly stops being true. Worth a one-line note in CONTRIBUTING about that. I'll add it.

Merging this first. Everything else in the stack will benefit from rebasing onto post-#30 main.

M0nkeyFl0wer · 2026-05-26T01:48:32Z

One coordination note as I work through the stack — five PRs in this wave (#31, #37, #38, #39, #40) touch `sme/categories/ingestion_integrity.py` or `sme/categories/gap_detection.py`. If they're all branched from pre-#30 main, serial merge will hit conflicts from the third PR onward. Would you be up for rebasing the stack onto post-#30 main in one pass once this lands, so the rest land cleanly? Happy to do it from this end if easier — your call.

Also flagging: when #32 lands (RandomRetrievalAdapter + OracleRetrievalAdapter), the contract testkit parametrize list needs both adapters added or the "27 conformance tests" invariant silently degrades. Mentioned it on #32 too — flagging here since this is the foundation that enforces the invariant.

M0nkeyFl0wer · 2026-05-27T15:49:56Z

Following up on the coordination point I raised on Monday — now that the #31/#33/#34/#35/#36/#37 fixes are all green and ready, the file-overlap question is back on the front burner. Five of the seven open PRs touch sme/categories/ingestion_integrity.py or sme/categories/gap_detection.py, so serial merge will hit conflicts from the second PR onward.

I'm not precious about the order; mostly I want to confirm a sequence so we don't end up doing rebase coordination cold-cache next week. Tentative proposal, push back if you'd prefer different:

Paired bootstrap CIs and Benjamini-Hochberg FDR correction #36 (paired bootstrap + BH-FDR) — new sme/stats/ module, no file overlap with anything else, foundation for downstream readings. Land first.
Cost-per-correct headline metric with versioned pricing (HELM methodology) #34 (cost-per-correct) — new sme/eval/pricing.py, no overlap. Land second.
Cat 7.b: query latency distribution (YCSB percentiles) #33 (Cat 7.b latency) — new sme/categories/latency.py, no overlap. Land third.
LongMemEval judge: K-replicate variance with majority voting #35 (K-replicate judge) — touches sme/eval/longmemeval_judge.py only, no overlap with this group. Land fourth.
Cat 4: surface per-ID edge degree in collision report #31 (Cat 4 collision degrees) — first ingestion_integrity.py touch in this wave. Land fifth.
Cat 4c: edge count alongside component count, sparse annotation #37 (Cat 4c edge counts) — second ingestion_integrity.py touch. Rebases onto Cat 4: surface per-ID edge degree in collision report #31.
feat(cat5): break out isolates by entity_type #38 → feat(cat5): representative cycles alongside Betti-1 #39 → feat(cat5): component-size distribution and non-trivial summaries #40 (Cat 5 gap_detection improvements) — all touch gap_detection.py, each rebases onto the previous.

That keeps the rebase work bounded to the actual file overlaps (#37 onto #31, and the #38/#39/#40 stack). Open to a different order if you have a preference — happy to be wrong about which to land first.

Separately, #32 (random+oracle) I'm leaving as its own conversation — see the comment I just left there.

jphein and others added 5 commits May 24, 2026 14:45

fix: trim fork-only adapters from harness manifest test for upstream

07c4ad9

Remove HindsightAdapter, Mem0Adapter, OmegaAdapter, RlmAdapter imports since those adapter files don't exist in upstream repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 24, 2026 21:49

This was referenced May 24, 2026

fix: allowlist adapter registry + test coverage (#18, #19, #20) #29

Closed

feat(adapters): RlmAdapter + Qwen-7B/Llama-70B baselines #7

Closed

Copilot AI reviewed May 24, 2026

View reviewed changes

M0nkeyFl0wer mentioned this pull request May 25, 2026

Random + oracle retrieval baselines (TREC bounds) #32

Open

3 tasks

M0nkeyFl0wer merged commit e5ca6e3 into M0nkeyFl0wer:main May 26, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test+refactor: adapter contract testkit, registry allowlist, test coverage (#8, #18, #19, #20)#30

test+refactor: adapter contract testkit, registry allowlist, test coverage (#8, #18, #19, #20)#30
M0nkeyFl0wer merged 5 commits into
M0nkeyFl0wer:mainfrom
techempower-org:upstream/code-quality-issues

jphein commented May 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

M0nkeyFl0wer commented May 26, 2026

Uh oh!

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jphein commented May 24, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

M0nkeyFl0wer commented May 26, 2026

Uh oh!

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants