feat(adapters): RlmAdapter + Qwen-7B/Llama-70B baselines by jphein · Pull Request #7 · M0nkeyFl0wer/multipass-structural-memory-eval

jphein · 2026-04-30T17:06:53Z

Summary

RlmAdapter treats RLM (a fork of alexzhang13/rlm) as the read-side orchestrator rather than a standalone retrieval pipeline. The LLM decides when and how to query the palace-daemon — SME evaluates what comes back.

Stacked on #30 — please merge that PR first (allowlist adapter registry + test coverage). Once #30 lands, this PR's diff collapses to just the RlmAdapter-specific commits.

Changes

sme/adapters/rlm_adapter.py — Full adapter: query via RLM tool-calling loop, get_graph_snapshot() via palace-daemon /graph, ingest_corpus() no-op (palace is pre-populated), source_file preservation + invocation_mode for Cat 9a Step 2 measurement
tests/test_rlm_adapter.py — 16 tests with pytest.importorskip("rlm") guard (no CI dependency on the rlm package)
pyproject.toml — [rlm] optional extra pointing at jphein/rlm fork for baseline reproducibility
sme/cli.py — rlm entry in _AdapterSpec registry + cmd_cat5 fix to use _load_adapter_from_args() (resolves @M0nkeyFl0wer's May 3 review blocker)
sme/categories/gap_detection.py — Fix variable shadowing (considered → seeded_applicable)
tests/test_gap_detection.py — 37 tests for gap detection scoring
tests/test_cli_adapter_forwarding.py — rlm registry forwarding test

Review blockers resolved

pytest tests/test_rlm_adapter.py fails on fresh clones → importorskip + [rlm] extra
cat5 --adapter mempalace-daemon --api-url ... silently drops API args → _load_adapter_from_args()
Per-call vs per-result counter conflation → separate _tool_calls counter

Copilot

Pull request overview

Adds an RlmAdapter that uses RLM as the read-side orchestrator (with mempalace_search exposed as a tool), plus CLI plumbing/tests/docs and new jp-realm-v0.1 artifacts to support baseline evaluation runs.

Changes:

Introduce sme/adapters/rlm_adapter.py with tool-call capture into context_string / retrieved_entities.
Extend CLI adapter loading + add flags (--api-key, --kind, --mock/--no-mock, --familiar-timeout) and add/expand adapter test coverage (unit + gated live smoke).
Add jp-realm-v0.1 corpus YAML plus baseline JSON run artifacts and supporting docs/README updates.

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`sme/adapters/rlm_adapter.py`	New RLM-orchestrated adapter with `mempalace_search` tool registration and capture for scoring.
`sme/cli.py`	Wire new adapters and CLI flags through adapter construction paths (retrieve + cat4/cat5/check/cat9).
`tests/test_rlm_adapter.py`	Unit tests for RlmAdapter tool-call aggregation/reset and error handling.
`sme/adapters/mempalace_daemon.py`	HTTP adapter for palace-daemon (/search + /graph with MCP fallback).
`tests/test_mempalace_daemon_integration.py`	Gated live smoke tests against a running palace-daemon.
`sme/adapters/familiar.py`	Familiar HTTP adapter (eval + graph) with warning/error translation and optional helpers.
`tests/test_familiar_adapter.py`	Unit tests for FamiliarAdapter request/response mapping and error contracts.
`tests/test_familiar_live.py`	Gated live smoke tests against a running familiar instance.
`sme/adapters/_graph_mapping.py`	Shared mapping of daemon `/graph` payload to `Entity`/`Edge`.
`tests/conftest.py`	Adds `fake_urlopen_factory` for deterministic HTTP mocking in adapter tests.
`tests/fixtures/tiny_questions.yaml`	Small fixture question set for smoke-style runs.
`sme/corpora/jp_realm_v0_1/questions.yaml`	New jp-realm-v0.1 diagnostic corpus (30 questions).
`docs/superpowers/specs/2026-04-26-familiar-adapter-design.md`	Design spec for FamiliarAdapter + CLI/test approach.
`docs/superpowers/specs/2026-04-25-mempalace-daemon-adapter-design.md`	Design spec for MemPalaceDaemonAdapter + CLI/test approach.
`docs/ideas.md`	Records lessons learned + forward-compat notes + RlmAdapter scaffold notes.
`README.md`	Documents fork roadmap and shipped adapters + invocation examples.
`baselines/jp_realm_v0_1_familiar_20260426.json`	Baseline run artifact for jp-realm-v0.1 (familiar).
`baselines/jp_realm_v0_1_familiar_20260426_v2.json`	Baseline run artifact variant (familiar).
`baselines/jp_realm_v0_1_familiar_20260426_v3.json`	Baseline run artifact variant (familiar).
`baselines/jp_realm_v0_1_familiar_20260426_v4_punct_fix.json`	Baseline run artifact variant (familiar).
`baselines/jp_realm_v0_1_familiar_v0.3_candidate_window.json`	Baseline run artifact variant (familiar v0.3).
`baselines/jp_realm_v0_1_familiar_v0.3_reflect.json`	Baseline run artifact variant (familiar v0.3).
`baselines/jp_realm_v0_1_familiar_v0.3_task1_bm25.json`	Baseline run artifact variant (familiar v0.3).
`baselines/jp_realm_v0_1_familiar_v0.3.9_reflect_cumulative.json`	Baseline run artifact variant (familiar v0.3.9).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jphein · 2026-04-30T21:02:26Z

+        params = {"q": query, "limit": str(limit), "kind": self.kind}
+        url = f"{self.api_url}/search?" + "&".join(f"{k}={_urlrequest.quote(v)}" for k, v in params.items())
+        req = _urlrequest.Request(url)


Verified false positive on Python 3.10+: urllib.request silently re-exports quote from urllib.parse (hasattr(urllib.request, 'quote') returns True; quote.__module__ is 'urllib.parse'). The code works in production — the baselines/jp_realm_v0_1_rlm_*.json artifacts in this PR are evidence. That said, relying on an undocumented re-export is fragile across Python versions, so a cosmetic refactor to urllib.parse.quote is reasonable. Left for a follow-up PR to keep this one focused on substantive fixes.

jphein · 2026-04-30T21:02:30Z

+        try:
+            with _urlrequest.urlopen(req, timeout=self.timeout_s) as resp:
+                payload = json.loads(resp.read().decode("utf-8"))
+        except (_urlerror.URLError, _urlerror.HTTPError, OSError) as e:


Deferred — could be a feature decision rather than a bug. Returning [{error: ..., results: []}] to the LM (so it sees the failure in its tool-result payload and can route around it) is the current path; appending an error entry to _capture would put the failure into context_string for SME scoring, which has tradeoffs around interpretability for the LM on the next turn. Will revisit once there's signal on whether SME wants failed-tool-call invocations counted in scoring. Note: binary invocation rate (the headline 9a quantity) is unaffected — _tool_call_count is incremented at the top of _mempalace_search before the try/except in bcb4799, so failed calls are still counted toward 9a even though they don't add drawers to _capture.

+            # Strings, not dicts — cli's `'; '.join(path)` expects strings.
+            retrieval_path=[
+                f"rlm_completion ({elapsed_ms}ms, {len(self._capture)} tool calls)",
+            ],


+        # Drop kwargs RLM doesn't understand.
+        for k in (
+            "include_node_tables",
+            "include_edge_tables",
+            "auto_discover",
+            "kg_path",
+            "collection_name",
+            "default_query_mode",
+            "db_path",
+            "buffer_pool_size",
+            "kind",
+            "read_only",
+        ):
+            kwargs.pop(k, None)
+        return RlmAdapter(**kwargs)


Addresses Copilot review on PR #7: - Track _tool_call_count separately from _capture (which grows once per drawer returned). Previously, retrieval_path reported `len(_capture)` as the tool-call count — a single _mempalace_search call returning 5 drawers showed up as "5 tool calls", skewing any Cat 9a-shaped invocation-rate read of the per-question histogram. - retrieval_path now reports both quantities explicitly: `rlm_completion (Xms, N tool calls, M drawers)`. Binary invocation-rate readings (zero vs non-zero) are unaffected — those were always correct because zero captures means no successful call. Per-question call-count histograms from the baselines/jp_realm_v0_1_rlm_*.json files predate this fix and reflect drawer counts; only the binary read on those JSONs is reliable. - _load_adapter() no longer drops `kind` when constructing RlmAdapter. RlmAdapter.__init__ explicitly accepts `kind` to forward into mempalace_search /search calls; the CLI's --kind flag was previously a silent no-op for the rlm adapter. Deferred (commented in PR review): appending an error entry to _capture when the tool call raises (currently the failure is visible via the returned [{error: ..., results: []}] but doesn't flow into context_string or per-query telemetry). Could be a feature decision rather than a bug — error-as-data has its own tradeoffs around context-string interpretability for the LM.

M0nkeyFl0wer

Caught a couple of things on local checkout that block merge — both small fixes, all worth landing as a single follow-up commit.

Failures on a fresh clone

1. pytest tests/test_rlm_adapter.py — 5 tests fail with ModuleNotFoundError: No module named 'rlm'.

The tests use unittest.mock.patch.object(rlm, ...) but rlm isn't declared in pyproject.toml. On your fork it presumably resolves via a venv that has it installed; on a fresh clone of this repo it doesn't.

Fix: add rlm to [project.optional-dependencies] (probably an rlm extra, alongside the existing optional groups), and either guard the test module with pytest.importorskip("rlm") at the top or document the install command in the module docstring. The pattern in tests/test_familiar_live.py (gated live smoke) is a reasonable template.

2. ruff check sme/adapters/rlm_adapter.py — 3 E701 errors.

sme/adapters/rlm_adapter.py:257:34: E701 Multiple statements on one line (colon)
sme/adapters/rlm_adapter.py:258:29: E701 Multiple statements on one line (colon)
sme/adapters/rlm_adapter.py:259:29: E701 Multiple statements on one line (colon)

The if r.get("drawer_id"): tags.append(...) pattern at 257-259 needs to be split into proper if-blocks (or rewritten with a list comp). 5 minutes.

Verified what I could

The bcb4799 / 927493e tool-call counter fix I commented on at issue #3 reads cleanly — _capture per drawer vs _tool_call_count per invocation are now properly separated, and the retrieval_path emit format (N tool calls, M drawers) is the right surface.
The 162 LOC adapter itself is well-scoped; design lines up with the issue #3 conversation.
The 2 RLM baseline JSONs (Qwen-7B, Llama-70B) carry the fixed-format readings — those are good.

Heads-up on a structural thing not blocking this PR

Across PR #5, #6, and #7 we've now landed about 1,085 lines of adapter unit tests that all share the same shape — HTTP-mock fixtures plus deserialization assertions. None of them actually run an SME category against the adapter; they verify wrapper plumbing, not adapter conformance to the SMEAdapter ABC. I'm filing a separate issue to introduce a shared adapter-contract testkit so the next adapter PR can register itself in pytest.parametrize instead of writing 200+ lines of bespoke HTTP mocks.

Not a blocker for this PR — keep your existing tests as the implementation-level reference. The testkit will sit alongside, not replace.

Once the two fixes above land, this is good to merge.

Thanks again for the stack — the RLM adapter is the meaningful new measurement primitive in here (orchestrator-mediated retrieval, the actual Cat 9 surface), and the bcb4799 fix is the kind of self-correction discipline I want to keep encouraging.

Triggered by jphein's PR #7 shipping with 5 ModuleNotFoundError test failures and 3 ruff E701 errors that no automated check caught — the contributor's fork has the deps, the public repo doesn't, and there was no CI to enforce the difference. Workflow runs on push/PR to main, installs the package with [dev] extras, runs ruff check (project config is line-length=100, target=py310) and pytest. Matrix matches pyproject.toml's declared support: 3.10, 3.11, 3.12. Note: this workflow is committed but has not yet executed against GitHub's runners. First execution will be the runtime validation. If anything fails, the fix lands in a follow-up commit before more PRs come through.

…ruff E701 Addresses Ben's CHANGES_REQUESTED on M0nkeyFl0wer/multipass-structural- memory-eval#7. Two blocking items, fixed in one commit per his ask. 1. tests/test_rlm_adapter.py failed with ModuleNotFoundError: No module named 'rlm' on a fresh clone. Now: - pyproject.toml gains an opt-in `rlm` extra pointing at alexzhang13/rlm (distribution name `rlms`, importable as `rlm`, requires Python >=3.11). Not in `all` because the Python floor is higher than SME's >=3.10 baseline. - The test module guards with `pytest.importorskip("rlm")` so a fresh clone without the extra installed skips the whole module cleanly instead of erroring during collection. 2. sme/adapters/rlm_adapter.py:257-259 had three E701 ruff failures (`if cond: stmt` one-liners). Split into proper if-blocks. Behavior unchanged — same tag construction. A third Copilot comment on PR #7 (`_load_adapter()` drops `kind` kwarg, making `--kind` a no-op for rlm) was already addressed in 7d459c4: `kind` is intentionally NOT in the rlm drop list, with a comment explaining the contract. No code change needed for that one. Verified locally: - ruff check sme/adapters/rlm_adapter.py — clean - pytest tests/test_rlm_adapter.py — 5/5 pass - pytest -q — 96 passed, 3 skipped (the one failure is a live-gated daemon integration test, unrelated to this PR) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jphein · 2026-05-03T18:55:47Z

Pushed fixup as f06f5e2. Single commit per your ask, addresses both blockers and the open Copilot comment:

Blocker 1 — tests/test_rlm_adapter.py ModuleNotFoundError on fresh clones.

Two halves of the fix:

pyproject.toml gains an opt-in rlm extra:
```
rlm = [
    "rlms @ git+https://github.com/alexzhang13/rlm.git",
]
```
Distribution name on the upstream rlm repo is rlms (importable as rlm); not on PyPI under either name, hence the git URL. requires-python = ">=3.11" over there is one minor higher than SME's >=3.10 floor, so I deliberately left it out of all — anyone wanting it has to opt in with pip install -e ".[rlm]". Keeps the lightweight-constitution promise.
tests/test_rlm_adapter.py now guards with pytest.importorskip("rlm") at module top, plus a docstring note pointing at the install incantation. Followed your test_familiar_live.py template — module-level skip pattern. Subsequent imports carry # noqa: E402 since importorskip is technically a non-import statement above them.

Blocker 2 — 3 × E701 ruff errors at sme/adapters/rlm_adapter.py:257-259.

Split the three if r.get("..."): tags.append(...) one-liners into proper if-blocks. Behavior unchanged — same tag construction.

Copilot comment 4 (_load_adapter() drops kind kwarg) — already fixed.

This one was on an older cli.py. Commit 7d459c4 (the original RlmAdapter commit) explicitly removed kind from rlm's drop list with a comment explaining the contract:

# Drop kwargs RLM doesn't understand. (RlmAdapter accepts `kind`
# to forward into mempalace_search /search calls, so it is NOT
# in this drop list.)

So --kind does flow through to RlmAdapter for cmd_retrieve. _load_adapter_from_args (used by cat4/cat5/check/cat9) also threads it. cmd_analyze / cmd_cat8 / cmd_cat5 don't surface --kind at the argparse layer, but those are graph-snapshot commands hitting /graph rather than /search, where the daemon's kind filter doesn't apply.

Verified locally:

$ ruff check sme/adapters/rlm_adapter.py
All checks passed!

$ pytest tests/test_rlm_adapter.py -q
.....                                                                    [100%]
5 passed in 0.06s

$ pytest -q
96 passed, 3 skipped in 162.10s

(One additional test fails when PALACE_DAEMON_URL is set — test_kind_content_excludes_stop_hook_checkpoints got 2 vs 2 instead of strictly fewer. That's a daemon-side behavior surface, not an SME PR issue, and gates on PALACE_DAEMON_URL so it'll skip cleanly on your fresh-clone path. Happy to file it on rboarescu/palace-daemon separately if it persists.)

On the testkit heads-up (issue #8): agreed it's the right next move. Codetopo's structural_duplication findings on the per-adapter test files are concrete refactor targets; happy to follow up there once this PR lands.

The baselines/jp_realm_v0_1_rlm_*.json artifacts on this branch were produced against the jphein/rlm fork (currently a 3-commit divergence from alexzhang13/rlm — README header + examples/mempalace_demo.py + a merge). Code surface in rlm/ is identical today, but pinning to the fork is the correct call: - Reproducibility: fresh-clone reproductions of the baselines run the same code path that produced the evidence on PR #7. - Provenance: rlm_adapter.py's docstring already says "fork of alexzhang13/rlm", so the install path matches the doc. - Future divergence: any SME-relevant patches we land on the fork flow through to a fresh clone without a pyproject churn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jphein · 2026-05-03T18:58:14Z

Small correction on top of the previous comment, pushed as ea08eb0:

The [rlm] extra now points at jphein/rlm, not upstream alexzhang13/rlm:

rlm = [
    "rlms @ git+https://github.com/jphein/rlm.git",
]

Reasoning: the baselines/jp_realm_v0_1_rlm_*.json artifacts in this PR were produced against the fork (/home/jp/Projects/rlm ↔ jphein/rlm). Currently a 3-commit divergence from upstream (README header + examples/mempalace_demo.py showing the SME-composition pattern + a merge) — the rlm/ package code itself is identical. But pinning to the fork is the right call:

Reproducibility: fresh-clone reproductions of the readings on this PR run the same code path that produced them.
Provenance: rlm_adapter.py's docstring already says "fork of alexzhang13/rlm" — the install path now matches.
Future divergence: SME-relevant patches landing on the fork flow through automatically without a pyproject churn here.

Local verification re-run:

$ ruff check sme/adapters/rlm_adapter.py
All checks passed!
$ pytest tests/test_rlm_adapter.py -q
5 passed in 0.05s

Apologies for the two-step — should have called this on the first push.

M0nkeyFl0wer · 2026-05-03T22:45:20Z

Reviewed thoroughly — adapter code, CLI wiring, baseline JSONs, and the cross-validation implications. Everything is solid.

One code blocker — cat5 doesn't forward API-mode args (--api-url, --api-key, --kind, --mock-inference, --familiar-timeout) even though check already does via _load_adapter_from_args(). I reproduced this on the PR branch: cat5 --adapter mempalace-daemon --api-url http://example.test:8085 fails with ValueError: MemPalaceDaemonAdapter needs api_url. The fix is straightforward — cmd_cat5() needs to use the same _load_adapter_from_args() path that cmd_check() uses. Not a merge blocker, but worth a quick follow-up PR once this lands; I'll file a tracking issue.

README status table — the invocation count for Qwen was listed as 25/30 zero-call, 2/30 used tool which only sums to 27. docs/ideas.md correctly notes the remaining 3 errored. I'll correct the table in main after this PR merges.

The bigger picture — your mempalace-daemon and familiar adapters complete the four-way comparison the cross-validation plan calls for. The familiar adapter framing (comparing familiar's full pipeline against the bare daemon to measure what the v0.2 reranking/temporal-decay layer adds) is exactly how we want to position it. We've also updated the Next Steps in main to link to docs/cross_validation_2026.md and docs/industry_standards_integration.md — the standards audit doc is where the decisions about what SME rolls its own vs. adopts live, so contributors can see why we chose B-Cubed over V-measure (Amigó et al. 2009 proves B-Cubed satisfies all four formal cluster-eval constraints) and what we surveyed and rejected before settling on our provenance approach for Cat 8b.

One non-blocking question from the review: the transport-failure path in API-mode adapters produces an empty-graph report with exit 0. I'm not treating this as a PR-specific bug — it's a design question for the evaluator-honesty framing. Does an empty result from a transport failure look the same as an empty result from a genuine zero-retrieval? If they look identical, we lose the ability to distinguish "the system ran correctly and found nothing" from "the system didn't run at all." Happy to discuss in an issue rather than blocking this PR.

An invitation: with the cross-validation harness now in main and all four adapters (flat, mempalace-daemon, familiar, ladybugdb) wired up, there's a natural next step: run the RLM adapter through the LongMemEval question set. That would give us a Category 1 / 2c reading for RLM against a standardized corpus and start building the comparative picture across all six adapters. Let me know if that interests you — happy to coordinate on corpus preparation.

Codetopo signal — the structural duplication hits in test_familiar_adapter.py are real, but we've already scoped the fix in issue #8 (shared adapter contract testkit). The fact that we're running codetopo in CI means your test code is being reviewed constructively, not just functionally — I want to keep that signal live.

Thanks for the thorough work — this is exactly the kind of contribution that moves the comparative picture forward. Hope you have a great week.

Addresses Copilot review on PR #7: - Track _tool_call_count separately from _capture (which grows once per drawer returned). Previously, retrieval_path reported `len(_capture)` as the tool-call count — a single _mempalace_search call returning 5 drawers showed up as "5 tool calls", skewing any Cat 9a-shaped invocation-rate read of the per-question histogram. - retrieval_path now reports both quantities explicitly: `rlm_completion (Xms, N tool calls, M drawers)`. Binary invocation-rate readings (zero vs non-zero) are unaffected — those were always correct because zero captures means no successful call. Per-question call-count histograms from the baselines/jp_realm_v0_1_rlm_*.json files predate this fix and reflect drawer counts; only the binary read on those JSONs is reliable. - _load_adapter() no longer drops `kind` when constructing RlmAdapter. RlmAdapter.__init__ explicitly accepts `kind` to forward into mempalace_search /search calls; the CLI's --kind flag was previously a silent no-op for the rlm adapter. Deferred (commented in PR review): appending an error entry to _capture when the tool call raises (currently the failure is visible via the returned [{error: ..., results: []}] but doesn't flow into context_string or per-query telemetry). Could be a feature decision rather than a bug — error-as-data has its own tradeoffs around context-string interpretability for the LM.

…ruff E701 Addresses Ben's CHANGES_REQUESTED on M0nkeyFl0wer/multipass-structural- memory-eval#7. Two blocking items, fixed in one commit per his ask. 1. tests/test_rlm_adapter.py failed with ModuleNotFoundError: No module named 'rlm' on a fresh clone. Now: - pyproject.toml gains an opt-in `rlm` extra pointing at alexzhang13/rlm (distribution name `rlms`, importable as `rlm`, requires Python >=3.11). Not in `all` because the Python floor is higher than SME's >=3.10 baseline. - The test module guards with `pytest.importorskip("rlm")` so a fresh clone without the extra installed skips the whole module cleanly instead of erroring during collection. 2. sme/adapters/rlm_adapter.py:257-259 had three E701 ruff failures (`if cond: stmt` one-liners). Split into proper if-blocks. Behavior unchanged — same tag construction. A third Copilot comment on PR #7 (`_load_adapter()` drops `kind` kwarg, making `--kind` a no-op for rlm) was already addressed in 7d459c4: `kind` is intentionally NOT in the rlm drop list, with a comment explaining the contract. No code change needed for that one. Verified locally: - ruff check sme/adapters/rlm_adapter.py — clean - pytest tests/test_rlm_adapter.py — 5/5 pass - pytest -q — 96 passed, 3 skipped (the one failure is a live-gated daemon integration test, unrelated to this PR) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The baselines/jp_realm_v0_1_rlm_*.json artifacts on this branch were produced against the jphein/rlm fork (currently a 3-commit divergence from alexzhang13/rlm — README header + examples/mempalace_demo.py + a merge). Code surface in rlm/ is identical today, but pinning to the fork is the correct call: - Reproducibility: fresh-clone reproductions of the baselines run the same code path that produced the evidence on PR #7. - Provenance: rlm_adapter.py's docstring already says "fork of alexzhang13/rlm", so the install path matches the doc. - Future divergence: any SME-relevant patches we land on the fork flow through to a fresh clone without a pyproject churn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cat5 accepted --api-url/--api-key/--kind/--mock/--familiar-timeout at the argparse layer but quietly dropped them before constructing the adapter, so `sme-eval cat5 --adapter mempalace-daemon --api-url ...` raised ValueError("MemPalaceDaemonAdapter needs api_url") even with the URL on the command line. Refactor cmd_cat5 to use the same _load_adapter_from_args() helper that cmd_cat4/cmd_check/cmd_cat9 already use. One-line replacement plus 14 lines of bespoke kwargs-building deleted. Behaviour change: the shared helper drops auto_discover=False from kwargs (via `if val:`) where the old cat5 code passed it explicitly. Functionally equivalent because every adapter's auto_discover default is False anyway, so the resulting constructor call is identical. Regression coverage in tests/test_cli_adapter_forwarding.py — three focused tests against the helper and cmd_cat5 directly, covering both the mempalace-daemon and familiar adapter paths. M0nkeyFl0wer's exact CLI repro now reaches gap detection. Closes upstream PR #7 review comment from 2026-05-03. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

M0nkeyFl0wer · 2026-05-19T00:05:45Z

Hey — wow, you have contributed so much to this project. I loved watching the notifications roll in yesterday. The substrate-floor parity hitting R@5=0.9660 byte-identical to upstream, the FT-300 reproduction landing inside published noise, the AGE write-through spike turning up +9pp R@5 over vector-only, the Step 2-forced finding that invocation-discipline only closes the base-model gap when retrieval is wide enough, the four-regime chunking-sensitivity taxonomy you and @nakata-app converged on — this is the most measurement-dense two weeks the project has had. Substrate-floor calibration against a published number is exactly the cross-validation #9 was asking for; the 9a sub-test proposal in #3 is the right shape and supported by data I didn't expect this fork to produce this fast. Substantive replies on #9 and #3 are coming in their own threads.

To get CI green — the May-17 re-run turned up 17 ruff errors, 16 of which are auto-fixable with ruff check --fix . (F401 unused imports, F541 f-strings without placeholders). The remaining one is the import pytest at tests/test_cli_adapter_forwarding.py:21 — a one-line delete. Once that lands and CI is green, I'll do the final merge pass.

Ruff version drift is on me, not you. `pyproject.toml` pins `ruff>=0.5` unbounded; CI pulled `0.15.13` which is stricter than what was running when this PR opened. I'll land a small upper-pin in a follow-up so this can't bite long-lived PRs again.

One forward-looking note, not a complaint about this PR — the work since the May-3 review is substantial enough that it would have been easier for me to review well as 3-4 smaller PRs (postgres+AGE adapter as one; LongMemEval substrate-floor + FT-300 as another; AGE write-through spike + git-probes corpus as a third; AGENTS.md/Gemini config separately). The structural improvements that land in this PR are genuinely valuable — that's exactly why I'd rather review each thread on its own merits next time. For #7 I'll merge as-is once CI is green; for the next batch of work, can we aim for one structural thread per PR? My review cadence will be tighter on smaller PRs.

jphein

@M0nkeyFl0wer — both merge-blockers from your 2026-05-01 review are resolved.

Fix 1 — rlm dependency + test guard: 6fb6925 adds [rlm] to [project.optional-dependencies] pointing at rlms @ git+https://github.com/jphein/rlm.git, and pytest.importorskip("rlm") at the top of test_rlm_adapter.py. Module docstring documents pip install -e ".[rlm]". Fresh-clone verified: without the extra, 9 tests skip cleanly; with it, 9/9 pass.

Fix 2 — ruff E701: Same commit (6fb6925) splits the three one-liner if r.get(...): tags.append(...) patterns into proper if-blocks. ruff check sme/adapters/rlm_adapter.py passes clean as of today (2026-05-23).

Follow-up commit 348c714 pins the extra to jphein/rlm specifically (not upstream alexzhang13/rlm) since the baseline artifacts were produced against this fork.

Scope growth caveat. Since your review this branch has grown from the original adapter scope to 40 commits / 68 files. The additional work is the full Cat 9a investigation chain: Step 1 retrieval-breadth probe (8-cell table), Step 2-forced invocation decomposition, mempalace-git-probes-v2 200-question corpus, LME substrate-floor parity experiments, AGE write-through spike, and associated baselines/docs. All of it feeds back into M0nkeyFl0wer/multipass-structural-memory-eval#3 and M0nkeyFl0wer/multipass-structural-memory-eval#9.

The security fixes that were in this branch (hardcoded DSN, Cypher escaping) have already landed in your main via the techempower-org/multipass-structural-memory-eval#16 upstream sync, so those commits are duplicates in the diff.

Happy to do whichever you prefer:

Rebase + squash the experimental chain into thematic commits and force-push to reduce the diff
Split into the original RlmAdapter PR (your reviewed scope) + a separate PR for the Cat 9a experimental corpus
Merge as-is if the 40-commit history is acceptable given it's all one investigation thread

Your call — the fixes you blocked on are ready either way. 🫏

…yFl0wer#19) The ABC default returning [] was added in 3754e8a as part of Cat 9 scaffolding, but the upstream issue tracks two follow-throughs that were still missing: - Module docstring + class docstring in `base.py` claimed "Two optional" methods; promote to "Three optional" so readers see `get_harness_manifest` as part of the contract. - Spec § Adapter Interface (sme_spec_v8.md) listed only `get_flat_retrieval` and `get_ontology_source` as optional. Add `get_harness_manifest` alongside them with its default-list rationale. Adds tests/test_adapter_harness_manifest_contract.py: every shipped adapter must resolve `get_harness_manifest` (either inherited or overridden) and the ABC default must return []. Class-level checks avoid instantiating adapters with heavy constructors. Closes the silent-AttributeError class entirely for any future adapter that forgets to override.

…Fl0wer#8) Parametric test module verifying adapters conform to the SMEAdapter ABC: query returns QueryResult, graph snapshot is internally consistent, ingest accepts corpus shape. Covers FlatBaselineAdapter and a MockAdapter; other adapters opt in by registering a factory.

…er#20) Replace 8-branch if-chain with _AdapterSpec registry. Each adapter declares its accepted kwargs; unknown kwargs are silently dropped. Makes the PR #7 class of drop-list-drift regression structurally impossible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…keyFl0wer#18) Three new test files for measurement-critical modules that had zero direct tests. Covers hop-bucket grouping, Cat 8 sub-tests, and the shared graph projection helper. Prevents silent regressions in published category readings.

Remove HindsightAdapter, Mem0Adapter, OmegaAdapter, RlmAdapter imports since those adapter files don't exist in upstream repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Builds the scaffold for benchmarking RLM (jphein/rlm fork of alexzhang13/rlm) against familiar's deterministic pipeline on the jp-realm-v0.1 corpus. How it satisfies the SME contract: - palace-daemon /search is exposed to RLM via custom_tools as `mempalace_search`. Every call's results flow through a per-query capture buffer. - QueryResult.context_string is the concatenation of every drawer text RLM pulled during the run, in call order. The Cat 1 / retrieve substring scorer sees exactly what RLM saw. - retrieved_entities mirror captures so Cat 7 token counts and Cat 8 hop counts have something real (Cat 8 returns empty — RLM doesn't maintain a graph view). - ingest_corpus is a no-op (RLM consumes a palace it didn't author; ingestion happens upstream via mempalace mine). CLI: --adapter rlm available. Same kwargs-drop pattern as the other adapters. Backend defaults to portkey @openai/gpt-5-nano when PORTKEY_API_KEY is set; falls through to direct OpenAI/ Anthropic per RLM's own backend resolution. Tests (5, all pass) cover the SME contract without burning API credits — a live benchmark belongs in baselines/, not unit tests. This is research scaffolding. The actual A/B vs familiar (which the v0.3 brainstorm flagged as the lever for deciding v0.4 productionization) needs an API key + a separate run that produces a baseline JSON. Documented in docs/ideas.md.

…ard, cat5 forwarding, invocation_mode Squashes the following fixes onto the initial RlmAdapter commit: - Separate per-call tool counter from per-result drawer accumulator - Add [rlm] optional extra to pyproject.toml (pytest.importorskip guard) - Point [rlm] extra at jphein/rlm fork for baseline reproducibility - Fix cmd_cat5 to use _load_adapter_from_args() — resolves M0nkeyFl0wer review blocker (--api-url/--api-key/--kind silently dropped) - Add source_file preservation + invocation_mode for Cat 9a Step 2 - Fix gap_detection variable shadowing (considered → seeded_applicable) - Add rlm registry forwarding test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jphein · 2026-05-24T22:00:23Z

Rebased onto #30 (allowlist registry) — both review blockers from the May 3 review are resolved:

Fresh-clone ModuleNotFoundError — tests use pytest.importorskip("rlm") + [rlm] optional extra in pyproject.toml
cat5 --adapter mempalace-daemon --api-url silently drops args — cmd_cat5 now uses _load_adapter_from_args() like every other subcommand. The allowlist registry in test+refactor: adapter contract testkit, registry allowlist, test coverage (#8, #18, #19, #20) #30 structurally prevents this class of bug going forward.

Also squashed: tool-call counter fix, source_file preservation + invocation_mode for Cat 9a Step 2, gap_detection variable shadowing fix.

Please merge #30 first — once it lands, this PR's diff drops to just the RlmAdapter-specific commits. 🫏

…#19, #20) Adds 146 tests across 6 new test files. The _AdapterSpec dataclass with accepts: frozenset[str] inverts the drop-list pattern to an allowlist registry, making the PR #7 class of drop-list-drift regression structurally impossible. Closes #8, #18, #19, #20.

Extracts the substantive enhancements from upstream PR #7 that aren't already on fork main via the PR #24 bundle. Five focused changes: 1. invocation_mode="forced": prepend a system-prompt directive requiring at least one mempalace_search call before FINAL. Tests whether the invocation-rate ceiling (gemma4: 60% zero-call) is the dominant Cat 9a lever on substring-shaped corpora. 2. invocation_mode="grounded": prepend a directive requiring the answer to quote at least one retrieved source filename. Tests whether the substring-scorer-vs-LLM-synthesis gap is the lever on file-shaped corpora (n=200 git-derived probes). 3. source_file round-trip through trimmed dict and context_string — the substring scorer matches against filenames from expected_sources; dropping source_file meant correct retrieval silently scored 0. 4. ingest_corpus returns the full SMEAdapter contract dict (errors/warnings empty lists) so harness code doesn't KeyError. 5. RLM kwargs restructured to a dict so custom_system_prompt is only set when needed — keeps default behavior unchanged. Four new tests cover source_file preservation and the three invocation_mode branches (forced / grounded / default). The fork-only adapters (Hindsight, Mem0, Omega) and the LongMemEval script that PR #7 deletes are preserved here — this branch carries only the RLM-specific deltas, not the upstream-targeted cleanups. Closes M0nkeyFl0wer#7 (for the fork's purposes — upstream PR remains pending Ben's re-review). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… 2 (#29) * feat(rlm): invocation_mode + source_file preservation for Cat 9a Step 2 Extracts the substantive enhancements from upstream PR #7 that aren't already on fork main via the PR #24 bundle. Five focused changes: 1. invocation_mode="forced": prepend a system-prompt directive requiring at least one mempalace_search call before FINAL. Tests whether the invocation-rate ceiling (gemma4: 60% zero-call) is the dominant Cat 9a lever on substring-shaped corpora. 2. invocation_mode="grounded": prepend a directive requiring the answer to quote at least one retrieved source filename. Tests whether the substring-scorer-vs-LLM-synthesis gap is the lever on file-shaped corpora (n=200 git-derived probes). 3. source_file round-trip through trimmed dict and context_string — the substring scorer matches against filenames from expected_sources; dropping source_file meant correct retrieval silently scored 0. 4. ingest_corpus returns the full SMEAdapter contract dict (errors/warnings empty lists) so harness code doesn't KeyError. 5. RLM kwargs restructured to a dict so custom_system_prompt is only set when needed — keeps default behavior unchanged. Four new tests cover source_file preservation and the three invocation_mode branches (forced / grounded / default). The fork-only adapters (Hindsight, Mem0, Omega) and the LongMemEval script that PR #7 deletes are preserved here — this branch carries only the RLM-specific deltas, not the upstream-targeted cleanups. Closes M0nkeyFl0wer#7 (for the fork's purposes — upstream PR remains pending Ben's re-review). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rlm-adapter): lazy-load RLM_SYSTEM_PROMPT inside invocation_mode branch The adapter eagerly imported `rlm.utils.prompts.RLM_SYSTEM_PROMPT` at __init__ time, which broke 6 tests that mock only `rlm.RLM` — they got ModuleNotFoundError: No module named 'rlm.utils' before reaching the StubRLM. Moving the import inside the `invocation_mode in ('forced', 'grounded')` branch keeps the default path (most tests, most prod uses) free of the rlm.utils submodule dependency, and patches sys.modules in the two mode-specific tests so they can still verify REPL/FINAL scaffolding is preserved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 30, 2026 17:06

Copilot started reviewing on behalf of jphein April 30, 2026 17:07 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

jphein mentioned this pull request Apr 30, 2026

Cat 9a (invocation rate) — proposed measurement protocol from two RLM-orchestrator runs #3

Open

M0nkeyFl0wer mentioned this pull request May 1, 2026

feat(adapters): MemPalaceDaemonAdapter — read-side adapter for production palaces #5

Merged

3 tasks

M0nkeyFl0wer requested changes May 1, 2026

View reviewed changes

M0nkeyFl0wer mentioned this pull request May 1, 2026

Adapter contract testkit — extract shared SMEAdapter conformance suite #8

Closed

jphein force-pushed the feat/rlm-adapter branch from 9b03c29 to 7d081c3 Compare May 15, 2026 06:30

This was referenced May 16, 2026

Test coverage gaps for measurement-critical modules (multi_hop, ontology_coherence, topology, adapters) #18

Open

_load_adapter: invert drop-list pattern to allowlist registry (root cause of the PR #7 cat5 regression) #20

Open

jphein commented May 23, 2026

View reviewed changes

This was referenced May 23, 2026

fix: Cat9b None-vs-0.0, flat_rarity_mode flag, spec caveats, starter corpus, v0.2.0 #28

Open

fix: allowlist adapter registry + test coverage (#18, #19, #20) #29

Closed

jphein and others added 4 commits May 24, 2026 14:45

fix: trim fork-only adapters from harness manifest test for upstream

07c4ad9

Remove HindsightAdapter, Mem0Adapter, OmegaAdapter, RlmAdapter imports since those adapter files don't exist in upstream repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jphein mentioned this pull request May 24, 2026

test+refactor: adapter contract testkit, registry allowlist, test coverage (#8, #18, #19, #20) #30

Merged

3 tasks

jphein and others added 2 commits May 24, 2026 14:55

jphein force-pushed the feat/rlm-adapter branch from 6cda051 to e585469 Compare May 24, 2026 21:59

jphein closed this May 25, 2026

jphein deleted the feat/rlm-adapter branch May 25, 2026 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(adapters): RlmAdapter + Qwen-7B/Llama-70B baselines#7

feat(adapters): RlmAdapter + Qwen-7B/Llama-70B baselines#7
jphein wants to merge 7 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/rlm-adapter

jphein commented Apr 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

jphein Apr 30, 2026

Uh oh!

jphein Apr 30, 2026

Uh oh!

M0nkeyFl0wer left a comment

Uh oh!

jphein commented May 3, 2026

Uh oh!

jphein commented May 3, 2026

Uh oh!

M0nkeyFl0wer commented May 3, 2026

Uh oh!

M0nkeyFl0wer commented May 19, 2026

Uh oh!

jphein left a comment

Uh oh!

jphein commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jphein commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Review blockers resolved

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jphein Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jphein Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

M0nkeyFl0wer left a comment

Choose a reason for hiding this comment

Failures on a fresh clone

Verified what I could

Heads-up on a structural thing not blocking this PR

Uh oh!

jphein commented May 3, 2026

Uh oh!

jphein commented May 3, 2026

Uh oh!

M0nkeyFl0wer commented May 3, 2026

Uh oh!

M0nkeyFl0wer commented May 19, 2026

Uh oh!

jphein left a comment

Choose a reason for hiding this comment

Uh oh!

jphein commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jphein commented Apr 30, 2026 •

edited

Loading