Skip to content

feat(adapters): RlmAdapter + Qwen-7B/Llama-70B baselines#7

Closed
jphein wants to merge 7 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/rlm-adapter
Closed

feat(adapters): RlmAdapter + Qwen-7B/Llama-70B baselines#7
jphein wants to merge 7 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/rlm-adapter

Conversation

@jphein
Copy link
Copy Markdown
Contributor

@jphein jphein commented Apr 30, 2026

Summary

RlmAdapter treats RLM (a fork of alexzhang13/rlm) as the read-side orchestrator rather than a standalone retrieval pipeline. The LLM decides when and how to query the palace-daemon — SME evaluates what comes back.

Stacked on #30 — please merge that PR first (allowlist adapter registry + test coverage). Once #30 lands, this PR's diff collapses to just the RlmAdapter-specific commits.

Changes

  1. sme/adapters/rlm_adapter.py — Full adapter: query via RLM tool-calling loop, get_graph_snapshot() via palace-daemon /graph, ingest_corpus() no-op (palace is pre-populated), source_file preservation + invocation_mode for Cat 9a Step 2 measurement
  2. tests/test_rlm_adapter.py — 16 tests with pytest.importorskip("rlm") guard (no CI dependency on the rlm package)
  3. pyproject.toml[rlm] optional extra pointing at jphein/rlm fork for baseline reproducibility
  4. sme/cli.pyrlm entry in _AdapterSpec registry + cmd_cat5 fix to use _load_adapter_from_args() (resolves @M0nkeyFl0wer's May 3 review blocker)
  5. sme/categories/gap_detection.py — Fix variable shadowing (consideredseeded_applicable)
  6. tests/test_gap_detection.py — 37 tests for gap detection scoring
  7. tests/test_cli_adapter_forwarding.py — rlm registry forwarding test

Review blockers resolved

  • pytest tests/test_rlm_adapter.py fails on fresh clones → importorskip + [rlm] extra
  • cat5 --adapter mempalace-daemon --api-url ... silently drops API args → _load_adapter_from_args()
  • Per-call vs per-result counter conflation → separate _tool_calls counter

Copilot AI review requested due to automatic review settings April 30, 2026 17:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an RlmAdapter that uses RLM as the read-side orchestrator (with mempalace_search exposed as a tool), plus CLI plumbing/tests/docs and new jp-realm-v0.1 artifacts to support baseline evaluation runs.

Changes:

  • Introduce sme/adapters/rlm_adapter.py with tool-call capture into context_string / retrieved_entities.
  • Extend CLI adapter loading + add flags (--api-key, --kind, --mock/--no-mock, --familiar-timeout) and add/expand adapter test coverage (unit + gated live smoke).
  • Add jp-realm-v0.1 corpus YAML plus baseline JSON run artifacts and supporting docs/README updates.

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sme/adapters/rlm_adapter.py New RLM-orchestrated adapter with mempalace_search tool registration and capture for scoring.
sme/cli.py Wire new adapters and CLI flags through adapter construction paths (retrieve + cat4/cat5/check/cat9).
tests/test_rlm_adapter.py Unit tests for RlmAdapter tool-call aggregation/reset and error handling.
sme/adapters/mempalace_daemon.py HTTP adapter for palace-daemon (/search + /graph with MCP fallback).
tests/test_mempalace_daemon_integration.py Gated live smoke tests against a running palace-daemon.
sme/adapters/familiar.py Familiar HTTP adapter (eval + graph) with warning/error translation and optional helpers.
tests/test_familiar_adapter.py Unit tests for FamiliarAdapter request/response mapping and error contracts.
tests/test_familiar_live.py Gated live smoke tests against a running familiar instance.
sme/adapters/_graph_mapping.py Shared mapping of daemon /graph payload to Entity/Edge.
tests/conftest.py Adds fake_urlopen_factory for deterministic HTTP mocking in adapter tests.
tests/fixtures/tiny_questions.yaml Small fixture question set for smoke-style runs.
sme/corpora/jp_realm_v0_1/questions.yaml New jp-realm-v0.1 diagnostic corpus (30 questions).
docs/superpowers/specs/2026-04-26-familiar-adapter-design.md Design spec for FamiliarAdapter + CLI/test approach.
docs/superpowers/specs/2026-04-25-mempalace-daemon-adapter-design.md Design spec for MemPalaceDaemonAdapter + CLI/test approach.
docs/ideas.md Records lessons learned + forward-compat notes + RlmAdapter scaffold notes.
README.md Documents fork roadmap and shipped adapters + invocation examples.
baselines/jp_realm_v0_1_familiar_20260426.json Baseline run artifact for jp-realm-v0.1 (familiar).
baselines/jp_realm_v0_1_familiar_20260426_v2.json Baseline run artifact variant (familiar).
baselines/jp_realm_v0_1_familiar_20260426_v3.json Baseline run artifact variant (familiar).
baselines/jp_realm_v0_1_familiar_20260426_v4_punct_fix.json Baseline run artifact variant (familiar).
baselines/jp_realm_v0_1_familiar_v0.3_candidate_window.json Baseline run artifact variant (familiar v0.3).
baselines/jp_realm_v0_1_familiar_v0.3_reflect.json Baseline run artifact variant (familiar v0.3).
baselines/jp_realm_v0_1_familiar_v0.3_task1_bm25.json Baseline run artifact variant (familiar v0.3).
baselines/jp_realm_v0_1_familiar_v0.3.9_reflect_cumulative.json Baseline run artifact variant (familiar v0.3.9).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +185 to +187
params = {"q": query, "limit": str(limit), "kind": self.kind}
url = f"{self.api_url}/search?" + "&".join(f"{k}={_urlrequest.quote(v)}" for k, v in params.items())
req = _urlrequest.Request(url)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified false positive on Python 3.10+: urllib.request silently re-exports quote from urllib.parse (hasattr(urllib.request, 'quote') returns True; quote.__module__ is 'urllib.parse'). The code works in production — the baselines/jp_realm_v0_1_rlm_*.json artifacts in this PR are evidence. That said, relying on an undocumented re-export is fragile across Python versions, so a cosmetic refactor to urllib.parse.quote is reasonable. Left for a follow-up PR to keep this one focused on substantive fixes.

try:
with _urlrequest.urlopen(req, timeout=self.timeout_s) as resp:
payload = json.loads(resp.read().decode("utf-8"))
except (_urlerror.URLError, _urlerror.HTTPError, OSError) as e:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred — could be a feature decision rather than a bug. Returning [{error: ..., results: []}] to the LM (so it sees the failure in its tool-result payload and can route around it) is the current path; appending an error entry to _capture would put the failure into context_string for SME scoring, which has tradeoffs around interpretability for the LM on the next turn. Will revisit once there's signal on whether SME wants failed-tool-call invocations counted in scoring. Note: binary invocation rate (the headline 9a quantity) is unaffected — _tool_call_count is incremented at the top of _mempalace_search before the try/except in bcb4799, so failed calls are still counted toward 9a even though they don't add drawers to _capture.

Comment on lines +286 to +289
# Strings, not dicts — cli's `'; '.join(path)` expects strings.
retrieval_path=[
f"rlm_completion ({elapsed_ms}ms, {len(self._capture)} tool calls)",
],
Comment thread sme/cli.py Outdated
Comment on lines +55 to +69
# Drop kwargs RLM doesn't understand.
for k in (
"include_node_tables",
"include_edge_tables",
"auto_discover",
"kg_path",
"collection_name",
"default_query_mode",
"db_path",
"buffer_pool_size",
"kind",
"read_only",
):
kwargs.pop(k, None)
return RlmAdapter(**kwargs)
jphein referenced this pull request in techempower-org/multipass-structural-memory-eval Apr 30, 2026
Addresses Copilot review on PR #7:

- Track _tool_call_count separately from _capture (which grows once
  per drawer returned). Previously, retrieval_path reported
  `len(_capture)` as the tool-call count — a single _mempalace_search
  call returning 5 drawers showed up as "5 tool calls", skewing any
  Cat 9a-shaped invocation-rate read of the per-question histogram.
- retrieval_path now reports both quantities explicitly:
  `rlm_completion (Xms, N tool calls, M drawers)`. Binary
  invocation-rate readings (zero vs non-zero) are unaffected — those
  were always correct because zero captures means no successful
  call. Per-question call-count histograms from the
  baselines/jp_realm_v0_1_rlm_*.json files predate this fix and
  reflect drawer counts; only the binary read on those JSONs is
  reliable.
- _load_adapter() no longer drops `kind` when constructing
  RlmAdapter. RlmAdapter.__init__ explicitly accepts `kind` to
  forward into mempalace_search /search calls; the CLI's --kind flag
  was previously a silent no-op for the rlm adapter.

Deferred (commented in PR review): appending an error entry to
_capture when the tool call raises (currently the failure is
visible via the returned [{error: ..., results: []}] but doesn't
flow into context_string or per-query telemetry). Could be a
feature decision rather than a bug — error-as-data has its own
tradeoffs around context-string interpretability for the LM.
jphein referenced this pull request in techempower-org/multipass-structural-memory-eval Apr 30, 2026
Addresses Copilot review on PR #7:

- Track _tool_call_count separately from _capture (which grows once
  per drawer returned). Previously, retrieval_path reported
  `len(_capture)` as the tool-call count — a single _mempalace_search
  call returning 5 drawers showed up as "5 tool calls", skewing any
  Cat 9a-shaped invocation-rate read of the per-question histogram.
- retrieval_path now reports both quantities explicitly:
  `rlm_completion (Xms, N tool calls, M drawers)`. Binary
  invocation-rate readings (zero vs non-zero) are unaffected — those
  were always correct because zero captures means no successful
  call. Per-question call-count histograms from the
  baselines/jp_realm_v0_1_rlm_*.json files predate this fix and
  reflect drawer counts; only the binary read on those JSONs is
  reliable.
- _load_adapter() no longer drops `kind` when constructing
  RlmAdapter. RlmAdapter.__init__ explicitly accepts `kind` to
  forward into mempalace_search /search calls; the CLI's --kind flag
  was previously a silent no-op for the rlm adapter.

Deferred (commented in PR review): appending an error entry to
_capture when the tool call raises (currently the failure is
visible via the returned [{error: ..., results: []}] but doesn't
flow into context_string or per-query telemetry). Could be a
feature decision rather than a bug — error-as-data has its own
tradeoffs around context-string interpretability for the LM.
Copy link
Copy Markdown
Owner

@M0nkeyFl0wer M0nkeyFl0wer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught a couple of things on local checkout that block merge — both small fixes, all worth landing as a single follow-up commit.

Failures on a fresh clone

1. pytest tests/test_rlm_adapter.py — 5 tests fail with ModuleNotFoundError: No module named 'rlm'.

The tests use unittest.mock.patch.object(rlm, ...) but rlm isn't declared in pyproject.toml. On your fork it presumably resolves via a venv that has it installed; on a fresh clone of this repo it doesn't.

Fix: add rlm to [project.optional-dependencies] (probably an rlm extra, alongside the existing optional groups), and either guard the test module with pytest.importorskip("rlm") at the top or document the install command in the module docstring. The pattern in tests/test_familiar_live.py (gated live smoke) is a reasonable template.

2. ruff check sme/adapters/rlm_adapter.py — 3 E701 errors.

sme/adapters/rlm_adapter.py:257:34: E701 Multiple statements on one line (colon)
sme/adapters/rlm_adapter.py:258:29: E701 Multiple statements on one line (colon)
sme/adapters/rlm_adapter.py:259:29: E701 Multiple statements on one line (colon)

The if r.get("drawer_id"): tags.append(...) pattern at 257-259 needs to be split into proper if-blocks (or rewritten with a list comp). 5 minutes.

Verified what I could

  • The bcb4799 / 927493e tool-call counter fix I commented on at issue #3 reads cleanly — _capture per drawer vs _tool_call_count per invocation are now properly separated, and the retrieval_path emit format (N tool calls, M drawers) is the right surface.
  • The 162 LOC adapter itself is well-scoped; design lines up with the issue #3 conversation.
  • The 2 RLM baseline JSONs (Qwen-7B, Llama-70B) carry the fixed-format readings — those are good.

Heads-up on a structural thing not blocking this PR

Across PR #5, #6, and #7 we've now landed about 1,085 lines of adapter unit tests that all share the same shape — HTTP-mock fixtures plus deserialization assertions. None of them actually run an SME category against the adapter; they verify wrapper plumbing, not adapter conformance to the SMEAdapter ABC. I'm filing a separate issue to introduce a shared adapter-contract testkit so the next adapter PR can register itself in pytest.parametrize instead of writing 200+ lines of bespoke HTTP mocks.

Not a blocker for this PR — keep your existing tests as the implementation-level reference. The testkit will sit alongside, not replace.

Once the two fixes above land, this is good to merge.

Thanks again for the stack — the RLM adapter is the meaningful new measurement primitive in here (orchestrator-mediated retrieval, the actual Cat 9 surface), and the bcb4799 fix is the kind of self-correction discipline I want to keep encouraging.

M0nkeyFl0wer pushed a commit that referenced this pull request May 1, 2026
Triggered by jphein's PR #7 shipping with 5 ModuleNotFoundError test
failures and 3 ruff E701 errors that no automated check caught — the
contributor's fork has the deps, the public repo doesn't, and there
was no CI to enforce the difference.

Workflow runs on push/PR to main, installs the package with [dev]
extras, runs ruff check (project config is line-length=100,
target=py310) and pytest. Matrix matches pyproject.toml's declared
support: 3.10, 3.11, 3.12.

Note: this workflow is committed but has not yet executed against
GitHub's runners. First execution will be the runtime validation.
If anything fails, the fix lands in a follow-up commit before more
PRs come through.
jphein referenced this pull request in techempower-org/multipass-structural-memory-eval May 3, 2026
…ruff E701

Addresses Ben's CHANGES_REQUESTED on M0nkeyFl0wer/multipass-structural-
memory-eval#7. Two blocking items, fixed in one commit per his ask.

1. tests/test_rlm_adapter.py failed with ModuleNotFoundError: No module
   named 'rlm' on a fresh clone. Now:
   - pyproject.toml gains an opt-in `rlm` extra pointing at
     alexzhang13/rlm (distribution name `rlms`, importable as `rlm`,
     requires Python >=3.11). Not in `all` because the Python floor is
     higher than SME's >=3.10 baseline.
   - The test module guards with `pytest.importorskip("rlm")` so a
     fresh clone without the extra installed skips the whole module
     cleanly instead of erroring during collection.

2. sme/adapters/rlm_adapter.py:257-259 had three E701 ruff failures
   (`if cond: stmt` one-liners). Split into proper if-blocks. Behavior
   unchanged — same tag construction.

A third Copilot comment on PR #7 (`_load_adapter()` drops `kind`
kwarg, making `--kind` a no-op for rlm) was already addressed in
7d459c4: `kind` is intentionally NOT in the rlm drop list, with a
comment explaining the contract. No code change needed for that one.

Verified locally:
- ruff check sme/adapters/rlm_adapter.py — clean
- pytest tests/test_rlm_adapter.py — 5/5 pass
- pytest -q — 96 passed, 3 skipped (the one failure is a live-gated
  daemon integration test, unrelated to this PR)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jphein
Copy link
Copy Markdown
Contributor Author

jphein commented May 3, 2026

Pushed fixup as f06f5e2. Single commit per your ask, addresses both blockers and the open Copilot comment:

Blocker 1 — tests/test_rlm_adapter.py ModuleNotFoundError on fresh clones.

Two halves of the fix:

  • pyproject.toml gains an opt-in rlm extra:

    rlm = [
        "rlms @ git+https://github.com/alexzhang13/rlm.git",
    ]

    Distribution name on the upstream rlm repo is rlms (importable as rlm); not on PyPI under either name, hence the git URL. requires-python = ">=3.11" over there is one minor higher than SME's >=3.10 floor, so I deliberately left it out of all — anyone wanting it has to opt in with pip install -e ".[rlm]". Keeps the lightweight-constitution promise.

  • tests/test_rlm_adapter.py now guards with pytest.importorskip("rlm") at module top, plus a docstring note pointing at the install incantation. Followed your test_familiar_live.py template — module-level skip pattern. Subsequent imports carry # noqa: E402 since importorskip is technically a non-import statement above them.

Blocker 2 — 3 × E701 ruff errors at sme/adapters/rlm_adapter.py:257-259.

Split the three if r.get("..."): tags.append(...) one-liners into proper if-blocks. Behavior unchanged — same tag construction.

Copilot comment 4 (_load_adapter() drops kind kwarg) — already fixed.

This one was on an older cli.py. Commit 7d459c4 (the original RlmAdapter commit) explicitly removed kind from rlm's drop list with a comment explaining the contract:

# Drop kwargs RLM doesn't understand. (RlmAdapter accepts `kind`
# to forward into mempalace_search /search calls, so it is NOT
# in this drop list.)

So --kind does flow through to RlmAdapter for cmd_retrieve. _load_adapter_from_args (used by cat4/cat5/check/cat9) also threads it. cmd_analyze / cmd_cat8 / cmd_cat5 don't surface --kind at the argparse layer, but those are graph-snapshot commands hitting /graph rather than /search, where the daemon's kind filter doesn't apply.

Verified locally:

$ ruff check sme/adapters/rlm_adapter.py
All checks passed!

$ pytest tests/test_rlm_adapter.py -q
.....                                                                    [100%]
5 passed in 0.06s

$ pytest -q
96 passed, 3 skipped in 162.10s

(One additional test fails when PALACE_DAEMON_URL is set — test_kind_content_excludes_stop_hook_checkpoints got 2 vs 2 instead of strictly fewer. That's a daemon-side behavior surface, not an SME PR issue, and gates on PALACE_DAEMON_URL so it'll skip cleanly on your fresh-clone path. Happy to file it on rboarescu/palace-daemon separately if it persists.)

On the testkit heads-up (issue #8): agreed it's the right next move. Codetopo's structural_duplication findings on the per-adapter test files are concrete refactor targets; happy to follow up there once this PR lands.

jphein referenced this pull request in techempower-org/multipass-structural-memory-eval May 3, 2026
The baselines/jp_realm_v0_1_rlm_*.json artifacts on this branch were
produced against the jphein/rlm fork (currently a 3-commit divergence
from alexzhang13/rlm — README header + examples/mempalace_demo.py + a
merge). Code surface in rlm/ is identical today, but pinning to the
fork is the correct call:

  - Reproducibility: fresh-clone reproductions of the baselines run
    the same code path that produced the evidence on PR #7.
  - Provenance: rlm_adapter.py's docstring already says "fork of
    alexzhang13/rlm", so the install path matches the doc.
  - Future divergence: any SME-relevant patches we land on the fork
    flow through to a fresh clone without a pyproject churn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jphein
Copy link
Copy Markdown
Contributor Author

jphein commented May 3, 2026

Small correction on top of the previous comment, pushed as ea08eb0:

The [rlm] extra now points at jphein/rlm, not upstream alexzhang13/rlm:

rlm = [
    "rlms @ git+https://github.com/jphein/rlm.git",
]

Reasoning: the baselines/jp_realm_v0_1_rlm_*.json artifacts in this PR were produced against the fork (/home/jp/Projects/rlmjphein/rlm). Currently a 3-commit divergence from upstream (README header + examples/mempalace_demo.py showing the SME-composition pattern + a merge) — the rlm/ package code itself is identical. But pinning to the fork is the right call:

  • Reproducibility: fresh-clone reproductions of the readings on this PR run the same code path that produced them.
  • Provenance: rlm_adapter.py's docstring already says "fork of alexzhang13/rlm" — the install path now matches.
  • Future divergence: SME-relevant patches landing on the fork flow through automatically without a pyproject churn here.

Local verification re-run:

$ ruff check sme/adapters/rlm_adapter.py
All checks passed!
$ pytest tests/test_rlm_adapter.py -q
5 passed in 0.05s

Apologies for the two-step — should have called this on the first push.

@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

Reviewed thoroughly — adapter code, CLI wiring, baseline JSONs, and the cross-validation implications. Everything is solid.

One code blockercat5 doesn't forward API-mode args (--api-url, --api-key, --kind, --mock-inference, --familiar-timeout) even though check already does via _load_adapter_from_args(). I reproduced this on the PR branch: cat5 --adapter mempalace-daemon --api-url http://example.test:8085 fails with ValueError: MemPalaceDaemonAdapter needs api_url. The fix is straightforward — cmd_cat5() needs to use the same _load_adapter_from_args() path that cmd_check() uses. Not a merge blocker, but worth a quick follow-up PR once this lands; I'll file a tracking issue.

README status table — the invocation count for Qwen was listed as 25/30 zero-call, 2/30 used tool which only sums to 27. docs/ideas.md correctly notes the remaining 3 errored. I'll correct the table in main after this PR merges.

The bigger picture — your mempalace-daemon and familiar adapters complete the four-way comparison the cross-validation plan calls for. The familiar adapter framing (comparing familiar's full pipeline against the bare daemon to measure what the v0.2 reranking/temporal-decay layer adds) is exactly how we want to position it. We've also updated the Next Steps in main to link to docs/cross_validation_2026.md and docs/industry_standards_integration.md — the standards audit doc is where the decisions about what SME rolls its own vs. adopts live, so contributors can see why we chose B-Cubed over V-measure (Amigó et al. 2009 proves B-Cubed satisfies all four formal cluster-eval constraints) and what we surveyed and rejected before settling on our provenance approach for Cat 8b.

One non-blocking question from the review: the transport-failure path in API-mode adapters produces an empty-graph report with exit 0. I'm not treating this as a PR-specific bug — it's a design question for the evaluator-honesty framing. Does an empty result from a transport failure look the same as an empty result from a genuine zero-retrieval? If they look identical, we lose the ability to distinguish "the system ran correctly and found nothing" from "the system didn't run at all." Happy to discuss in an issue rather than blocking this PR.

An invitation: with the cross-validation harness now in main and all four adapters (flat, mempalace-daemon, familiar, ladybugdb) wired up, there's a natural next step: run the RLM adapter through the LongMemEval question set. That would give us a Category 1 / 2c reading for RLM against a standardized corpus and start building the comparative picture across all six adapters. Let me know if that interests you — happy to coordinate on corpus preparation.

Codetopo signal — the structural duplication hits in test_familiar_adapter.py are real, but we've already scoped the fix in issue #8 (shared adapter contract testkit). The fact that we're running codetopo in CI means your test code is being reviewed constructively, not just functionally — I want to keep that signal live.

Thanks for the thorough work — this is exactly the kind of contribution that moves the comparative picture forward. Hope you have a great week.

@jphein jphein force-pushed the feat/rlm-adapter branch from 9b03c29 to 7d081c3 Compare May 15, 2026 06:30
jphein referenced this pull request in techempower-org/multipass-structural-memory-eval May 15, 2026
Addresses Copilot review on PR #7:

- Track _tool_call_count separately from _capture (which grows once
  per drawer returned). Previously, retrieval_path reported
  `len(_capture)` as the tool-call count — a single _mempalace_search
  call returning 5 drawers showed up as "5 tool calls", skewing any
  Cat 9a-shaped invocation-rate read of the per-question histogram.
- retrieval_path now reports both quantities explicitly:
  `rlm_completion (Xms, N tool calls, M drawers)`. Binary
  invocation-rate readings (zero vs non-zero) are unaffected — those
  were always correct because zero captures means no successful
  call. Per-question call-count histograms from the
  baselines/jp_realm_v0_1_rlm_*.json files predate this fix and
  reflect drawer counts; only the binary read on those JSONs is
  reliable.
- _load_adapter() no longer drops `kind` when constructing
  RlmAdapter. RlmAdapter.__init__ explicitly accepts `kind` to
  forward into mempalace_search /search calls; the CLI's --kind flag
  was previously a silent no-op for the rlm adapter.

Deferred (commented in PR review): appending an error entry to
_capture when the tool call raises (currently the failure is
visible via the returned [{error: ..., results: []}] but doesn't
flow into context_string or per-query telemetry). Could be a
feature decision rather than a bug — error-as-data has its own
tradeoffs around context-string interpretability for the LM.
jphein referenced this pull request in techempower-org/multipass-structural-memory-eval May 15, 2026
…ruff E701

Addresses Ben's CHANGES_REQUESTED on M0nkeyFl0wer/multipass-structural-
memory-eval#7. Two blocking items, fixed in one commit per his ask.

1. tests/test_rlm_adapter.py failed with ModuleNotFoundError: No module
   named 'rlm' on a fresh clone. Now:
   - pyproject.toml gains an opt-in `rlm` extra pointing at
     alexzhang13/rlm (distribution name `rlms`, importable as `rlm`,
     requires Python >=3.11). Not in `all` because the Python floor is
     higher than SME's >=3.10 baseline.
   - The test module guards with `pytest.importorskip("rlm")` so a
     fresh clone without the extra installed skips the whole module
     cleanly instead of erroring during collection.

2. sme/adapters/rlm_adapter.py:257-259 had three E701 ruff failures
   (`if cond: stmt` one-liners). Split into proper if-blocks. Behavior
   unchanged — same tag construction.

A third Copilot comment on PR #7 (`_load_adapter()` drops `kind`
kwarg, making `--kind` a no-op for rlm) was already addressed in
7d459c4: `kind` is intentionally NOT in the rlm drop list, with a
comment explaining the contract. No code change needed for that one.

Verified locally:
- ruff check sme/adapters/rlm_adapter.py — clean
- pytest tests/test_rlm_adapter.py — 5/5 pass
- pytest -q — 96 passed, 3 skipped (the one failure is a live-gated
  daemon integration test, unrelated to this PR)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein referenced this pull request in techempower-org/multipass-structural-memory-eval May 15, 2026
The baselines/jp_realm_v0_1_rlm_*.json artifacts on this branch were
produced against the jphein/rlm fork (currently a 3-commit divergence
from alexzhang13/rlm — README header + examples/mempalace_demo.py + a
merge). Code surface in rlm/ is identical today, but pinning to the
fork is the correct call:

  - Reproducibility: fresh-clone reproductions of the baselines run
    the same code path that produced the evidence on PR #7.
  - Provenance: rlm_adapter.py's docstring already says "fork of
    alexzhang13/rlm", so the install path matches the doc.
  - Future divergence: any SME-relevant patches we land on the fork
    flow through to a fresh clone without a pyproject churn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein referenced this pull request in techempower-org/multipass-structural-memory-eval May 15, 2026
cat5 accepted --api-url/--api-key/--kind/--mock/--familiar-timeout at the
argparse layer but quietly dropped them before constructing the adapter,
so `sme-eval cat5 --adapter mempalace-daemon --api-url ...` raised
ValueError("MemPalaceDaemonAdapter needs api_url") even with the URL on
the command line.

Refactor cmd_cat5 to use the same _load_adapter_from_args() helper that
cmd_cat4/cmd_check/cmd_cat9 already use. One-line replacement plus 14
lines of bespoke kwargs-building deleted.

Behaviour change: the shared helper drops auto_discover=False from
kwargs (via `if val:`) where the old cat5 code passed it explicitly.
Functionally equivalent because every adapter's auto_discover default
is False anyway, so the resulting constructor call is identical.

Regression coverage in tests/test_cli_adapter_forwarding.py — three
focused tests against the helper and cmd_cat5 directly, covering both
the mempalace-daemon and familiar adapter paths. M0nkeyFl0wer's exact
CLI repro now reaches gap detection.

Closes upstream PR #7 review comment from 2026-05-03.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

Hey — wow, you have contributed so much to this project. I loved watching the notifications roll in yesterday. The substrate-floor parity hitting R@5=0.9660 byte-identical to upstream, the FT-300 reproduction landing inside published noise, the AGE write-through spike turning up +9pp R@5 over vector-only, the Step 2-forced finding that invocation-discipline only closes the base-model gap when retrieval is wide enough, the four-regime chunking-sensitivity taxonomy you and @nakata-app converged on — this is the most measurement-dense two weeks the project has had. Substrate-floor calibration against a published number is exactly the cross-validation #9 was asking for; the 9a sub-test proposal in #3 is the right shape and supported by data I didn't expect this fork to produce this fast. Substantive replies on #9 and #3 are coming in their own threads.

To get CI green — the May-17 re-run turned up 17 ruff errors, 16 of which are auto-fixable with ruff check --fix . (F401 unused imports, F541 f-strings without placeholders). The remaining one is the import pytest at tests/test_cli_adapter_forwarding.py:21 — a one-line delete. Once that lands and CI is green, I'll do the final merge pass.

Ruff version drift is on me, not you. `pyproject.toml` pins `ruff>=0.5` unbounded; CI pulled `0.15.13` which is stricter than what was running when this PR opened. I'll land a small upper-pin in a follow-up so this can't bite long-lived PRs again.

One forward-looking note, not a complaint about this PR — the work since the May-3 review is substantial enough that it would have been easier for me to review well as 3-4 smaller PRs (postgres+AGE adapter as one; LongMemEval substrate-floor + FT-300 as another; AGE write-through spike + git-probes corpus as a third; AGENTS.md/Gemini config separately). The structural improvements that land in this PR are genuinely valuable — that's exactly why I'd rather review each thread on its own merits next time. For #7 I'll merge as-is once CI is green; for the next batch of work, can we aim for one structural thread per PR? My review cadence will be tighter on smaller PRs.

Copy link
Copy Markdown
Contributor Author

@jphein jphein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@M0nkeyFl0wer — both merge-blockers from your 2026-05-01 review are resolved.

Fix 1 — rlm dependency + test guard: 6fb6925 adds [rlm] to [project.optional-dependencies] pointing at rlms @ git+https://github.com/jphein/rlm.git, and pytest.importorskip("rlm") at the top of test_rlm_adapter.py. Module docstring documents pip install -e ".[rlm]". Fresh-clone verified: without the extra, 9 tests skip cleanly; with it, 9/9 pass.

Fix 2 — ruff E701: Same commit (6fb6925) splits the three one-liner if r.get(...): tags.append(...) patterns into proper if-blocks. ruff check sme/adapters/rlm_adapter.py passes clean as of today (2026-05-23).

Follow-up commit 348c714 pins the extra to jphein/rlm specifically (not upstream alexzhang13/rlm) since the baseline artifacts were produced against this fork.


Scope growth caveat. Since your review this branch has grown from the original adapter scope to 40 commits / 68 files. The additional work is the full Cat 9a investigation chain: Step 1 retrieval-breadth probe (8-cell table), Step 2-forced invocation decomposition, mempalace-git-probes-v2 200-question corpus, LME substrate-floor parity experiments, AGE write-through spike, and associated baselines/docs. All of it feeds back into M0nkeyFl0wer/multipass-structural-memory-eval#3 and M0nkeyFl0wer/multipass-structural-memory-eval#9.

The security fixes that were in this branch (hardcoded DSN, Cypher escaping) have already landed in your main via the techempower-org/multipass-structural-memory-eval#16 upstream sync, so those commits are duplicates in the diff.

Happy to do whichever you prefer:

  1. Rebase + squash the experimental chain into thematic commits and force-push to reduce the diff
  2. Split into the original RlmAdapter PR (your reviewed scope) + a separate PR for the Cat 9a experimental corpus
  3. Merge as-is if the 40-commit history is acceptable given it's all one investigation thread

Your call — the fixes you blocked on are ready either way. 🫏

jphein and others added 4 commits May 24, 2026 14:45
…yFl0wer#19)

The ABC default returning [] was added in 3754e8a as part of Cat 9
scaffolding, but the upstream issue tracks two follow-throughs that
were still missing:

- Module docstring + class docstring in `base.py` claimed "Two
  optional" methods; promote to "Three optional" so readers see
  `get_harness_manifest` as part of the contract.
- Spec § Adapter Interface (sme_spec_v8.md) listed only
  `get_flat_retrieval` and `get_ontology_source` as optional. Add
  `get_harness_manifest` alongside them with its default-list rationale.

Adds tests/test_adapter_harness_manifest_contract.py: every shipped
adapter must resolve `get_harness_manifest` (either inherited or
overridden) and the ABC default must return []. Class-level checks
avoid instantiating adapters with heavy constructors. Closes the
silent-AttributeError class entirely for any future adapter that
forgets to override.
…Fl0wer#8)

Parametric test module verifying adapters conform to the SMEAdapter
ABC: query returns QueryResult, graph snapshot is internally
consistent, ingest accepts corpus shape. Covers FlatBaselineAdapter
and a MockAdapter; other adapters opt in by registering a factory.
…er#20)

Replace 8-branch if-chain with _AdapterSpec registry. Each adapter
declares its accepted kwargs; unknown kwargs are silently dropped.
Makes the PR #7 class of drop-list-drift regression structurally
impossible.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…keyFl0wer#18)

Three new test files for measurement-critical modules that had zero
direct tests. Covers hop-bucket grouping, Cat 8 sub-tests, and the
shared graph projection helper. Prevents silent regressions in
published category readings.
Remove HindsightAdapter, Mem0Adapter, OmegaAdapter, RlmAdapter imports
since those adapter files don't exist in upstream repo.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jphein and others added 2 commits May 24, 2026 14:55
Builds the scaffold for benchmarking RLM (jphein/rlm fork of
alexzhang13/rlm) against familiar's deterministic pipeline on the
jp-realm-v0.1 corpus.

How it satisfies the SME contract:
  - palace-daemon /search is exposed to RLM via custom_tools as
    `mempalace_search`. Every call's results flow through a
    per-query capture buffer.
  - QueryResult.context_string is the concatenation of every drawer
    text RLM pulled during the run, in call order. The Cat 1 /
    retrieve substring scorer sees exactly what RLM saw.
  - retrieved_entities mirror captures so Cat 7 token counts and
    Cat 8 hop counts have something real (Cat 8 returns empty —
    RLM doesn't maintain a graph view).
  - ingest_corpus is a no-op (RLM consumes a palace it didn't
    author; ingestion happens upstream via mempalace mine).

CLI: --adapter rlm available. Same kwargs-drop pattern as the
other adapters. Backend defaults to portkey @openai/gpt-5-nano
when PORTKEY_API_KEY is set; falls through to direct OpenAI/
Anthropic per RLM's own backend resolution.

Tests (5, all pass) cover the SME contract without burning API
credits — a live benchmark belongs in baselines/, not unit tests.

This is research scaffolding. The actual A/B vs familiar (which
the v0.3 brainstorm flagged as the lever for deciding v0.4
productionization) needs an API key + a separate run that produces
a baseline JSON. Documented in docs/ideas.md.
…ard, cat5 forwarding, invocation_mode

Squashes the following fixes onto the initial RlmAdapter commit:
- Separate per-call tool counter from per-result drawer accumulator
- Add [rlm] optional extra to pyproject.toml (pytest.importorskip guard)
- Point [rlm] extra at jphein/rlm fork for baseline reproducibility
- Fix cmd_cat5 to use _load_adapter_from_args() — resolves M0nkeyFl0wer
  review blocker (--api-url/--api-key/--kind silently dropped)
- Add source_file preservation + invocation_mode for Cat 9a Step 2
- Fix gap_detection variable shadowing (considered → seeded_applicable)
- Add rlm registry forwarding test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jphein jphein force-pushed the feat/rlm-adapter branch from 6cda051 to e585469 Compare May 24, 2026 21:59
@jphein
Copy link
Copy Markdown
Contributor Author

jphein commented May 24, 2026

Rebased onto #30 (allowlist registry) — both review blockers from the May 3 review are resolved:

  1. Fresh-clone ModuleNotFoundError — tests use pytest.importorskip("rlm") + [rlm] optional extra in pyproject.toml
  2. cat5 --adapter mempalace-daemon --api-url silently drops argscmd_cat5 now uses _load_adapter_from_args() like every other subcommand. The allowlist registry in test+refactor: adapter contract testkit, registry allowlist, test coverage (#8, #18, #19, #20) #30 structurally prevents this class of bug going forward.

Also squashed: tool-call counter fix, source_file preservation + invocation_mode for Cat 9a Step 2, gap_detection variable shadowing fix.

Please merge #30 first — once it lands, this PR's diff drops to just the RlmAdapter-specific commits. 🫏

@jphein jphein closed this May 25, 2026
@jphein jphein deleted the feat/rlm-adapter branch May 25, 2026 23:29
M0nkeyFl0wer pushed a commit that referenced this pull request May 26, 2026
…#19, #20)

Adds 146 tests across 6 new test files. The _AdapterSpec dataclass with accepts: frozenset[str] inverts the drop-list pattern to an allowlist registry, making the PR #7 class of drop-list-drift regression structurally impossible.

Closes #8, #18, #19, #20.
jphein added a commit to techempower-org/multipass-structural-memory-eval that referenced this pull request May 26, 2026
Extracts the substantive enhancements from upstream PR #7 that aren't
already on fork main via the PR #24 bundle. Five focused changes:

1. invocation_mode="forced": prepend a system-prompt directive
   requiring at least one mempalace_search call before FINAL. Tests
   whether the invocation-rate ceiling (gemma4: 60% zero-call) is
   the dominant Cat 9a lever on substring-shaped corpora.

2. invocation_mode="grounded": prepend a directive requiring the
   answer to quote at least one retrieved source filename. Tests
   whether the substring-scorer-vs-LLM-synthesis gap is the lever
   on file-shaped corpora (n=200 git-derived probes).

3. source_file round-trip through trimmed dict and context_string —
   the substring scorer matches against filenames from
   expected_sources; dropping source_file meant correct retrieval
   silently scored 0.

4. ingest_corpus returns the full SMEAdapter contract dict
   (errors/warnings empty lists) so harness code doesn't KeyError.

5. RLM kwargs restructured to a dict so custom_system_prompt is
   only set when needed — keeps default behavior unchanged.

Four new tests cover source_file preservation and the three
invocation_mode branches (forced / grounded / default).

The fork-only adapters (Hindsight, Mem0, Omega) and the LongMemEval
script that PR #7 deletes are preserved here — this branch carries
only the RLM-specific deltas, not the upstream-targeted cleanups.

Closes M0nkeyFl0wer#7 (for the
fork's purposes — upstream PR remains pending Ben's re-review).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
jphein added a commit to techempower-org/multipass-structural-memory-eval that referenced this pull request May 26, 2026
… 2 (#29)

* feat(rlm): invocation_mode + source_file preservation for Cat 9a Step 2

Extracts the substantive enhancements from upstream PR #7 that aren't
already on fork main via the PR #24 bundle. Five focused changes:

1. invocation_mode="forced": prepend a system-prompt directive
   requiring at least one mempalace_search call before FINAL. Tests
   whether the invocation-rate ceiling (gemma4: 60% zero-call) is
   the dominant Cat 9a lever on substring-shaped corpora.

2. invocation_mode="grounded": prepend a directive requiring the
   answer to quote at least one retrieved source filename. Tests
   whether the substring-scorer-vs-LLM-synthesis gap is the lever
   on file-shaped corpora (n=200 git-derived probes).

3. source_file round-trip through trimmed dict and context_string —
   the substring scorer matches against filenames from
   expected_sources; dropping source_file meant correct retrieval
   silently scored 0.

4. ingest_corpus returns the full SMEAdapter contract dict
   (errors/warnings empty lists) so harness code doesn't KeyError.

5. RLM kwargs restructured to a dict so custom_system_prompt is
   only set when needed — keeps default behavior unchanged.

Four new tests cover source_file preservation and the three
invocation_mode branches (forced / grounded / default).

The fork-only adapters (Hindsight, Mem0, Omega) and the LongMemEval
script that PR #7 deletes are preserved here — this branch carries
only the RLM-specific deltas, not the upstream-targeted cleanups.

Closes M0nkeyFl0wer#7 (for the
fork's purposes — upstream PR remains pending Ben's re-review).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rlm-adapter): lazy-load RLM_SYSTEM_PROMPT inside invocation_mode branch

The adapter eagerly imported `rlm.utils.prompts.RLM_SYSTEM_PROMPT` at
__init__ time, which broke 6 tests that mock only `rlm.RLM` — they
got ModuleNotFoundError: No module named 'rlm.utils' before reaching
the StubRLM. Moving the import inside the
`invocation_mode in ('forced', 'grounded')` branch keeps the default
path (most tests, most prod uses) free of the rlm.utils submodule
dependency, and patches sys.modules in the two mode-specific tests
so they can still verify REPL/FINAL scaffolding is preserved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants