feat(adapters): RlmAdapter + Qwen-7B/Llama-70B baselines#7
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an RlmAdapter that uses RLM as the read-side orchestrator (with mempalace_search exposed as a tool), plus CLI plumbing/tests/docs and new jp-realm-v0.1 artifacts to support baseline evaluation runs.
Changes:
- Introduce
sme/adapters/rlm_adapter.pywith tool-call capture intocontext_string/retrieved_entities. - Extend CLI adapter loading + add flags (
--api-key,--kind,--mock/--no-mock,--familiar-timeout) and add/expand adapter test coverage (unit + gated live smoke). - Add jp-realm-v0.1 corpus YAML plus baseline JSON run artifacts and supporting docs/README updates.
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
sme/adapters/rlm_adapter.py |
New RLM-orchestrated adapter with mempalace_search tool registration and capture for scoring. |
sme/cli.py |
Wire new adapters and CLI flags through adapter construction paths (retrieve + cat4/cat5/check/cat9). |
tests/test_rlm_adapter.py |
Unit tests for RlmAdapter tool-call aggregation/reset and error handling. |
sme/adapters/mempalace_daemon.py |
HTTP adapter for palace-daemon (/search + /graph with MCP fallback). |
tests/test_mempalace_daemon_integration.py |
Gated live smoke tests against a running palace-daemon. |
sme/adapters/familiar.py |
Familiar HTTP adapter (eval + graph) with warning/error translation and optional helpers. |
tests/test_familiar_adapter.py |
Unit tests for FamiliarAdapter request/response mapping and error contracts. |
tests/test_familiar_live.py |
Gated live smoke tests against a running familiar instance. |
sme/adapters/_graph_mapping.py |
Shared mapping of daemon /graph payload to Entity/Edge. |
tests/conftest.py |
Adds fake_urlopen_factory for deterministic HTTP mocking in adapter tests. |
tests/fixtures/tiny_questions.yaml |
Small fixture question set for smoke-style runs. |
sme/corpora/jp_realm_v0_1/questions.yaml |
New jp-realm-v0.1 diagnostic corpus (30 questions). |
docs/superpowers/specs/2026-04-26-familiar-adapter-design.md |
Design spec for FamiliarAdapter + CLI/test approach. |
docs/superpowers/specs/2026-04-25-mempalace-daemon-adapter-design.md |
Design spec for MemPalaceDaemonAdapter + CLI/test approach. |
docs/ideas.md |
Records lessons learned + forward-compat notes + RlmAdapter scaffold notes. |
README.md |
Documents fork roadmap and shipped adapters + invocation examples. |
baselines/jp_realm_v0_1_familiar_20260426.json |
Baseline run artifact for jp-realm-v0.1 (familiar). |
baselines/jp_realm_v0_1_familiar_20260426_v2.json |
Baseline run artifact variant (familiar). |
baselines/jp_realm_v0_1_familiar_20260426_v3.json |
Baseline run artifact variant (familiar). |
baselines/jp_realm_v0_1_familiar_20260426_v4_punct_fix.json |
Baseline run artifact variant (familiar). |
baselines/jp_realm_v0_1_familiar_v0.3_candidate_window.json |
Baseline run artifact variant (familiar v0.3). |
baselines/jp_realm_v0_1_familiar_v0.3_reflect.json |
Baseline run artifact variant (familiar v0.3). |
baselines/jp_realm_v0_1_familiar_v0.3_task1_bm25.json |
Baseline run artifact variant (familiar v0.3). |
baselines/jp_realm_v0_1_familiar_v0.3.9_reflect_cumulative.json |
Baseline run artifact variant (familiar v0.3.9). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| params = {"q": query, "limit": str(limit), "kind": self.kind} | ||
| url = f"{self.api_url}/search?" + "&".join(f"{k}={_urlrequest.quote(v)}" for k, v in params.items()) | ||
| req = _urlrequest.Request(url) |
There was a problem hiding this comment.
Verified false positive on Python 3.10+: urllib.request silently re-exports quote from urllib.parse (hasattr(urllib.request, 'quote') returns True; quote.__module__ is 'urllib.parse'). The code works in production — the baselines/jp_realm_v0_1_rlm_*.json artifacts in this PR are evidence. That said, relying on an undocumented re-export is fragile across Python versions, so a cosmetic refactor to urllib.parse.quote is reasonable. Left for a follow-up PR to keep this one focused on substantive fixes.
| try: | ||
| with _urlrequest.urlopen(req, timeout=self.timeout_s) as resp: | ||
| payload = json.loads(resp.read().decode("utf-8")) | ||
| except (_urlerror.URLError, _urlerror.HTTPError, OSError) as e: |
There was a problem hiding this comment.
Deferred — could be a feature decision rather than a bug. Returning [{error: ..., results: []}] to the LM (so it sees the failure in its tool-result payload and can route around it) is the current path; appending an error entry to _capture would put the failure into context_string for SME scoring, which has tradeoffs around interpretability for the LM on the next turn. Will revisit once there's signal on whether SME wants failed-tool-call invocations counted in scoring. Note: binary invocation rate (the headline 9a quantity) is unaffected — _tool_call_count is incremented at the top of _mempalace_search before the try/except in bcb4799, so failed calls are still counted toward 9a even though they don't add drawers to _capture.
| # Strings, not dicts — cli's `'; '.join(path)` expects strings. | ||
| retrieval_path=[ | ||
| f"rlm_completion ({elapsed_ms}ms, {len(self._capture)} tool calls)", | ||
| ], |
| # Drop kwargs RLM doesn't understand. | ||
| for k in ( | ||
| "include_node_tables", | ||
| "include_edge_tables", | ||
| "auto_discover", | ||
| "kg_path", | ||
| "collection_name", | ||
| "default_query_mode", | ||
| "db_path", | ||
| "buffer_pool_size", | ||
| "kind", | ||
| "read_only", | ||
| ): | ||
| kwargs.pop(k, None) | ||
| return RlmAdapter(**kwargs) |
Addresses Copilot review on PR #7: - Track _tool_call_count separately from _capture (which grows once per drawer returned). Previously, retrieval_path reported `len(_capture)` as the tool-call count — a single _mempalace_search call returning 5 drawers showed up as "5 tool calls", skewing any Cat 9a-shaped invocation-rate read of the per-question histogram. - retrieval_path now reports both quantities explicitly: `rlm_completion (Xms, N tool calls, M drawers)`. Binary invocation-rate readings (zero vs non-zero) are unaffected — those were always correct because zero captures means no successful call. Per-question call-count histograms from the baselines/jp_realm_v0_1_rlm_*.json files predate this fix and reflect drawer counts; only the binary read on those JSONs is reliable. - _load_adapter() no longer drops `kind` when constructing RlmAdapter. RlmAdapter.__init__ explicitly accepts `kind` to forward into mempalace_search /search calls; the CLI's --kind flag was previously a silent no-op for the rlm adapter. Deferred (commented in PR review): appending an error entry to _capture when the tool call raises (currently the failure is visible via the returned [{error: ..., results: []}] but doesn't flow into context_string or per-query telemetry). Could be a feature decision rather than a bug — error-as-data has its own tradeoffs around context-string interpretability for the LM.
Addresses Copilot review on PR #7: - Track _tool_call_count separately from _capture (which grows once per drawer returned). Previously, retrieval_path reported `len(_capture)` as the tool-call count — a single _mempalace_search call returning 5 drawers showed up as "5 tool calls", skewing any Cat 9a-shaped invocation-rate read of the per-question histogram. - retrieval_path now reports both quantities explicitly: `rlm_completion (Xms, N tool calls, M drawers)`. Binary invocation-rate readings (zero vs non-zero) are unaffected — those were always correct because zero captures means no successful call. Per-question call-count histograms from the baselines/jp_realm_v0_1_rlm_*.json files predate this fix and reflect drawer counts; only the binary read on those JSONs is reliable. - _load_adapter() no longer drops `kind` when constructing RlmAdapter. RlmAdapter.__init__ explicitly accepts `kind` to forward into mempalace_search /search calls; the CLI's --kind flag was previously a silent no-op for the rlm adapter. Deferred (commented in PR review): appending an error entry to _capture when the tool call raises (currently the failure is visible via the returned [{error: ..., results: []}] but doesn't flow into context_string or per-query telemetry). Could be a feature decision rather than a bug — error-as-data has its own tradeoffs around context-string interpretability for the LM.
M0nkeyFl0wer
left a comment
There was a problem hiding this comment.
Caught a couple of things on local checkout that block merge — both small fixes, all worth landing as a single follow-up commit.
Failures on a fresh clone
1. pytest tests/test_rlm_adapter.py — 5 tests fail with ModuleNotFoundError: No module named 'rlm'.
The tests use unittest.mock.patch.object(rlm, ...) but rlm isn't declared in pyproject.toml. On your fork it presumably resolves via a venv that has it installed; on a fresh clone of this repo it doesn't.
Fix: add rlm to [project.optional-dependencies] (probably an rlm extra, alongside the existing optional groups), and either guard the test module with pytest.importorskip("rlm") at the top or document the install command in the module docstring. The pattern in tests/test_familiar_live.py (gated live smoke) is a reasonable template.
2. ruff check sme/adapters/rlm_adapter.py — 3 E701 errors.
sme/adapters/rlm_adapter.py:257:34: E701 Multiple statements on one line (colon)
sme/adapters/rlm_adapter.py:258:29: E701 Multiple statements on one line (colon)
sme/adapters/rlm_adapter.py:259:29: E701 Multiple statements on one line (colon)
The if r.get("drawer_id"): tags.append(...) pattern at 257-259 needs to be split into proper if-blocks (or rewritten with a list comp). 5 minutes.
Verified what I could
- The bcb4799 / 927493e tool-call counter fix I commented on at issue #3 reads cleanly —
_captureper drawer vs_tool_call_countper invocation are now properly separated, and theretrieval_pathemit format(N tool calls, M drawers)is the right surface. - The 162 LOC adapter itself is well-scoped; design lines up with the issue #3 conversation.
- The 2 RLM baseline JSONs (Qwen-7B, Llama-70B) carry the fixed-format readings — those are good.
Heads-up on a structural thing not blocking this PR
Across PR #5, #6, and #7 we've now landed about 1,085 lines of adapter unit tests that all share the same shape — HTTP-mock fixtures plus deserialization assertions. None of them actually run an SME category against the adapter; they verify wrapper plumbing, not adapter conformance to the SMEAdapter ABC. I'm filing a separate issue to introduce a shared adapter-contract testkit so the next adapter PR can register itself in pytest.parametrize instead of writing 200+ lines of bespoke HTTP mocks.
Not a blocker for this PR — keep your existing tests as the implementation-level reference. The testkit will sit alongside, not replace.
Once the two fixes above land, this is good to merge.
Thanks again for the stack — the RLM adapter is the meaningful new measurement primitive in here (orchestrator-mediated retrieval, the actual Cat 9 surface), and the bcb4799 fix is the kind of self-correction discipline I want to keep encouraging.
Triggered by jphein's PR #7 shipping with 5 ModuleNotFoundError test failures and 3 ruff E701 errors that no automated check caught — the contributor's fork has the deps, the public repo doesn't, and there was no CI to enforce the difference. Workflow runs on push/PR to main, installs the package with [dev] extras, runs ruff check (project config is line-length=100, target=py310) and pytest. Matrix matches pyproject.toml's declared support: 3.10, 3.11, 3.12. Note: this workflow is committed but has not yet executed against GitHub's runners. First execution will be the runtime validation. If anything fails, the fix lands in a follow-up commit before more PRs come through.
…ruff E701
Addresses Ben's CHANGES_REQUESTED on M0nkeyFl0wer/multipass-structural-
memory-eval#7. Two blocking items, fixed in one commit per his ask.
1. tests/test_rlm_adapter.py failed with ModuleNotFoundError: No module
named 'rlm' on a fresh clone. Now:
- pyproject.toml gains an opt-in `rlm` extra pointing at
alexzhang13/rlm (distribution name `rlms`, importable as `rlm`,
requires Python >=3.11). Not in `all` because the Python floor is
higher than SME's >=3.10 baseline.
- The test module guards with `pytest.importorskip("rlm")` so a
fresh clone without the extra installed skips the whole module
cleanly instead of erroring during collection.
2. sme/adapters/rlm_adapter.py:257-259 had three E701 ruff failures
(`if cond: stmt` one-liners). Split into proper if-blocks. Behavior
unchanged — same tag construction.
A third Copilot comment on PR #7 (`_load_adapter()` drops `kind`
kwarg, making `--kind` a no-op for rlm) was already addressed in
7d459c4: `kind` is intentionally NOT in the rlm drop list, with a
comment explaining the contract. No code change needed for that one.
Verified locally:
- ruff check sme/adapters/rlm_adapter.py — clean
- pytest tests/test_rlm_adapter.py — 5/5 pass
- pytest -q — 96 passed, 3 skipped (the one failure is a live-gated
daemon integration test, unrelated to this PR)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed fixup as Blocker 1 — Two halves of the fix:
Blocker 2 — 3 × E701 ruff errors at Split the three Copilot comment 4 ( This one was on an older # Drop kwargs RLM doesn't understand. (RlmAdapter accepts `kind`
# to forward into mempalace_search /search calls, so it is NOT
# in this drop list.)So Verified locally: (One additional test fails when On the testkit heads-up (issue #8): agreed it's the right next move. Codetopo's structural_duplication findings on the per-adapter test files are concrete refactor targets; happy to follow up there once this PR lands. |
The baselines/jp_realm_v0_1_rlm_*.json artifacts on this branch were
produced against the jphein/rlm fork (currently a 3-commit divergence
from alexzhang13/rlm — README header + examples/mempalace_demo.py + a
merge). Code surface in rlm/ is identical today, but pinning to the
fork is the correct call:
- Reproducibility: fresh-clone reproductions of the baselines run
the same code path that produced the evidence on PR #7.
- Provenance: rlm_adapter.py's docstring already says "fork of
alexzhang13/rlm", so the install path matches the doc.
- Future divergence: any SME-relevant patches we land on the fork
flow through to a fresh clone without a pyproject churn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Small correction on top of the previous comment, pushed as The rlm = [
"rlms @ git+https://github.com/jphein/rlm.git",
]Reasoning: the
Local verification re-run: Apologies for the two-step — should have called this on the first push. |
|
Reviewed thoroughly — adapter code, CLI wiring, baseline JSONs, and the cross-validation implications. Everything is solid. One code blocker — README status table — the invocation count for Qwen was listed as The bigger picture — your One non-blocking question from the review: the transport-failure path in API-mode adapters produces an empty-graph report with exit 0. I'm not treating this as a PR-specific bug — it's a design question for the evaluator-honesty framing. Does an empty result from a transport failure look the same as an empty result from a genuine zero-retrieval? If they look identical, we lose the ability to distinguish "the system ran correctly and found nothing" from "the system didn't run at all." Happy to discuss in an issue rather than blocking this PR. An invitation: with the cross-validation harness now in main and all four adapters (flat, mempalace-daemon, familiar, ladybugdb) wired up, there's a natural next step: run the RLM adapter through the LongMemEval question set. That would give us a Category 1 / 2c reading for RLM against a standardized corpus and start building the comparative picture across all six adapters. Let me know if that interests you — happy to coordinate on corpus preparation. Codetopo signal — the structural duplication hits in Thanks for the thorough work — this is exactly the kind of contribution that moves the comparative picture forward. Hope you have a great week. |
Addresses Copilot review on PR #7: - Track _tool_call_count separately from _capture (which grows once per drawer returned). Previously, retrieval_path reported `len(_capture)` as the tool-call count — a single _mempalace_search call returning 5 drawers showed up as "5 tool calls", skewing any Cat 9a-shaped invocation-rate read of the per-question histogram. - retrieval_path now reports both quantities explicitly: `rlm_completion (Xms, N tool calls, M drawers)`. Binary invocation-rate readings (zero vs non-zero) are unaffected — those were always correct because zero captures means no successful call. Per-question call-count histograms from the baselines/jp_realm_v0_1_rlm_*.json files predate this fix and reflect drawer counts; only the binary read on those JSONs is reliable. - _load_adapter() no longer drops `kind` when constructing RlmAdapter. RlmAdapter.__init__ explicitly accepts `kind` to forward into mempalace_search /search calls; the CLI's --kind flag was previously a silent no-op for the rlm adapter. Deferred (commented in PR review): appending an error entry to _capture when the tool call raises (currently the failure is visible via the returned [{error: ..., results: []}] but doesn't flow into context_string or per-query telemetry). Could be a feature decision rather than a bug — error-as-data has its own tradeoffs around context-string interpretability for the LM.
…ruff E701
Addresses Ben's CHANGES_REQUESTED on M0nkeyFl0wer/multipass-structural-
memory-eval#7. Two blocking items, fixed in one commit per his ask.
1. tests/test_rlm_adapter.py failed with ModuleNotFoundError: No module
named 'rlm' on a fresh clone. Now:
- pyproject.toml gains an opt-in `rlm` extra pointing at
alexzhang13/rlm (distribution name `rlms`, importable as `rlm`,
requires Python >=3.11). Not in `all` because the Python floor is
higher than SME's >=3.10 baseline.
- The test module guards with `pytest.importorskip("rlm")` so a
fresh clone without the extra installed skips the whole module
cleanly instead of erroring during collection.
2. sme/adapters/rlm_adapter.py:257-259 had three E701 ruff failures
(`if cond: stmt` one-liners). Split into proper if-blocks. Behavior
unchanged — same tag construction.
A third Copilot comment on PR #7 (`_load_adapter()` drops `kind`
kwarg, making `--kind` a no-op for rlm) was already addressed in
7d459c4: `kind` is intentionally NOT in the rlm drop list, with a
comment explaining the contract. No code change needed for that one.
Verified locally:
- ruff check sme/adapters/rlm_adapter.py — clean
- pytest tests/test_rlm_adapter.py — 5/5 pass
- pytest -q — 96 passed, 3 skipped (the one failure is a live-gated
daemon integration test, unrelated to this PR)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The baselines/jp_realm_v0_1_rlm_*.json artifacts on this branch were
produced against the jphein/rlm fork (currently a 3-commit divergence
from alexzhang13/rlm — README header + examples/mempalace_demo.py + a
merge). Code surface in rlm/ is identical today, but pinning to the
fork is the correct call:
- Reproducibility: fresh-clone reproductions of the baselines run
the same code path that produced the evidence on PR #7.
- Provenance: rlm_adapter.py's docstring already says "fork of
alexzhang13/rlm", so the install path matches the doc.
- Future divergence: any SME-relevant patches we land on the fork
flow through to a fresh clone without a pyproject churn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cat5 accepted --api-url/--api-key/--kind/--mock/--familiar-timeout at the
argparse layer but quietly dropped them before constructing the adapter,
so `sme-eval cat5 --adapter mempalace-daemon --api-url ...` raised
ValueError("MemPalaceDaemonAdapter needs api_url") even with the URL on
the command line.
Refactor cmd_cat5 to use the same _load_adapter_from_args() helper that
cmd_cat4/cmd_check/cmd_cat9 already use. One-line replacement plus 14
lines of bespoke kwargs-building deleted.
Behaviour change: the shared helper drops auto_discover=False from
kwargs (via `if val:`) where the old cat5 code passed it explicitly.
Functionally equivalent because every adapter's auto_discover default
is False anyway, so the resulting constructor call is identical.
Regression coverage in tests/test_cli_adapter_forwarding.py — three
focused tests against the helper and cmd_cat5 directly, covering both
the mempalace-daemon and familiar adapter paths. M0nkeyFl0wer's exact
CLI repro now reaches gap detection.
Closes upstream PR #7 review comment from 2026-05-03.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hey — wow, you have contributed so much to this project. I loved watching the notifications roll in yesterday. The substrate-floor parity hitting R@5=0.9660 byte-identical to upstream, the FT-300 reproduction landing inside published noise, the AGE write-through spike turning up +9pp R@5 over vector-only, the Step 2-forced finding that invocation-discipline only closes the base-model gap when retrieval is wide enough, the four-regime chunking-sensitivity taxonomy you and @nakata-app converged on — this is the most measurement-dense two weeks the project has had. Substrate-floor calibration against a published number is exactly the cross-validation #9 was asking for; the 9a sub-test proposal in #3 is the right shape and supported by data I didn't expect this fork to produce this fast. Substantive replies on #9 and #3 are coming in their own threads. To get CI green — the May-17 re-run turned up 17 ruff errors, 16 of which are auto-fixable with Ruff version drift is on me, not you. `pyproject.toml` pins `ruff>=0.5` unbounded; CI pulled `0.15.13` which is stricter than what was running when this PR opened. I'll land a small upper-pin in a follow-up so this can't bite long-lived PRs again. One forward-looking note, not a complaint about this PR — the work since the May-3 review is substantial enough that it would have been easier for me to review well as 3-4 smaller PRs (postgres+AGE adapter as one; LongMemEval substrate-floor + FT-300 as another; AGE write-through spike + git-probes corpus as a third; AGENTS.md/Gemini config separately). The structural improvements that land in this PR are genuinely valuable — that's exactly why I'd rather review each thread on its own merits next time. For #7 I'll merge as-is once CI is green; for the next batch of work, can we aim for one structural thread per PR? My review cadence will be tighter on smaller PRs. |
jphein
left a comment
There was a problem hiding this comment.
@M0nkeyFl0wer — both merge-blockers from your 2026-05-01 review are resolved.
Fix 1 — rlm dependency + test guard: 6fb6925 adds [rlm] to [project.optional-dependencies] pointing at rlms @ git+https://github.com/jphein/rlm.git, and pytest.importorskip("rlm") at the top of test_rlm_adapter.py. Module docstring documents pip install -e ".[rlm]". Fresh-clone verified: without the extra, 9 tests skip cleanly; with it, 9/9 pass.
Fix 2 — ruff E701: Same commit (6fb6925) splits the three one-liner if r.get(...): tags.append(...) patterns into proper if-blocks. ruff check sme/adapters/rlm_adapter.py passes clean as of today (2026-05-23).
Follow-up commit 348c714 pins the extra to jphein/rlm specifically (not upstream alexzhang13/rlm) since the baseline artifacts were produced against this fork.
Scope growth caveat. Since your review this branch has grown from the original adapter scope to 40 commits / 68 files. The additional work is the full Cat 9a investigation chain: Step 1 retrieval-breadth probe (8-cell table), Step 2-forced invocation decomposition, mempalace-git-probes-v2 200-question corpus, LME substrate-floor parity experiments, AGE write-through spike, and associated baselines/docs. All of it feeds back into M0nkeyFl0wer/multipass-structural-memory-eval#3 and M0nkeyFl0wer/multipass-structural-memory-eval#9.
The security fixes that were in this branch (hardcoded DSN, Cypher escaping) have already landed in your main via the techempower-org/multipass-structural-memory-eval#16 upstream sync, so those commits are duplicates in the diff.
Happy to do whichever you prefer:
- Rebase + squash the experimental chain into thematic commits and force-push to reduce the diff
- Split into the original RlmAdapter PR (your reviewed scope) + a separate PR for the Cat 9a experimental corpus
- Merge as-is if the 40-commit history is acceptable given it's all one investigation thread
Your call — the fixes you blocked on are ready either way. 🫏
…yFl0wer#19) The ABC default returning [] was added in 3754e8a as part of Cat 9 scaffolding, but the upstream issue tracks two follow-throughs that were still missing: - Module docstring + class docstring in `base.py` claimed "Two optional" methods; promote to "Three optional" so readers see `get_harness_manifest` as part of the contract. - Spec § Adapter Interface (sme_spec_v8.md) listed only `get_flat_retrieval` and `get_ontology_source` as optional. Add `get_harness_manifest` alongside them with its default-list rationale. Adds tests/test_adapter_harness_manifest_contract.py: every shipped adapter must resolve `get_harness_manifest` (either inherited or overridden) and the ABC default must return []. Class-level checks avoid instantiating adapters with heavy constructors. Closes the silent-AttributeError class entirely for any future adapter that forgets to override.
…Fl0wer#8) Parametric test module verifying adapters conform to the SMEAdapter ABC: query returns QueryResult, graph snapshot is internally consistent, ingest accepts corpus shape. Covers FlatBaselineAdapter and a MockAdapter; other adapters opt in by registering a factory.
…keyFl0wer#18) Three new test files for measurement-critical modules that had zero direct tests. Covers hop-bucket grouping, Cat 8 sub-tests, and the shared graph projection helper. Prevents silent regressions in published category readings.
Remove HindsightAdapter, Mem0Adapter, OmegaAdapter, RlmAdapter imports since those adapter files don't exist in upstream repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Builds the scaffold for benchmarking RLM (jphein/rlm fork of
alexzhang13/rlm) against familiar's deterministic pipeline on the
jp-realm-v0.1 corpus.
How it satisfies the SME contract:
- palace-daemon /search is exposed to RLM via custom_tools as
`mempalace_search`. Every call's results flow through a
per-query capture buffer.
- QueryResult.context_string is the concatenation of every drawer
text RLM pulled during the run, in call order. The Cat 1 /
retrieve substring scorer sees exactly what RLM saw.
- retrieved_entities mirror captures so Cat 7 token counts and
Cat 8 hop counts have something real (Cat 8 returns empty —
RLM doesn't maintain a graph view).
- ingest_corpus is a no-op (RLM consumes a palace it didn't
author; ingestion happens upstream via mempalace mine).
CLI: --adapter rlm available. Same kwargs-drop pattern as the
other adapters. Backend defaults to portkey @openai/gpt-5-nano
when PORTKEY_API_KEY is set; falls through to direct OpenAI/
Anthropic per RLM's own backend resolution.
Tests (5, all pass) cover the SME contract without burning API
credits — a live benchmark belongs in baselines/, not unit tests.
This is research scaffolding. The actual A/B vs familiar (which
the v0.3 brainstorm flagged as the lever for deciding v0.4
productionization) needs an API key + a separate run that produces
a baseline JSON. Documented in docs/ideas.md.
…ard, cat5 forwarding, invocation_mode Squashes the following fixes onto the initial RlmAdapter commit: - Separate per-call tool counter from per-result drawer accumulator - Add [rlm] optional extra to pyproject.toml (pytest.importorskip guard) - Point [rlm] extra at jphein/rlm fork for baseline reproducibility - Fix cmd_cat5 to use _load_adapter_from_args() — resolves M0nkeyFl0wer review blocker (--api-url/--api-key/--kind silently dropped) - Add source_file preservation + invocation_mode for Cat 9a Step 2 - Fix gap_detection variable shadowing (considered → seeded_applicable) - Add rlm registry forwarding test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Rebased onto #30 (allowlist registry) — both review blockers from the May 3 review are resolved:
Also squashed: tool-call counter fix, Please merge #30 first — once it lands, this PR's diff drops to just the RlmAdapter-specific commits. 🫏 |
Extracts the substantive enhancements from upstream PR #7 that aren't already on fork main via the PR #24 bundle. Five focused changes: 1. invocation_mode="forced": prepend a system-prompt directive requiring at least one mempalace_search call before FINAL. Tests whether the invocation-rate ceiling (gemma4: 60% zero-call) is the dominant Cat 9a lever on substring-shaped corpora. 2. invocation_mode="grounded": prepend a directive requiring the answer to quote at least one retrieved source filename. Tests whether the substring-scorer-vs-LLM-synthesis gap is the lever on file-shaped corpora (n=200 git-derived probes). 3. source_file round-trip through trimmed dict and context_string — the substring scorer matches against filenames from expected_sources; dropping source_file meant correct retrieval silently scored 0. 4. ingest_corpus returns the full SMEAdapter contract dict (errors/warnings empty lists) so harness code doesn't KeyError. 5. RLM kwargs restructured to a dict so custom_system_prompt is only set when needed — keeps default behavior unchanged. Four new tests cover source_file preservation and the three invocation_mode branches (forced / grounded / default). The fork-only adapters (Hindsight, Mem0, Omega) and the LongMemEval script that PR #7 deletes are preserved here — this branch carries only the RLM-specific deltas, not the upstream-targeted cleanups. Closes M0nkeyFl0wer#7 (for the fork's purposes — upstream PR remains pending Ben's re-review). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… 2 (#29) * feat(rlm): invocation_mode + source_file preservation for Cat 9a Step 2 Extracts the substantive enhancements from upstream PR #7 that aren't already on fork main via the PR #24 bundle. Five focused changes: 1. invocation_mode="forced": prepend a system-prompt directive requiring at least one mempalace_search call before FINAL. Tests whether the invocation-rate ceiling (gemma4: 60% zero-call) is the dominant Cat 9a lever on substring-shaped corpora. 2. invocation_mode="grounded": prepend a directive requiring the answer to quote at least one retrieved source filename. Tests whether the substring-scorer-vs-LLM-synthesis gap is the lever on file-shaped corpora (n=200 git-derived probes). 3. source_file round-trip through trimmed dict and context_string — the substring scorer matches against filenames from expected_sources; dropping source_file meant correct retrieval silently scored 0. 4. ingest_corpus returns the full SMEAdapter contract dict (errors/warnings empty lists) so harness code doesn't KeyError. 5. RLM kwargs restructured to a dict so custom_system_prompt is only set when needed — keeps default behavior unchanged. Four new tests cover source_file preservation and the three invocation_mode branches (forced / grounded / default). The fork-only adapters (Hindsight, Mem0, Omega) and the LongMemEval script that PR #7 deletes are preserved here — this branch carries only the RLM-specific deltas, not the upstream-targeted cleanups. Closes M0nkeyFl0wer#7 (for the fork's purposes — upstream PR remains pending Ben's re-review). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rlm-adapter): lazy-load RLM_SYSTEM_PROMPT inside invocation_mode branch The adapter eagerly imported `rlm.utils.prompts.RLM_SYSTEM_PROMPT` at __init__ time, which broke 6 tests that mock only `rlm.RLM` — they got ModuleNotFoundError: No module named 'rlm.utils' before reaching the StubRLM. Moving the import inside the `invocation_mode in ('forced', 'grounded')` branch keeps the default path (most tests, most prod uses) free of the rlm.utils submodule dependency, and patches sys.modules in the two mode-specific tests so they can still verify REPL/FINAL scaffolding is preserved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
RlmAdaptertreats RLM (a fork of alexzhang13/rlm) as the read-side orchestrator rather than a standalone retrieval pipeline. The LLM decides when and how to query the palace-daemon — SME evaluates what comes back.Stacked on #30 — please merge that PR first (allowlist adapter registry + test coverage). Once #30 lands, this PR's diff collapses to just the RlmAdapter-specific commits.
Changes
sme/adapters/rlm_adapter.py— Full adapter: query via RLM tool-calling loop,get_graph_snapshot()via palace-daemon/graph,ingest_corpus()no-op (palace is pre-populated),source_filepreservation +invocation_modefor Cat 9a Step 2 measurementtests/test_rlm_adapter.py— 16 tests withpytest.importorskip("rlm")guard (no CI dependency on the rlm package)pyproject.toml—[rlm]optional extra pointing atjphein/rlmfork for baseline reproducibilitysme/cli.py—rlmentry in_AdapterSpecregistry +cmd_cat5fix to use_load_adapter_from_args()(resolves @M0nkeyFl0wer's May 3 review blocker)sme/categories/gap_detection.py— Fix variable shadowing (considered→seeded_applicable)tests/test_gap_detection.py— 37 tests for gap detection scoringtests/test_cli_adapter_forwarding.py— rlm registry forwarding testReview blockers resolved
pytest tests/test_rlm_adapter.pyfails on fresh clones →importorskip+[rlm]extracat5 --adapter mempalace-daemon --api-url ...silently drops API args →_load_adapter_from_args()_tool_callscounter