feat(eval): FamiliarAdapter + jp-realm-v0.1 corpus + first live readings by jphein · Pull Request #6 · M0nkeyFl0wer/multipass-structural-memory-eval

jphein · 2026-04-30T17:06:46Z

Second in a three-PR stack. Stacked on feat/mempalace-daemon-adapter — please merge that one first; this PR's diff against main will collapse to just the new commits afterward.

Summary

Three logically-grouped pieces ride together because they're tightly coupled — the corpus was authored to exercise both adapters, and the lessons doc is the writeup that came out of running them:

FamiliarAdapter — read-side adapter for familiar.realm.watch, a deterministic retrieve→rerank→decay→compress pipeline that wraps a palace. Pairs cleanly with MemPalaceDaemonAdapter for A/B work.
jp-realm-v0.1 corpus — 30-question hand-authored corpus (27 single-hop, 3 two-hop) covering JP's homelab + project knowledge. Authoring contract follows the spec's existing per-corpus convention; lives at sme/corpora/jp_realm_v0_1/questions.yaml.
First live readings — head-to-head familiar vs daemon baselines, plus the v0.3 reflect-pipeline iterations that came out of running them. docs/ideas.md gains a "Lessons from the first live eval" section synthesizing what surfaced.

Design + plan docs in the PR

docs/superpowers/specs/2026-04-26-familiar-adapter-design.md
docs/superpowers/plans/2026-04-26-familiar-adapter.md

Notable refactor

refactor(adapters): extract /graph payload mapping into shared module (commit 4f213db in this PR) extracts the /graph snapshot mapping that both FamiliarAdapter and MemPalaceDaemonAdapter need, into sme/adapters/_graph_mapping.py. The daemon adapter is updated to use the shared module in this PR rather than the prior PR — keeps the prior PR scoped to "introduce the daemon adapter" and the refactor scoped to "share the part that's now duplicated."

Lessons doc highlights

The new section in docs/ideas.md captures the diagnostic patterns that fell out of the first run — including the system-gap-vs-pipeline-gap split that the q12/q13 case study illustrates (writing the missing drawers via palace-daemon flipped q12 from 0.0 to 1.0 — palace gap; left q13 at 0.0 — pipeline gap). This is the diagnostic the framework was built to provide; nice to see it land cleanly with the simplest possible scoring.

Testing

tests/test_familiar_adapter.py — 26 tests covering query happy/error paths, graph snapshot, ingest stub, mock/no-mock modes, timeout behavior.
tests/test_familiar_live.py — gated live smoke (requires FAMILIAR_LIVE=1), skipped in CI.
All daemon tests from PR 1 still pass after the shared-module refactor.

Test plan

python -m pytest tests/test_familiar_adapter.py tests/test_mempalace_daemon_adapter.py -v — 54 tests pass (26 familiar + 28 daemon)
sme-eval retrieve --adapter familiar --questions sme/corpora/jp_realm_v0_1/questions.yaml --familiar-url $URL — CLI smoke against a running familiar
cat baselines/jp_realm_v0_1_familiar_*.json | jq '.summary.mean_recall' — eyeball the recall progression across the v0.2/v0.3 baselines

The 19 commits split into design/plan, shared-module refactor, scaffold, query, graph snapshot, CLI wiring, then corpus authoring and the v0.2/v0.3 baseline iterations + lessons doc in roughly that order. Happy to walk through any specific commit if useful.

The shipped sme/adapters/mempalace.py opens ChromaDB directly via chromadb.PersistentClient. That's correct for upstream MemPalace single-process installs but violates the daemon-strict invariant of the jphein/palace-daemon + jphein/mempalace fork architecture: two PersistentClient instances against the same palace = the multi-writer corruption scenario the daemon was built to prevent. Documenting a planned mempalace-daemon adapter that: - talks HTTP/MCP only (no filesystem access, no chromadb import) - assembles the graph snapshot from /mcp tool calls (mempalace_list_wings, mempalace_list_rooms, mempalace_list_tunnels, mempalace_kg_query) - routes query() through palace-daemon's GET /search with the new kind= parameter (default kind=content excludes Stop-hook auto-save checkpoints — validated 2026-04-25 against the canonical 151K palace, filter excludes ~637 of 151,448 drawers but ~80%+ of search results before filter) Why it matters: SME's Cat 9 (The Handshake) measures integration-under-production-model — exactly the slice engram-2's "0.984 R@5 vs 17% E2E QA" critique targets. Running Cat 9 through the daemon adapter, A/B before/after applying the kind= filter, would let SME's verdict replace our hand-rolled benchmark and generate publishable data on whether the fix moves the needle. Multi-hour build, not yet started; this entry is the placeholder so the work doesn't drift. Lives on main only — kept off the feat/cat9-9b-scaffolding upstream PR branch per the "fork README handling" pattern (PR branches must carry upstream's README so fork-only content doesn't leak upstream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Design for a fork-side SMEAdapter that talks HTTP/MCP to a running palace-daemon, plus a planned /graph endpoint to make graph snapshots fast. Adapter ships independently against daemon 1.5.1 via MCP-tool fallback; /graph is coordinated daemon-side work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

12 tasks following TDD: scaffolding, adapter (auth → query happy path → query errors → /graph fast path → MCP fallback → ontology), CLI plumbing, gated integration smoke, end-to-end CLI smoke, README. Daemon-side /graph endpoint explicitly out-of-scope; coordinated separately with JP's active palace-daemon work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three-source auth resolution: kwargs > env file > process env. Raises ValueError at construction if nothing resolves. query() and get_graph_snapshot() raise NotImplementedError pending follow-on tasks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GET /search builds context_string in the same format as the existing MemPalaceAdapter so tiktoken counts stay comparable across adapters. retrieval_path captures kind + scope counts for downstream scoring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Covers WARN-with-results soft signal, NO_RESULTS, AUTH, HTTP 5xx, connection failures, and X-API-Key header propagation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GET /graph in one call; projects to wings, rooms, member_of, tunnel, and KG layers. Mirrors the existing direct adapter's projection minus drawer-level surface (impractical to fetch at scale over HTTP). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When /graph is unavailable (older daemons or prefer_graph_endpoint=False), walk mempalace_list_wings, mempalace_list_rooms per wing, and mempalace_list_tunnels via POST /mcp. Per-wing failures degrade to a partial snapshot rather than aborting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Returns the same readme-typed ontology as the existing MemPalaceAdapter so Cat 8 readings are comparable across the direct-ChromaDB and HTTP paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the mempalace-daemon branch to _load_adapter and threads --api-key / --kind through _load_adapter_from_args, _add_db_or_api_args, and the retrieve subparser. Existing mempalace and flat adapters drop the new kwargs cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Skipped when PALACE_DAEMON_URL is unset so CI without a daemon stays green. Asserts query() returns a QueryResult, snapshot returns >=1 wing, and kind=all >= kind=content for total_before_filter (validates the README's filter claim). Runs in ~29s against the 151K-drawer palace via the MCP fallback path; will drop to <5s once the daemon ships GET /graph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the fork-roadmap planning subsection with the wired-endpoint description, auth resolution rules, and invocation example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Moved conftest imports to the top of the module, dropped unused os/Path/urllib.error imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The earlier total_before_filter comparison conflated metadata math with filter behaviour. The reliable signal is whether 'CHECKPOINT:' strings appear in the result text — kind='content' must produce strictly fewer than kind='all'. Skips when the live palace has no checkpoints to test against. Discovered against palace-daemon 1.6.0 with 151K drawers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the mempalace-daemon adapter that shipped 2026-04-25. Familiar is the layer above palace-daemon (rerank + decay + compress + grounding); this adapter measures that pipeline's contribution to retrieval quality on top of the palace itself. Cat 9 implementation deferred — but the adapter captures all the HandshakeTrace-shaped data Cat 9 will eventually consume, so the per-question JSON output is forward-compatible with the future sme/categories/handshake.py scorer.

12 TDD tasks, 19 unit tests + 2 gated live tests, 1562 lines. Mirrors the conventions established by the mempalace-daemon adapter (commits 4f5eba3..da21708): refactor common /graph mapping first, then scaffold-test-implement-test-commit per task. Spec: docs/superpowers/specs/2026-04-26-familiar-adapter-design.md

Both MemPalaceDaemonAdapter and the upcoming FamiliarAdapter consume the same /graph payload shape (familiar's /api/familiar/graph is a verbatim proxy of palace-daemon's /graph). Extract project_graph into sme/adapters/_graph_mapping.py so both adapters share the implementation and stay in lock-step if the upstream shape evolves. mempalace_daemon.py: 519 -> 408 lines. Existing 28 mempalace_daemon tests pass unchanged.

Class structure + constructor + base URL trim + Mode-B ingest stub. query() and get_graph_snapshot() raise NotImplementedError; subsequent tasks fill them in. Mirrors MemPalaceDaemonAdapter conventions (base_url, timeout_s, opener for test injection).

POSTs to /api/familiar/eval, deserializes the SME-shape response into QueryResult. mock_inference=True default forwarded as 'mock' field so Cat 1 substring scoring stays deterministic. Error contract (no-raise per SMEAdapter): HTTP 4xx/5xx -> error='familiar endpoint <code>: <body[:200]>' socket.timeout -> error='familiar timeout after <s>s ...' URLError -> error='familiar connection failed: <reason>' invalid JSON -> error='familiar invalid JSON response: ...' unexpected -> error='familiar unexpected: <type>: ...' Warnings translation: server_error + warnings -> '<server> | WARN: <warnings joined>' warnings only -> 'WARN: <warnings joined>' no signals -> error=None Soft warnings (low_confidence, filtered_null_text_*, stuck_loop) and hard warnings (palace_unreachable) both surface with WARN: prefix; data fields still populated so SME's per-question scoring distinguishes 'warned-but-answered' from 'errored'. 13 new tests; 16 passing total.

- get_graph_snapshot: GET /api/familiar/graph (familiar's verbatim proxy of palace-daemon's /graph with a 5-min cache layer) and runs the response through the shared _graph_mapping.project_graph helper. Failure returns ([], []) per the SME contract — a missing snapshot zeroes out structural-category scores rather than raising. - get_flat_retrieval: same eval endpoint, entities only (Cat 7). - get_ontology_source: 'declared' — wings/rooms are author-defined. - get_harness_manifest: forward-compat Cat 9 — returns HTTP-tool + MCP descriptors when sme.harness is importable, else [] for graceful behavior on multipass versions before Cat 9 ships. 7 new tests; 23 passing total. Daemon adapter (28 tests) still green since the shared graph mapping module is unchanged.

…meout Adds the familiar branch to _load_adapter (mirrors mempalace-daemon's branch shape). Drops daemon/ladybug-specific kwargs; remaps --api-url to FamiliarAdapter's base_url constructor arg. Adds mutually-exclusive --mock/--no-mock flags (default --mock for Cat 1 substring-scoring determinism) and --familiar-timeout for tuning per-question wait. CLI integration tests cover the dispatch path: instantiation, mock_inference default, --no-mock pass-through. 3 new tests; 26 familiar + 28 daemon = 54 total.

README: parallel section to the mempalace-daemon adapter, framing familiar as 'palace + v0.2 pipeline' so the score delta vs the daemon adapter captures what the v0.2 work is worth. ideas.md: notes that FamiliarAdapter's per-question records are forward-compatible with the eventual sme/categories/handshake.py scorer — Cat 9 implementation can land later without revisiting the adapter API.

Set FAMILIAR_BASE_URL to run; otherwise skipped. Asserts the QueryResult contract holds end-to-end and graph_snapshot returns lists. Tolerates palace HNSW rebuild state via skip+warn handling — verified 2026-04-26 against https://familiar.jphe.in/ (palace mid-rebuild, query took 17s and adapter correctly translated daemon timeout into WARN: error string).

- tests/test_familiar_adapter.py: drop unused pytest import (ruff F401). - sme/cli.py: --adapter help now lists familiar in the choices. 54/54 adapter tests still green; ruff clean on all files added in this PR (sme/adapters/familiar.py, sme/adapters/_graph_mapping.py, tests/test_familiar_adapter.py, tests/test_familiar_live.py).

Diagnostic Mode-B corpus targeting content known to exist in JP's live palace based on session context, CLAUDE.md, and trace logs. 30 questions across 6 thematic groups: realm.watch ecosystem, mempalace+palace-daemon, surveyed forks, tools/workflow, infra, specific incidents/decisions. expected_sources substrings will match against the adapter's context_string. Run when palace HNSW rebuild completes: sme-eval retrieve --adapter familiar --api-url https://familiar.jphe.in \ --questions sme/corpora/jp_realm_v0_1/questions.yaml --mock --json familiar.json Compare against same corpus through --adapter mempalace-daemon to measure familiar v0.2 pipeline's retrieval delta.

The retrieve subcommand always sets read_only=True in adapter_kwargs (meaningful for ChromaDB / mempalace-daemon), but FamiliarAdapter takes only base_url/timeout_s/mock_inference/opener and rejects unknown kwargs with TypeError. Add read_only to the familiar drop list so `sme-eval retrieve --adapter familiar ...` works.

familiar v0.2 pipeline: 73.33% recall, 17/30 full, 27/30 hit-rate mempalace-daemon raw: 70.00% recall, 15/30 full, 27/30 hit-rate delta: +3.33pp recall, +2 questions; familiar never loses (28 ties). Pipeline contribution surfaces on: - q04_os_realm_watch (rerank lifts os.realm into top-K) - q16_speech_to_cli (decay weight prefers fresh speech-to-cli notes) 3 honest gaps in palace (recall=0.0 in both): q08 kind-filter (internal API detail), q12 rlm, q13 GraphPalace — fork-survey topics synthesized in chat but never written to palace. Run: PALACE_API_KEY=$(grep PALACE_DAEMON_API_KEY ~/.config/palace-daemon/env | cut -d= -f2) \\ venv/bin/sme-eval retrieve --adapter mempalace-daemon --api-url http://disks:8085 \\ --questions sme/corpora/jp_realm_v0_1/questions.yaml --json out.json

…arser Earlier work added these flags to _add_db_or_api_args (used by cat4, cat5, check) but the retrieve subcommand inlines its own arg block and never picked them up. Result: `sme-eval retrieve --mock ...` errored with "unrecognized arguments: --mock". Add the same arg group + kwarg passthrough to retrieve so all FamiliarAdapter knobs are accessible from the retrieve path — the only path that actually uses them today.

v2 (clean): 73.33% recall 17 full hits 27/30 hit-rate v3 (+3 missing drawers): 76.67% recall 18 full hits 28/30 hit-rate v4 (+ punct fix): 76.67% recall 17 full hits 29/30 hit-rate v3 added drawers for q12 (rlm), q13 (GraphPalace), q08 (kind filter) written via palace-daemon /memory POST. Only q12 surfaced (q13 and q08 missed due to embedding ranking). v4 added trailing-punctuation strip in palace-client (familiar commit b676852). q13 flipped 0.0 → 0.5 (now finds GraphPalace); q15 regressed 1.0 → 0.5 (one substring lost from rerank shift). Net: hit-rate up, full-recall flat — punctuation normalization is a useful retrieval tuning lever but ranking is sensitive. Last hard miss (q08 kind_filter) is a true semantic-distance issue: jargon-heavy drawer text doesn't embed close to a conversational question even though both contain the same keywords literally. Vector retrieval limitation, not a familiar bug.

Captures findings that should shape adapter + category work: - substring scoring on context_string is the right MVP - corpus runs distinguish system gaps from pipeline gaps - corpora must be cheap to re-run (ratchet, not benchmark) - retrieve subcommand needed CLI plumbing fixes the helpers hid - per-corpus directories scale cleanly to multi-corpus future

Two attempts to lift retrieval recall on jp-realm-v0.1, both reverted upstream after measuring: 1. jp_realm_v0_1_familiar_v0.3_task1_bm25.json familiar adds cosine+bm25 blend in domainRerank baseScore. Net effect: zero (76.67% → 76.67%). The daemon's PalaceDrawer response uses field name 'bm25_score' not 'bm25', so the blend condition never fired in production. Even when fixed, the mempalace fork ALREADY does identical 0.6/0.4 hybrid in _hybrid_rank — familiar's blend would have been redundant. 2. jp_realm_v0_1_familiar_v0.3_candidate_window.json familiar fetches candidateLimit=20 from daemon then slices to retrievalLimit=5 after rerank/decay/compress. Net REGRESSION (76.67% → 71.67%, hit-rate 29 → 26 of 30). The wider candidate window let wing-match boost (×1.4) elevate same-wing-but-irrelevant drawers above different-wing correct answers — 4 questions regressed (q07, q18, q21, q28) vs 3 improved (q15, q17, q29). Both reverted in familiar.realm.watch (eb0b640 reverted by d2bf8a0; 55d8889 reverted by f4f2cce). Lesson: read-side reranking has hit diminishing returns; the q08-class miss is upstream of familiar. v0.3 attention should shift to the reflect loop (write-side) per the spec's framing.

Mean recall: 76.67% -> 78.33% (+1.67pp) Full hits: 17/30 -> 18/30 Hit-rate: 29/30 -> 29/30 (unchanged) q14_hermes_agent flipped 0.50 -> 1.00 because the reflect smoke turn wrote a drawer about hermes-agent that now surfaces 'agent' in retrieval. Mechanism proven; v0.4's automatic per-session reflect should compound this across many turns. The bm25-blend and candidate-window attempts (both reverted) moved zero questions; reflect moved one. Confirms the spec's framing — write-side gaps are the real lever.

…lative After a full day of real chat usage with auto-reflect enabled (every ≥80-char assistant turn), recall on jp-realm-v0.1 measures: v0.2.1 (entering v0.3): 76.67% recall, 17 full v0.3.0 (single smoke turn): 78.33% recall, 18 full v0.3.9 (a day later): 78.33% recall, 18 full ← no further movement What this tells us: reflect IS writing drawers (memories panel showed 5+ from real chats during the day), but the 30-question jp-realm-v0.1 corpus asks about topics palace already had coverage for. Reflect's contribution lives in topics OUTSIDE the frozen corpus. Implication for v0.4: cumulative reflect impact needs a different measurement vehicle — a reflect-specific corpus, or longitudinal recall on a corpus that grows as JP introduces new topics. Filing this as a v0.4 candidate (corpus authoring, not pipeline change).

Copilot

Pull request overview

Adds two new read-side adapters (mempalace-daemon, familiar) to run SME retrieval + graph-snapshot evaluations against live HTTP services, plus a new JP-specific diagnostic corpus and first baseline artifacts / lessons writeup.

Changes:

Introduce MemPalaceDaemonAdapter (palace-daemon HTTP) and FamiliarAdapter (familiar.realm.watch HTTP) with shared /graph payload projection.
Wire new adapters + flags into the CLI and add extensive mocked + gated-live tests.
Add the jp-realm-v0.1 question corpus plus baseline JSON outputs and documentation updates.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_mempalace_daemon_integration.py	Gated live smoke tests for the daemon adapter.
tests/test_mempalace_daemon_adapter.py	Mocked unit tests for daemon adapter query + graph snapshot (fast-path + MCP fallback).
tests/test_familiar_live.py	Gated live smoke tests for FamiliarAdapter.
tests/test_familiar_adapter.py	Mocked unit tests for FamiliarAdapter behavior and CLI loading.
tests/fixtures/tiny_questions.yaml	Minimal smoke corpus fixture.
tests/conftest.py	Adds `fake_urlopen_factory` + `_FakeResponse` for HTTP-mocking urllib adapters.
sme/corpora/jp_realm_v0_1/questions.yaml	Adds the jp-realm-v0.1 diagnostic question corpus.
sme/cli.py	Adds adapter loading branches + new flags (`--api-key`, `--kind`, `--mock/--no-mock`, `--familiar-timeout`) and threads them into `retrieve`.
sme/adapters/mempalace_daemon.py	Implements palace-daemon HTTP adapter (`/search`, `/graph`, MCP fallback).
sme/adapters/familiar.py	Implements familiar HTTP adapter (`/api/familiar/eval`, `/api/familiar/graph`) and payload deserialization.
sme/adapters/_graph_mapping.py	Shared projection from `/graph` payload to `(Entity, Edge)` lists.
docs/superpowers/specs/2026-04-26-familiar-adapter-design.md	Design spec for FamiliarAdapter.
docs/superpowers/specs/2026-04-25-mempalace-daemon-adapter-design.md	Design spec for MemPalaceDaemonAdapter.
docs/ideas.md	Adds “Lessons from the first live eval” section and adapter forward-compat notes.
baselines/jp_realm_v0_1_familiar_v0.3_task1_bm25.json	Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_v0.3_reflect.json	Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_v0.3_candidate_window.json	Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_v0.3.9_reflect_cumulative.json	Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_20260426_v4_punct_fix.json	Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_20260426_v3.json	Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_20260426_v2.json	Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_20260426.json	Baseline output artifact (familiar).
README.md	Documents fork roadmap and usage for the new adapters/flags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T17:11:00Z

+    not os.environ.get("PALACE_DAEMON_URL"),
+    reason="needs a running palace-daemon; set PALACE_DAEMON_URL to enable",


The skip condition only checks PALACE_DAEMON_URL, but the fixture constructs MemPalaceDaemonAdapter() which also requires PALACE_API_KEY (or an env_file). If a developer sets the URL but forgets the key, this will fail the test instead of skipping. Consider extending the skipif to require both PALACE_DAEMON_URL and PALACE_API_KEY (or otherwise skipping with a clearer reason).

Suggested change

not os.environ.get("PALACE_DAEMON_URL"),

reason="needs a running palace-daemon; set PALACE_DAEMON_URL to enable",

not os.environ.get("PALACE_DAEMON_URL")

or not os.environ.get("PALACE_API_KEY"),

reason=(

"needs a running palace-daemon; set PALACE_DAEMON_URL and "

"PALACE_API_KEY to enable"

),

Copilot · 2026-04-30T17:11:01Z

+    def query(self, question: str) -> QueryResult:
+        """POST /api/familiar/eval and deserialize the SME-shape response."""
+        body = {
+            "query": question[:250],
+            "limit": 5,
+            "kind": "content",
+            "mock": self.mock_inference,
+        }
+        status, payload = self._post_json("/api/familiar/eval", body)


FamiliarAdapter.query() hard-codes "limit": 5 and does not accept an n_results kwarg, so the CLI's --n-results setting is silently ignored for this adapter (cmd_retrieve falls back after a TypeError). Consider adding an optional n_results parameter (and threading it through as the request's "limit") so familiar runs are comparable at different K values and the CLI flag behaves consistently across adapters.

Copilot · 2026-04-30T17:11:01Z

+    def get_flat_retrieval(self, question: str, k: int = 5) -> list[Entity]:
+        """Optional: same path as query() but returns only the entities,
+        used by Cat 7 token-budget analysis where the answer text is
+        irrelevant and only retrieval changes."""
+        body = {
+            "query": question[:250],
+            "limit": k,
+            "kind": "content",
+            "mock": self.mock_inference,
+        }
+        status, payload = self._post_json("/api/familiar/eval", body)
+        if status != 200 or not isinstance(payload, dict):
+            return []
+        raw_entities = payload.get("retrieved_entities") or []
+        return [self._entity_from_payload(e) for e in raw_entities]


FamiliarAdapter.get_flat_retrieval() doesn't match the SMEAdapter optional-method contract: SMEAdapter.get_flat_retrieval returns a QueryResult, but this implementation returns list[Entity] and adds a nonstandard k parameter. This will break any generic Cat 7 / tooling that expects a QueryResult shape from get_flat_retrieval. Suggest returning a QueryResult (with context_string/answer empty if desired) and using the same n_results kwarg convention as other adapters.

Suggested change

def get_flat_retrieval(self, question: str, k: int = 5) -> list[Entity]:

"""Optional: same path as query() but returns only the entities,

used by Cat 7 token-budget analysis where the answer text is

irrelevant and only retrieval changes."""

body = {

"query": question[:250],

"limit": k,

"kind": "content",

"mock": self.mock_inference,

}

status, payload = self._post_json("/api/familiar/eval", body)

if status != 200 or not isinstance(payload, dict):

return []

raw_entities = payload.get("retrieved_entities") or []

return [self._entity_from_payload(e) for e in raw_entities]

def get_flat_retrieval(self, question: str, n_results: int = 5) -> QueryResult:

"""Optional: same path as query() but returns the standard

QueryResult shape so generic SME tooling can consume retrieval-only

calls through the base adapter contract."""

body = {

"query": question[:250],

"limit": n_results,

"kind": "content",

"mock": self.mock_inference,

}

status, payload = self._post_json("/api/familiar/eval", body)

if status == -1:

msg = payload.get("_network_error", "unknown network error") if isinstance(payload, dict) else "unknown error"

return QueryResult(

answer="",

context_string="",

retrieved_entities=[],

retrieved_edges=[],

error=f"familiar {msg}",

)

if status != 200:

snippet = json.dumps(payload)[:200] if isinstance(payload, dict) else str(payload)[:200]

return QueryResult(

answer="",

context_string="",

retrieved_entities=[],

retrieved_edges=[],

error=f"familiar endpoint {status}: {snippet}",

)

return self._deserialize_query(payload)

Copilot · 2026-04-30T17:11:01Z

+    def get_ontology_source(self) -> str:
+        """Wings/rooms taxonomy is author-declared (mempalace_get_taxonomy
+        MCP tool), not inferred from data. Same semantics as
+        MemPalaceDaemonAdapter."""
+        return "declared"
+


FamiliarAdapter.get_ontology_source() returns the string "declared", but SMEAdapter.get_ontology_source() is documented/used as a dict with keys like {type, schema, documentation} (e.g. cmd_analyze expects ontology.get(...)). Returning a string will raise at call sites if this adapter is used with ontology-reporting commands. Consider returning a dict like {"type": "declared", "schema": [...], "documentation": "..."} even if schema is empty for now.

Suggested change

def get_ontology_source(self) -> str:

"""Wings/rooms taxonomy is author-declared (mempalace_get_taxonomy

MCP tool), not inferred from data. Same semantics as

MemPalaceDaemonAdapter."""

return "declared"

def get_ontology_source(self) -> dict[str, Any]:

"""Return a declared ontology descriptor compatible with SME

ontology-reporting call sites.

Familiar's wings/rooms taxonomy is author-declared

(mempalace_get_taxonomy MCP tool), not inferred from graph data.

The schema is currently surfaced as empty until this adapter

exposes the declared taxonomy structure directly.

"""

return {

"type": "declared",

"schema": [],

"documentation": (

"Wings/rooms taxonomy is author-declared via the "

"mempalace_get_taxonomy MCP tool and is not inferred from "

"retrieved graph data."

),

}

Copilot · 2026-04-30T17:11:01Z

+import urllib.error
+import urllib.parse
+import urllib.request
+from collections import defaultdict


sme/adapters/mempalace_daemon.py imports defaultdict but doesn't use it (graph projection moved into sme.adapters._graph_mapping). With ruff enabled (pyproject.toml includes ruff in dev deps), this will fail lint due to an unused import. Remove the unused import to keep the module clean.

Suggested change

from collections import defaultdict

Addresses Copilot review on PR #6: - query() now accepts an optional n_results kwarg (default 5) instead of hard-coding limit=5 in the request body. The CLI's --n-results flag was being silently ignored for the familiar adapter; the cmd_retrieve TypeError-fallback ladder was masking it. - get_flat_retrieval() now returns QueryResult per the SMEAdapter base contract (with empty answer/context_string by design — this is the "retrieval-only" reading for Cat 7). Previously returned list[Entity], which would break any generic Cat 7 / tooling that expects a QueryResult shape. - get_ontology_source() now returns a dict with keys {type, schema, documentation} per the SMEAdapter contract. Previously returned the bare string "declared", which would raise at any call site that accesses the dict (e.g. cmd_analyze). Documents the same MemPalace ontology familiar wraps, since familiar doesn't change the underlying schema. Tests updated to assert the new shapes (QueryResult attributes, dict structure on the ontology call).

Addresses Copilot review on PR #5: - Drop unused `from collections import defaultdict` import (graph projection moved into sme.adapters._graph_mapping in the familiar PR; this import was left orphaned). Was failing ruff lint silently. - Wrap `_parse_env_file`'s `path.read_text()` in a try/except for OSError + UnicodeDecodeError so adapter construction falls through to env-var resolution instead of raising on permissions issues, broken symlinks, or non-UTF-8 contents. - Tighten the live-integration skip gate to require both PALACE_DAEMON_URL and PALACE_API_KEY (it previously checked only the URL, so a developer with the URL but not the key would get a hard test failure instead of a skip). Deferred (commented in PR review): only-fall-back-on-404 for get_graph_snapshot. That's a behavior change and the empty-on-error shape may be load-bearing for existing users; will revisit once the PR has reviewer signal on direction.

Addresses Copilot review on PR #6: - query() now accepts an optional n_results kwarg (default 5) instead of hard-coding limit=5 in the request body. The CLI's --n-results flag was being silently ignored for the familiar adapter; the cmd_retrieve TypeError-fallback ladder was masking it. - get_flat_retrieval() now returns QueryResult per the SMEAdapter base contract (with empty answer/context_string by design — this is the "retrieval-only" reading for Cat 7). Previously returned list[Entity], which would break any generic Cat 7 / tooling that expects a QueryResult shape. - get_ontology_source() now returns a dict with keys {type, schema, documentation} per the SMEAdapter contract. Previously returned the bare string "declared", which would raise at any call site that accesses the dict (e.g. cmd_analyze). Documents the same MemPalace ontology familiar wraps, since familiar doesn't change the underlying schema. Tests updated to assert the new shapes (QueryResult attributes, dict structure on the ontology call).

M0nkeyFl0wer

Approved. Stack continues to land cleanly.

Local verification on pr-6:

pytest tests/test_familiar_adapter.py -v — 26/26 pass
Full suite — 90 passed, 5 skipped
ruff check — clean

What I liked:

Refactor before composition. Extracting _project_graph into sme/adapters/_graph_mapping.py:project_graph rather than copy-pasting it into FamiliarAdapter is the right move — it pins the projection contract to a single function that both adapters import. The "Extracted verbatim from MemPalaceDaemonAdapter._project_graph 2026-04-26" comment makes the provenance auditable. Resolves the deferred unused-import lint from PR #5 too.
mock_inference=True by default is the right call for SME runs. Substring scoring needs determinism; LLM-in-the-loop reads belong in a Cat 9 pass. The docstring says this explicitly, which is good — anyone who wonders "why are we mocking the inference?" gets the answer immediately.
Familiar pipeline measurement strategy is well-framed. "POST /api/familiar/eval already returns SME-QueryResult shape natively" — designing familiar v0.2's chunk-C against the SME contract means the adapter is mostly deserialization, which is exactly the surface area we want at the eval boundary. Comparing familiar vs daemon adapter scores on the same corpus (the diff measuring rerank/decay/compress/grounding contribution) is a clean isolation.
jp-realm-v0.1 corpus — 30 questions, expected_sources as substrings (matches the framework's existing matcher), Mode B diagnostic framing (no ingest, queries target known-existing content). The "seed authored 2026-04-26 by Claude with JP" note is honest about provenance.
8 baseline JSON files at consistent envelope shape (total, full_recall, partial_hit, mean_recall, mean_tokens) — the v0.2 → v0.3.x progression is captured, which means future regressions are diff-able rather than re-discoverable.

No requested changes.

Note for future spec work: the substring-on-context_string matcher inherited from the existing framework has the same precision/recall caveats I documented in mempalace#101 (filenames-containing-subjects scoring spuriously high). Worth flagging in a follow-up issue rather than blocking this PR; the corpus authoring contract you've outlined in tests/fixtures/tiny_questions.yaml and the v0.1 questions is independent of the matcher's known limitations.

jphein and others added 30 commits April 30, 2026 10:02

test: add fake_urlopen_factory fixture for HTTP-mocking adapter tests

7019e91

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(adapters): lock MemPalaceDaemonAdapter query error/warning contract

5456c90

Covers WARN-with-results soft signal, NO_RESULTS, AUTH, HTTP 5xx, connection failures, and X-API-Key header propagation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(fixtures): tiny corpus for mempalace-daemon CLI smoke

70b5778

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(readme): mempalace-daemon adapter — planned → shipped

23d541c

Replaces the fork-roadmap planning subsection with the wired-endpoint description, auth resolution rules, and invocation example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: ruff fixes for mempalace_daemon adapter tests

41b248f

Moved conftest imports to the top of the module, dropped unused os/Path/urllib.error imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jphein added 5 commits April 30, 2026 10:02

Copilot AI review requested due to automatic review settings April 30, 2026 17:06

jphein mentioned this pull request Apr 30, 2026

feat(adapters): RlmAdapter + Qwen-7B/Llama-70B baselines #7

Closed

3 tasks

Copilot started reviewing on behalf of jphein April 30, 2026 17:07 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

jphein added 2 commits April 30, 2026 13:53

jphein mentioned this pull request Apr 30, 2026

feat(adapters): MemPalaceDaemonAdapter — read-side adapter for production palaces #5

Merged

3 tasks

M0nkeyFl0wer approved these changes May 1, 2026

View reviewed changes

M0nkeyFl0wer mentioned this pull request May 1, 2026

Adapter contract testkit — extract shared SMEAdapter conformance suite #8

Closed

M0nkeyFl0wer merged commit fb0844c into M0nkeyFl0wer:main May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): FamiliarAdapter + jp-realm-v0.1 corpus + first live readings#6

feat(eval): FamiliarAdapter + jp-realm-v0.1 corpus + first live readings#6
M0nkeyFl0wer merged 37 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/familiar-adapter-and-jp-realm-corpus

jphein commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

M0nkeyFl0wer left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		not os.environ.get("PALACE_DAEMON_URL"),
		reason="needs a running palace-daemon; set PALACE_DAEMON_URL to enable",

-    not os.environ.get("PALACE_DAEMON_URL"),
-    reason="needs a running palace-daemon; set PALACE_DAEMON_URL to enable",
+    not os.environ.get("PALACE_DAEMON_URL")
+    or not os.environ.get("PALACE_API_KEY"),
+    reason=(
+        "needs a running palace-daemon; set PALACE_DAEMON_URL and "
+        "PALACE_API_KEY to enable"
+    ),

-    def get_flat_retrieval(self, question: str, k: int = 5) -> list[Entity]:
-        """Optional: same path as query() but returns only the entities,
-        used by Cat 7 token-budget analysis where the answer text is
-        irrelevant and only retrieval changes."""
-        body = {
-            "query": question[:250],
-            "limit": k,
-            "kind": "content",
-            "mock": self.mock_inference,
-        }
-        status, payload = self._post_json("/api/familiar/eval", body)
-        if status != 200 or not isinstance(payload, dict):
-            return []
-        raw_entities = payload.get("retrieved_entities") or []
-        return [self._entity_from_payload(e) for e in raw_entities]
+    def get_flat_retrieval(self, question: str, n_results: int = 5) -> QueryResult:
+        """Optional: same path as query() but returns the standard
+        QueryResult shape so generic SME tooling can consume retrieval-only
+        calls through the base adapter contract."""
+        body = {
+            "query": question[:250],
+            "limit": n_results,
+            "kind": "content",
+            "mock": self.mock_inference,
+        }
+        status, payload = self._post_json("/api/familiar/eval", body)
+        if status == -1:
+            msg = payload.get("_network_error", "unknown network error") if isinstance(payload, dict) else "unknown error"
+            return QueryResult(
+                answer="",
+                context_string="",
+                retrieved_entities=[],
+                retrieved_edges=[],
+                error=f"familiar {msg}",
+            )
+        if status != 200:
+            snippet = json.dumps(payload)[:200] if isinstance(payload, dict) else str(payload)[:200]
+            return QueryResult(
+                answer="",
+                context_string="",
+                retrieved_entities=[],
+                retrieved_edges=[],
+                error=f"familiar endpoint {status}: {snippet}",
+            )
+        return self._deserialize_query(payload)

-    def get_ontology_source(self) -> str:
-        """Wings/rooms taxonomy is author-declared (mempalace_get_taxonomy
-        MCP tool), not inferred from data. Same semantics as
-        MemPalaceDaemonAdapter."""
-        return "declared"
+    def get_ontology_source(self) -> dict[str, Any]:
+        """Return a declared ontology descriptor compatible with SME
+        ontology-reporting call sites.
+        Familiar's wings/rooms taxonomy is author-declared
+        (mempalace_get_taxonomy MCP tool), not inferred from graph data.
+        The schema is currently surfaced as empty until this adapter
+        exposes the declared taxonomy structure directly.
+        """
+        return {
+            "type": "declared",
+            "schema": [],
+            "documentation": (
+                "Wings/rooms taxonomy is author-declared via the "
+                "mempalace_get_taxonomy MCP tool and is not inferred from "
+                "retrieved graph data."
+            ),
+        }

Conversation

jphein commented Apr 30, 2026

Summary

Design + plan docs in the PR

Notable refactor

Lessons doc highlights

Testing

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

M0nkeyFl0wer left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants