Skip to content

feat(eval): FamiliarAdapter + jp-realm-v0.1 corpus + first live readings#6

Merged
M0nkeyFl0wer merged 37 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/familiar-adapter-and-jp-realm-corpus
May 1, 2026
Merged

feat(eval): FamiliarAdapter + jp-realm-v0.1 corpus + first live readings#6
M0nkeyFl0wer merged 37 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/familiar-adapter-and-jp-realm-corpus

Conversation

@jphein
Copy link
Copy Markdown
Contributor

@jphein jphein commented Apr 30, 2026

Second in a three-PR stack. Stacked on feat/mempalace-daemon-adapter — please merge that one first; this PR's diff against main will collapse to just the new commits afterward.

Summary

Three logically-grouped pieces ride together because they're tightly coupled — the corpus was authored to exercise both adapters, and the lessons doc is the writeup that came out of running them:

  • FamiliarAdapter — read-side adapter for familiar.realm.watch, a deterministic retrieve→rerank→decay→compress pipeline that wraps a palace. Pairs cleanly with MemPalaceDaemonAdapter for A/B work.
  • jp-realm-v0.1 corpus — 30-question hand-authored corpus (27 single-hop, 3 two-hop) covering JP's homelab + project knowledge. Authoring contract follows the spec's existing per-corpus convention; lives at sme/corpora/jp_realm_v0_1/questions.yaml.
  • First live readings — head-to-head familiar vs daemon baselines, plus the v0.3 reflect-pipeline iterations that came out of running them. docs/ideas.md gains a "Lessons from the first live eval" section synthesizing what surfaced.

Design + plan docs in the PR

  • docs/superpowers/specs/2026-04-26-familiar-adapter-design.md
  • docs/superpowers/plans/2026-04-26-familiar-adapter.md

Notable refactor

refactor(adapters): extract /graph payload mapping into shared module (commit 4f213db in this PR) extracts the /graph snapshot mapping that both FamiliarAdapter and MemPalaceDaemonAdapter need, into sme/adapters/_graph_mapping.py. The daemon adapter is updated to use the shared module in this PR rather than the prior PR — keeps the prior PR scoped to "introduce the daemon adapter" and the refactor scoped to "share the part that's now duplicated."

Lessons doc highlights

The new section in docs/ideas.md captures the diagnostic patterns that fell out of the first run — including the system-gap-vs-pipeline-gap split that the q12/q13 case study illustrates (writing the missing drawers via palace-daemon flipped q12 from 0.0 to 1.0 — palace gap; left q13 at 0.0 — pipeline gap). This is the diagnostic the framework was built to provide; nice to see it land cleanly with the simplest possible scoring.

Testing

  • tests/test_familiar_adapter.py — 26 tests covering query happy/error paths, graph snapshot, ingest stub, mock/no-mock modes, timeout behavior.
  • tests/test_familiar_live.py — gated live smoke (requires FAMILIAR_LIVE=1), skipped in CI.
  • All daemon tests from PR 1 still pass after the shared-module refactor.

Test plan

  • python -m pytest tests/test_familiar_adapter.py tests/test_mempalace_daemon_adapter.py -v — 54 tests pass (26 familiar + 28 daemon)
  • sme-eval retrieve --adapter familiar --questions sme/corpora/jp_realm_v0_1/questions.yaml --familiar-url $URL — CLI smoke against a running familiar
  • cat baselines/jp_realm_v0_1_familiar_*.json | jq '.summary.mean_recall' — eyeball the recall progression across the v0.2/v0.3 baselines

The 19 commits split into design/plan, shared-module refactor, scaffold, query, graph snapshot, CLI wiring, then corpus authoring and the v0.2/v0.3 baseline iterations + lessons doc in roughly that order. Happy to walk through any specific commit if useful.

jphein and others added 30 commits April 30, 2026 10:02
The shipped sme/adapters/mempalace.py opens ChromaDB directly via
chromadb.PersistentClient. That's correct for upstream MemPalace
single-process installs but violates the daemon-strict invariant of
the jphein/palace-daemon + jphein/mempalace fork architecture: two
PersistentClient instances against the same palace = the
multi-writer corruption scenario the daemon was built to prevent.

Documenting a planned mempalace-daemon adapter that:
- talks HTTP/MCP only (no filesystem access, no chromadb import)
- assembles the graph snapshot from /mcp tool calls
  (mempalace_list_wings, mempalace_list_rooms, mempalace_list_tunnels,
   mempalace_kg_query)
- routes query() through palace-daemon's GET /search with the new
  kind= parameter (default kind=content excludes Stop-hook auto-save
  checkpoints — validated 2026-04-25 against the canonical 151K
  palace, filter excludes ~637 of 151,448 drawers but ~80%+ of
  search results before filter)

Why it matters: SME's Cat 9 (The Handshake) measures
integration-under-production-model — exactly the slice engram-2's
"0.984 R@5 vs 17% E2E QA" critique targets. Running Cat 9 through
the daemon adapter, A/B before/after applying the kind= filter, would
let SME's verdict replace our hand-rolled benchmark and generate
publishable data on whether the fix moves the needle.

Multi-hour build, not yet started; this entry is the placeholder so
the work doesn't drift. Lives on main only — kept off the
feat/cat9-9b-scaffolding upstream PR branch per the
"fork README handling" pattern (PR branches must carry upstream's
README so fork-only content doesn't leak upstream).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Design for a fork-side SMEAdapter that talks HTTP/MCP to a running
palace-daemon, plus a planned /graph endpoint to make graph snapshots fast.
Adapter ships independently against daemon 1.5.1 via MCP-tool fallback;
/graph is coordinated daemon-side work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 tasks following TDD: scaffolding, adapter (auth → query happy
path → query errors → /graph fast path → MCP fallback → ontology),
CLI plumbing, gated integration smoke, end-to-end CLI smoke, README.
Daemon-side /graph endpoint explicitly out-of-scope; coordinated
separately with JP's active palace-daemon work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three-source auth resolution: kwargs > env file > process env. Raises
ValueError at construction if nothing resolves. query() and
get_graph_snapshot() raise NotImplementedError pending follow-on tasks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GET /search builds context_string in the same format as the existing
MemPalaceAdapter so tiktoken counts stay comparable across adapters.
retrieval_path captures kind + scope counts for downstream scoring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers WARN-with-results soft signal, NO_RESULTS, AUTH, HTTP 5xx,
connection failures, and X-API-Key header propagation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GET /graph in one call; projects to wings, rooms, member_of, tunnel,
and KG layers. Mirrors the existing direct adapter's projection minus
drawer-level surface (impractical to fetch at scale over HTTP).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When /graph is unavailable (older daemons or prefer_graph_endpoint=False),
walk mempalace_list_wings, mempalace_list_rooms per wing, and
mempalace_list_tunnels via POST /mcp. Per-wing failures degrade to a
partial snapshot rather than aborting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Returns the same readme-typed ontology as the existing MemPalaceAdapter so
Cat 8 readings are comparable across the direct-ChromaDB and HTTP paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the mempalace-daemon branch to _load_adapter and threads
--api-key / --kind through _load_adapter_from_args, _add_db_or_api_args,
and the retrieve subparser. Existing mempalace and flat adapters drop
the new kwargs cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skipped when PALACE_DAEMON_URL is unset so CI without a daemon stays
green. Asserts query() returns a QueryResult, snapshot returns >=1
wing, and kind=all >= kind=content for total_before_filter (validates
the README's filter claim). Runs in ~29s against the 151K-drawer
palace via the MCP fallback path; will drop to <5s once the daemon
ships GET /graph.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the fork-roadmap planning subsection with the wired-endpoint
description, auth resolution rules, and invocation example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moved conftest imports to the top of the module, dropped unused
os/Path/urllib.error imports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier total_before_filter comparison conflated metadata math
with filter behaviour. The reliable signal is whether 'CHECKPOINT:'
strings appear in the result text — kind='content' must produce
strictly fewer than kind='all'. Skips when the live palace has no
checkpoints to test against.

Discovered against palace-daemon 1.6.0 with 151K drawers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the mempalace-daemon adapter that shipped 2026-04-25. Familiar
is the layer above palace-daemon (rerank + decay + compress + grounding);
this adapter measures that pipeline's contribution to retrieval quality
on top of the palace itself.

Cat 9 implementation deferred — but the adapter captures all the
HandshakeTrace-shaped data Cat 9 will eventually consume, so the
per-question JSON output is forward-compatible with the future
sme/categories/handshake.py scorer.
12 TDD tasks, 19 unit tests + 2 gated live tests, 1562 lines.
Mirrors the conventions established by the mempalace-daemon adapter
(commits 4f5eba3..da21708): refactor common /graph mapping first,
then scaffold-test-implement-test-commit per task.

Spec: docs/superpowers/specs/2026-04-26-familiar-adapter-design.md
Both MemPalaceDaemonAdapter and the upcoming FamiliarAdapter consume
the same /graph payload shape (familiar's /api/familiar/graph is a
verbatim proxy of palace-daemon's /graph). Extract project_graph into
sme/adapters/_graph_mapping.py so both adapters share the implementation
and stay in lock-step if the upstream shape evolves.

mempalace_daemon.py: 519 -> 408 lines.
Existing 28 mempalace_daemon tests pass unchanged.
Class structure + constructor + base URL trim + Mode-B ingest stub.
query() and get_graph_snapshot() raise NotImplementedError; subsequent
tasks fill them in. Mirrors MemPalaceDaemonAdapter conventions
(base_url, timeout_s, opener for test injection).
POSTs to /api/familiar/eval, deserializes the SME-shape response into
QueryResult. mock_inference=True default forwarded as 'mock' field
so Cat 1 substring scoring stays deterministic.

Error contract (no-raise per SMEAdapter):
  HTTP 4xx/5xx       -> error='familiar endpoint <code>: <body[:200]>'
  socket.timeout     -> error='familiar timeout after <s>s ...'
  URLError           -> error='familiar connection failed: <reason>'
  invalid JSON       -> error='familiar invalid JSON response: ...'
  unexpected         -> error='familiar unexpected: <type>: ...'

Warnings translation:
  server_error + warnings -> '<server> | WARN: <warnings joined>'
  warnings only           -> 'WARN: <warnings joined>'
  no signals              -> error=None

Soft warnings (low_confidence, filtered_null_text_*, stuck_loop) and
hard warnings (palace_unreachable) both surface with WARN: prefix;
data fields still populated so SME's per-question scoring distinguishes
'warned-but-answered' from 'errored'.

13 new tests; 16 passing total.
- get_graph_snapshot: GET /api/familiar/graph (familiar's verbatim
  proxy of palace-daemon's /graph with a 5-min cache layer) and runs
  the response through the shared _graph_mapping.project_graph helper.
  Failure returns ([], []) per the SME contract — a missing snapshot
  zeroes out structural-category scores rather than raising.
- get_flat_retrieval: same eval endpoint, entities only (Cat 7).
- get_ontology_source: 'declared' — wings/rooms are author-defined.
- get_harness_manifest: forward-compat Cat 9 — returns HTTP-tool +
  MCP descriptors when sme.harness is importable, else [] for
  graceful behavior on multipass versions before Cat 9 ships.

7 new tests; 23 passing total. Daemon adapter (28 tests) still green
since the shared graph mapping module is unchanged.
…meout

Adds the familiar branch to _load_adapter (mirrors mempalace-daemon's
branch shape). Drops daemon/ladybug-specific kwargs; remaps --api-url
to FamiliarAdapter's base_url constructor arg. Adds mutually-exclusive
--mock/--no-mock flags (default --mock for Cat 1 substring-scoring
determinism) and --familiar-timeout for tuning per-question wait.

CLI integration tests cover the dispatch path: instantiation,
mock_inference default, --no-mock pass-through.

3 new tests; 26 familiar + 28 daemon = 54 total.
README: parallel section to the mempalace-daemon adapter, framing
familiar as 'palace + v0.2 pipeline' so the score delta vs the daemon
adapter captures what the v0.2 work is worth.

ideas.md: notes that FamiliarAdapter's per-question records are
forward-compatible with the eventual sme/categories/handshake.py
scorer — Cat 9 implementation can land later without revisiting
the adapter API.
Set FAMILIAR_BASE_URL to run; otherwise skipped. Asserts the
QueryResult contract holds end-to-end and graph_snapshot returns lists.
Tolerates palace HNSW rebuild state via skip+warn handling — verified
2026-04-26 against https://familiar.jphe.in/ (palace mid-rebuild,
query took 17s and adapter correctly translated daemon timeout into
WARN: error string).
- tests/test_familiar_adapter.py: drop unused pytest import (ruff F401).
- sme/cli.py: --adapter help now lists familiar in the choices.

54/54 adapter tests still green; ruff clean on all files added in this
PR (sme/adapters/familiar.py, sme/adapters/_graph_mapping.py,
tests/test_familiar_adapter.py, tests/test_familiar_live.py).
Diagnostic Mode-B corpus targeting content known to exist in JP's
live palace based on session context, CLAUDE.md, and trace logs.
30 questions across 6 thematic groups: realm.watch ecosystem,
mempalace+palace-daemon, surveyed forks, tools/workflow, infra,
specific incidents/decisions. expected_sources substrings will
match against the adapter's context_string.

Run when palace HNSW rebuild completes:
  sme-eval retrieve --adapter familiar --api-url https://familiar.jphe.in \
      --questions sme/corpora/jp_realm_v0_1/questions.yaml --mock --json familiar.json

Compare against same corpus through --adapter mempalace-daemon to
measure familiar v0.2 pipeline's retrieval delta.
The retrieve subcommand always sets read_only=True in adapter_kwargs
(meaningful for ChromaDB / mempalace-daemon), but FamiliarAdapter
takes only base_url/timeout_s/mock_inference/opener and rejects
unknown kwargs with TypeError. Add read_only to the familiar drop
list so `sme-eval retrieve --adapter familiar ...` works.
familiar v0.2 pipeline:  73.33% recall, 17/30 full, 27/30 hit-rate
mempalace-daemon raw:    70.00% recall, 15/30 full, 27/30 hit-rate
delta: +3.33pp recall, +2 questions; familiar never loses (28 ties).

Pipeline contribution surfaces on:
- q04_os_realm_watch (rerank lifts os.realm into top-K)
- q16_speech_to_cli (decay weight prefers fresh speech-to-cli notes)

3 honest gaps in palace (recall=0.0 in both): q08 kind-filter
(internal API detail), q12 rlm, q13 GraphPalace — fork-survey
topics synthesized in chat but never written to palace.

Run:
  PALACE_API_KEY=$(grep PALACE_DAEMON_API_KEY ~/.config/palace-daemon/env | cut -d= -f2) \\
  venv/bin/sme-eval retrieve --adapter mempalace-daemon --api-url http://disks:8085 \\
      --questions sme/corpora/jp_realm_v0_1/questions.yaml --json out.json
…arser

Earlier work added these flags to _add_db_or_api_args (used by cat4,
cat5, check) but the retrieve subcommand inlines its own arg block
and never picked them up. Result: `sme-eval retrieve --mock ...`
errored with "unrecognized arguments: --mock". Add the same arg
group + kwarg passthrough to retrieve so all FamiliarAdapter knobs
are accessible from the retrieve path — the only path that actually
uses them today.
jphein added 5 commits April 30, 2026 10:02
v2 (clean):              73.33% recall  17 full hits  27/30 hit-rate
v3 (+3 missing drawers): 76.67% recall  18 full hits  28/30 hit-rate
v4 (+ punct fix):        76.67% recall  17 full hits  29/30 hit-rate

v3 added drawers for q12 (rlm), q13 (GraphPalace), q08 (kind filter)
written via palace-daemon /memory POST. Only q12 surfaced
(q13 and q08 missed due to embedding ranking).

v4 added trailing-punctuation strip in palace-client (familiar
commit b676852). q13 flipped 0.0 → 0.5 (now finds GraphPalace);
q15 regressed 1.0 → 0.5 (one substring lost from rerank shift).
Net: hit-rate up, full-recall flat — punctuation normalization
is a useful retrieval tuning lever but ranking is sensitive.

Last hard miss (q08 kind_filter) is a true semantic-distance
issue: jargon-heavy drawer text doesn't embed close to a
conversational question even though both contain the same
keywords literally. Vector retrieval limitation, not a familiar bug.
Captures findings that should shape adapter + category work:
- substring scoring on context_string is the right MVP
- corpus runs distinguish system gaps from pipeline gaps
- corpora must be cheap to re-run (ratchet, not benchmark)
- retrieve subcommand needed CLI plumbing fixes the helpers hid
- per-corpus directories scale cleanly to multi-corpus future
Two attempts to lift retrieval recall on jp-realm-v0.1, both reverted
upstream after measuring:

1. jp_realm_v0_1_familiar_v0.3_task1_bm25.json
   familiar adds cosine+bm25 blend in domainRerank baseScore.
   Net effect: zero (76.67% → 76.67%). The daemon's PalaceDrawer
   response uses field name 'bm25_score' not 'bm25', so the blend
   condition never fired in production. Even when fixed, the
   mempalace fork ALREADY does identical 0.6/0.4 hybrid in
   _hybrid_rank — familiar's blend would have been redundant.

2. jp_realm_v0_1_familiar_v0.3_candidate_window.json
   familiar fetches candidateLimit=20 from daemon then slices to
   retrievalLimit=5 after rerank/decay/compress. Net REGRESSION
   (76.67% → 71.67%, hit-rate 29 → 26 of 30). The wider candidate
   window let wing-match boost (×1.4) elevate same-wing-but-irrelevant
   drawers above different-wing correct answers — 4 questions
   regressed (q07, q18, q21, q28) vs 3 improved (q15, q17, q29).

Both reverted in familiar.realm.watch (eb0b640 reverted by d2bf8a0;
55d8889 reverted by f4f2cce). Lesson: read-side reranking has hit
diminishing returns; the q08-class miss is upstream of familiar.
v0.3 attention should shift to the reflect loop (write-side) per
the spec's framing.
Mean recall:    76.67% -> 78.33%  (+1.67pp)
Full hits:      17/30 -> 18/30
Hit-rate:       29/30 -> 29/30 (unchanged)

q14_hermes_agent flipped 0.50 -> 1.00 because the reflect smoke
turn wrote a drawer about hermes-agent that now surfaces 'agent'
in retrieval. Mechanism proven; v0.4's automatic per-session
reflect should compound this across many turns.

The bm25-blend and candidate-window attempts (both reverted)
moved zero questions; reflect moved one. Confirms the spec's
framing — write-side gaps are the real lever.
…lative

After a full day of real chat usage with auto-reflect enabled (every
≥80-char assistant turn), recall on jp-realm-v0.1 measures:

  v0.2.1 (entering v0.3):        76.67% recall, 17 full
  v0.3.0 (single smoke turn):    78.33% recall, 18 full
  v0.3.9 (a day later):          78.33% recall, 18 full   ← no further movement

What this tells us: reflect IS writing drawers (memories panel
showed 5+ from real chats during the day), but the 30-question
jp-realm-v0.1 corpus asks about topics palace already had coverage
for. Reflect's contribution lives in topics OUTSIDE the frozen
corpus.

Implication for v0.4: cumulative reflect impact needs a different
measurement vehicle — a reflect-specific corpus, or longitudinal
recall on a corpus that grows as JP introduces new topics. Filing
this as a v0.4 candidate (corpus authoring, not pipeline change).
Copilot AI review requested due to automatic review settings April 30, 2026 17:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds two new read-side adapters (mempalace-daemon, familiar) to run SME retrieval + graph-snapshot evaluations against live HTTP services, plus a new JP-specific diagnostic corpus and first baseline artifacts / lessons writeup.

Changes:

  • Introduce MemPalaceDaemonAdapter (palace-daemon HTTP) and FamiliarAdapter (familiar.realm.watch HTTP) with shared /graph payload projection.
  • Wire new adapters + flags into the CLI and add extensive mocked + gated-live tests.
  • Add the jp-realm-v0.1 question corpus plus baseline JSON outputs and documentation updates.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_mempalace_daemon_integration.py Gated live smoke tests for the daemon adapter.
tests/test_mempalace_daemon_adapter.py Mocked unit tests for daemon adapter query + graph snapshot (fast-path + MCP fallback).
tests/test_familiar_live.py Gated live smoke tests for FamiliarAdapter.
tests/test_familiar_adapter.py Mocked unit tests for FamiliarAdapter behavior and CLI loading.
tests/fixtures/tiny_questions.yaml Minimal smoke corpus fixture.
tests/conftest.py Adds fake_urlopen_factory + _FakeResponse for HTTP-mocking urllib adapters.
sme/corpora/jp_realm_v0_1/questions.yaml Adds the jp-realm-v0.1 diagnostic question corpus.
sme/cli.py Adds adapter loading branches + new flags (--api-key, --kind, --mock/--no-mock, --familiar-timeout) and threads them into retrieve.
sme/adapters/mempalace_daemon.py Implements palace-daemon HTTP adapter (/search, /graph, MCP fallback).
sme/adapters/familiar.py Implements familiar HTTP adapter (/api/familiar/eval, /api/familiar/graph) and payload deserialization.
sme/adapters/_graph_mapping.py Shared projection from /graph payload to (Entity, Edge) lists.
docs/superpowers/specs/2026-04-26-familiar-adapter-design.md Design spec for FamiliarAdapter.
docs/superpowers/specs/2026-04-25-mempalace-daemon-adapter-design.md Design spec for MemPalaceDaemonAdapter.
docs/ideas.md Adds “Lessons from the first live eval” section and adapter forward-compat notes.
baselines/jp_realm_v0_1_familiar_v0.3_task1_bm25.json Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_v0.3_reflect.json Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_v0.3_candidate_window.json Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_v0.3.9_reflect_cumulative.json Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_20260426_v4_punct_fix.json Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_20260426_v3.json Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_20260426_v2.json Baseline output artifact (familiar).
baselines/jp_realm_v0_1_familiar_20260426.json Baseline output artifact (familiar).
README.md Documents fork roadmap and usage for the new adapters/flags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +25
not os.environ.get("PALACE_DAEMON_URL"),
reason="needs a running palace-daemon; set PALACE_DAEMON_URL to enable",
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skip condition only checks PALACE_DAEMON_URL, but the fixture constructs MemPalaceDaemonAdapter() which also requires PALACE_API_KEY (or an env_file). If a developer sets the URL but forgets the key, this will fail the test instead of skipping. Consider extending the skipif to require both PALACE_DAEMON_URL and PALACE_API_KEY (or otherwise skipping with a clearer reason).

Suggested change
not os.environ.get("PALACE_DAEMON_URL"),
reason="needs a running palace-daemon; set PALACE_DAEMON_URL to enable",
not os.environ.get("PALACE_DAEMON_URL")
or not os.environ.get("PALACE_API_KEY"),
reason=(
"needs a running palace-daemon; set PALACE_DAEMON_URL and "
"PALACE_API_KEY to enable"
),

Copilot uses AI. Check for mistakes.
Comment thread sme/adapters/familiar.py Outdated
Comment on lines +75 to +83
def query(self, question: str) -> QueryResult:
"""POST /api/familiar/eval and deserialize the SME-shape response."""
body = {
"query": question[:250],
"limit": 5,
"kind": "content",
"mock": self.mock_inference,
}
status, payload = self._post_json("/api/familiar/eval", body)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FamiliarAdapter.query() hard-codes "limit": 5 and does not accept an n_results kwarg, so the CLI's --n-results setting is silently ignored for this adapter (cmd_retrieve falls back after a TypeError). Consider adding an optional n_results parameter (and threading it through as the request's "limit") so familiar runs are comparable at different K values and the CLI flag behaves consistently across adapters.

Copilot uses AI. Check for mistakes.
Comment thread sme/adapters/familiar.py Outdated
Comment on lines +117 to +131
def get_flat_retrieval(self, question: str, k: int = 5) -> list[Entity]:
"""Optional: same path as query() but returns only the entities,
used by Cat 7 token-budget analysis where the answer text is
irrelevant and only retrieval changes."""
body = {
"query": question[:250],
"limit": k,
"kind": "content",
"mock": self.mock_inference,
}
status, payload = self._post_json("/api/familiar/eval", body)
if status != 200 or not isinstance(payload, dict):
return []
raw_entities = payload.get("retrieved_entities") or []
return [self._entity_from_payload(e) for e in raw_entities]
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FamiliarAdapter.get_flat_retrieval() doesn't match the SMEAdapter optional-method contract: SMEAdapter.get_flat_retrieval returns a QueryResult, but this implementation returns list[Entity] and adds a nonstandard k parameter. This will break any generic Cat 7 / tooling that expects a QueryResult shape from get_flat_retrieval. Suggest returning a QueryResult (with context_string/answer empty if desired) and using the same n_results kwarg convention as other adapters.

Suggested change
def get_flat_retrieval(self, question: str, k: int = 5) -> list[Entity]:
"""Optional: same path as query() but returns only the entities,
used by Cat 7 token-budget analysis where the answer text is
irrelevant and only retrieval changes."""
body = {
"query": question[:250],
"limit": k,
"kind": "content",
"mock": self.mock_inference,
}
status, payload = self._post_json("/api/familiar/eval", body)
if status != 200 or not isinstance(payload, dict):
return []
raw_entities = payload.get("retrieved_entities") or []
return [self._entity_from_payload(e) for e in raw_entities]
def get_flat_retrieval(self, question: str, n_results: int = 5) -> QueryResult:
"""Optional: same path as query() but returns the standard
QueryResult shape so generic SME tooling can consume retrieval-only
calls through the base adapter contract."""
body = {
"query": question[:250],
"limit": n_results,
"kind": "content",
"mock": self.mock_inference,
}
status, payload = self._post_json("/api/familiar/eval", body)
if status == -1:
msg = payload.get("_network_error", "unknown network error") if isinstance(payload, dict) else "unknown error"
return QueryResult(
answer="",
context_string="",
retrieved_entities=[],
retrieved_edges=[],
error=f"familiar {msg}",
)
if status != 200:
snippet = json.dumps(payload)[:200] if isinstance(payload, dict) else str(payload)[:200]
return QueryResult(
answer="",
context_string="",
retrieved_entities=[],
retrieved_edges=[],
error=f"familiar endpoint {status}: {snippet}",
)
return self._deserialize_query(payload)

Copilot uses AI. Check for mistakes.
Comment thread sme/adapters/familiar.py Outdated
Comment on lines +133 to +138
def get_ontology_source(self) -> str:
"""Wings/rooms taxonomy is author-declared (mempalace_get_taxonomy
MCP tool), not inferred from data. Same semantics as
MemPalaceDaemonAdapter."""
return "declared"

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FamiliarAdapter.get_ontology_source() returns the string "declared", but SMEAdapter.get_ontology_source() is documented/used as a dict with keys like {type, schema, documentation} (e.g. cmd_analyze expects ontology.get(...)). Returning a string will raise at call sites if this adapter is used with ontology-reporting commands. Consider returning a dict like {"type": "declared", "schema": [...], "documentation": "..."} even if schema is empty for now.

Suggested change
def get_ontology_source(self) -> str:
"""Wings/rooms taxonomy is author-declared (mempalace_get_taxonomy
MCP tool), not inferred from data. Same semantics as
MemPalaceDaemonAdapter."""
return "declared"
def get_ontology_source(self) -> dict[str, Any]:
"""Return a declared ontology descriptor compatible with SME
ontology-reporting call sites.
Familiar's wings/rooms taxonomy is author-declared
(mempalace_get_taxonomy MCP tool), not inferred from graph data.
The schema is currently surfaced as empty until this adapter
exposes the declared taxonomy structure directly.
"""
return {
"type": "declared",
"schema": [],
"documentation": (
"Wings/rooms taxonomy is author-declared via the "
"mempalace_get_taxonomy MCP tool and is not inferred from "
"retrieved graph data."
),
}

Copilot uses AI. Check for mistakes.
Comment thread sme/adapters/mempalace_daemon.py Outdated
import urllib.error
import urllib.parse
import urllib.request
from collections import defaultdict
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sme/adapters/mempalace_daemon.py imports defaultdict but doesn't use it (graph projection moved into sme.adapters._graph_mapping). With ruff enabled (pyproject.toml includes ruff in dev deps), this will fail lint due to an unused import. Remove the unused import to keep the module clean.

Suggested change
from collections import defaultdict

Copilot uses AI. Check for mistakes.
jphein referenced this pull request in techempower-org/multipass-structural-memory-eval Apr 30, 2026
Addresses Copilot review on PR #6:

- query() now accepts an optional n_results kwarg (default 5) instead
  of hard-coding limit=5 in the request body. The CLI's --n-results
  flag was being silently ignored for the familiar adapter; the
  cmd_retrieve TypeError-fallback ladder was masking it.
- get_flat_retrieval() now returns QueryResult per the SMEAdapter
  base contract (with empty answer/context_string by design — this
  is the "retrieval-only" reading for Cat 7). Previously returned
  list[Entity], which would break any generic Cat 7 / tooling that
  expects a QueryResult shape.
- get_ontology_source() now returns a dict with keys
  {type, schema, documentation} per the SMEAdapter contract.
  Previously returned the bare string "declared", which would raise
  at any call site that accesses the dict (e.g. cmd_analyze).
  Documents the same MemPalace ontology familiar wraps, since
  familiar doesn't change the underlying schema.

Tests updated to assert the new shapes (QueryResult attributes,
dict structure on the ontology call).
jphein added 2 commits April 30, 2026 13:53
Addresses Copilot review on PR #5:

- Drop unused `from collections import defaultdict` import (graph
  projection moved into sme.adapters._graph_mapping in the familiar
  PR; this import was left orphaned). Was failing ruff lint silently.
- Wrap `_parse_env_file`'s `path.read_text()` in a try/except for
  OSError + UnicodeDecodeError so adapter construction falls through
  to env-var resolution instead of raising on permissions issues,
  broken symlinks, or non-UTF-8 contents.
- Tighten the live-integration skip gate to require both
  PALACE_DAEMON_URL and PALACE_API_KEY (it previously checked only
  the URL, so a developer with the URL but not the key would get a
  hard test failure instead of a skip).

Deferred (commented in PR review): only-fall-back-on-404 for
get_graph_snapshot. That's a behavior change and the empty-on-error
shape may be load-bearing for existing users; will revisit once the
PR has reviewer signal on direction.
Addresses Copilot review on PR #6:

- query() now accepts an optional n_results kwarg (default 5) instead
  of hard-coding limit=5 in the request body. The CLI's --n-results
  flag was being silently ignored for the familiar adapter; the
  cmd_retrieve TypeError-fallback ladder was masking it.
- get_flat_retrieval() now returns QueryResult per the SMEAdapter
  base contract (with empty answer/context_string by design — this
  is the "retrieval-only" reading for Cat 7). Previously returned
  list[Entity], which would break any generic Cat 7 / tooling that
  expects a QueryResult shape.
- get_ontology_source() now returns a dict with keys
  {type, schema, documentation} per the SMEAdapter contract.
  Previously returned the bare string "declared", which would raise
  at any call site that accesses the dict (e.g. cmd_analyze).
  Documents the same MemPalace ontology familiar wraps, since
  familiar doesn't change the underlying schema.

Tests updated to assert the new shapes (QueryResult attributes,
dict structure on the ontology call).
jphein referenced this pull request in techempower-org/multipass-structural-memory-eval Apr 30, 2026
Addresses Copilot review on PR #6:

- query() now accepts an optional n_results kwarg (default 5) instead
  of hard-coding limit=5 in the request body. The CLI's --n-results
  flag was being silently ignored for the familiar adapter; the
  cmd_retrieve TypeError-fallback ladder was masking it.
- get_flat_retrieval() now returns QueryResult per the SMEAdapter
  base contract (with empty answer/context_string by design — this
  is the "retrieval-only" reading for Cat 7). Previously returned
  list[Entity], which would break any generic Cat 7 / tooling that
  expects a QueryResult shape.
- get_ontology_source() now returns a dict with keys
  {type, schema, documentation} per the SMEAdapter contract.
  Previously returned the bare string "declared", which would raise
  at any call site that accesses the dict (e.g. cmd_analyze).
  Documents the same MemPalace ontology familiar wraps, since
  familiar doesn't change the underlying schema.

Tests updated to assert the new shapes (QueryResult attributes,
dict structure on the ontology call).
Copy link
Copy Markdown
Owner

@M0nkeyFl0wer M0nkeyFl0wer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Stack continues to land cleanly.

Local verification on pr-6:

  • pytest tests/test_familiar_adapter.py -v — 26/26 pass
  • Full suite — 90 passed, 5 skipped
  • ruff check — clean

What I liked:

  • Refactor before composition. Extracting _project_graph into sme/adapters/_graph_mapping.py:project_graph rather than copy-pasting it into FamiliarAdapter is the right move — it pins the projection contract to a single function that both adapters import. The "Extracted verbatim from MemPalaceDaemonAdapter._project_graph 2026-04-26" comment makes the provenance auditable. Resolves the deferred unused-import lint from PR #5 too.
  • mock_inference=True by default is the right call for SME runs. Substring scoring needs determinism; LLM-in-the-loop reads belong in a Cat 9 pass. The docstring says this explicitly, which is good — anyone who wonders "why are we mocking the inference?" gets the answer immediately.
  • Familiar pipeline measurement strategy is well-framed. "POST /api/familiar/eval already returns SME-QueryResult shape natively" — designing familiar v0.2's chunk-C against the SME contract means the adapter is mostly deserialization, which is exactly the surface area we want at the eval boundary. Comparing familiar vs daemon adapter scores on the same corpus (the diff measuring rerank/decay/compress/grounding contribution) is a clean isolation.
  • jp-realm-v0.1 corpus — 30 questions, expected_sources as substrings (matches the framework's existing matcher), Mode B diagnostic framing (no ingest, queries target known-existing content). The "seed authored 2026-04-26 by Claude with JP" note is honest about provenance.
  • 8 baseline JSON files at consistent envelope shape (total, full_recall, partial_hit, mean_recall, mean_tokens) — the v0.2 → v0.3.x progression is captured, which means future regressions are diff-able rather than re-discoverable.

No requested changes.

Note for future spec work: the substring-on-context_string matcher inherited from the existing framework has the same precision/recall caveats I documented in mempalace#101 (filenames-containing-subjects scoring spuriously high). Worth flagging in a follow-up issue rather than blocking this PR; the corpus authoring contract you've outlined in tests/fixtures/tiny_questions.yaml and the v0.1 questions is independent of the matcher's known limitations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants