Skip to content

feat(memory): surface recall scores and add a graded relevance floor#249

Open
prakashUXtech wants to merge 2 commits into
devfrom
feat/recall-graded-floor
Open

feat(memory): surface recall scores and add a graded relevance floor#249
prakashUXtech wants to merge 2 commits into
devfrom
feat/recall-graded-floor

Conversation

@prakashUXtech
Copy link
Copy Markdown
Contributor

What

soul recall scores memories two separate ways, and the two never meet. Each store's search() keeps a candidate if it shares any token with the query, so one shared word is enough to win a slot. BM25 gets computed too, but only inside compute_activation as the spread term, so it never reaches a result. And recall --json has no score field at all.

This PR puts the activation score on every recalled entry and turns the binary gate into a graded cutoff you can configure.

Why

Issue #247 audited the session-start soul recall mandate. Recall already drops zero-overlap matches, so a vague query comes back empty instead of padded. The part #247 flagged is the floor itself. Today it is a binary gate: any token overlap, however slight, takes a slot. Setting a graded threshold means the score has to reach the result first, and right now it is stuck inside the scorer.

Changes

  • activation.py: activation_breakdown() returns the activation total along with its component terms. compute_activation() now calls through to it, so existing callers see no change.
  • recall.py: RecallEngine.recall() scores each candidate once, writes the activation score to recall_score, and drops candidates below relevance_floor. The floor checks raw token overlap rather than the activation total, so a recency or emotion boost cannot carry an off-topic memory past it.
  • search.py: DEFAULT_RELEVANCE_FLOOR and passes_relevance_floor() replace the bare score > 0.0 literal in each store's search().
  • types.py: adds MemoryEntry.recall_score — a runtime-only field that stays out of the .soul file.
  • cli/main.py: recall --json emits a score per result, and a --min-relevance flag exposes the floor.

The default floor stays 0.0, so existing recalls behave the same as before. A higher cutoff is opt-in through relevance_floor or --min-relevance. Raising the default would quietly drop matches that current callers expect to see.

Testing

Smoke test (tests/test_cli/test_recall_score.py): soul recall --json emits a score field per result, a recall returns scored results, and --min-relevance drops a weak match.

End-to-end (tests/test_retrieval.py::TestRecallScoresAndFloor): populate a store, query it, then check that every result carries a score, results come back ordered by score, and a weak match below the floor is dropped while the strong match stays.

Results:

  • uv run pytest tests/test_retrieval.py -q — 25 passed
  • uv run pytest tests/ -q --ignore=tests/e2e — 3067 passed, 1 skipped (the skip is a known pre-existing failure, unrelated to this change)

Docs updated in this PR: memory-architecture.md (recall behavior, new "Recall scores and the graded relevance floor" section), cli-reference.md (--min-relevance flag and the score JSON field), api-reference.md (relevance_floor parameter), and the GAP-ANALYSIS.md recall row.

Closes #247

Recall had two scoring mechanisms that never met. Each store's search()
gated candidates on a binary token-overlap check (score > 0.0), so a
single shared word earned a result slot. BM25 was computed too, but only
inside compute_activation as the spread term — it never reached a result,
and recall --json had no score field at all.

This plumbs the activation score onto every recalled entry and turns the
binary gate into a graded, configurable cutoff.

- activation.py: add ActivationBreakdown and activation_breakdown(), which
  returns the activation total alongside its terms. compute_activation()
  delegates to it, so existing callers are unchanged.
- recall.py: RecallEngine.recall() scores each candidate once, writes the
  activation score to the entry's recall_score field, and drops candidates
  whose token overlap falls below relevance_floor. The floor checks raw
  overlap, not the activation total, so recency or emotion cannot lift an
  off-topic memory past it.
- search.py: DEFAULT_RELEVANCE_FLOOR and passes_relevance_floor() replace
  the bare score > 0.0 literal in each store's search().
- types.py: MemoryEntry gains recall_score, a runtime-only field
  (exclude=True) so query-dependent scores stay out of the .soul file.
- cli/main.py: soul recall --json emits a score per result; a new
  --min-relevance flag exposes the graded floor.

The default floor is 0.0, which keeps the historical "any overlap"
behaviour — raising the default would silently drop matches existing
callers expect. The graded cutoff is opt-in via relevance_floor /
--min-relevance.

Closes #247
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Issues (must fix)

  • No evidence of local testing found. Please include terminal output or screenshots.

Heads up

  • Large PR detected (801 lines across 16 files). Consider splitting into smaller PRs.

Please update your PR to address these points.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Security scan: review needed

Potentially dangerous code patterns detected in changed files. A maintainer should verify these are intentional and safe.### src/soul_protocol/runtime/memory/manager.py

2282:            search_strategy: Optional SearchStrategy for pluggable retrieval (v0.2.2).

src/soul_protocol/runtime/memory/recall.py

33:#   Access timestamps are updated on retrieval (strengthens future recall).

src/soul_protocol/runtime/soul.py

833:            search_strategy: Optional SearchStrategy for pluggable retrieval (v0.2.2).
1050:            search_strategy: Optional SearchStrategy for pluggable retrieval (v0.2.2).
1321:    def last_retrieval(self) -> RetrievalTrace | None:

…nce floor

A graph-augmented candidate enters the result set because it matched a
related graph entity term, not the original user query. The relevance
floor was then checked against the original query, where that candidate's
token overlap is typically ~0 — so any positive relevance_floor silently
dropped every graph-augmented result and neutered graph augmentation.

recall() now captures the candidate IDs the text search produced before
graph augmentation runs, and applies the floor only to those. Graph-
augmented entries were already validated by the entity-term search and
pass through regardless of the floor.

Also:
- passes_relevance_floor() docstring spells out the boundary semantics:
  the comparison is inclusive (score >= floor) and a negative floor is
  clamped to the historical strict-positive gate rather than honoured.
- Note the recall_score trade-off at its assignment: it mutates the live
  store object, which is safe (exclude=True, no .soul corruption) but can
  race across concurrent recalls. Copying only there would be a half-
  measure since the access-metadata reinforcement loop mutates the same
  shared objects on purpose.
- test_floor_above_all_matches_returns_empty keeps relevance_floor=1.01
  with a comment: the floor is inclusive, so 1.0 still keeps a perfect
  match — rejecting even a perfect match needs a floor just above 1.0.

Tests: add TestGraphAugmentationExemptFromFloor — a graph-augmented entry
with zero query overlap survives a 0.5 floor, while a weak direct match
is still dropped (the exemption is scoped, not a blanket bypass). Docs:
memory-architecture.md and api-reference.md note the exemption and the
inclusive-comparison semantics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant