feat(memory): surface recall scores and add a graded relevance floor#249
Open
prakashUXtech wants to merge 2 commits into
Open
feat(memory): surface recall scores and add a graded relevance floor#249prakashUXtech wants to merge 2 commits into
prakashUXtech wants to merge 2 commits into
Conversation
Recall had two scoring mechanisms that never met. Each store's search() gated candidates on a binary token-overlap check (score > 0.0), so a single shared word earned a result slot. BM25 was computed too, but only inside compute_activation as the spread term — it never reached a result, and recall --json had no score field at all. This plumbs the activation score onto every recalled entry and turns the binary gate into a graded, configurable cutoff. - activation.py: add ActivationBreakdown and activation_breakdown(), which returns the activation total alongside its terms. compute_activation() delegates to it, so existing callers are unchanged. - recall.py: RecallEngine.recall() scores each candidate once, writes the activation score to the entry's recall_score field, and drops candidates whose token overlap falls below relevance_floor. The floor checks raw overlap, not the activation total, so recency or emotion cannot lift an off-topic memory past it. - search.py: DEFAULT_RELEVANCE_FLOOR and passes_relevance_floor() replace the bare score > 0.0 literal in each store's search(). - types.py: MemoryEntry gains recall_score, a runtime-only field (exclude=True) so query-dependent scores stay out of the .soul file. - cli/main.py: soul recall --json emits a score per result; a new --min-relevance flag exposes the graded floor. The default floor is 0.0, which keeps the historical "any overlap" behaviour — raising the default would silently drop matches existing callers expect. The graded cutoff is opt-in via relevance_floor / --min-relevance. Closes #247
Issues (must fix)
Heads up
Please update your PR to address these points. |
Security scan: review neededPotentially dangerous code patterns detected in changed files. A maintainer should verify these are intentional and safe.### src/soul_protocol/runtime/memory/manager.py src/soul_protocol/runtime/memory/recall.pysrc/soul_protocol/runtime/soul.py |
…nce floor A graph-augmented candidate enters the result set because it matched a related graph entity term, not the original user query. The relevance floor was then checked against the original query, where that candidate's token overlap is typically ~0 — so any positive relevance_floor silently dropped every graph-augmented result and neutered graph augmentation. recall() now captures the candidate IDs the text search produced before graph augmentation runs, and applies the floor only to those. Graph- augmented entries were already validated by the entity-term search and pass through regardless of the floor. Also: - passes_relevance_floor() docstring spells out the boundary semantics: the comparison is inclusive (score >= floor) and a negative floor is clamped to the historical strict-positive gate rather than honoured. - Note the recall_score trade-off at its assignment: it mutates the live store object, which is safe (exclude=True, no .soul corruption) but can race across concurrent recalls. Copying only there would be a half- measure since the access-metadata reinforcement loop mutates the same shared objects on purpose. - test_floor_above_all_matches_returns_empty keeps relevance_floor=1.01 with a comment: the floor is inclusive, so 1.0 still keeps a perfect match — rejecting even a perfect match needs a floor just above 1.0. Tests: add TestGraphAugmentationExemptFromFloor — a graph-augmented entry with zero query overlap survives a 0.5 floor, while a weak direct match is still dropped (the exemption is scoped, not a blanket bypass). Docs: memory-architecture.md and api-reference.md note the exemption and the inclusive-comparison semantics.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
soul recallscores memories two separate ways, and the two never meet. Each store'ssearch()keeps a candidate if it shares any token with the query, so one shared word is enough to win a slot. BM25 gets computed too, but only insidecompute_activationas thespreadterm, so it never reaches a result. Andrecall --jsonhas no score field at all.This PR puts the activation score on every recalled entry and turns the binary gate into a graded cutoff you can configure.
Why
Issue #247 audited the session-start
soul recallmandate. Recall already drops zero-overlap matches, so a vague query comes back empty instead of padded. The part #247 flagged is the floor itself. Today it is a binary gate: any token overlap, however slight, takes a slot. Setting a graded threshold means the score has to reach the result first, and right now it is stuck inside the scorer.Changes
activation.py:activation_breakdown()returns the activation total along with its component terms.compute_activation()now calls through to it, so existing callers see no change.recall.py:RecallEngine.recall()scores each candidate once, writes the activation score torecall_score, and drops candidates belowrelevance_floor. The floor checks raw token overlap rather than the activation total, so a recency or emotion boost cannot carry an off-topic memory past it.search.py:DEFAULT_RELEVANCE_FLOORandpasses_relevance_floor()replace the barescore > 0.0literal in each store'ssearch().types.py: addsMemoryEntry.recall_score— a runtime-only field that stays out of the.soulfile.cli/main.py:recall --jsonemits ascoreper result, and a--min-relevanceflag exposes the floor.The default floor stays
0.0, so existing recalls behave the same as before. A higher cutoff is opt-in throughrelevance_flooror--min-relevance. Raising the default would quietly drop matches that current callers expect to see.Testing
Smoke test (
tests/test_cli/test_recall_score.py):soul recall --jsonemits ascorefield per result, a recall returns scored results, and--min-relevancedrops a weak match.End-to-end (
tests/test_retrieval.py::TestRecallScoresAndFloor): populate a store, query it, then check that every result carries a score, results come back ordered by score, and a weak match below the floor is dropped while the strong match stays.Results:
uv run pytest tests/test_retrieval.py -q— 25 passeduv run pytest tests/ -q --ignore=tests/e2e— 3067 passed, 1 skipped (the skip is a known pre-existing failure, unrelated to this change)Docs updated in this PR:
memory-architecture.md(recall behavior, new "Recall scores and the graded relevance floor" section),cli-reference.md(--min-relevanceflag and thescoreJSON field),api-reference.md(relevance_floorparameter), and theGAP-ANALYSIS.mdrecall row.Closes #247