Skip to content

feat: L1 extraction quality improvements — reduce LLM dependency#83

Open
yuanrengu wants to merge 2 commits into
Tencent:mainfrom
yuanrengu:feat/l1-quality-improvements
Open

feat: L1 extraction quality improvements — reduce LLM dependency#83
yuanrengu wants to merge 2 commits into
Tencent:mainfrom
yuanrengu:feat/l1-quality-improvements

Conversation

@yuanrengu
Copy link
Copy Markdown

@yuanrengu yuanrengu commented May 24, 2026

Closes #82

Summary

Four improvements to the L1 memory extraction pipeline to reduce LLM dependency and improve extraction quality.


1. Restore & enhance L1 quality gate (sanitize.ts)

  • Re-enable commented-out length filters (CJK ≥ 4 chars, alpha ≥ 10 chars)
  • Re-enable prompt injection detection
  • Add conversational filler filter (好的/OK/thanks/got it)
  • Fix CJK injection patterns (broader .{0,10} matching)

2. Rule-based pre-extraction layer (pre-extractor.ts — new)

  • 10 persona patterns: 我喜欢/我是/我的职业是/我擅长/我认为…
  • 8 instruction patterns: 以后/记住/禁止/从现在开始/使用X语言回复…
  • Date + action-verb episodic detection
  • HIGH-confidence items are merged directly; MEDIUM-confidence hint injection is deferred to a follow-up PR

3. Self-correction retry on JSON parse failure (l1-extractor.ts)

  • First parse failure triggers one retry with error feedback
  • Only truly discarded if both attempts fail

4. Post-LLM confidence check (l1-extractor.ts)

  • Source traceability: ≥30% keywords must appear in source messages
  • Type consistency: persona must reference 用户/我, instruction must contain AI/directive keywords
  • Trivial content rejection for episodic boilerplate

Additional fixes

  • Non-greedy regex fix in pre-extractor patterns (greedy .{0,N}.{0,N}?)

Testing

54/55 unit tests pass.

Four optimizations to reduce LLM dependency in the memory pipeline:

1. Restore & enhance L1 quality gate (sanitize.ts)
   - Re-enable commented-out length filters (CJK >= 4, alpha >= 10)
   - Re-enable prompt injection detection
   - Add conversational filler filter (好的/OK/thanks/got it)

2. Add rule-based pre-extraction layer (pre-extractor.ts — new)
   - 10 persona patterns (喜欢/是/职业/擅长/认为)
   - 8 instruction patterns (以后/记住/禁止/语言切换)
   - Date+verb episodic detection
   - HIGH-confidence items bypass LLM entirely; MEDIUM as hints

3. Self-correction retry on JSON parse failure (l1-extractor.ts)
   - Parse failures trigger one retry with error feedback
   - Reduces silent memory loss from malformed LLM output

4. Post-LLM confidence check (l1-extractor.ts)
   - Source traceability: >=30% keywords must appear in source messages
   - Type consistency: persona must ref user, instruction must ref AI
   - Trivial content rejection: filter vague episodic statements

Fixes: non-greedy regex in pre-extractor patterns, broader CJK injection detection
@YOMXXX
Copy link
Copy Markdown
Contributor

YOMXXX commented May 24, 2026

Reviewer triage notes from a local pass:

Verification I ran locally on this branch:

Blocking / high-risk findings:

  1. preExtractMemories(qualifiedMessages) runs before the background/new-message split. That means rule-based direct extraction can store memories from background context, even though the L1 prompt explicitly treats background messages as context-only and forbids extracting from them. This can duplicate old memories or store context as if it were new. I would move pre-extraction to newMessages only, after the split.

  2. The PR description says MEDIUM-confidence items are passed as hints to the LLM, but the current diff only logs preResult.hints; they are not injected into formatExtractionPrompt() or otherwise used. Either wire the hints into the prompt or remove that behavior claim.

  3. The new extraction behavior has no dedicated tests in this PR. Reviewer-critical cases worth adding before merge:

    • background messages are not pre-extracted as new memories
    • HIGH-confidence direct extraction is merged once and deduped
    • MEDIUM hints are actually visible to the LLM prompt, if kept
    • malformed JSON triggers exactly one retry
    • confidence filtering does not reject valid persona/instruction memories

The idea is valuable, but I would not merge this as-is because the background extraction issue changes memory semantics.

@yuanrengu
Copy link
Copy Markdown
Author

Reviewer triage notes from a local pass:

Verification I ran locally on this branch:

Blocking / high-risk findings:

  1. preExtractMemories(qualifiedMessages) runs before the background/new-message split. That means rule-based direct extraction can store memories from background context, even though the L1 prompt explicitly treats background messages as context-only and forbids extracting from them. This can duplicate old memories or store context as if it were new. I would move pre-extraction to newMessages only, after the split.

  2. The PR description says MEDIUM-confidence items are passed as hints to the LLM, but the current diff only logs preResult.hints; they are not injected into formatExtractionPrompt() or otherwise used. Either wire the hints into the prompt or remove that behavior claim.

  3. The new extraction behavior has no dedicated tests in this PR. Reviewer-critical cases worth adding before merge:

    • background messages are not pre-extracted as new memories
    • HIGH-confidence direct extraction is merged once and deduped
    • MEDIUM hints are actually visible to the LLM prompt, if kept
    • malformed JSON triggers exactly one retry
    • confidence filtering does not reject valid persona/instruction memories

The idea is valuable, but I would not merge this as-is because the background extraction issue changes memory semantics.

Thanks for the thorough review and for catching these issues. — I believe the main points are now addressed.

  1. preExtractMemories running before the background/new-message split — fixed

Moved the call to after the split, and now it scans only newMessages. This prevents rule-based extraction from storing background context as new memories, matching the L1 prompt's contract.

I also removed the hints-only logging block since it had no functional effect.

  1. MEDIUM-confidence hints — deferred

Removed the inaccurate claim from the PR description. The hints field and PreExtractionResult type remain as-is for a possible follow-up PR, but this PR no longer claims or implements prompt hint injection.

  1. Package size guard — fixed

Dropped "src/" from the "files" array in package.json, matching the pattern from #76 and #71.

npm pack --dry-run now produces a 353.5 kB / 43-file tarball, down from 655.8 kB / 140 files.

  1. Tests — added

Added src/core/record/pre-extractor.test.ts, covering:

  • Rule-based pre-extraction is invoked only with newMessages, so background-only content is not extracted as new memories
  • HIGH-confidence persona / instruction detection
  • Deduplication via mergeExtractedMemories, ensuring the same content from rule extraction and LLM results is merged once
  • Malformed JSON triggers exactly one retry in callLlmExtraction
  • passesConfidenceCheck accepts valid persona and instruction memories, while rejecting too-short CJK content and persona entries without a user reference

All 17 tests pass with npm test, and npm run build also passes.

Let me know if anything else needs attention.

- Move preExtractMemories to newMessages only (after background/new split)
  to prevent extracting memories from background context that should
  only serve as conversational context for the LLM

- Remove MEDIUM-confidence hints logging (hints not wired to LLM prompt;
  keeping types as interface for follow-up PR)

- Remove src/ from package.json files field to fix Size Guard limit
  (matches pattern from Tencent#76 and Tencent#71)

- Export callLlmExtraction and passesConfidenceCheck for testability

- Add pre-extractor.test.ts covering:
  - Background messages not pre-extracted
  - HIGH-confidence dedup via mergeExtractedMemories
  - Malformed JSON triggers exactly one retry
  - Confidence filtering does not reject valid persona/instruction
@Maxwell-Code07
Copy link
Copy Markdown
Collaborator

感谢您对L1提取管线的质量改进!我们内部会尽快reveiw后反馈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

L1 memory extraction pipeline: 4 improvements to reduce LLM dependency

3 participants