feat: L1 extraction quality improvements — reduce LLM dependency by yuanrengu · Pull Request #83 · Tencent/TencentDB-Agent-Memory

yuanrengu · 2026-05-24T06:29:20Z

Closes #82

Summary

Four improvements to the L1 memory extraction pipeline to reduce LLM dependency and improve extraction quality.

1. Restore & enhance L1 quality gate (`sanitize.ts`)

Re-enable commented-out length filters (CJK ≥ 4 chars, alpha ≥ 10 chars)
Re-enable prompt injection detection
Add conversational filler filter (好的/OK/thanks/got it)
Fix CJK injection patterns (broader .{0,10} matching)

2. Rule-based pre-extraction layer (`pre-extractor.ts` — new)

10 persona patterns: 我喜欢/我是/我的职业是/我擅长/我认为…
8 instruction patterns: 以后/记住/禁止/从现在开始/使用X语言回复…
Date + action-verb episodic detection
HIGH-confidence items are merged directly; MEDIUM-confidence hint injection is deferred to a follow-up PR

3. Self-correction retry on JSON parse failure (`l1-extractor.ts`)

First parse failure triggers one retry with error feedback
Only truly discarded if both attempts fail

4. Post-LLM confidence check (`l1-extractor.ts`)

Source traceability: ≥30% keywords must appear in source messages
Type consistency: persona must reference 用户/我, instruction must contain AI/directive keywords
Trivial content rejection for episodic boilerplate

Additional fixes

Non-greedy regex fix in pre-extractor patterns (greedy .{0,N} → .{0,N}?)

Testing

54/55 unit tests pass.

Four optimizations to reduce LLM dependency in the memory pipeline: 1. Restore & enhance L1 quality gate (sanitize.ts) - Re-enable commented-out length filters (CJK >= 4, alpha >= 10) - Re-enable prompt injection detection - Add conversational filler filter (好的/OK/thanks/got it) 2. Add rule-based pre-extraction layer (pre-extractor.ts — new) - 10 persona patterns (喜欢/是/职业/擅长/认为) - 8 instruction patterns (以后/记住/禁止/语言切换) - Date+verb episodic detection - HIGH-confidence items bypass LLM entirely; MEDIUM as hints 3. Self-correction retry on JSON parse failure (l1-extractor.ts) - Parse failures trigger one retry with error feedback - Reduces silent memory loss from malformed LLM output 4. Post-LLM confidence check (l1-extractor.ts) - Source traceability: >=30% keywords must appear in source messages - Type consistency: persona must ref user, instruction must ref AI - Trivial content rejection: filter vague episodic statements Fixes: non-greedy regex in pre-extractor patterns, broader CJK injection detection

YOMXXX · 2026-05-24T16:07:11Z

Reviewer triage notes from a local pass:

Verification I ran locally on this branch:

npm test ✅ (currently only 1 test file / 6 tests on this branch)
npm run build ✅
npm pack --dry-run ⚠️ produces a 655.8 kB tarball / 140 files, which would exceed the repo Size Guard limit of 512 kB if the full CI matrix runs. This likely needs the package file-list fix already used in feat(capture): make timestamps timezone configurable #76/feat(recall): cap injected memory context #71, or a rebase after one of those lands.

Blocking / high-risk findings:

preExtractMemories(qualifiedMessages) runs before the background/new-message split. That means rule-based direct extraction can store memories from background context, even though the L1 prompt explicitly treats background messages as context-only and forbids extracting from them. This can duplicate old memories or store context as if it were new. I would move pre-extraction to newMessages only, after the split.
The PR description says MEDIUM-confidence items are passed as hints to the LLM, but the current diff only logs preResult.hints; they are not injected into formatExtractionPrompt() or otherwise used. Either wire the hints into the prompt or remove that behavior claim.
The new extraction behavior has no dedicated tests in this PR. Reviewer-critical cases worth adding before merge:
- background messages are not pre-extracted as new memories
- HIGH-confidence direct extraction is merged once and deduped
- MEDIUM hints are actually visible to the LLM prompt, if kept
- malformed JSON triggers exactly one retry
- confidence filtering does not reject valid persona/instruction memories

The idea is valuable, but I would not merge this as-is because the background extraction issue changes memory semantics.

yuanrengu · 2026-05-25T00:23:31Z

Reviewer triage notes from a local pass:

Verification I ran locally on this branch:

npm test ✅ (currently only 1 test file / 6 tests on this branch)

npm run build ✅

npm pack --dry-run ⚠️ produces a 655.8 kB tarball / 140 files, which would exceed the repo Size Guard limit of 512 kB if the full CI matrix runs. This likely needs the package file-list fix already used in feat(capture): make timestamps timezone configurable #76/feat(recall): cap injected memory context #71, or a rebase after one of those lands.

Blocking / high-risk findings:

preExtractMemories(qualifiedMessages) runs before the background/new-message split. That means rule-based direct extraction can store memories from background context, even though the L1 prompt explicitly treats background messages as context-only and forbids extracting from them. This can duplicate old memories or store context as if it were new. I would move pre-extraction to newMessages only, after the split.

The PR description says MEDIUM-confidence items are passed as hints to the LLM, but the current diff only logs preResult.hints; they are not injected into formatExtractionPrompt() or otherwise used. Either wire the hints into the prompt or remove that behavior claim.

The new extraction behavior has no dedicated tests in this PR. Reviewer-critical cases worth adding before merge:

background messages are not pre-extracted as new memories

HIGH-confidence direct extraction is merged once and deduped

MEDIUM hints are actually visible to the LLM prompt, if kept

malformed JSON triggers exactly one retry

confidence filtering does not reject valid persona/instruction memories

The idea is valuable, but I would not merge this as-is because the background extraction issue changes memory semantics.

Thanks for the thorough review and for catching these issues. — I believe the main points are now addressed.

preExtractMemories running before the background/new-message split — fixed

Moved the call to after the split, and now it scans only newMessages. This prevents rule-based extraction from storing background context as new memories, matching the L1 prompt's contract.

I also removed the hints-only logging block since it had no functional effect.

MEDIUM-confidence hints — deferred

Removed the inaccurate claim from the PR description. The hints field and PreExtractionResult type remain as-is for a possible follow-up PR, but this PR no longer claims or implements prompt hint injection.

Package size guard — fixed

Dropped "src/" from the "files" array in package.json, matching the pattern from #76 and #71.

npm pack --dry-run now produces a 353.5 kB / 43-file tarball, down from 655.8 kB / 140 files.

Tests — added

Added src/core/record/pre-extractor.test.ts, covering:

Rule-based pre-extraction is invoked only with newMessages, so background-only content is not extracted as new memories
HIGH-confidence persona / instruction detection
Deduplication via mergeExtractedMemories, ensuring the same content from rule extraction and LLM results is merged once
Malformed JSON triggers exactly one retry in callLlmExtraction
passesConfidenceCheck accepts valid persona and instruction memories, while rejecting too-short CJK content and persona entries without a user reference

All 17 tests pass with npm test, and npm run build also passes.

Let me know if anything else needs attention.

- Move preExtractMemories to newMessages only (after background/new split) to prevent extracting memories from background context that should only serve as conversational context for the LLM - Remove MEDIUM-confidence hints logging (hints not wired to LLM prompt; keeping types as interface for follow-up PR) - Remove src/ from package.json files field to fix Size Guard limit (matches pattern from Tencent#76 and Tencent#71) - Export callLlmExtraction and passesConfidenceCheck for testability - Add pre-extractor.test.ts covering: - Background messages not pre-extracted - HIGH-confidence dedup via mergeExtractedMemories - Malformed JSON triggers exactly one retry - Confidence filtering does not reject valid persona/instruction

Maxwell-Code07 · 2026-05-25T02:38:43Z

感谢您对L1提取管线的质量改进！我们内部会尽快reveiw后反馈

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: L1 extraction quality improvements — reduce LLM dependency#83

feat: L1 extraction quality improvements — reduce LLM dependency#83
yuanrengu wants to merge 2 commits into
Tencent:mainfrom
yuanrengu:feat/l1-quality-improvements

yuanrengu commented May 24, 2026 •

edited

Loading

Uh oh!

YOMXXX commented May 24, 2026

Uh oh!

yuanrengu commented May 25, 2026

Uh oh!

Maxwell-Code07 commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuanrengu commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Restore & enhance L1 quality gate (sanitize.ts)

2. Rule-based pre-extraction layer (pre-extractor.ts — new)

3. Self-correction retry on JSON parse failure (l1-extractor.ts)

4. Post-LLM confidence check (l1-extractor.ts)

Additional fixes

Testing

Uh oh!

YOMXXX commented May 24, 2026

Uh oh!

yuanrengu commented May 25, 2026

Uh oh!

Maxwell-Code07 commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanrengu commented May 24, 2026 •

edited

Loading

1. Restore & enhance L1 quality gate (`sanitize.ts`)

2. Rule-based pre-extraction layer (`pre-extractor.ts` — new)

3. Self-correction retry on JSON parse failure (`l1-extractor.ts`)

4. Post-LLM confidence check (`l1-extractor.ts`)