Skip to content

Bug: extractMinMessages=2 + autoCaptureSeenTextCount 累积逻辑失效 → 所有单轮对话都掉入 regex fallback,污染全库为脏数据 #417

@lintianyuan666

Description

@lintianyuan666

Plugin Version

1.1.0

OpenClaw Version

2026.3.28

Bug Description

开启 smartExtraction: true 后,正常配置 extractMinMessages: 2,但 auto-capture 几乎所有对话都落入 regex fallback 写入 raw text,导致库内记忆全部是脏数据(l0_abstract == text,无 LLM 蒸馏)。
代码有两条累积路径试图凑够 extractMinMessages,两条都失效:

───

路径A:autoCaptureSeenTextCount diffing(失效)

// index.ts line 2169-2176
const previousSeenCount = autoCaptureSeenTextCount.get(sessionKey) ?? 0;
let newTexts = eligibleTexts; // ← 每次 agent_end 的 eligibleTexts 是"当前事件的消息数",不是"历史累积量"
if (pendingIngressTexts.length > 0) {
newTexts = pendingIngressTexts;
} else if (previousSeenCount > 0 && eligibleTexts.length > previousSeenCount) {
newTexts = eligibleTexts.slice(previousSeenCount); // ← 永远不会触发,因为 eligibleTexts.length === previousSeenCount === 1
}
autoCaptureSeenTextCount.set(sessionKey, eligibleTexts.length); // ← 每次覆盖成"1",diffing 失效

在单轮 DM 场景:

• 事件1:eligibleTexts=1, previousSeenCount=0 → newTexts=1 → smart extraction 跳过(需要≥2)
• 事件2:eligibleTexts=1, previousSeenCount=1 → 1 > 1 为 false → newTexts=1 → 同样跳过

日志佐证:

08:44:28 smart-extractor: extracted 3 candidates ← 历史累积生效过一次(跨会话或特定模式)
08:46:41 regex fallback found 1 capturable text(s) ← 后续全走 regex

───

路径B:pendingIngressTexts 跨消息累积(冷启动失效)

// message_received hook — 累积入口
const conversationKey = buildAutoCaptureConversationKeyFromIngress(channelId, conversationId);
queue.push(normalized); // ← 来自用户发送的 ingress 消息

// agent_end hook — 消费出口
const conversationKey = buildAutoCaptureConversationKeyFromSessionKey(sessionKey); // ← 格式: "agent:::"
const pendingIngressTexts = autoCapturePendingIngressTexts.get(conversationKey) ?? [];

问题:pendingIngressTexts.length > 0 时会用 pending 队列替代当前 texts,但这段代码只在 previousSeenCount > 0 时才可能有意义(否则 pending 队列里的内容永远是那1条刚进门的 ingress 消息)。

且 pending 队列只在 previousSeenCount > 0 && eligibleTexts.length > previousSeenCount 时才被"考虑"——第一次对话永远没有 previousSeenCount,永远用 eligibleTexts,永远凑不到2。

───

结果

对话模式 eligibleTexts smartExtraction regex fallback 结果
单轮 DM(1条 user msg) 1 ❌ 跳过(<2) ✅ 触发 ⚠️ 脏数据
多轮历史累积成功 ≥2 ✅ 触发 ❌ 不触发 ✅ 正常
LLM extraction 失败 ≥2 ❌ 失败 ✅ 触发 ⚠️ 脏数据

───

日志:
memory-pro: smart-extractor: extracted 3 candidate(s) ← smart extraction 成功
memory-pro: smart-extractor: created [cases] Memory-lanceDB-pro dirty data issue
memory-pro: smart-extractor: created [preferences] Model preference: Yunwu GPT-4o
memory-pro: smart-extracted 2 created, 0 merged, 1 skipped ← 正常
regex fallback found 1 capturable text(s) ← 单轮 DM 落入 fallback
memory-lancedb-pro: auto-captured 1 memories for agent main in scope agent:main ← 脏数据

Expected Behavior

改 extractMinMessages 语义
将 extractMinMessages 从"每轮 eligible texts 数量"改为"smart extraction 触发前需要累积的最小 conversation rounds",并在 session 级别真正做累积计数,而不是依赖 per-event 的 diffing hack。

Steps to Reproduce

以上

Error Logs / Screenshots

Embedding Provider

None

OS / Platform

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions