Skip to content

fix: resolve 5 memory leak issues causing heap OOM (Issue #598)#603

Open
jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
jlin53882:fix/issue-598-memory-leak
Open

fix: resolve 5 memory leak issues causing heap OOM (Issue #598)#603
jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
jlin53882:fix/issue-598-memory-leak

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

fix: resolve 5 memory leak issues causing heap OOM (Issue #598)

問題背景

Issue #598 報告:OpenClaw 運行約 10 小時後出現 JavaScript heap out of memory 崩潰。

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
[2516] 37166117 ms: Mark-Compact 4089.9 (4139.9) -> 3989.5 (4115.0) MB

修復內容(5 個問題,全部經 Codex 對抗審查)


Fix 1 — store.ts:Promise Chain 無限增長(CRITICAL)

問題runSerializedUpdate()previous.then(() => lock) 串成無界 Promise 鏈。寫入速度快於完成速度時鏈無限成長,V8 無法回收。

修復:Tail-reset semaphore,完全移除 Promise chain:

// Before:無界鏈
private updateQueue: Promise<void> = Promise.resolve();
this.updateQueue = previous.then(() => lock);

// After:布林 flag + FIFO queue
private _updating = false;
private _waitQueue: Array<() => void> = [];

// 維持 maxConcurrent=1,不改變 update() 的 delete+add 序列化語意

驗證:Codex 確認 tail-reset 維持了 delete+add rollback 的單寫語意,沒有併發風險。


Fix 2 — access-tracker.ts:Retry 累積(HIGH)

問題existing + delta 把 access count 當 retry count 用。同一 ID 持續失敗時 delta 從 5→10→15 持續放大。

修復

  • 獨立 _retryCount: Map<string, number> 追蹤重試次數
  • _maxRetries = 5 上限,超過即放棄
  • pending.set(id, delta) — 永遠只 requeue 原始 delta

隱藏 bugdestroy() 直接 pending.clear(),最後一批 access count 消失。修復為帶 3s timeout 的 Promise.race([doFlush(), timeout])


Fix 3 — embedder.ts:TTL 被動清理(MEDIUM)

問題:過期 entry 只在 access 時刪除,冷門資料永久佔記憶體。

修復:在 set() 且 cache 接近滿時主動呼叫 _evictExpired() 清理過期 entry。維持 O(1) 寫入代價,不加 timer(避免 timer leak)。


Fix 4 — retrieval-stats.ts:O(n) shift()(MEDIUM → 效能優化)

問題Array.shift() 是 O(n),1000 筆時造成 GC 壓力。

修復:Ring buffer,寫入 O(1)。

Codex review 後修正getStats() 原本用 _records.length(capacity)當 n,buffer 未滿時統計失真。修正為 _getRecords().length


Fix 5 — noise-prototypes.ts:DEDUP 太寬鬆(MEDIUM)

問題DEDUP_THRESHOLD = 0.95 太高,與 isNoise() threshold 0.82 有 gap,相似噪聲持續累積。

修復DEDUP_THRESHOLD = 0.90(下調 0.05)。


Codex 對抗審查(兩輪)

第一輪:設計文件審查

發現原始設計稿的問題:

  • ❌ 原設計把 Store 改成 maxConcurrent=10會破壞 update() 的序列化語意 → 修正為維持 maxConcurrent=1
  • ❌ EmbeddingCache setInterval會引入 timer 生命週期 leak → 修正為 set() 時順手清理

第二輪:實際程式碼審查

發現 2 個新 bug:

  • 🔴 access-tracker.ts destroy()finally 無法處理 never-resolves → 已加 3s timeout
  • 🟠 retrieval-stats.ts getStats() — 用 _records.length 而非 _count已修正

測試結果

node --test test/access-tracker.test.mjs  → 59/59 ✅
npm run test:core-regression             → ✅

⚠️ 建議一併評估:PR #430(singleton state + hook dedup)

Issue #598 的 heap OOM 可能還有另一個根因:

PR #430(已 CLOSED,未 MERGED)指出:

OpenClaw 在 startup 期間 register() 會被呼叫多次(5× scope init,4× per inbound message on cache-miss),造成:

  1. Heavy resources(MemoryStore、embedder、SmartExtractor)被重複建立,session Map 狀態丟失
  2. api.registerHook() 的 handlers 會無上限累積,一個 /reset 可觸發 200+ 個重複 handler 呼叫,每次都發 Ollama embedding 請求

PR #430 的修復(未被 merge):

  • _singletonState:所有 heavy resources 只建立一次
  • _hookEventDedup:防止 hook handler 無限累積

這兩個問題(#430 + #598)是互補的,建議 maintainer 評估是否需要一併實作,或作為 follow-up PR。


關聯資源


修復 + Codex 對抗審查 | 2026-04-13

1. store.ts: Replace unbounded Promise chain (tail-reset semaphore)
   - _updating flag + FIFO _waitQueue instead of updateQueue promise chain
   - Maintains serialized semantics, no concurrent writes

2. access-tracker.ts: Fix retry count vs delta amplification
   - Separate _retryCount map (not accumulated delta)
   - _maxRetries=5 cap, drops after exceeded
   - getById returns null -> drop silently (not retry)
   - destroy() now calls doFlush().finally() before clearing

3. embedder.ts: TTL eviction on every set() when near capacity
   - _evictExpired() scans and removes expired entries on full cache
   - Avoids unbounded growth from stale entries

4. retrieval-stats.ts: Ring buffer replaces O(n) Array.shift()
   - O(1) write, O(n) read (same as before for getStats)
   - Bounded memory, no GC pressure from shift()

5. noise-prototypes.ts: Lower DEDUP_THRESHOLD 0.95 -> 0.90
   - Reduces near-duplicate noise from accumulating in bank
   - Closer to the actual isNoise() threshold of 0.82
1. retrieval-stats.ts: getStats() now uses _getRecords().length
   (not _records.length) for n. Prevents systematic underestimation
   of avg/p95 when ring buffer is not yet full.

2. access-tracker.ts: destroy() now wraps final flush in Promise.race
   with a 3s hard timeout. Guarantees pending/_retryCount are always
   cleared even if store.getById()/update() hangs indefinitely.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 810adf92c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

} else {
this._retryCount.set(id, retryCount);
// Requeue with the original delta only (NOT accumulated) for next flush.
this.pending.set(id, delta);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Merge retry delta with newly recorded accesses

When doFlush() retries a failed ID, it now does this.pending.set(id, delta), which overwrites any fresh accesses that were recorded for the same ID while the flush was in flight. In a transient store failure scenario (slow/failing getById or update plus concurrent recordAccess() calls), this drops real access events and undercounts reinforcement metadata. Requeueing should add to the current pending value instead of replacing it.

Useful? React with 👍 / 👎.

if (retryCount > this._maxRetries) {
// Exceeded max retries — drop and log error.
this._retryCount.delete(id);
this.logger.error(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard optional logger.error before calling it

The AccessTrackerOptions logger contract only requires warn (with optional info), but this new retry-drop branch calls this.logger.error(...) unconditionally. Any caller that provides the documented minimal logger will throw at runtime once retries exceed _maxRetries, converting a handled write-back failure into an unexpected flush failure.

Useful? React with 👍 / 👎.

@jlin53882
Copy link
Copy Markdown
Contributor Author

CI cli-smoke failure 分析(已知問題,追蹤於 Issue #590,與本 PR 無關)

本 PR 的 CI 執行顯示 cli-smoke 測試失敗,錯誤如下:

cli-smoke.mjs:316: AssertionError: undefined !== 1
assert.equal(recallResult.details.count, 1);

根因分析

這是官方 master 的回歸 bug(commit 0988a46:「skip 75ms retry when store is empty」),已在 Issue #590 追蹤:CortexReach/memory-lancedb-pro#590

問題鏈

  1. 0988a46 新增 countStore callback 參數至 retrieveWithRetry()
  2. retrieveWithRetry 內部呼叫 runtimeContext.store.count()
  3. cli-smoke.mjs 的 mock store 只有 async patchMetadata() {}沒有 count() 方法
  4. 呼叫 undefined()TypeError → 進 catch block
  5. 回傳 { details: { error: "recall_failed" } }(無 count 欄位)
  6. undefined !== 1 → 測試失敗

證據

  • 過去 30 個 CI run 幾乎全部失敗(failure),包含其他 branch(fix/issue-492-v4fix/issue-415-stale-threshold 等)
  • 唯一的成功紀錄(db:24323865547)是在 0988a46 merge 之前
  • cli-smoke.mjs 的 mock store 自 2026-02-26(commit f00acee)後從未更新

建議

維持本 PR 的修復內容,此 CI 問題應由官方維護團隊修復(更新 cli-smoke.mjs 的 mock store 加上 async count() { return N; })。詳見:CortexReach/memory-lancedb-pro#590

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant