fix: resolve 5 memory leak issues causing heap OOM (Issue #598)#603
fix: resolve 5 memory leak issues causing heap OOM (Issue #598)#603jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
Conversation
…gateway blocking (issue-598)
1. store.ts: Replace unbounded Promise chain (tail-reset semaphore) - _updating flag + FIFO _waitQueue instead of updateQueue promise chain - Maintains serialized semantics, no concurrent writes 2. access-tracker.ts: Fix retry count vs delta amplification - Separate _retryCount map (not accumulated delta) - _maxRetries=5 cap, drops after exceeded - getById returns null -> drop silently (not retry) - destroy() now calls doFlush().finally() before clearing 3. embedder.ts: TTL eviction on every set() when near capacity - _evictExpired() scans and removes expired entries on full cache - Avoids unbounded growth from stale entries 4. retrieval-stats.ts: Ring buffer replaces O(n) Array.shift() - O(1) write, O(n) read (same as before for getStats) - Bounded memory, no GC pressure from shift() 5. noise-prototypes.ts: Lower DEDUP_THRESHOLD 0.95 -> 0.90 - Reduces near-duplicate noise from accumulating in bank - Closer to the actual isNoise() threshold of 0.82
1. retrieval-stats.ts: getStats() now uses _getRecords().length (not _records.length) for n. Prevents systematic underestimation of avg/p95 when ring buffer is not yet full. 2. access-tracker.ts: destroy() now wraps final flush in Promise.race with a 3s hard timeout. Guarantees pending/_retryCount are always cleared even if store.getById()/update() hangs indefinitely.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 810adf92c8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } else { | ||
| this._retryCount.set(id, retryCount); | ||
| // Requeue with the original delta only (NOT accumulated) for next flush. | ||
| this.pending.set(id, delta); |
There was a problem hiding this comment.
Merge retry delta with newly recorded accesses
When doFlush() retries a failed ID, it now does this.pending.set(id, delta), which overwrites any fresh accesses that were recorded for the same ID while the flush was in flight. In a transient store failure scenario (slow/failing getById or update plus concurrent recordAccess() calls), this drops real access events and undercounts reinforcement metadata. Requeueing should add to the current pending value instead of replacing it.
Useful? React with 👍 / 👎.
| if (retryCount > this._maxRetries) { | ||
| // Exceeded max retries — drop and log error. | ||
| this._retryCount.delete(id); | ||
| this.logger.error( |
There was a problem hiding this comment.
Guard optional logger.error before calling it
The AccessTrackerOptions logger contract only requires warn (with optional info), but this new retry-drop branch calls this.logger.error(...) unconditionally. Any caller that provides the documented minimal logger will throw at runtime once retries exceed _maxRetries, converting a handled write-back failure into an unexpected flush failure.
Useful? React with 👍 / 👎.
CI cli-smoke failure 分析(已知問題,追蹤於 Issue #590,與本 PR 無關)本 PR 的 CI 執行顯示 根因分析這是官方 master 的回歸 bug(commit 問題鏈:
證據:
建議維持本 PR 的修復內容,此 CI 問題應由官方維護團隊修復(更新 |
fix: resolve 5 memory leak issues causing heap OOM (Issue #598)
問題背景
Issue #598 報告:OpenClaw 運行約 10 小時後出現
JavaScript heap out of memory崩潰。修復內容(5 個問題,全部經 Codex 對抗審查)
Fix 1 — store.ts:Promise Chain 無限增長(CRITICAL)
問題:
runSerializedUpdate()用previous.then(() => lock)串成無界 Promise 鏈。寫入速度快於完成速度時鏈無限成長,V8 無法回收。修復:Tail-reset semaphore,完全移除 Promise chain:
驗證:Codex 確認 tail-reset 維持了 delete+add rollback 的單寫語意,沒有併發風險。
Fix 2 — access-tracker.ts:Retry 累積(HIGH)
問題:
existing + delta把 access count 當 retry count 用。同一 ID 持續失敗時 delta 從 5→10→15 持續放大。修復:
_retryCount: Map<string, number>追蹤重試次數_maxRetries = 5上限,超過即放棄pending.set(id, delta)— 永遠只 requeue 原始 delta隱藏 bug:
destroy()直接pending.clear(),最後一批 access count 消失。修復為帶 3s timeout 的Promise.race([doFlush(), timeout])。Fix 3 — embedder.ts:TTL 被動清理(MEDIUM)
問題:過期 entry 只在 access 時刪除,冷門資料永久佔記憶體。
修復:在
set()且 cache 接近滿時主動呼叫_evictExpired()清理過期 entry。維持 O(1) 寫入代價,不加 timer(避免 timer leak)。Fix 4 — retrieval-stats.ts:O(n) shift()(MEDIUM → 效能優化)
問題:
Array.shift()是 O(n),1000 筆時造成 GC 壓力。修復:Ring buffer,寫入 O(1)。
Codex review 後修正:
getStats()原本用_records.length(capacity)當n,buffer 未滿時統計失真。修正為_getRecords().length。Fix 5 — noise-prototypes.ts:DEDUP 太寬鬆(MEDIUM)
問題:
DEDUP_THRESHOLD = 0.95太高,與isNoise()threshold 0.82 有 gap,相似噪聲持續累積。修復:
DEDUP_THRESHOLD = 0.90(下調 0.05)。Codex 對抗審查(兩輪)
第一輪:設計文件審查
發現原始設計稿的問題:
maxConcurrent=10→ 會破壞 update() 的序列化語意 → 修正為維持 maxConcurrent=1setInterval→ 會引入 timer 生命週期 leak → 修正為 set() 時順手清理第二輪:實際程式碼審查
發現 2 個新 bug:
access-tracker.ts destroy()—finally無法處理 never-resolves → 已加 3s timeoutretrieval-stats.ts getStats()— 用_records.length而非_count→ 已修正測試結果
Issue #598 的 heap OOM 可能還有另一個根因:
PR #430(已 CLOSED,未 MERGED)指出:
PR #430 的修復(未被 merge):
_singletonState:所有 heavy resources 只建立一次_hookEventDedup:防止 hook handler 無限累積這兩個問題(#430 + #598)是互補的,建議 maintainer 評估是否需要一併實作,或作為 follow-up PR。
關聯資源
修復 + Codex 對抗審查 | 2026-04-13