fix: resolve 5 memory leak issues causing heap OOM (Issue #598) by jlin53882 · Pull Request #603 · CortexReach/memory-lancedb-pro

jlin53882 · 2026-04-13T09:09:15Z

fix: resolve 5 memory leak issues causing heap OOM (Issue #598)

問題背景

Issue #598 報告：OpenClaw 運行約 10 小時後出現 JavaScript heap out of memory 崩潰。

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
[2516] 37166117 ms: Mark-Compact 4089.9 (4139.9) -> 3989.5 (4115.0) MB

修復內容（5 個問題，全部經 Codex 對抗審查）

Fix 1 — store.ts：Promise Chain 無限增長（CRITICAL）

問題：runSerializedUpdate() 用 previous.then(() => lock) 串成無界 Promise 鏈。寫入速度快於完成速度時鏈無限成長，V8 無法回收。

修復：Tail-reset semaphore，完全移除 Promise chain：

// Before：無界鏈
private updateQueue: Promise<void> = Promise.resolve();
this.updateQueue = previous.then(() => lock);

// After：布林 flag + FIFO queue
private _updating = false;
private _waitQueue: Array<() => void> = [];

// 維持 maxConcurrent=1，不改變 update() 的 delete+add 序列化語意

驗證：Codex 確認 tail-reset 維持了 delete+add rollback 的單寫語意，沒有併發風險。

Fix 2 — access-tracker.ts：Retry 累積（HIGH）

問題：existing + delta 把 access count 當 retry count 用。同一 ID 持續失敗時 delta 從 5→10→15 持續放大。

修復：

獨立 _retryCount: Map<string, number> 追蹤重試次數
_maxRetries = 5 上限，超過即放棄
pending.set(id, delta) — 永遠只 requeue 原始 delta

隱藏 bug：destroy() 直接 pending.clear()，最後一批 access count 消失。修復為帶 3s timeout 的 Promise.race([doFlush(), timeout])。

Fix 3 — embedder.ts：TTL 被動清理（MEDIUM）

問題：過期 entry 只在 access 時刪除，冷門資料永久佔記憶體。

修復：在 set() 且 cache 接近滿時主動呼叫 _evictExpired() 清理過期 entry。維持 O(1) 寫入代價，不加 timer（避免 timer leak）。

Fix 4 — retrieval-stats.ts：O(n) shift()（MEDIUM → 效能優化）

問題：Array.shift() 是 O(n)，1000 筆時造成 GC 壓力。

修復：Ring buffer，寫入 O(1)。

Codex review 後修正：getStats() 原本用 _records.length（capacity）當 n，buffer 未滿時統計失真。修正為 _getRecords().length。

Fix 5 — noise-prototypes.ts：DEDUP 太寬鬆（MEDIUM）

問題：DEDUP_THRESHOLD = 0.95 太高，與 isNoise() threshold 0.82 有 gap，相似噪聲持續累積。

修復：DEDUP_THRESHOLD = 0.90（下調 0.05）。

Codex 對抗審查（兩輪）

第一輪：設計文件審查

發現原始設計稿的問題：

❌ 原設計把 Store 改成 maxConcurrent=10 → 會破壞 update() 的序列化語意 → 修正為維持 maxConcurrent=1
❌ EmbeddingCache setInterval → 會引入 timer 生命週期 leak → 修正為 set() 時順手清理

第二輪：實際程式碼審查

發現 2 個新 bug：

🔴 access-tracker.ts destroy() — finally 無法處理 never-resolves → 已加 3s timeout
🟠 retrieval-stats.ts getStats() — 用 _records.length 而非 _count → 已修正

測試結果

node --test test/access-tracker.test.mjs  → 59/59 ✅
npm run test:core-regression             → ✅

⚠️ 建議一併評估：PR #430（singleton state + hook dedup）

Issue #598 的 heap OOM 可能還有另一個根因：

PR #430（已 CLOSED，未 MERGED）指出：

OpenClaw 在 startup 期間 register() 會被呼叫多次（5× scope init，4× per inbound message on cache-miss），造成：

Heavy resources（MemoryStore、embedder、SmartExtractor）被重複建立，session Map 狀態丟失

api.registerHook() 的 handlers 會無上限累積，一個 /reset 可觸發 200+ 個重複 handler 呼叫，每次都發 Ollama embedding 請求

PR #430 的修復（未被 merge）：

_singletonState：所有 heavy resources 只建立一次
_hookEventDedup：防止 hook handler 無限累積

這兩個問題（#430 + #598）是互補的，建議 maintainer 評估是否需要一併實作，或作為 follow-up PR。

關聯資源

Issue Memory leak issues causing heap out of memory #598：Memory leak issues causing heap out of memory #598
PR fix: singleton state + handler dedup to prevent resource leak and hook accumulation #430（已關閉，未 merge）：fix: singleton state + handler dedup to prevent resource leak and hook accumulation #430
PR fix: add per-ID mutex lock for update() to prevent concurrent corruption #143（已關閉，未 merge）：https://github.com/CortexReach/memory-lancedb-pro/pull/143（per-ID mutex，預防資料 corruption）

修復 + Codex 對抗審查 | 2026-04-13

…gateway blocking (issue-598)

1. store.ts: Replace unbounded Promise chain (tail-reset semaphore) - _updating flag + FIFO _waitQueue instead of updateQueue promise chain - Maintains serialized semantics, no concurrent writes 2. access-tracker.ts: Fix retry count vs delta amplification - Separate _retryCount map (not accumulated delta) - _maxRetries=5 cap, drops after exceeded - getById returns null -> drop silently (not retry) - destroy() now calls doFlush().finally() before clearing 3. embedder.ts: TTL eviction on every set() when near capacity - _evictExpired() scans and removes expired entries on full cache - Avoids unbounded growth from stale entries 4. retrieval-stats.ts: Ring buffer replaces O(n) Array.shift() - O(1) write, O(n) read (same as before for getStats) - Bounded memory, no GC pressure from shift() 5. noise-prototypes.ts: Lower DEDUP_THRESHOLD 0.95 -> 0.90 - Reduces near-duplicate noise from accumulating in bank - Closer to the actual isNoise() threshold of 0.82

1. retrieval-stats.ts: getStats() now uses _getRecords().length (not _records.length) for n. Prevents systematic underestimation of avg/p95 when ring buffer is not yet full. 2. access-tracker.ts: destroy() now wraps final flush in Promise.race with a 3s hard timeout. Guarantees pending/_retryCount are always cleared even if store.getById()/update() hangs indefinitely.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 810adf92c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-13T09:14:04Z

src/access-tracker.ts

+        } else {
+          this._retryCount.set(id, retryCount);
+          // Requeue with the original delta only (NOT accumulated) for next flush.
+          this.pending.set(id, delta);


Merge retry delta with newly recorded accesses

When doFlush() retries a failed ID, it now does this.pending.set(id, delta), which overwrites any fresh accesses that were recorded for the same ID while the flush was in flight. In a transient store failure scenario (slow/failing getById or update plus concurrent recordAccess() calls), this drops real access events and undercounts reinforcement metadata. Requeueing should add to the current pending value instead of replacing it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-13T09:14:04Z

src/access-tracker.ts

+        if (retryCount > this._maxRetries) {
+          // Exceeded max retries — drop and log error.
+          this._retryCount.delete(id);
+          this.logger.error(


Guard optional logger.error before calling it

The AccessTrackerOptions logger contract only requires warn (with optional info), but this new retry-drop branch calls this.logger.error(...) unconditionally. Any caller that provides the documented minimal logger will throw at runtime once retries exceed _maxRetries, converting a handled write-back failure into an unexpected flush failure.

Useful? React with 👍 / 👎.

jlin53882 · 2026-04-13T09:26:20Z

CI cli-smoke failure 分析（已知問題，追蹤於 Issue #590，與本 PR 無關）

本 PR 的 CI 執行顯示 cli-smoke 測試失敗，錯誤如下：

cli-smoke.mjs:316: AssertionError: undefined !== 1
assert.equal(recallResult.details.count, 1);

根因分析

這是官方 master 的回歸 bug（commit 0988a46：「skip 75ms retry when store is empty」），已在 Issue #590 追蹤：CortexReach/memory-lancedb-pro#590。

問題鏈：

0988a46 新增 countStore callback 參數至 retrieveWithRetry()
retrieveWithRetry 內部呼叫 runtimeContext.store.count()
cli-smoke.mjs 的 mock store 只有 async patchMetadata() {}，沒有 count() 方法
呼叫 undefined() → TypeError → 進 catch block
回傳 { details: { error: "recall_failed" } }（無 count 欄位）
undefined !== 1 → 測試失敗

證據：

過去 30 個 CI run 幾乎全部失敗（failure），包含其他 branch（fix/issue-492-v4、fix/issue-415-stale-threshold 等）
唯一的成功紀錄（db:24323865547）是在 0988a46 merge 之前
cli-smoke.mjs 的 mock store 自 2026-02-26（commit f00acee）後從未更新

建議

維持本 PR 的修復內容，此 CI 問題應由官方維護團隊修復（更新 cli-smoke.mjs 的 mock store 加上 async count() { return N; }）。詳見：CortexReach/memory-lancedb-pro#590

jlin53882 added 3 commits April 13, 2026 17:08

fix: skip before_prompt_build hooks for subagent sessions to prevent …

cd695ba

…gateway blocking (issue-598)

chatgpt-codex-connector bot reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve 5 memory leak issues causing heap OOM (Issue #598)#603

fix: resolve 5 memory leak issues causing heap OOM (Issue #598)#603
jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
jlin53882:fix/issue-598-memory-leak

jlin53882 commented Apr 13, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 13, 2026

Uh oh!

chatgpt-codex-connector bot Apr 13, 2026

Uh oh!

jlin53882 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlin53882 commented Apr 13, 2026

fix: resolve 5 memory leak issues causing heap OOM (Issue #598)

問題背景

修復內容（5 個問題，全部經 Codex 對抗審查）

Fix 1 — store.ts：Promise Chain 無限增長（CRITICAL）

Fix 2 — access-tracker.ts：Retry 累積（HIGH）

Fix 3 — embedder.ts：TTL 被動清理（MEDIUM）

Fix 4 — retrieval-stats.ts：O(n) shift()（MEDIUM → 效能優化）

Fix 5 — noise-prototypes.ts：DEDUP 太寬鬆（MEDIUM）

Codex 對抗審查（兩輪）

第一輪：設計文件審查

第二輪：實際程式碼審查

測試結果

⚠️ 建議一併評估：PR #430（singleton state + hook dedup）

關聯資源

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

jlin53882 commented Apr 13, 2026

CI cli-smoke failure 分析（已知問題，追蹤於 Issue #590，與本 PR 無關）

根因分析

建議

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant