fix(store): proper-lockfile retries + ECOMPROMISED graceful handling (#415)#517
fix(store): proper-lockfile retries + ECOMPROMISED graceful handling (#415)#517jlin53882 wants to merge 11 commits intoCortexReach:masterfrom
Conversation
…PROMISED graceful fallback (CortexReach#415)
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
AliceLJY
left a comment
There was a problem hiding this comment.
Fix is correct and well-scoped. The stale threshold increase (10s to 60s) and retry parameter realignment make sense for event-loop-starvation scenarios. The ECOMPROMISED fallback is a reasonable best-effort recovery.
Minor suggestions (non-blocking):
- Remove the unused
fnCompletedvariable (dead code) - Add a
console.warnwhen ECOMPROMISED fallback activates -- operators need visibility into how often this degraded path fires - CI
cli-smokefailure is pre-existing on master (strip-envelope-metadata tests), unrelated to this PR
Approved.
…h warn (PR CortexReach#517 author review)
…turn counting test + changelog - Fix #1: buildAutoCaptureConversationKeyFromIngress DM fallback - Fix #2: currentCumulativeCount (cumulative per-event counting) - Fix #3: REPLACE vs APPEND + cum count threshold for smart extraction - Fix #4: remove pendingIngressTexts.delete() - Fix #5: isExplicitRememberCommand lastPending guard - Fix #6: Math.min extractMinMessages cap (max 100) - Fix #7: MAX_MESSAGE_LENGTH=5000 guard - Add test: 2 sequential agent_end events with extractMinMessages=2 - Add changelog: Unreleased section with issue details
Author review 已處理✅ 已移除未使用的 nCompleted 變數 感謝作者的 review! |
rwmjhb
left a comment
There was a problem hiding this comment.
Review: fix(store): increase proper-lockfile stale threshold + ECOMPROMISED graceful fallback (#415)
高负载下 event loop 阻塞导致 ECOMPROMISED crash 是真实问题。但 stale 从 10s 改到 60s 有 trade-off:
Must Fix
-
恢复延迟 6x:
stale: 60_000意味着真正 crash 的 lock holder 需要 60s 才能被回收(原来 10s)。对于生产环境的可用性影响需要权衡。 -
Stress test 未被执行: 新的 stress-test file 没有加入项目的 test script,CI 不会跑。
-
ECOMPROMISED catch 是死代码: proper-lockfile 的 ECOMPROMISED 通过
onCompromisedcallback 触发,不是 thrown error——catch block 永远不会执行。
Questions
- fallback lockfile.lock 的
retries: 3vs primary path 的retries: 10是有意为之还是遗漏? - ECOMPROMISED fallback 硬编码 3s wait,但新 stale 是 60s——3s retry 无法回收 fresh lock,fallback 是否有效?
请修复 stress test 执行问题和 ECOMPROMISED 处理逻辑后再 review。
…ssue CortexReach#415) Must Fix from maintainer review: - ECOMPROMISED is triggered via onCompromised callback, not throw → replaced dead catch block with proper onCompromised flag mechanism - fn() error takes priority over lock compromised error (no error masking) - Stress test added to CI (package.json test script) - Stress test fixed: sequential writes instead of aggressive concurrent Other: - stale: 10000 → 60000ms (tolerate event loop blocking) - retries: 5→10, minTimeout: 100→1000, maxTimeout: 2000→30000
Maintainer Review 回應(commit 2d3277f)感謝 reviewer 的詳細審查,以下是所有 Must Fix 的處理結果: ✅ Must Fix 3:ECOMPROMISED catch 是死代碼已修復。使用 onCompromised callback + flag 機制取代原本無效的 catch (err.code === \ECOMPROMISED):
n() 的錯誤優先於 lock compromised 錯誤(避免 error masking)。 ✅ Must Fix 2:Stress test 未加入 CI已修復:
✅ Questions 回答
關於 stale=60s 的 trade-off這是預期的 trade-off:lock holder 崩潰後需要 60 秒才能回收。但相對於 Gateway 直接 crash,60 秒的等待代價更低,且多數情況下 holder 只是被 block 不是真的崩潰。 請再次 review 🙏 |
…le-threshold # Conflicts: # package.json
Re-review on new commitsThanks for addressing the previous feedback. The Verdict: request-changes (confidence 0.95) Must Fix1. Build failure + stale base BUILD_FAILURE blocker persists. stale_base=true — AliceLJY noted strip-envelope-metadata failures are pre-existing on main, but without a rebase this can't be confirmed on your branch. Please rebase and verify. 2. Two test failures in strip-envelope-metadata.test.mjs Lines 121, 132: subagent wrapper text not stripped. Even if pre-existing, merging with unattributed test failures normalizes broken CI. After rebase, confirm these fail identically on main. 3. stale=60s recovery trade-off needs documentation Your response acknowledges this as an intentional trade-off (60s wait vs Gateway crash). That's reasonable — please add a code comment near Nice to Have
Prior Items — Status
|
PR #517 回覆(完整版)Must Fix 1:ECOMPROMISED callback 機制問題:proper-lockfile 的 ECOMPROMISED 是透過 修復狀態:✅ 已完成 Commit Must Fix 2:Stress test 未加入 CI問題:新加的 修復狀態:✅ 已完成 Commit Must Fix 3:stale=60s 的設計取捨問題:stale 從 10s 改到 60s,真正 crash 的 lock holder 需要 60s 才能被回收。 設計決策說明: 問題根源是 retries 的總等待時間與 stale 不匹配:
修復方向是讓 retry 的總等待時間覆蓋 event loop 可能阻塞的時間:
Stale=60s 的代價評估:
Question 1:fallback lock 的
|
PR CortexReach#517 added test/lock-stress-test.mjs to CI_TEST_MANIFEST but did not update EXPECTED_BASELINE in verify-ci-test-manifest.mjs, causing CI to fail with 'unexpected manifest entry'.
Fixes applied to this PRTwo issues were found and fixed in this PR: Fix 1:
|
jlin53882
left a comment
There was a problem hiding this comment.
重新看過 PR #517 最新 commit,聚焦 #415 的 lock 修復後,目前只看到 1 個 blocker,沒有發現其他同級 hidden bug:
src/store.ts 在 onCompromised 後會把已成功的寫入誤報成失敗,caller 重試時可能產生 duplicate writes。
具體影響鏈:fn() 執行成功 → table.add() 已 commit → fn() 回傳成功結果 → finally 因 isCompromised=true 改成 throw compromisedErr → caller 以為寫入失敗而重試 → 新 UUID 再次寫入 → 重複資料。
Fixes applied to this PRTwo issues were found and fixed in this PR: Fix 1:
|
rwmjhb
left a comment
There was a problem hiding this comment.
感谢这个 PR,event loop 阻塞导致 ECOMPROMISED 崩溃是真实的高频故障,方向正确。
必须修复(3 项)
MR2:stale 提高到 60s 同时延迟了真实崩溃后的 lock 恢复时间
原来 10s 的 stale 窗口意味着持锁进程崩溃后最多 10s 可以被其他 writer 接管;改为 60s 后,一个真实崩溃的 lock holder 会让所有等待者阻塞最多 60s。建议评估一个更小的中间值(如 20–30s),或者在 ECOMPROMISED fallback 路径里主动缩短 stale 窗口。
EF1 / EF2:strip-envelope-metadata.test.mjs 有 2 个测试失败
验证器报告了 BUILD_FAILURE,且 strip-envelope-metadata.test.mjs 中有两个测试未通过(subagent wrapper pattern 未被 strip)。请确认:在你的 branch 上,这两个失败是在你修改之前就已存在,还是本 PR 引入的回归?建议 rebase 后用干净 main 重新跑一次测试并截图确认。
建议修复(不阻塞合并)
- F2:ECOMPROMISED catch block 在实际崩溃场景下是死代码——proper-lockfile 的
onCompromised回调才会被调用,catch 无法拦截 - F3:stress test 2 是顺序执行的,无法验证并发争抢下的 retry 行为
- EF3:
src/store.ts锁错误处理路径中新增了 3 处显式any,高风险文件建议补充类型
一个问题
fallback lockfile.lock 用的是 retries:3,主路径是 retries:10,这个不对称是有意为之(快速失败 fallback)还是笔误?
Re-review on
|
…ter fn() succeeds
b4d990d to
f4db94c
Compare
回覆:感謝所有審查意見感謝 @rwmjhb 兩輪詳細審查。以下逐一回覆。 🔴 P1 Blocker:James 發現的「成功寫入被誤報為失敗」問題(已修復)問題:當 根因: 修復(commit let fnSucceeded = false;
try {
const result = await fn();
fnSucceeded = true;
return result;
} catch (e) {
fnError = e;
throw e;
} finally {
if (isCompromised) {
if (fnError !== null) throw fnError; // fn() 失敗 → 拋原本錯誤
if (!fnSucceeded) throw compromisedErr; // fn() 未開始 → 拋 compromisedErr
// fn() 成功 → log warning + 回傳成功結果,不重試
console.warn(`[memory-lancedb-pro] Returning successful result despite compromised lock...`);
}
await release();
}驗證:Codex 對抗式審查確認, MR2:
|
| 設計選擇 | 優點 | 缺點 |
|---|---|---|
stale=10s(舊) |
真實崩潰後 10s 可恢復 | 高負載時 event loop 阻塞 → false ECOMPROMISED |
stale=60s(新) |
避免 false ECOMPROMISED | 真實崩潰後需 60s 才釋放 |
我們的選擇:在 OpenClaw 實際使用情境下,event loop 阻塞導致的 false ECOMPROMISED 是高頻故障,而真實崩潰是低頻事件。60s 的等待時間對於記憶寫入的 SLA 是可接受的。
補充:competitor 的重試策略(minTimeout=1000, maxTimeout=30000, retries=10)在 holder 崩潰後,會在約 63s(minTimeout * factor^5 = 32s 級)成功取得 lock。Lock 恢復時間不會高達 60s。
EF1/EF2:strip-envelope-metadata.test.mjs 失敗
確認:這個測試在 master branch 上根本不存在,是 fix/envelope-stripping-phase2 feature 的產物,透過 PR #517 的 merge 帶入的。
在 fix/issue-415-stale-threshold branch 上執行:
node --test test/strip-envelope-metadata.test.mjs
✔ stripEnvelopeMetadata (3.0944ms)
ℹ tests 14 | ℹ pass 14 | ℹ fail 0
結論:EF1/EF2 不是 PR #517 的 regression,是 Phase 2 envelope stripping 的測試。如果 Phase 2 有問題,需要獨立的 PR 處理。
EF3:三個 any type
同意這是類型品質問題。目前 compromisedErr、fnError、err 都是 any。從嚴格類型安全角度,catch (e) 應該是 unknown 再轉換,onCompromised 的 err 也應該是 Error type。
這是 non-blocking 建議,不阻擋 merge。如需修復,我們可以在 follow-up PR 中統一改用 unknown / Error。
F2 / F3
感謝建議 Stress test 和 Dead code 清理。這些可以在 stale 修復 merge 後作為 follow-up PR 處理,確保不與當前 PR 的範圍Scope 混淆。
總結
| 問題 | 狀態 |
|---|---|
| P1 Blocker(duplicate writes) | ✅ 已修復(f4db94c) |
| MR2(stale=60s trade-off) | 📝 設計選擇,已說明 |
| EF1/EF2(strip-envelope 失敗) | ✅ 已排除(非 regression) |
| EF3(any type) | 📝 Non-blocking,可 follow-up |
| F2/F3(stress test / dead code) | 📝 Non-blocking,可 follow-up |
請求 @rwmjhb 重新 review。謝謝!
…ions from cross-process-lock regression test
CI 修復說明(commit
|
| Job | 結論 |
|---|---|
| version-sync | ✅ |
| core-regression | ✅ |
| packaging-and-workflow | ✅ |
| llm-clients-and-auth | ✅ |
| storage-and-schema | ✅ |
| cli-smoke | ❌(預存問題,非本 PR 造成) |
rwmjhb
left a comment
There was a problem hiding this comment.
感谢这次修复——lockfile stale 阈值和 ECOMPROMISED fallback 解决的是真实的高负载崩溃问题。有几个阻塞项:
Must Fix
EF1/EF2 — Build failure + 2 个测试失败
strip-envelope-metadata.test.mjs:121,132 断言 subagent wrapper 前缀未被 stripEnvelopeMetadata() 剥离。AliceLJY 认为这是 pre-existing 问题,但 stale_base=true 意味着你的分支没有 main 上可能已有的修复。请先 rebase,确认失败仍然是 pre-existing 而非你的分支引入,并在 PR 中说明——不能带 unattributed 测试失败合并。
MR2 — stale 从 10s 提高到 60s,同时把真实崩溃进程的锁恢复时间延长了 6 倍
这是一个 tradeoff 需要明确:事件循环卡顿场景受益(减少误触发),但真实进程崩溃后其他进程等待锁释放的时间从 10s 变成 60s。是否有比单纯调大阈值更精准的方案(比如结合 onCompromised 回调)?
Nice to Have
- F2 (
src/store.ts:219): ECOMPROMISED 的 catch block 在崩溃场景下是死代码——onCompromised默认行为是在setTimeout回调中throw,这是 Node.js 级别的 uncaught exception,PR 的 try/catch 此时已不活跃(lockfile.lock() promise 早已 resolve)。Issue #415 的 stack trace 也印证了这一点。如果要实现 graceful fallback,需要把恢复逻辑放在onCompromised回调里。 - F3 (
test/lock-stress-test.mjs:82): Stress test 2 是顺序执行的——第一个await store.store()完成后锁已释放,第二个才开始,完全没有测到竞争场景。建议用Promise.all([store.store(...), store.store(...)])并发触发。 - EF3:
catch (err: any)和let fnError: any在高风险文件src/store.ts的错误处理路径中新增了 3 处any,建议改为unknown+ 类型收窄。
方向正确,rebase 确认 build 状态后可以合并。
…rrent test
- MR2: stale=10000 + lockfile.check(lockPath,{stale:2000}) to distinguish
event loop blocking (check succeeds → warn, no throw) from real crash
(check fails → throw) — faster crash detection (10s) without false positives
- EF3: compromisedErr/fnError/err all changed to unknown with type narrowing
- F2: add clarifying comment that fnError!==null branch is theoretically unreachable
(onCompromised fires async via setTimeout after try/catch has already completed)
- F3: rewrite Test 2 to use Promise.all() for true concurrent lock contention
- regression test: add mock check() to __setLockfileModuleForTests so
onCompromised callback can succeed in test environment
|
@rwmjhb 請確認以下修改: MR2 回覆:比單純調大閾值更精準的方案分析:為何
|
| 項目 | 修改內容 |
|---|---|
| EF3 | compromisedErr: any → unknown,fnError: any → unknown,catch (err: any) → catch (err: unknown) + 類型收窄 |
| F2 | if (fnError !== null) 保留並加 clarifying comment(理論上 unreachable,因為 onCompromised 是 async throw) |
| F3 | Test 2 改為 Promise.all() 真正並發請求,測試 lock 競爭情境 |
| Regression test | mock lock() 加入 check() 模擬,確保 callback 在測試環境可正常運作 |
驗證
node --test test/cross-process-lock.test.mjs→ 6/6 pass ✅node --test test/lock-stress-test.mjs→ 3/3 pass ✅- TypeScript 編譯乾淨 ✅
修復內容: 1. onCompromised 改為同步 callback(移除 async/await check/throw) - async throw 無法傳回 caller(setLockAsCompromised 不等 Promise) 2. finally 正確處理 ERELEASED(compromised 後 release 必 ERELEASED) - 重要:不在 catch 裡 return,否則覆蓋 try 的 return 值 3. 移除無效的 check() 區分邏輯(mtime 無法區分 crash vs 阻塞) 4. 接受 ECOMPROMISED 為 ambiguous degradation 訊號 狀態機: - fn 成功 + compromised → 回傳成功 + warn - fn 失敗 + compromised → throw fnError - fn 未完成 + compromised → throw compromisedErr - 非 compromised → 正常 release 新增測試(test/cross-process-lock.test.mjs): - fnError 優先於 compromisedErr 測試 - compromised 時 release ERELEASED 處理測試 - 非 compromised release 錯誤傳播測試
移除「throws compromisedErr when fn() never completes」測試。 原因:finally 只在 fn() resolve/reject 後執行,無法從外部打斷 pending promise。 此情境的正確處理是 caller 自行實作 timeout。
PR 更新說明(commit
|
| 原本方案 | 最終方案 |
|---|---|
async onCompromised + lockfile.check() 區分 crash vs 阻塞 |
同步 onCompromised 只設 flag |
throw err 在 callback 裡 |
移除(setLockAsCompromised 不等 Promise,throw 無法傳回 caller) |
stale=60000 被動方案 |
stale=10000(維持快速偵測) |
check() 區分 crash vs 阻塞 |
移除(check() 無法區分,mtime 對兩者觀測結果相同) |
為何 check() 無法區分 crash vs 阻塞
lockfile.check() 只問:stat.mtime < now - stale?
- Holder event loop 阻塞 >10s:mtime停在T=0,
now - mtime = 12s > 2s→ check 失敗 - Holder crash:mtime也停在T=0,
now - mtime = 12s > 2s→ check 失敗
兩者觀測結果完全相同,無法區分。
最終設計原則
接受 ECOMPROMISED 為 ambiguous degradation 訊號:
fn() 完成 + compromised → 回傳成功(資料已寫入,別重試)
fn() 失敗 + compromised → throw fnError(fn 的錯誤優先)
fn() 未完成 + compromised → throw compromisedErr
非 compromised → 正常 release
關鍵 insight:mtime 是 lock continuity 訊號,不是 process liveness 訊號。兩種 failure mode 在 lock 層面無法區分,但對 OpenClaw 場景不重要——只要資料寫入了,回傳成功就是對的。
PR 描述已同步更新
- 標題改為準確描述:
proper-lockfile retries + ECOMPROMISED graceful handling - 正文更新為完整技術說明
- CI 核心測試全部通過 ✅
@rwmjhb 請確認是否有其他問題。
Fix Issue #415 — proper-lockfile ECOMPROMISED graceful handling
Problem
Error: Unable to update lock within the stale thresholdcauses OpenClaw Gateway to exit when under heavy load.Root cause: The
stalethreshold (10s) is too short for high-load environments where the Node.js event loop can be blocked for >10s by synchronous I/O or heavy computation. Whenproper-lockfile'ssetTimeoutcallback is delayed beyond the stale threshold, it triggersECOMPROMISEDwhich by default crashes the process.Additionally, the
retriesconfiguration (5 retries, max wait ~3.1s) was severely misaligned withstale=10000ms, causing competitor processes to give up waiting even when the lock holder was still alive but temporarily blocked.Solution
Fix 1: Increase
retriesmax wait from ~3.1s → ~151s (src/store.ts)Exponential backoff sequence: 1s, 2s, 4s, 8s, 16s, 30s×5 = ~151 seconds total. This gives lock holders enough time to recover from temporary event loop blocking.
Fix 2: ECOMPROMISED graceful handling with synchronous flag mechanism
ECOMPROMISEDis an ambiguous degradation signal — the mtime-based mechanism inproper-lockfilecannot distinguish between "holder crashed" vs "holder event loop is temporarily blocked". Instead of trying to distinguish these cases (which is impossible with the available signals), the code accepts this ambiguity and uses a state machine:Key implementation details:
Synchronous
onCompromisedcallback: Must be synchronous.setLockAsCompromised()does NOT await the Promise — anasync throwinsideonCompromisedbecomes an unhandled rejection, not an error returned torunWithFileLock().ERELEASED handling: After
onCompromisedfires,proper-lockfilesetslock.released = true, causingrelease()to immediately returnERELEASED. This must be caught and ignored — do notreturnfrom the catch block, otherwise thefinally's return value overrides the successful result fromtry.No
lockfile.check()distinction:check()only testsstat.mtime < now - stale. Both "holder crashed" and "holder event loop blocked" produce identical stale mtimes. Usingcheck()to distinguish them is无效.State Machine
fn()outcomeChanges
src/store.tsrunWithFileLock(): retries + synchronous onCompromised flag + ERELEASED handlingtest/cross-process-lock.test.mjstest/lock-stress-test.mjsTesting
cross-process-lock.test.mjs: 8/8 pass (including ERELEASED handling, fnError priority, error propagation)lock-stress-test.mjs: 3/3 passcore-regression✅,storage-and-schema✅,llm-clients-and-auth✅,packaging-and-workflow✅References