fix: prevent delta-sync wedge on Tana Local API HTTP 500 poison node#86
Conversation
When Tana's Local API /nodes/search returns HTTP 500 while serializing a specific node in the changed set, the whole page failed. The delta watermark only advanced on a fully successful sync, so the same poisoned request was retried every cycle and never recovered (observed stuck 8+ days). supertag-cli was unchanged in the relevant window, so the trigger is a Tana-side serializer bug. DeltaSyncService isolates the offending node via offset/limit bisection, skips just that node (reported as poisonNodesSkipped), and advances the watermark so sync keeps progressing. A full `supertag sync index` re-captures skipped nodes from the export. HTTP 500 is treated as a skippable poison node; 400/401/404/network errors still propagate. DeltaSyncPoller tracks consecutive failed cycles and escalates to a loud warning after a threshold, exposed via getFailureState(). Catches the "silently wedged for days" failure mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🔍 Review: APPROVE (self-PR, can't formally approve)Well-diagnosed production incident fix. Bisection approach to isolate poison nodes is clean and efficient (O(log PAGE_SIZE) to pin a single offender). The three-layer fix (isolation → watermark advance → failure-streak escalation) addresses the immediate wedge, the repeat-retry loop, and the silent-failure observability gap. Observations (non-blocking)
Clean code, clear comments, solid tests. Ship it. Reviewed by Ivy (cortex) |
…e resilience (#86) - Fix: macOS binaries shipped unsigned → SIGKILL (137) on Apple Silicon; CI now codesigns + verifies + smoke-runs all binaries (#84/#88) - Fix: delta-sync wedge on Tana Local API HTTP 500 poison node; isolate/skip + advance watermark + failure-streak escalation (#86) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
A user reported delta-sync failing with HTTP 500 from Tana's Local API for 8+ days.
Diagnosis:
/nodes/search500s while serializing one specific node in the changed set.edited.sincereturns the whole changed set, so one poison node 500s the entire page.RETRYABLE_STATUS_CODESis{502,503,504}only — 500 fell through as a non-retryable hard failure with no isolation or fallback.Fix
DeltaSyncService(src/services/delta-sync.ts)fetchPageResilient()— on HTTP 500, bisect the offset/limit window to isolate the offending node, then skip just that node and keep paging. Recurses down tolimit=1to pin the exact culprit.poisonNodesSkippedfield onDeltaSyncResult.DeltaSyncPoller(src/mcp/delta-sync-poller.ts)consecutiveFailures; escalates to a loud, actionable warning ("failed N cycles → runsupertag sync index") after a threshold instead of an identical per-cycle error. Exposed viagetFailureState().Recovery for affected users
supertag sync index(full sync) bypasses the delta/nodes/searchpath entirely and re-captures any skipped node from the export.Tests
Note: pre-existing
Transcript CLI Commandstest is a timing/hook-timeout flake unrelated to this change (passes on retry).🤖 Generated with Claude Code