Skip to content

fix: prevent delta-sync wedge on Tana Local API HTTP 500 poison node#86

Merged
jcfischer merged 1 commit into
mainfrom
fix/delta-sync-poison-node-resilience
May 31, 2026
Merged

fix: prevent delta-sync wedge on Tana Local API HTTP 500 poison node#86
jcfischer merged 1 commit into
mainfrom
fix/delta-sync-poison-node-resilience

Conversation

@jcfischer
Copy link
Copy Markdown
Owner

Problem

A user reported delta-sync failing with HTTP 500 from Tana's Local API for 8+ days.

Diagnosis:

  • supertag-cli was unchanged in the relevant window (last release ~6 weeks before onset) → trigger is Tana-side: /nodes/search 500s while serializing one specific node in the changed set.
  • edited.since returns the whole changed set, so one poison node 500s the entire page.
  • The delta watermark only advanced on a fully successful sync → the same poisoned request was retried every cycle and never recovered. Permanent wedge.
  • RETRYABLE_STATUS_CODES is {502,503,504} only — 500 fell through as a non-retryable hard failure with no isolation or fallback.

Fix

DeltaSyncService (src/services/delta-sync.ts)

  • fetchPageResilient() — on HTTP 500, bisect the offset/limit window to isolate the offending node, then skip just that node and keep paging. Recurses down to limit=1 to pin the exact culprit.
  • Watermark advances when nodes were found or poison nodes skipped → sync un-wedges instead of re-requesting the same bad node forever.
  • New poisonNodesSkipped field on DeltaSyncResult.
  • 500 → skippable poison node; 400/401/404/network still propagate as real failures.

DeltaSyncPoller (src/mcp/delta-sync-poller.ts)

  • Tracks consecutiveFailures; escalates to a loud, actionable warning ("failed N cycles → run supertag sync index") after a threshold instead of an identical per-cycle error. Exposed via getFailureState().

Recovery for affected users

supertag sync index (full sync) bypasses the delta /nodes/search path entirely and re-captures any skipped node from the export.

Tests

  • 3 new: isolates/skips a single poison node, propagates non-500 errors, watermark advances past poison.
  • Typecheck clean; 119 delta-sync + poller tests pass.

Note: pre-existing Transcript CLI Commands test is a timing/hook-timeout flake unrelated to this change (passes on retry).

🤖 Generated with Claude Code

When Tana's Local API /nodes/search returns HTTP 500 while serializing a
specific node in the changed set, the whole page failed. The delta
watermark only advanced on a fully successful sync, so the same poisoned
request was retried every cycle and never recovered (observed stuck 8+
days). supertag-cli was unchanged in the relevant window, so the trigger
is a Tana-side serializer bug.

DeltaSyncService isolates the offending node via offset/limit bisection,
skips just that node (reported as poisonNodesSkipped), and advances the
watermark so sync keeps progressing. A full `supertag sync index`
re-captures skipped nodes from the export. HTTP 500 is treated as a
skippable poison node; 400/401/404/network errors still propagate.

DeltaSyncPoller tracks consecutive failed cycles and escalates to a loud
warning after a threshold, exposed via getFailureState(). Catches the
"silently wedged for days" failure mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jcfischer
Copy link
Copy Markdown
Owner Author

🔍 Review: APPROVE (self-PR, can't formally approve)

Well-diagnosed production incident fix. Bisection approach to isolate poison nodes is clean and efficient (O(log PAGE_SIZE) to pin a single offender). The three-layer fix (isolation → watermark advance → failure-streak escalation) addresses the immediate wedge, the repeat-retry loop, and the silent-failure observability gap.

Observations (non-blocking)

  1. Poison clusters — if multiple adjacent nodes are poison (unlikely but possible), each triggers independent bisection. The MAX_PAGE_ITERATIONS cap covers pathological cases, but a cluster of ~10 would cause ~70 recursive fetch attempts. Worth a log line if >3 poison nodes are skipped in a single cycle to surface unexpected API degradation early.

  2. Watermark after poison-only cycle — advancing to Date.now() when only poison nodes were skipped (zero real nodes found) means any node that changed between old watermark and now won't appear in the next delta unless it changes again. Low practical risk (poison is rare + full sync recovers), but a comment acknowledging this trade-off would aid future readers.

  3. getFailureState() — nice API surface for future status reporting. No test exercises it directly in this PR; consider a lightweight assertion in the poller test that it resets after a successful cycle.

  4. Short-page exit with reduced flag — good catch that bisected pages shouldn't trigger the short-page end-of-results heuristic.

Clean code, clear comments, solid tests. Ship it.


Reviewed by Ivy (cortex)

@jcfischer jcfischer merged commit cf2f95a into main May 31, 2026
1 check passed
@jcfischer jcfischer deleted the fix/delta-sync-poison-node-resilience branch May 31, 2026 10:20
jcfischer added a commit that referenced this pull request May 31, 2026
…e resilience (#86)

- Fix: macOS binaries shipped unsigned → SIGKILL (137) on Apple Silicon; CI now codesigns + verifies + smoke-runs all binaries (#84/#88)
- Fix: delta-sync wedge on Tana Local API HTTP 500 poison node; isolate/skip + advance watermark + failure-streak escalation (#86)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant