Skip to content

fix(recover): re-enqueue orphan StateReady tasks#4

Merged
gocronx merged 1 commit into
masterfrom
fix/recover-orphan-ready-tasks
May 8, 2026
Merged

fix(recover): re-enqueue orphan StateReady tasks#4
gocronx merged 1 commit into
masterfrom
fix/recover-orphan-ready-tasks

Conversation

@gocronx
Copy link
Copy Markdown
Owner

@gocronx gocronx commented May 8, 2026

Closes #3.

What

When a process crashes between BLPOP and the state update inside PopTask, the task ID is removed from the ready list but the task in Redis is still in StateReady. On restart, recover() saw StateReady and skipped the task — assuming the ID was still on the ready list — so the task sat as an orphan until its TTL expired.

This change has recover() pre-fetch each topic's ready list once and RPUSH any StateReady task whose ID is missing.

Why this approach

  • Minimal surface area. No public API change, no schema change, no new Redis primitives. Two small store helpers (ReadyListIDs, RequeueReady) and a few lines in recover().
  • Cheap. One LRANGE per topic on startup; orphan check is O(1) per task after that.
  • Best-effort. If the LRANGE fails we fall back to "treat the set as empty" — orphans get re-enqueued (correct), live IDs get duplicated (harmless: the duplicate becomes a stale ID once the original is popped).

A fully crash-safe pop (BLMOVE + processing list + reaper, or Redis Streams with consumer groups) is a larger redesign and intentionally out of scope here.

Test plan

  • TestQueue_Recover_OrphanReadyTask — plants a StateReady task with no ready-list entry, asserts it gets popped after Start
  • TestQueue_Recover_LiveReadyTaskNotDuplicated — plants a StateReady task that is on the ready list, asserts the list size stays at 1 after Start
  • Full test suite passes locally (go test -count=1 ./...)

The pre-existing flake in TestStress_TimeWheel_Concurrent_4Writers under -race is reproducible on master and unrelated — happy to look at it separately.

When a process crashes between BLPOP and the state update inside
PopTask, the task ID is removed from the ready list but the task
in Redis is still in StateReady. recover() previously skipped such
tasks on the assumption that StateReady implied ready-list membership,
leaving the task as an orphan until its TTL expired.

This change has recover() pre-fetch each topic's ready list once and
RPUSH any StateReady task whose ID is missing.

Tests cover both the orphan and live-task cases — the live-task test
guards against duplicate enqueues on restart.
@gocronx gocronx merged commit a1179de into master May 8, 2026
3 checks passed
@gocronx gocronx deleted the fix/recover-orphan-ready-tasks branch May 8, 2026 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PopTask: tasks orphaned in StateReady when process crashes mid-pop

1 participant