Symptom
TestStress_TimeWheel_Concurrent_4Writers is flaky under -race on master. Repro on my machine (Apple Silicon, Go 1.26.2, Redis 7):
$ go test -count=3 -race -timeout 360s -run "TestStress_TimeWheel_Concurrent_4Writers" .
--- FAIL: TestStress_TimeWheel_Concurrent_4Writers (11.27s)
benchmark_test.go:311: timeout: fired 999998/1000000
--- FAIL: TestStress_TimeWheel_Concurrent_4Writers (11.34s)
benchmark_test.go:311: timeout: fired 999999/1000000
3 runs → 2 failures, both within 1–2 entries of the full 1,000,000. Without -race it's much more reliable but I've still seen 999999/1000000 once. TestStress_TimeWheel_1M_Add exhibits the same pattern, less frequently.
Likely cause (rough hypothesis)
The test fires 1M tasks with delays from 1ms to 1000ms, ticker interval is 1ms, deadline is 10s. Under -race the seqflow handler goroutine is slower, so the last few entries can land in slots that the cursor has already passed in the same tick window — effectively round=0 entries enqueued just after slot.entries was processed. The post-drain block at timewheel.go:218-238 is supposed to catch exactly this, but it only catches adds that arrive during the same Handle call — not ones that the producer queued just before the slot was processed but the handler hadn't drained yet.
I haven't verified this — putting it down so we have a starting point.
Suggestions
- Bump the test deadline from 10s to e.g. 30s and see whether the missing entries eventually fire (if so, it's a slowness issue, not a correctness one)
- If they never fire, it's a real off-by-one in the wheel handler that's worth fixing
- Either way, gating the stress test behind
-tags=stress or a SEQDELAY_STRESS=1 env var would keep CI green while preserving the assertion locally
Out of scope
Not blocking #4 — that PR doesn't touch the wheel and the failure reproduces on master.
Symptom
TestStress_TimeWheel_Concurrent_4Writersis flaky under-raceon master. Repro on my machine (Apple Silicon, Go 1.26.2, Redis 7):3 runs → 2 failures, both within 1–2 entries of the full 1,000,000. Without
-raceit's much more reliable but I've still seen999999/1000000once.TestStress_TimeWheel_1M_Addexhibits the same pattern, less frequently.Likely cause (rough hypothesis)
The test fires 1M tasks with delays from 1ms to 1000ms, ticker interval is 1ms, deadline is 10s. Under
-racethe seqflow handler goroutine is slower, so the last few entries can land in slots that the cursor has already passed in the same tick window — effectively round=0 entries enqueued just afterslot.entrieswas processed. The post-drain block attimewheel.go:218-238is supposed to catch exactly this, but it only catches adds that arrive during the sameHandlecall — not ones that the producer queued just before the slot was processed but the handler hadn't drained yet.I haven't verified this — putting it down so we have a starting point.
Suggestions
-tags=stressor aSEQDELAY_STRESS=1env var would keep CI green while preserving the assertion locallyOut of scope
Not blocking #4 — that PR doesn't touch the wheel and the failure reproduces on master.