Move waitForShardReady outside transaction in finishMoveKeys#12981
Move waitForShardReady outside transaction in finishMoveKeys#12981saintstack wants to merge 6 commits into
Conversation
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
There was a problem hiding this comment.
Pull request overview
This PR adjusts finishMoveKeys() to avoid holding a transaction open while waiting up to SERVER_READY_QUORUM_TIMEOUT (15s) for destination servers to become readable, which can exceed the transaction lifetime and trigger transaction_too_old. It does so by splitting the logic into a “read metadata” transaction, then waiting outside a transaction, and finally a “re-verify + commit” transaction.
Changes:
- Capture the transaction read version, then
tr.reset()before the potentially longwaitForShardReady()quorum wait. - Introduce a second transaction that re-checks ownership via
checkMoveKeysLock, re-readskeyServers, and commits the metadata update. - Add logging/probing when the destination team changes during the out-of-transaction wait.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
2 similar comments
Result of foundationdb-pr-73 on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
|
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
2 similar comments
Result of foundationdb-pr-73 on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
|
@saintstack Haven't reviewed this in depth. But curious how are we testing this? In simulation, can we add artificial delay at some places? So before this change: DD gets stuck, 100K is not clean. After this change: DD makes progress, simulation test completes, 100K is clean. |
@spraza I like this suggestion... doing. |
6867999 to
30720fb
Compare
|
Rebase and bundle the test suggested by @spraza |
Result of foundationdb-pr-clang-73 on Linux RHEL 9
|
|
Failure is GcGenerations. |
SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction that must commit within ~5s (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination servers are slow to respond (e.g. after a team rebuild unsticks moves), the 15s wait guarantees transaction_too_old. Split the single transaction into two: 1. First transaction reads keyServers metadata and server list 2. waitForShardReady runs outside any transaction (15s timeout is safe) 3. Second transaction re-verifies dest is unchanged, then commits If dest changed while waiting (another DD reassigned the shard), the code retries from the top of the inner loop.
Document why the split is safe: checkMoveKeysLock + dest re-verification in the second transaction ensures stale metadata is never committed.
Fix two issues in the second transaction of the finishMoveKeys split: 1. Restore tr.trState->taskID = TaskPriority::MoveKeys after tr.reset() which resets taskID to the default. Without this the verification and commit run at wrong priority. 2. Check dest across all subranges in the convergence check, not just the first entry. Also verify the iteration boundary (endKey) has not changed. A split or partial reassignment during the wait could leave some subranges with a different dest which the single-entry check would miss.
Injects a 6s delay (5% probability) after shard readiness is confirmed, simulating a slow storage server. This exercises the fix that moved waitForShardReady outside the transaction in finishMoveKeys. Verified with BUGGIFY_WITH_PROB(1.0) on MoveKeysCycle.toml: - With tr.reset() disabled (old behavior): test FAILS — transaction_too_old storms, Cycle workload achieves 2-9 txns vs 89 required, Passed=0 - With tr.reset() enabled (fix): test PASSES — no retry storms, Passed=1
30720fb to
fae8743
Compare
Result of foundationdb-pr-cluster-tests-73 on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
Result of foundationdb-pr-clang-73 on Linux RHEL 9
|
|
This is putting even more lines of code and complexity into a function that is already pushing 400+ lines with multiple levels of nested loops, try/catch blocks, and conditional logic. Why is this SERVER_READY_QUORUM_TIMEOUT a problem now all of a sudden? Do we have data on how often this condition happens? I suspect this relates to having too many moves going on in general per graphs we have seen, reports of OOMs, etc. I am not convinced this sort of apparent workaround with obvious complexity cost is the right approach. I think front-end admission control needs attention |
Result of foundationdb-pr-cluster-tests-73 on Linux RHEL 9
|
gxglass
left a comment
There was a problem hiding this comment.
See comments earlier in review about complexity vs benefit on this. What is the data saying how often this happens? Why is it happening more now? The default explanation per graphs is that too much data (TBs) is in flight, with the result being individual moves now share bandwidth, CPU, etc and take longer, hitting timeouts. Better concurrency control would seem to avoid all of that, and the need for complex workarounds like this for conditions that should be rare.
Adds a simulation workload that reproduces the finishMoveKeys pipeline stall from incidents p127 and p102. The test creates conditions where destination storage servers cannot respond to getShardState requests promptly, causing waitForShardReady to consume the 5s transaction budget and triggering a self-reinforcing retry storm. What is simulated vs real: - Real: the finishMoveKeys code path, FlowLock slots, retry logic, dataMoves accumulation, startMoveKeys outpacing finishMoveKeys, OOM from fetchKeys arena growth. All downstream behavior is genuine FDB code. - Simulated: the reason getShardState is slow. In production, dest SSes are CPU-saturated from fetchKeys + client reads + compaction. In simulation, we use a FlowLock + queue-proportional delay. The effect is identical (getShardState takes seconds instead of microseconds). - getShardState is normally instantaneous (check shard state, return version). It only becomes slow when the SS event loop is saturated. It is ONLY called by waitForShardReady inside finishMoveKeys -- normal client transactions never touch it. - The simulation is optimistic: in production ALL operations on an overloaded SS are slow, not just getShardState. If a fix works here, it works in production where conditions are worse. Mechanism: - A FlowLock (capacity=2) on each SS limits concurrent getShardState processing. When the lock is acquired, processing is slowed proportional to queue depth (0.5s per waiter). This models real CPU contention: the SS can only serve N requests at a time, and each takes longer under pressure. - storage_fetch_keys_delay=2.0 keeps fetchKeys running long enough that the concurrency limit stays active during mass shard movement. - Self-reinforcing: more retries -> longer queue -> slower processing -> transaction_too_old -> more retries. Test structure: - Loads 500K keys with tiny shards (5KB min, 20KB max) -> ~17,000 shards - Excludes 3/9 storage servers -> ~8,000 shard moves - Monitors dataMoves accumulation and pipeline throughput for 900s Results: - 89 consecutive stall detections (full 900s observe window, never recovered) - finishMoveShards: 57% abort rate (72,197/127,566 aborted) - 102 GB arena reserved (unbounded fetchKeys memory growth) - Matches p102 OOM pattern: fetchKeys arenas accumulate because finishMoveKeys cannot clear dataMoves entries Fix validation results: - apple#12981 (waitForShardReady outside txn): FIXES IT. Queue drains fully (13,900 -> 78). All moves complete. Root cause eliminated. - apple#12991 (retry backoff): MAKES IT WORSE. Stall becomes more permanent because slower retries prevent even lucky successes from draining queue. - apple#13112 (global in-flight cap at 200): LIMITS DAMAGE. InFlight capped, memory growth halved (49GB vs 102GB), queue partially drains. But stall persists (56% abort rate). Defense-in-depth, not a fix. Run command: fdbserver -r simulation -f tests/slow/DDPipelineStall.toml --memory 16GiB --knob-max_trace_lines 100000000 New knob: BUGGIFY_GET_SHARD_STATE_DELAY (default 0, no production effect). When set to N in simulation, limits getShardState to N concurrent operations per storage server. Files: - fdbserver/workloads/DDPipelineStall.cpp - tests/slow/DDPipelineStall.toml - fdbserver/core/include/fdbserver/core/Knobs.h - fdbserver/core/ServerKnobs.cpp - fdbserver/storageserver/storageserver.actor.cpp - tests/CMakeLists.txt
|
#13364 is forward-port of a version of this patch.
nod. Unfortunately, this is where the problem is.
None beyond the rash of outages we witnessed recently and that if we let the spotlight roam, it seems to find other cases. Yes, too many moves while the cluster is under heavy load brings on issues.
Front-end admission sounds like a good addition too -- belt and suspenders -- though I'd ascribe admission-control the workaround label given this PR addresses an obvious error and likely cause of outages. |
SERVER_READY_QUORUM_TIMEOUT (15s) was used INSIDE a transaction that must commit within ~5s (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination servers are slow to respond (e.g. after a team rebuild unsticks moves), the 15s wait guarantees transaction_too_old.
Split the single transaction into two:
If dest changed while waiting (another DD reassigned the shard), the code retries from the top of the inner loop.
20260414-005707-stack_logs-bddd63d245354453 compressed=True data_size=50997449 duration=4877944 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:59:04 sanity=False started=100000 stopped=20260414-015611 submitted=20260414-005707 timeout=5400 username=stack_logsThis is one of a few fixes to address transaction_too_old seen trying to do inRestore moves on startup incident on 7.3