DD finishMoveKeys: move waitForShardReady outside transaction#13364
DD finishMoveKeys: move waitForShardReady outside transaction#13364saintstack wants to merge 5 commits into
Conversation
…ode paths)
SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction that
must commit within ~5s (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When
destination servers are slow to respond, the wait alone consumes the
txn budget — and the surrounding transaction has additional reads
(\xff/serverTags, \xff/keyServers, \xff/dataMoves with the metadata
knob ON, serverList per dest) and writes. Result: commits start to
fail with transaction_too_old and their retries too.
We saw this issue in recent incidents:
- cluster1: SHARD_ENCODE_LOCATION_METADATA=true compounded
into a isRestore replay loop after DD died.
- cluster2: same trigger, knob OFF, OOMed but recovered.
DD has TWO finish-move functions, dispatched on the metadata knob in
rawFinishMovement: finishMoveKeys (knob OFF) and finishMoveShards
(knob ON). cluster1 had the knob ON, so its code path was finishMoveShards.
This patch applies the same fix to BOTH.
For each function: split the single transaction into two, with the wait
in between:
Transaction 1: read keyServers/serverTags/serverList (and dataMoves
metadata for finishMoveShards)
Save the read version, drop the transaction (tr.reset())
Wait: waitForShardReady — runs OUTSIDE any transaction;
the 15s timeout is now safe
Transaction 2: re-verify state hasn't changed (dest still ours,
dataMove still in Running phase for finishMoveShards),
then commit metadata writes
If the destination changed during the wait (another DD reassigned the
shard), the inner loop retries from the top — same as today's behaviour
on transient errors, just without burning the txn budget on the wait
itself.
Validation:
- k8s rig:
3 transaction_too_old events, vs 360,289 in the previous run without
the patch. 0 OOMs, 0 cap engagements ('cap' refers to PR apple#13112).
- Simulation (DDPipelineStall.toml, knob ON): cascade trigger eliminated
on the code path; transaction_too_old goes to zero, residual
TryFinishMoveShardsError events are not_committed which retry cleanly.
The convergence-check trace events (FinishMoveShardsDestChanged,
*DataMoveDeletedAfterWait, *PhaseChangedAfterWait) fire 0 times in
our runs — the safety check is conservative without false-positive
retries.
e563bce to
a835ec7
Compare
|
(Address copilot feedback suggesting we sort dest before comparing...) |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
|
Some initial review comments. I'd prefer this on main only for a few reasons:
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
Yes and yes usually, but was trying target backport to 7.3 so minimizing change.
Hard to discuss here an offline doc. Lets talk offline.
I was trying to do better than opinions by going off and spending a bunch of time reproducing the incident so we could see which patches actually rather speculatively helped. FDB already has admission control against 'too much work' whether constraint on loading by RateKeeper, the bound on how many concurrent starts and stop datamoves are allowed, through to DD's limit of how many datamoves each SS can have running at any one time. An incident occurred when an operator performed a task that they have done many times before without incident. In fact, the cluster ran for hours at its configured limit but then it went out of equilibrium when an exclude completed and a team rebuild was triggered. Adding another 'admission control' that bounds datamove is a useful just-in-case but why now do we need it (as you'd say yourself). It may mitigate. But then exclude moves bypass this 'cap' mechanism (though they add to the overall total count) and it was exclude moves that triggered the incident. The 'cap' does not address the cause of the cascade where the finish datamove transaction is unable to complete because getShardState puts an already involved transaction over the 5s limit. Thats what this PR is about.
Yeah. Sorry. Didn't really do compares. Was focused on pass/fail (the test runs take a while to setup and then run long enough to allow for assessment). My admission control test ran with the max set to 100, and then 200... which was probably too constraining. I could do reruns? |
alecgrieser
left a comment
There was a problem hiding this comment.
This LGTM. I guess I agree with @gxglass in the abstract that it would be nice if this were also refactored a bit to avoid (another) large method in the code base with a lot of duplicate work. But I also think this is a pretty clear win, and that we do want this on 7.3. It seems like with just the admission control fixes, we will still get into cases where DD can't make progress, though not spiraling to infinity, but we still very much care about making DD succeed more (which is what this PR does). I guess the extra ablation experiment is useful if not everyone is convinced
| if (checkDest != destServers) { destChanged = true; break; } | ||
| } | ||
| if (destChanged) { | ||
| CODE_PROBE(true, "finishMoveShards dest changed during waitForShardReady"); |
There was a problem hiding this comment.
Have you looked at the simulation code coverage to confirm whether we've been able to hit this?
There was a problem hiding this comment.
Tried. Nada. Let me mix in some buggify ... it's appropriate adding it in here. Will be back to you... Thanks.
There was a problem hiding this comment.
I ran a bunch of seeds with buggify via MoveKeysCycle, MoveKeysClean, and then MoveKeysSideband trying to trip this code probe but no luck. I suppose it makes sense. CancelConflictingDataMoves cancels existing move if a conflicting one so we don't get here (at least not in single DD test scenario). Would need two DDs contending.... I could try writing a test? Thanks @alecgrieser
Co-authored-by: Alec Grieser <alloc@apple.com>
Co-authored-by: Alec Grieser <alloc@apple.com>
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Those limits aren't sufficient to prevent a) unbounded work on startup, b) OOMs, and c) having to mitigate overload at miscellaneous points in the pipeline (such as this one) on an ad-hoc basis that probably would not not be needed if unbounded work was not let into the system.
That's a bug in the existing check, not a problem with the design. In all likelihood the reason for that bug is that I let AI talk me into setting the limit < 700 rather than just above it. Another interpretation is that maintenance "excludes" should not reuse Some simple fixes with the pipeline admission control are certainly warranted here. It should apply to exclude-driven moves, and whatever simulation problems (lack of progress due to insufficient pipeline capacity, probably) motivated the current threshold should be addressed. I will look at this when I'm back
Offline doc explains the mechanism by which it should mitigate this condition. Summary is that overloaded storage servers stop being overloaded and then finishMoveKeys stop failing en masse. |
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
|
@ploxiln perhaps of interest => one of these running the Joshua test ``` |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
So fail=3 passed=9997 ended=10000 ... but tail shows nothing but "Ensemble stopped". Yeah, I've seen that a few times recently, and I thought it may be due to timeouts, but the joshua-fdb transaction-timeout is all the way up to 512s by now ... (could it be due to retries? ... or something else? no idea ...) |
Result of foundationdb-pr-clang on Linux RHEL 9
|
|
After buggifying SHARD_READY_DELAY |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
(Forward port of #12981 -- though it needs update to match this version)
SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction that must commit within ~5s (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination servers are slow to respond, the wait alone consumes the txn budget — and the surrounding transaction has additional reads (\xff/serverTags, \xff/keyServers, \xff/dataMoves with the SHARD_ENCODE_LOCATION_METADATA knob ON, serverList per dest) as well as writes. Result: commits start to fail with transaction_too_old as do the retries.
We saw this issue in recent incidents:
DD has TWO finish-move functions, dispatched on the metadata knob in rawFinishMovement: finishMoveKeys (knob OFF) and finishMoveShards (knob ON). cluster1 had the knob ON, so its code path was finishMoveShards. This patch applies the same fix to BOTH.
For each function: split the single transaction into two, with the wait in between:
If the destination changed during the wait (another DD reassigned the shard), the inner loop retries from the top — same as today's behaviour on transient errors, just without burning the txn budget on the wait itself.
Notes
What finishMoveKeys / finishMoveShards actually does
A single transaction does ~10–14 async round-trips to FDB:
On a healthy cluster the whole transaction averages ~1.8 seconds — already 36 % of the 5 s budget, with waitForShardReady returning in milliseconds when the dest is already ready. With the metadata knob ON, steps 1 and 9 add two extra round-trips that further reduce headroom.
waitForShardReady (step 5) polls each dest SS via getShardState at intervals of SHARD_READY_DELAY (default 0.25 s) until a quorum reports ready, with an outer cap of SERVER_READY_QUORUM_TIMEOUT=15 s. The 15 s cap is rarely the trigger in practice — transaction_too_old fires at ~5 s for the whole txn first.
What happed on cluster1
dest SSes were CPU-saturated from concurrent fetchKeys operations on large shards (100–500 MB). At 80 % CPU the SS event loop couldn't process any RPC promptly. The entire finishMoveShards transaction slowed down, not just step 5: reads (steps 1–3) hit slow SSes, waitForShardReady (step 5) saw more "not ready" responses, writes (steps 6–7) hit the same storage layer. The 1.8 s baseline became 5+ seconds, transaction_too_old fired, retries hit the same wall, and the storm was self-sustaining.
The dest-overload was the trigger; the multi-step transaction with waitForShardReady embedded inside it was the latent bug. With the metadata knob ON, steps 1 and 9 also actively contributed by inflating the critical-path transaction — beyond the well-known restart-fatal-via-isRestore-replay issue.
What the simulation and the k8s emulation do
We reproduced the 'cascade' of unfinished moves phenomenon in a k8s test rig and in simulation.
https://github.com/saintstack/foundationdb/tree/dd-pipeline-stall-test is a simulation test that manufactures a cluster1 like situation. https://github.com/saintstack/fdb-kubernetes-tests/tree/backup_recreate is a k8s test that does a similar reproduction.
Two failure modes emerge from the same recipe:
Both converge on the same death: DD OOMs from accumulated actor state, restarts, isRestore replays accumulated \xff/dataMoves/, DD OOMs again.
The convergence-check trace events (FinishMoveShardsDestChanged, *DataMoveDeletedAfterWait, *PhaseChangedAfterWait) fire 0 times in our runs — the safety check is conservative without false-positive retries.
How we triggered the cascade across rigs
All three paths end at the same transaction_too_old. The k8s run with the extended fix exercised the actual production code path (finishMoveShards, knob ON) and showed the cascade trigger eliminated.
Tests
Here are the k8s test runs synopsized:
Running #12981 in Simulation (DDPipelineStall.toml, knob ON): cascade trigger eliminated on the code path; transaction_too_old goes to zero, residual TryFinishMoveShardsError events are not_committed which retry cleanly.
Why this matters
The latent bug has existed for years. A reasonable concern is whether it's worth the added complexity. Two points of evidence:
Two incidents. Both had this exact trigger. cluster1 plus the metadata knob compounded into a fatal isRestore replay loop on every DD restart.
Admission control complements but doesn't substitute. PR #13112 (cap) and PR #12981 (root cause) work on different parts of the failure chain. Whether the cap alone is sufficient depends on how persistent the trigger is. Intermittent trigger (production team-rebuild bursts → some slow polls → eventual recovery): cap can ride out the burst (Run 35 alone drained 2.3 TB). Sustained trigger (every poll slow because dests stay saturated): simulation shows the cap bounds memory but the queue doesn't drain. Deploying both gives PR #12981 removing the structural anti-pattern and PR #13112 as defense-in-depth.
Why not a simpler fix? Setting SERVER_READY_QUORUM_TIMEOUT below 5 s is two characters in a knob file, but that's not what fires in production. The budget is blown by accumulated 0.25 s polls × N + RPC time well before the 15 s outer timeout, so trimming it changes very little. To cover the real failure mode that way you'd also have to shrink SHARD_READY_DELAY, raising the rate of move failures under transient slowness. The two-transaction restructure costs more lines but eliminates the structural problem regardless of any knob value.