Skip to content

DD finishMoveKeys: move waitForShardReady outside transaction#13364

Open
saintstack wants to merge 5 commits into
apple:mainfrom
saintstack:pr12981-extended-refactor
Open

DD finishMoveKeys: move waitForShardReady outside transaction#13364
saintstack wants to merge 5 commits into
apple:mainfrom
saintstack:pr12981-extended-refactor

Conversation

@saintstack

@saintstack saintstack commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

(Forward port of #12981 -- though it needs update to match this version)

SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction that must commit within ~5s (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination servers are slow to respond, the wait alone consumes the txn budget — and the surrounding transaction has additional reads (\xff/serverTags, \xff/keyServers, \xff/dataMoves with the SHARD_ENCODE_LOCATION_METADATA knob ON, serverList per dest) as well as writes. Result: commits start to fail with transaction_too_old as do the retries.

We saw this issue in recent incidents:

  • cluster1: SHARD_ENCODE_LOCATION_METADATA=true compounded into a isRestore replay loop after DD died.
  • cluster2: same trigger, knob OFF, OOMed but recovered.

DD has TWO finish-move functions, dispatched on the metadata knob in rawFinishMovement: finishMoveKeys (knob OFF) and finishMoveShards (knob ON). cluster1 had the knob ON, so its code path was finishMoveShards. This patch applies the same fix to BOTH.

For each function: split the single transaction into two, with the wait in between:

  Transaction 1: read keyServers/serverTags/serverList (and dataMoves
                 metadata for finishMoveShards)
  Save the read version, drop the transaction (tr.reset())
  Wait:          waitForShardReady — runs OUTSIDE any transaction;
                 the 15s timeout is now safe
  Transaction 2: re-verify state hasn't changed (dest still ours,
                 dataMove still in Running phase for finishMoveShards),
                 then commit metadata writes

If the destination changed during the wait (another DD reassigned the shard), the inner loop retries from the top — same as today's behaviour on transient errors, just without burning the txn budget on the wait itself.

Notes

What finishMoveKeys / finishMoveShards actually does

A single transaction does ~10–14 async round-trips to FDB:

Read \xff/dataMoves/<id> — fetch metadata (knob ON only)
Read \xff/serverTags/
Read \xff/keyServers/ range via krmGetRanges
Decode src/dest, validate
waitForShardReady for each destination server
Write updated \xff/keyServers/ via krmSetRangeCoalescing
Write updated \xff/serverKeys/ (~6 servers per move)
Delete checkpoints
Clear \xff/dataMoves/<id> (knob ON only)
Commit

On a healthy cluster the whole transaction averages ~1.8 seconds — already 36 % of the 5 s budget, with waitForShardReady returning in milliseconds when the dest is already ready. With the metadata knob ON, steps 1 and 9 add two extra round-trips that further reduce headroom.

waitForShardReady (step 5) polls each dest SS via getShardState at intervals of SHARD_READY_DELAY (default 0.25 s) until a quorum reports ready, with an outer cap of SERVER_READY_QUORUM_TIMEOUT=15 s. The 15 s cap is rarely the trigger in practice — transaction_too_old fires at ~5 s for the whole txn first.

What happed on cluster1

dest SSes were CPU-saturated from concurrent fetchKeys operations on large shards (100–500 MB). At 80 % CPU the SS event loop couldn't process any RPC promptly. The entire finishMoveShards transaction slowed down, not just step 5: reads (steps 1–3) hit slow SSes, waitForShardReady (step 5) saw more "not ready" responses, writes (steps 6–7) hit the same storage layer. The 1.8 s baseline became 5+ seconds, transaction_too_old fired, retries hit the same wall, and the storm was self-sustaining.

The dest-overload was the trigger; the multi-step transaction with waitForShardReady embedded inside it was the latent bug. With the metadata knob ON, steps 1 and 9 also actively contributed by inflating the critical-path transaction — beyond the well-known restart-fatal-via-isRestore-replay issue.

What the simulation and the k8s emulation do

We reproduced the 'cascade' of unfinished moves phenomenon in a k8s test rig and in simulation.
https://github.com/saintstack/foundationdb/tree/dd-pipeline-stall-test is a simulation test that manufactures a cluster1 like situation. https://github.com/saintstack/fdb-kubernetes-tests/tree/backup_recreate is a k8s test that does a similar reproduction.

Two failure modes emerge from the same recipe:

Mode | Trigger | Distinctive error pattern
cluster1 cascade | mako + exclude | 96–99% transaction_too_old
Rebalance overload | mako only | 70% not_committed, 16% transaction_too_old

Both converge on the same death: DD OOMs from accumulated actor state, restarts, isRestore replays accumulated \xff/dataMoves/, DD OOMs again.

The convergence-check trace events (FinishMoveShardsDestChanged, *DataMoveDeletedAfterWait, *PhaseChangedAfterWait) fire 0 times in our runs — the safety check is conservative without false-positive retries.

How we triggered the cascade across rigs

Trigger	What slows waitForShardReady
cluster1	Dest SS event loop CPU-saturated → all reads/RPCs slow; getShardState returns not-ready across polls
K8s rig	SHARD_READY_DELAY raised from 0.25 s to 2.5 s → 2 not-ready polls = 5 s
Simulation	buggify_get_shard_state_delay = 2 (concurrency limit) simulates SS event loop saturation

All three paths end at the same transaction_too_old. The k8s run with the extended fix exercised the actual production code path (finishMoveShards, knob ON) and showed the cascade trigger eliminated.

Tests

Here are the k8s test runs synopsized:

PR	Mako-only k8s run	Mako+exclude k8s run
PR #12981	—	✅ test-5jmt4wg6: 3 tx_too_old (vs 360,289 in the previous run without the patch), 0 Phase-4 OOMs, 0 cap engagements in Phase 4 ('cap' refers to PR #13112)
PR #13112 alone	Run 34: ✅ 1 restart in 4 h, self-recovered	Run 35: ✅ 2.3 TB drained, 274 cap engagements

Running #12981 in Simulation (DDPipelineStall.toml, knob ON): cascade trigger eliminated on the code path; transaction_too_old goes to zero, residual TryFinishMoveShardsError events are not_committed which retry cleanly.

Why this matters

The latent bug has existed for years. A reasonable concern is whether it's worth the added complexity. Two points of evidence:

Two incidents. Both had this exact trigger. cluster1 plus the metadata knob compounded into a fatal isRestore replay loop on every DD restart.

Admission control complements but doesn't substitute. PR #13112 (cap) and PR #12981 (root cause) work on different parts of the failure chain. Whether the cap alone is sufficient depends on how persistent the trigger is. Intermittent trigger (production team-rebuild bursts → some slow polls → eventual recovery): cap can ride out the burst (Run 35 alone drained 2.3 TB). Sustained trigger (every poll slow because dests stay saturated): simulation shows the cap bounds memory but the queue doesn't drain. Deploying both gives PR #12981 removing the structural anti-pattern and PR #13112 as defense-in-depth.

Why not a simpler fix? Setting SERVER_READY_QUORUM_TIMEOUT below 5 s is two characters in a knob file, but that's not what fires in production. The budget is blown by accumulated 0.25 s polls × N + RPC time well before the 15 s outer timeout, so trimming it changes very little. To cover the real failure mode that way you'd also have to shrink SHARD_READY_DELAY, raising the rate of move failures under transient slowness. The two-transaction restructure costs more lines but eliminates the structural problem regardless of any knob value.

…ode paths)

SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction that
must commit within ~5s (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When
destination servers are slow to respond, the wait alone consumes the
txn budget — and the surrounding transaction has additional reads
(\xff/serverTags, \xff/keyServers, \xff/dataMoves with the metadata
knob ON, serverList per dest) and writes. Result: commits start to
fail with transaction_too_old and their retries too.

We saw this issue in recent incidents:
- cluster1: SHARD_ENCODE_LOCATION_METADATA=true compounded
  into a isRestore replay loop after DD died.
- cluster2: same trigger, knob OFF, OOMed but recovered.

DD has TWO finish-move functions, dispatched on the metadata knob in
rawFinishMovement: finishMoveKeys (knob OFF) and finishMoveShards
(knob ON). cluster1 had the knob ON, so its code path was finishMoveShards.
This patch applies the same fix to BOTH.

For each function: split the single transaction into two, with the wait
in between:

  Transaction 1: read keyServers/serverTags/serverList (and dataMoves
                 metadata for finishMoveShards)
  Save the read version, drop the transaction (tr.reset())
  Wait:          waitForShardReady — runs OUTSIDE any transaction;
                 the 15s timeout is now safe
  Transaction 2: re-verify state hasn't changed (dest still ours,
                 dataMove still in Running phase for finishMoveShards),
                 then commit metadata writes

If the destination changed during the wait (another DD reassigned the
shard), the inner loop retries from the top — same as today's behaviour
on transient errors, just without burning the txn budget on the wait
itself.

Validation:
- k8s rig:
  3 transaction_too_old events, vs 360,289 in the previous run without
  the patch. 0 OOMs, 0 cap engagements ('cap' refers to PR apple#13112).
- Simulation (DDPipelineStall.toml, knob ON): cascade trigger eliminated
  on the code path; transaction_too_old goes to zero, residual
  TryFinishMoveShardsError events are not_committed which retry cleanly.

The convergence-check trace events (FinishMoveShardsDestChanged,
*DataMoveDeletedAfterWait, *PhaseChangedAfterWait) fire 0 times in
our runs — the safety check is conservative without false-positive
retries.
@saintstack saintstack force-pushed the pr12981-extended-refactor branch from e563bce to a835ec7 Compare June 17, 2026 19:48
@saintstack

Copy link
Copy Markdown
Contributor Author

(Address copilot feedback suggesting we sort dest before comparing...)

@saintstack saintstack requested review from gxglass and sbodagala June 17, 2026 19:49
@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: e563bce
  • Duration 0:22:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS 14.x

  • Commit ID: e563bce
  • Duration 0:32:17
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux RHEL 9

  • Commit ID: e563bce
  • Duration 0:45:24
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS 14.x

  • Commit ID: e563bce
  • Duration 0:45:32
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: e563bce
  • Duration 0:48:20
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack saintstack requested a review from alecgrieser June 17, 2026 20:21
@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: e563bce
  • Duration 0:55:51
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS 14.x

  • Commit ID: a835ec7
  • Duration 0:40:09
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux RHEL 9

  • Commit ID: a835ec7
  • Duration 0:45:13
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: a835ec7
  • Duration 0:46:07
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass

gxglass commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Some initial review comments. I'd prefer this on main only for a few reasons:

  1. Echoing a concern on a prior PR, this involves is a super long and complicated function that is growing here (1759 - 1324 => 400+ lines) and I'd like to refactor that to simplify it. I realize this sort of thing (giant ~untestable-in-isolation functions) is somewhat endemic in the code base, but it doesn't mean we shouldn't make improvements as we go. We keep running into thinly tested code in complicated actor chains with actual bugs surfacing in production (e.g. PR 13200, PR 13312 in the last six weeks), so this isn't just a stylistic/code smell preference.

  2. I'm not convinced this is actually needed on release-7.3. The offline doc has language (obviously agent written) which to me suggests that the agent is overly indexed on this change, for example it refers to this as the fixing the root cause of various problems. In my opinion the root cause was letting too much work into the system, and this change improves a long standing weakness of the design that gets exacerbated when there is too much work. "Too much work" explains various other problems, notably a) excess recovery work on startup, b) OOMs due to just flat out too many actors. I also don't see an ablation experiment to assess system performance (as opposed to "runs to completion with minimal OOMs") without this change. Specifically, with admission control only, how fast did the storage migration take, and how does that compare to expected/optimal? If it's already reasonably close to expected/optimal, then I think release-7.3 need not wait for this. By "reasonably close" I mean anything sane, say 20% of expected speed or faster. We have not been doing migrations for months and $NORMAL_MIGRATION_SPEED/0.2 probably implies something measured in single digit days.

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: e563bce
  • Duration 1:08:38
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: a835ec7
  • Duration 1:22:05
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: a835ec7
  • Duration 1:23:40
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: a835ec7
  • Duration 1:50:23
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS 14.x

  • Commit ID: a835ec7
  • Duration 2:29:47
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack

Copy link
Copy Markdown
Contributor Author

I realize this sort of thing (giant ~untestable-in-isolation functions) is somewhat endemic in the code base, but it doesn't mean we shouldn't make improvements as we go.

Yes and yes usually, but was trying target backport to 7.3 so minimizing change.

The offline doc has language (obviously agent written) which to me suggests that the agent is overly indexed on this change, for example it refers to this as the fixing the root cause of various problems.

Hard to discuss here an offline doc. Lets talk offline.

In my opinion the root cause was letting too much work into the system...

I was trying to do better than opinions by going off and spending a bunch of time reproducing the incident so we could see which patches actually rather speculatively helped.

FDB already has admission control against 'too much work' whether constraint on loading by RateKeeper, the bound on how many concurrent starts and stop datamoves are allowed, through to DD's limit of how many datamoves each SS can have running at any one time. An incident occurred when an operator performed a task that they have done many times before without incident. In fact, the cluster ran for hours at its configured limit but then it went out of equilibrium when an exclude completed and a team rebuild was triggered. Adding another 'admission control' that bounds datamove is a useful just-in-case but why now do we need it (as you'd say yourself). It may mitigate. But then exclude moves bypass this 'cap' mechanism (though they add to the overall total count) and it was exclude moves that triggered the incident. The 'cap' does not address the cause of the cascade where the finish datamove transaction is unable to complete because getShardState puts an already involved transaction over the 5s limit. Thats what this PR is about.

I also don't see an ablation experiment to assess system performance (as opposed to "runs to completion with minimal OOMs") without this change. Specifically, with admission control only, how fast did the storage migration take, and how does that compare to expected/optimal?

Yeah. Sorry. Didn't really do compares. Was focused on pass/fail (the test runs take a while to setup and then run long enough to allow for assessment). My admission control test ran with the max set to 100, and then 200... which was probably too constraining. I could do reruns?

@alecgrieser alecgrieser left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. I guess I agree with @gxglass in the abstract that it would be nice if this were also refactored a bit to avoid (another) large method in the code base with a lot of duplicate work. But I also think this is a pretty clear win, and that we do want this on 7.3. It seems like with just the admission control fixes, we will still get into cases where DD can't make progress, though not spiraling to infinity, but we still very much care about making DD succeed more (which is what this PR does). I guess the extra ablation experiment is useful if not everyone is convinced

Comment thread fdbserver/core/MoveKeys.cpp Outdated
Comment thread fdbserver/core/MoveKeys.cpp Outdated
if (checkDest != destServers) { destChanged = true; break; }
}
if (destChanged) {
CODE_PROBE(true, "finishMoveShards dest changed during waitForShardReady");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at the simulation code coverage to confirm whether we've been able to hit this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried. Nada. Let me mix in some buggify ... it's appropriate adding it in here. Will be back to you... Thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a bunch of seeds with buggify via MoveKeysCycle, MoveKeysClean, and then MoveKeysSideband trying to trip this code probe but no luck. I suppose it makes sense. CancelConflictingDataMoves cancels existing move if a conflicting one so we don't get here (at least not in single DD test scenario). Would need two DDs contending.... I could try writing a test? Thanks @alecgrieser

saintstack and others added 2 commits June 18, 2026 10:57
Co-authored-by: Alec Grieser <alloc@apple.com>
Co-authored-by: Alec Grieser <alloc@apple.com>
@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 7c2372f
  • Duration 0:38:20
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: b4504cd
  • Duration 0:39:24
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux RHEL 9

  • Commit ID: 7c2372f
  • Duration 0:46:17
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux RHEL 9

  • Commit ID: b4504cd
  • Duration 0:48:30
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS 14.x

  • Commit ID: b4504cd
  • Duration 0:54:06
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 7c2372f
  • Duration 1:03:35
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 7c2372f
  • Duration 1:11:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: b4504cd
  • Duration 1:12:55
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: b4504cd
  • Duration 1:18:25
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS 14.x

  • Commit ID: 7c2372f
  • Duration 1:20:19
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass

gxglass commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

FDB already has admission control against 'too much work' whether constraint on loading by RateKeeper, the bound on how many concurrent starts and stop datamoves are allowed, through to DD's limit of how many datamoves each SS can have running at any one time.

Those limits aren't sufficient to prevent a) unbounded work on startup, b) OOMs, and c) having to mitigate overload at miscellaneous points in the pipeline (such as this one) on an ad-hoc basis that probably would not not be needed if unbounded work was not let into the system.

But then exclude moves bypass this 'cap' mechanism (though they add to the overall total count) and it was exclude moves that triggered the incident.

That's a bug in the existing check, not a problem with the design. In all likelihood the reason for that bug is that I let AI talk me into setting the limit < 700 rather than just above it. Another interpretation is that maintenance "excludes" should not reuse PRIORITY_TEAM_UNHEALTHY in the first place, but that is out of scope here. (The team isn't literally unhealthy, it's only been requested to disband itself due to maintenance. "Unhealthy" implies "subject to damage from potentially unknown causes, which might progressively escalate out of our control", and therefore deserves higher priority than mere planned maintenance, which should be low priority.)

Some simple fixes with the pipeline admission control are certainly warranted here. It should apply to exclude-driven moves, and whatever simulation problems (lack of progress due to insufficient pipeline capacity, probably) motivated the current threshold should be addressed. I will look at this when I'm back

The 'cap' does not address the cause of the cascade where the finish datamove transaction is unable to complete because getShardState puts an already involved transaction over the 5s limit. Thats what this PR is about.

Offline doc explains the mechanism by which it should mitigate this condition. Summary is that overloaded storage servers stop being overloaded and then finishMoveKeys stop failing en masse.

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: b4504cd
  • Duration 1:27:45
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 7c2372f
  • Duration 1:27:43
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS 14.x

  • Commit ID: 7c2372f
  • Duration 1:33:04
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack

Copy link
Copy Markdown
Contributor Author

@ploxiln perhaps of interest => one of these running the Joshua test ```
[Container] 2026/06/18 18:45:40.655093 Running command python3 -m joshua.joshua tail --errors --xml ${ENSEMBLE_ID} | tee "/tmp/joshua_tail_${ENSEMBLE_ID}.log"
Results for test ensemble: 20260618-184537-pr13364-clang-b4504cd0-1556-357336c2d6af199a
Ensemble stopped

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 3e8f6a6
  • Duration 0:22:06
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS 14.x

  • Commit ID: 3e8f6a6
  • Duration 0:32:15
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux RHEL 9

  • Commit ID: 3e8f6a6
  • Duration 0:45:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 3e8f6a6
  • Duration 0:52:36
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 3e8f6a6
  • Duration 0:56:13
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 3e8f6a6
  • Duration 1:06:51
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS 14.x

  • Commit ID: 3e8f6a6
  • Duration 1:12:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@ploxiln

ploxiln commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

@ploxiln perhaps of interest

[Container] 2026/06/18 20:11:53.718374 Running command python3 -m joshua.joshua tail --errors --xml ${ENSEMBLE_ID} | tee "/tmp/joshua_tail_${ENSEMBLE_ID}.log"
Results for test ensemble: 20260618-201151-pr13364-clang-3e8f6a68-1557-3180f074af7f66cb
<Trace>Ensemble stopped
</Trace>
[Container] 2026/06/18 20:21:25.295361 Running command if [ -x /usr/local/bin/repro.sh ]; then cat "/tmp/joshua_tail_${ENSEMBLE_ID}.log" | /usr/local/bin/repro.sh; fi

[Container] 2026/06/18 20:21:25.306511 Running command python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID}
  20260618-201151-pr13364-clang-3e8f6a68-1557-3180f074af7f66cb compressed=True data_size=37212026 duration=269869 ended=10000 fail=3 fail_fast=10 max_runs=10000 pass=9997 priority=100 remaining=0 runtime=0:09:34 sanity=False started=10000 stopped=20260618-202125 submitted=20260618-201151 timeout=5400 username=pr13364-clang-3e8f6a68-15571

So fail=3 passed=9997 ended=10000 ... but tail shows nothing but "Ensemble stopped". Yeah, I've seen that a few times recently, and I thought it may be due to timeouts, but the joshua-fdb transaction-timeout is all the way up to 512s by now ... (could it be due to retries? ... or something else? no idea ...)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 3e8f6a6
  • Duration 0:49:50
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack

Copy link
Copy Markdown
Contributor Author

After buggifying SHARD_READY_DELAY 20260619-231502-stack_buggify-f3f54da3f2f1c70d compressed=True data_size=37218191 duration=3276709 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:27:21 sanity=False started=100000 stopped=20260619-234223 submitted=20260619-231502 timeout=5400 username=stack_buggify

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: b1dfe07
  • Duration 0:21:25
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS 14.x

  • Commit ID: b1dfe07
  • Duration 0:31:48
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux RHEL 9

  • Commit ID: b1dfe07
  • Duration 0:45:41
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS 14.x

  • Commit ID: b1dfe07
  • Duration 0:46:11
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: b1dfe07
  • Duration 0:52:07
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: b1dfe07
  • Duration 0:53:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: b1dfe07
  • Duration 1:05:02
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants