Skip to content

Move waitForShardReady outside transaction in finishMoveKeys#12981

Open
saintstack wants to merge 6 commits into
apple:release-7.3from
saintstack:dd_7.3.finish_movekeys_timeout
Open

Move waitForShardReady outside transaction in finishMoveKeys#12981
saintstack wants to merge 6 commits into
apple:release-7.3from
saintstack:dd_7.3.finish_movekeys_timeout

Conversation

@saintstack

@saintstack saintstack commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

SERVER_READY_QUORUM_TIMEOUT (15s) was used INSIDE a transaction that must commit within ~5s (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination servers are slow to respond (e.g. after a team rebuild unsticks moves), the 15s wait guarantees transaction_too_old.

Split the single transaction into two:

  1. First transaction reads keyServers metadata and server list
  2. waitForShardReady runs outside any transaction (15s timeout is safe)
  3. Second transaction re-verifies dest is unchanged, then commits

If dest changed while waiting (another DD reassigned the shard), the code retries from the top of the inner loop.

20260414-005707-stack_logs-bddd63d245354453 compressed=True data_size=50997449 duration=4877944 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:59:04 sanity=False started=100000 stopped=20260414-015611 submitted=20260414-005707 timeout=5400 username=stack_logs

This is one of a few fixes to address transaction_too_old seen trying to do inRestore moves on startup incident on 7.3

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 1e09f90
  • Duration 0:04:08
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 1e09f90
  • Duration 0:08:20
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 1e09f90
  • Duration 0:08:25
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 1e09f90
  • Duration 0:08:40
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: c740ef1
  • Duration 0:04:14
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: c740ef1
  • Duration 0:08:08
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: c740ef1
  • Duration 0:08:26
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: c740ef1
  • Duration 0:08:35
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 1e09f90
  • Duration 0:32:48
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 1e09f90
  • Duration 0:39:42
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 0ea8e6f
  • Duration 0:04:01
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 0ea8e6f
  • Duration 0:08:24
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 0ea8e6f
  • Duration 0:08:34
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 0ea8e6f
  • Duration 0:08:38
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts finishMoveKeys() to avoid holding a transaction open while waiting up to SERVER_READY_QUORUM_TIMEOUT (15s) for destination servers to become readable, which can exceed the transaction lifetime and trigger transaction_too_old. It does so by splitting the logic into a “read metadata” transaction, then waiting outside a transaction, and finally a “re-verify + commit” transaction.

Changes:

  • Capture the transaction read version, then tr.reset() before the potentially long waitForShardReady() quorum wait.
  • Introduce a second transaction that re-checks ownership via checkMoveKeysLock, re-reads keyServers, and commits the metadata update.
  • Add logging/probing when the destination team changes during the out-of-transaction wait.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fdbserver/MoveKeys.actor.cpp
Comment thread fdbserver/MoveKeys.actor.cpp Outdated
@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 0ea8e6f
  • Duration 0:32:45
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 6867999
  • Duration 0:57:45
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

2 similar comments
@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 6867999
  • Duration 0:57:45
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 6867999
  • Duration 0:57:45
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack

Copy link
Copy Markdown
Contributor Author

20260414-215456-stack_waitForShardReady-b353c4d6beaa525c compressed=True data_size=51148181 duration=4615810 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:50:31 sanity=False started=100000 stopped=20260414-224527 submitted=20260414-215456 timeout=5400 username=stack_waitForShardReady

@saintstack saintstack closed this Apr 14, 2026
@saintstack saintstack reopened this Apr 14, 2026
@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 6867999
  • Duration 3:00:54
  • Result: ❌ FAILED
  • Error: Build has timed out.
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

2 similar comments
@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 6867999
  • Duration 3:00:54
  • Result: ❌ FAILED
  • Error: Build has timed out.
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 6867999
  • Duration 3:00:54
  • Result: ❌ FAILED
  • Error: Build has timed out.
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass

gxglass commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

Will come back later but for now in case anything in here makes you want to tweak this further,

image

@spraza

spraza commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

@saintstack Haven't reviewed this in depth. But curious how are we testing this? In simulation, can we add artificial delay at some places? So before this change: DD gets stuck, 100K is not clean. After this change: DD makes progress, simulation test completes, 100K is clean.

@saintstack

Copy link
Copy Markdown
Contributor Author

Haven't reviewed this in depth. But curious how are we testing this? In simulation, can we add artificial delay at some places? So before this change: DD gets stuck, 100K is not clean. After this change: DD makes progress, simulation test completes, 100K is clean.

@spraza I like this suggestion... doing.

@saintstack saintstack force-pushed the dd_7.3.finish_movekeys_timeout branch from 6867999 to 30720fb Compare April 23, 2026 19:41
@saintstack

Copy link
Copy Markdown
Contributor Author

Rebase and bundle the test suggested by @spraza

@saintstack saintstack requested a review from gxglass April 23, 2026 19:42
@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-73 on Linux RHEL 9

  • Commit ID: 30720fb
  • Duration 1:04:34
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack

Copy link
Copy Markdown
Contributor Author

20260423-194829-stack_all-4a07e58c2470ad40 compressed=True data_size=51163954 duration=3930274 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=0:53:53 sanity=False started=100000 stopped=20260423-204222 submitted=20260423-194829 timeout=5400 username=stack_all

Failure is GcGenerations.

michael stack added 6 commits April 23, 2026 14:56
SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction
that must commit within ~5s (MAX_WRITE_TRANSACTION_LIFE_VERSIONS).
When destination servers are slow to respond (e.g. after a team
rebuild unsticks moves), the 15s wait guarantees transaction_too_old.

Split the single transaction into two:
1. First transaction reads keyServers metadata and server list
2. waitForShardReady runs outside any transaction (15s timeout is safe)
3. Second transaction re-verifies dest is unchanged, then commits

If dest changed while waiting (another DD reassigned the shard),
the code retries from the top of the inner loop.
Document why the split is safe: checkMoveKeysLock + dest re-verification
in the second transaction ensures stale metadata is never committed.
Fix two issues in the second transaction of the finishMoveKeys split:

1. Restore tr.trState->taskID = TaskPriority::MoveKeys after tr.reset()
   which resets taskID to the default. Without this the verification and
   commit run at wrong priority.

2. Check dest across all subranges in the convergence check, not just
   the first entry. Also verify the iteration boundary (endKey) has not
   changed. A split or partial reassignment during the wait could leave
   some subranges with a different dest which the single-entry check
   would miss.
Injects a 6s delay (5% probability) after shard readiness is confirmed,
simulating a slow storage server. This exercises the fix that moved
waitForShardReady outside the transaction in finishMoveKeys.

Verified with BUGGIFY_WITH_PROB(1.0) on MoveKeysCycle.toml:
- With tr.reset() disabled (old behavior): test FAILS — transaction_too_old
  storms, Cycle workload achieves 2-9 txns vs 89 required, Passed=0
- With tr.reset() enabled (fix): test PASSES — no retry storms, Passed=1
@saintstack saintstack force-pushed the dd_7.3.finish_movekeys_timeout branch from 30720fb to fae8743 Compare April 23, 2026 21:56
@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests-73 on Linux RHEL 9

  • Commit ID: 30720fb
  • Duration 2:23:16
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 30720fb
  • Duration 3:00:56
  • Result: ❌ FAILED
  • Error: Build has timed out.
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: fae8743
  • Duration 0:46:30
  • Result: ❌ FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-73 on Linux RHEL 9

  • Commit ID: fae8743
  • Duration 1:24:02
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass

gxglass commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

This is putting even more lines of code and complexity into a function that is already pushing 400+ lines with multiple levels of nested loops, try/catch blocks, and conditional logic.

Why is this SERVER_READY_QUORUM_TIMEOUT a problem now all of a sudden? Do we have data on how often this condition happens? I suspect this relates to having too many moves going on in general per graphs we have seen, reports of OOMs, etc.

I am not convinced this sort of apparent workaround with obvious complexity cost is the right approach. I think front-end admission control needs attention

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests-73 on Linux RHEL 9

  • Commit ID: fae8743
  • Duration 2:25:16
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@gxglass gxglass left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments earlier in review about complexity vs benefit on this. What is the data saying how often this happens? Why is it happening more now? The default explanation per graphs is that too much data (TBs) is in flight, with the result being individual moves now share bandwidth, CPU, etc and take longer, hitting timeouts. Better concurrency control would seem to avoid all of that, and the need for complex workarounds like this for conditions that should be rare.

saintstack pushed a commit to saintstack/foundationdb that referenced this pull request Jun 1, 2026
Adds a simulation workload that reproduces the finishMoveKeys pipeline stall
from incidents p127 and p102. The test creates conditions where destination
storage servers cannot respond to getShardState requests promptly, causing
waitForShardReady to consume the 5s transaction budget and triggering a
self-reinforcing retry storm.

What is simulated vs real:
- Real: the finishMoveKeys code path, FlowLock slots, retry logic, dataMoves
  accumulation, startMoveKeys outpacing finishMoveKeys, OOM from fetchKeys
  arena growth. All downstream behavior is genuine FDB code.
- Simulated: the reason getShardState is slow. In production, dest SSes are
  CPU-saturated from fetchKeys + client reads + compaction. In simulation,
  we use a FlowLock + queue-proportional delay. The effect is identical
  (getShardState takes seconds instead of microseconds).
- getShardState is normally instantaneous (check shard state, return version).
  It only becomes slow when the SS event loop is saturated. It is ONLY called
  by waitForShardReady inside finishMoveKeys -- normal client transactions
  never touch it.
- The simulation is optimistic: in production ALL operations on an overloaded
  SS are slow, not just getShardState. If a fix works here, it works in
  production where conditions are worse.

Mechanism:
- A FlowLock (capacity=2) on each SS limits concurrent getShardState
  processing. When the lock is acquired, processing is slowed proportional
  to queue depth (0.5s per waiter). This models real CPU contention: the SS
  can only serve N requests at a time, and each takes longer under pressure.
- storage_fetch_keys_delay=2.0 keeps fetchKeys running long enough that the
  concurrency limit stays active during mass shard movement.
- Self-reinforcing: more retries -> longer queue -> slower processing ->
  transaction_too_old -> more retries.

Test structure:
- Loads 500K keys with tiny shards (5KB min, 20KB max) -> ~17,000 shards
- Excludes 3/9 storage servers -> ~8,000 shard moves
- Monitors dataMoves accumulation and pipeline throughput for 900s

Results:
- 89 consecutive stall detections (full 900s observe window, never recovered)
- finishMoveShards: 57% abort rate (72,197/127,566 aborted)
- 102 GB arena reserved (unbounded fetchKeys memory growth)
- Matches p102 OOM pattern: fetchKeys arenas accumulate because
  finishMoveKeys cannot clear dataMoves entries

Fix validation results:
- apple#12981 (waitForShardReady outside txn): FIXES IT. Queue drains fully
  (13,900 -> 78). All moves complete. Root cause eliminated.
- apple#12991 (retry backoff): MAKES IT WORSE. Stall becomes more permanent
  because slower retries prevent even lucky successes from draining queue.
- apple#13112 (global in-flight cap at 200): LIMITS DAMAGE. InFlight capped,
  memory growth halved (49GB vs 102GB), queue partially drains. But stall
  persists (56% abort rate). Defense-in-depth, not a fix.

Run command:
  fdbserver -r simulation -f tests/slow/DDPipelineStall.toml     --memory 16GiB --knob-max_trace_lines 100000000

New knob: BUGGIFY_GET_SHARD_STATE_DELAY (default 0, no production effect).
When set to N in simulation, limits getShardState to N concurrent operations
per storage server.

Files:
- fdbserver/workloads/DDPipelineStall.cpp
- tests/slow/DDPipelineStall.toml
- fdbserver/core/include/fdbserver/core/Knobs.h
- fdbserver/core/ServerKnobs.cpp
- fdbserver/storageserver/storageserver.actor.cpp
- tests/CMakeLists.txt
@saintstack

Copy link
Copy Markdown
Contributor Author

#13364 is forward-port of a version of this patch.

This is putting even more lines of code and complexity into a function that is already pushing 400+ lines with multiple levels of nested loops, try/catch blocks, and conditional logic.

nod. Unfortunately, this is where the problem is.

Why is this SERVER_READY_QUORUM_TIMEOUT a problem now all of a sudden? Do we have data on how often this condition happens? I suspect this relates to having too many moves going on in general per graphs we have seen, reports of OOMs, etc.

None beyond the rash of outages we witnessed recently and that if we let the spotlight roam, it seems to find other cases.

Yes, too many moves while the cluster is under heavy load brings on issues.

I am not convinced this sort of apparent workaround with obvious complexity cost is the right approach. I think front-end admission control needs attention

Front-end admission sounds like a good addition too -- belt and suspenders -- though I'd ascribe admission-control the workaround label given this PR addresses an obvious error and likely cause of outages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants