Skip to content

Flaky test report: committed-code failures on 2026-03-26 #220

@andrross

Description

@andrross

Overview

This report covers test failures observed in committed-code CI runs (Timer and Post Merge Action builds) during the 24-hour period ending 2026-03-26T10:00Z. All 8 distinct failing tests were reproduced locally using the exact seed from the failing build; none of the failures reproduced, confirming they are flaky and environment-dependent.

Summary Table

# Test Builds Affected (all-time) First Failure Recent Build Reproduced? Trend
1 RemoteRestoreSnapshotIT.testClusterManagerFailoverDuringSnapshotCreation 207 2024-09-02 73305 No Stable (chronic)
2 AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness 132 2024-08-31 73285 No Worsening
3 RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes 80 2024-04-03 73269 No Worsening
4 ReindexBasicTests.testCopyMany 75 2024-03-28 73298 No Stable (chronic)
5 TelemetryMetricsEnabledSanityIT.testGaugeWithValueAndTagSupplier 17 2024-07-15 73291 No Worsening
6 RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing 13 2024-05-20 73308 No Stable (sporadic)
7 RemoteSegmentMetadataHandlerTests.testWriteContent 8 2024-04-17 73253 No Stable (sporadic)
8 WarmIndexSegmentReplicationIT.testCancelPrimaryAllocation 7 2025-07-29 73321 No Stable (low frequency)

Detailed Findings

1. RemoteRestoreSnapshotIT.testClusterManagerFailoverDuringSnapshotCreation

  • Recent builds: 73305, 73285
  • Seed: 40B52E4EC0D5EB09 (build 73305)
  • Error: Suite timeout exceeded (>= 1200000 msec)
  • Reproduced locally: No
  • Historical: 207 unique builds since Sep 2024. Fails consistently every month (6-29 builds/month). This is the most prolific flaky test in the codebase.
  • Pattern: Stable/chronic. The test consistently hits the 20-minute suite timeout. Failure rate has been roughly constant since it first appeared.

2. AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

  • Recent build: 73285
  • Seed: 435D01AFDFDA08E8
  • Error: Expected: <120> but: was <118> — shard count assertion off by 2
  • Reproduced locally: No
  • Historical: 132 unique builds since Aug 2024. Was rare initially (1 build in Aug 2024), then surged starting Apr 2025 (9/month), peaking at 18 in Aug 2025. Recent months: 9-16 builds/month.
  • Pattern: Worsening. Failure frequency has increased significantly over the past year.

3. RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes

  • Recent build: 73269
  • Seed: B7520D853200C6E1
  • Error: replica shards haven't caught up with primary expected:<23> but was:<20> — replication lag assertion
  • Reproduced locally: No
  • Historical: 80 unique builds since Apr 2024. Was sporadic early on, then had a major spike Jun-Aug 2025 (29, 21, 9 builds). Quieted down Sep-Jan, but resurfaced Feb-Mar 2026 (4, 12 builds).
  • Pattern: Worsening. After a quiet period, failures are increasing again in recent months.

4. ReindexBasicTests.testCopyMany

  • Recent build: 73298
  • Seed: 83FD5806D44833B4
  • Error: AcknowledgedResponse failed - not acked — index deletion during teardown not acknowledged
  • Reproduced locally: No
  • Historical: 75 unique builds since Mar 2024. Failures occur in bursts: Jul-Sep 2024 (8,8,7), Apr 2025 (9), Jul 2025 (8). Quiet months in between.
  • Pattern: Stable/chronic. Bursty failure pattern with periodic spikes, likely tied to CI infrastructure load.

5. TelemetryMetricsEnabledSanityIT.testGaugeWithValueAndTagSupplier

  • Recent build: 73291
  • Seed: 39FEF6A84A1F4247
  • Error: expected:<3.0> but was:<5.0> — gauge value assertion mismatch
  • Reproduced locally: No
  • Historical: 17 unique builds since Jul 2024. Was very rare (0-2/month) until Mar 2026 when it spiked to 8 builds.
  • Pattern: Worsening. Significant recent spike suggests a possible regression or increased sensitivity.

6. RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

  • Recent build: 73308
  • Seed: E8CC87037191812A
  • Error: [move_allocation] can't move 0, failed to find it on node — shard relocation race condition
  • Reproduced locally: No
  • Historical: 13 unique builds since May 2024. Sporadic: 0-3 builds/month with many zero months. Recent spike of 4 in Mar 2026.
  • Pattern: Stable/sporadic. Low frequency but persistent over a long period.

7. RemoteSegmentMetadataHandlerTests.testWriteContent

  • Recent build: 73253
  • Seed: 4B7C17CA8BD5205C
  • Error: MockDirectoryWrapper: cannot close: there are still 1 open files — file handle leak in test teardown
  • Reproduced locally: No
  • Historical: 8 unique builds since Apr 2024. Very sporadic: 0-2 builds/month with long gaps (no failures Nov 2024 - Jun 2025).
  • Pattern: Stable/sporadic. Rare but persistent file handle leak during test cleanup.

8. WarmIndexSegmentReplicationIT.testCancelPrimaryAllocation

  • Recent build: 73321
  • Seed: A797612A56A64D1C
  • Error: RefreshFailedEngineException[Refresh failed]; nested: CorruptIndexException[misplaced codec footer (file truncated?)]
  • Reproduced locally: No
  • Historical: 7 unique builds since Jul 2025. Low frequency: 0-2 builds/month.
  • Pattern: Stable/low frequency. Relatively new flaky test (< 1 year old), appears to be a timing-dependent corruption during segment replication.

Reproduction Methodology

Each test was run locally on the current main branch using the exact seed from the failing CI build:

./gradlew <module>:<testType> --tests '<fully.qualified.TestClass.testMethod>' -Dtests.seed=<SEED>

None of the 8 failures reproduced, which is consistent with environment-dependent flakiness (timing, resource contention, CI node characteristics).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions