forked from opensearch-project/OpenSearch
-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Overview
This report covers test failures observed in committed-code CI runs (Timer and Post Merge Action builds) during the 24-hour period ending 2026-03-26T10:00Z. All 8 distinct failing tests were reproduced locally using the exact seed from the failing build; none of the failures reproduced, confirming they are flaky and environment-dependent.
Summary Table
| # | Test | Builds Affected (all-time) | First Failure | Recent Build | Reproduced? | Trend |
|---|---|---|---|---|---|---|
| 1 | RemoteRestoreSnapshotIT.testClusterManagerFailoverDuringSnapshotCreation |
207 | 2024-09-02 | 73305 | No | Stable (chronic) |
| 2 | AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness |
132 | 2024-08-31 | 73285 | No | Worsening |
| 3 | RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes |
80 | 2024-04-03 | 73269 | No | Worsening |
| 4 | ReindexBasicTests.testCopyMany |
75 | 2024-03-28 | 73298 | No | Stable (chronic) |
| 5 | TelemetryMetricsEnabledSanityIT.testGaugeWithValueAndTagSupplier |
17 | 2024-07-15 | 73291 | No | Worsening |
| 6 | RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing |
13 | 2024-05-20 | 73308 | No | Stable (sporadic) |
| 7 | RemoteSegmentMetadataHandlerTests.testWriteContent |
8 | 2024-04-17 | 73253 | No | Stable (sporadic) |
| 8 | WarmIndexSegmentReplicationIT.testCancelPrimaryAllocation |
7 | 2025-07-29 | 73321 | No | Stable (low frequency) |
Detailed Findings
1. RemoteRestoreSnapshotIT.testClusterManagerFailoverDuringSnapshotCreation
- Recent builds: 73305, 73285
- Seed:
40B52E4EC0D5EB09(build 73305) - Error: Suite timeout exceeded (>= 1200000 msec)
- Reproduced locally: No
- Historical: 207 unique builds since Sep 2024. Fails consistently every month (6-29 builds/month). This is the most prolific flaky test in the codebase.
- Pattern: Stable/chronic. The test consistently hits the 20-minute suite timeout. Failure rate has been roughly constant since it first appeared.
2. AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness
- Recent build: 73285
- Seed:
435D01AFDFDA08E8 - Error:
Expected: <120> but: was <118>— shard count assertion off by 2 - Reproduced locally: No
- Historical: 132 unique builds since Aug 2024. Was rare initially (1 build in Aug 2024), then surged starting Apr 2025 (9/month), peaking at 18 in Aug 2025. Recent months: 9-16 builds/month.
- Pattern: Worsening. Failure frequency has increased significantly over the past year.
3. RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes
- Recent build: 73269
- Seed:
B7520D853200C6E1 - Error:
replica shards haven't caught up with primary expected:<23> but was:<20>— replication lag assertion - Reproduced locally: No
- Historical: 80 unique builds since Apr 2024. Was sporadic early on, then had a major spike Jun-Aug 2025 (29, 21, 9 builds). Quieted down Sep-Jan, but resurfaced Feb-Mar 2026 (4, 12 builds).
- Pattern: Worsening. After a quiet period, failures are increasing again in recent months.
4. ReindexBasicTests.testCopyMany
- Recent build: 73298
- Seed:
83FD5806D44833B4 - Error:
AcknowledgedResponse failed - not acked— index deletion during teardown not acknowledged - Reproduced locally: No
- Historical: 75 unique builds since Mar 2024. Failures occur in bursts: Jul-Sep 2024 (8,8,7), Apr 2025 (9), Jul 2025 (8). Quiet months in between.
- Pattern: Stable/chronic. Bursty failure pattern with periodic spikes, likely tied to CI infrastructure load.
5. TelemetryMetricsEnabledSanityIT.testGaugeWithValueAndTagSupplier
- Recent build: 73291
- Seed:
39FEF6A84A1F4247 - Error:
expected:<3.0> but was:<5.0>— gauge value assertion mismatch - Reproduced locally: No
- Historical: 17 unique builds since Jul 2024. Was very rare (0-2/month) until Mar 2026 when it spiked to 8 builds.
- Pattern: Worsening. Significant recent spike suggests a possible regression or increased sensitivity.
6. RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing
- Recent build: 73308
- Seed:
E8CC87037191812A - Error:
[move_allocation] can't move 0, failed to find it on node— shard relocation race condition - Reproduced locally: No
- Historical: 13 unique builds since May 2024. Sporadic: 0-3 builds/month with many zero months. Recent spike of 4 in Mar 2026.
- Pattern: Stable/sporadic. Low frequency but persistent over a long period.
7. RemoteSegmentMetadataHandlerTests.testWriteContent
- Recent build: 73253
- Seed:
4B7C17CA8BD5205C - Error:
MockDirectoryWrapper: cannot close: there are still 1 open files— file handle leak in test teardown - Reproduced locally: No
- Historical: 8 unique builds since Apr 2024. Very sporadic: 0-2 builds/month with long gaps (no failures Nov 2024 - Jun 2025).
- Pattern: Stable/sporadic. Rare but persistent file handle leak during test cleanup.
8. WarmIndexSegmentReplicationIT.testCancelPrimaryAllocation
- Recent build: 73321
- Seed:
A797612A56A64D1C - Error:
RefreshFailedEngineException[Refresh failed]; nested: CorruptIndexException[misplaced codec footer (file truncated?)] - Reproduced locally: No
- Historical: 7 unique builds since Jul 2025. Low frequency: 0-2 builds/month.
- Pattern: Stable/low frequency. Relatively new flaky test (< 1 year old), appears to be a timing-dependent corruption during segment replication.
Reproduction Methodology
Each test was run locally on the current main branch using the exact seed from the failing CI build:
./gradlew <module>:<testType> --tests '<fully.qualified.TestClass.testMethod>' -Dtests.seed=<SEED>
None of the 8 failures reproduced, which is consistent with environment-dependent flakiness (timing, resource contention, CI node characteristics).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels