Flaky test report: committed-code failures on 2026-03-26

## Overview

This report covers test failures observed in committed-code CI runs (Timer and Post Merge Action builds) during the 24-hour period ending 2026-03-26T10:00Z. All 8 distinct failing tests were reproduced locally using the exact seed from the failing build; none of the failures reproduced, confirming they are flaky and environment-dependent.

## Summary Table

| # | Test | Builds Affected (all-time) | First Failure | Recent Build | Reproduced? | Trend |
|---|------|---------------------------|---------------|-------------|-------------|-------|
| 1 | `RemoteRestoreSnapshotIT.testClusterManagerFailoverDuringSnapshotCreation` | 207 | 2024-09-02 | [73305](https://build.ci.opensearch.org/job/gradle-check/73305/) | No | Stable (chronic) |
| 2 | `AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness` | 132 | 2024-08-31 | [73285](https://build.ci.opensearch.org/job/gradle-check/73285/) | No | Worsening |
| 3 | `RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes` | 80 | 2024-04-03 | [73269](https://build.ci.opensearch.org/job/gradle-check/73269/) | No | Worsening |
| 4 | `ReindexBasicTests.testCopyMany` | 75 | 2024-03-28 | [73298](https://build.ci.opensearch.org/job/gradle-check/73298/) | No | Stable (chronic) |
| 5 | `TelemetryMetricsEnabledSanityIT.testGaugeWithValueAndTagSupplier` | 17 | 2024-07-15 | [73291](https://build.ci.opensearch.org/job/gradle-check/73291/) | No | Worsening |
| 6 | `RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing` | 13 | 2024-05-20 | [73308](https://build.ci.opensearch.org/job/gradle-check/73308/) | No | Stable (sporadic) |
| 7 | `RemoteSegmentMetadataHandlerTests.testWriteContent` | 8 | 2024-04-17 | [73253](https://build.ci.opensearch.org/job/gradle-check/73253/) | No | Stable (sporadic) |
| 8 | `WarmIndexSegmentReplicationIT.testCancelPrimaryAllocation` | 7 | 2025-07-29 | [73321](https://build.ci.opensearch.org/job/gradle-check/73321/) | No | Stable (low frequency) |

## Detailed Findings

### 1. RemoteRestoreSnapshotIT.testClusterManagerFailoverDuringSnapshotCreation

- **Recent builds**: [73305](https://build.ci.opensearch.org/job/gradle-check/73305/), [73285](https://build.ci.opensearch.org/job/gradle-check/73285/)
- **Seed**: `40B52E4EC0D5EB09` (build 73305)
- **Error**: Suite timeout exceeded (>= 1200000 msec)
- **Reproduced locally**: No
- **Historical**: 207 unique builds since Sep 2024. Fails consistently every month (6-29 builds/month). This is the most prolific flaky test in the codebase.
- **Pattern**: Stable/chronic. The test consistently hits the 20-minute suite timeout. Failure rate has been roughly constant since it first appeared.

### 2. AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

- **Recent build**: [73285](https://build.ci.opensearch.org/job/gradle-check/73285/)
- **Seed**: `435D01AFDFDA08E8`
- **Error**: `Expected: <120> but: was <118>` — shard count assertion off by 2
- **Reproduced locally**: No
- **Historical**: 132 unique builds since Aug 2024. Was rare initially (1 build in Aug 2024), then surged starting Apr 2025 (9/month), peaking at 18 in Aug 2025. Recent months: 9-16 builds/month.
- **Pattern**: Worsening. Failure frequency has increased significantly over the past year.

### 3. RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes

- **Recent build**: [73269](https://build.ci.opensearch.org/job/gradle-check/73269/)
- **Seed**: `B7520D853200C6E1`
- **Error**: `replica shards haven't caught up with primary expected:<23> but was:<20>` — replication lag assertion
- **Reproduced locally**: No
- **Historical**: 80 unique builds since Apr 2024. Was sporadic early on, then had a major spike Jun-Aug 2025 (29, 21, 9 builds). Quieted down Sep-Jan, but resurfaced Feb-Mar 2026 (4, 12 builds).
- **Pattern**: Worsening. After a quiet period, failures are increasing again in recent months.

### 4. ReindexBasicTests.testCopyMany

- **Recent build**: [73298](https://build.ci.opensearch.org/job/gradle-check/73298/)
- **Seed**: `83FD5806D44833B4`
- **Error**: `AcknowledgedResponse failed - not acked` — index deletion during teardown not acknowledged
- **Reproduced locally**: No
- **Historical**: 75 unique builds since Mar 2024. Failures occur in bursts: Jul-Sep 2024 (8,8,7), Apr 2025 (9), Jul 2025 (8). Quiet months in between.
- **Pattern**: Stable/chronic. Bursty failure pattern with periodic spikes, likely tied to CI infrastructure load.

### 5. TelemetryMetricsEnabledSanityIT.testGaugeWithValueAndTagSupplier

- **Recent build**: [73291](https://build.ci.opensearch.org/job/gradle-check/73291/)
- **Seed**: `39FEF6A84A1F4247`
- **Error**: `expected:<3.0> but was:<5.0>` — gauge value assertion mismatch
- **Reproduced locally**: No
- **Historical**: 17 unique builds since Jul 2024. Was very rare (0-2/month) until Mar 2026 when it spiked to 8 builds.
- **Pattern**: Worsening. Significant recent spike suggests a possible regression or increased sensitivity.

### 6. RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

- **Recent build**: [73308](https://build.ci.opensearch.org/job/gradle-check/73308/)
- **Seed**: `E8CC87037191812A`
- **Error**: `[move_allocation] can't move 0, failed to find it on node` — shard relocation race condition
- **Reproduced locally**: No
- **Historical**: 13 unique builds since May 2024. Sporadic: 0-3 builds/month with many zero months. Recent spike of 4 in Mar 2026.
- **Pattern**: Stable/sporadic. Low frequency but persistent over a long period.

### 7. RemoteSegmentMetadataHandlerTests.testWriteContent

- **Recent build**: [73253](https://build.ci.opensearch.org/job/gradle-check/73253/)
- **Seed**: `4B7C17CA8BD5205C`
- **Error**: `MockDirectoryWrapper: cannot close: there are still 1 open files` — file handle leak in test teardown
- **Reproduced locally**: No
- **Historical**: 8 unique builds since Apr 2024. Very sporadic: 0-2 builds/month with long gaps (no failures Nov 2024 - Jun 2025).
- **Pattern**: Stable/sporadic. Rare but persistent file handle leak during test cleanup.

### 8. WarmIndexSegmentReplicationIT.testCancelPrimaryAllocation

- **Recent build**: [73321](https://build.ci.opensearch.org/job/gradle-check/73321/)
- **Seed**: `A797612A56A64D1C`
- **Error**: `RefreshFailedEngineException[Refresh failed]; nested: CorruptIndexException[misplaced codec footer (file truncated?)]`
- **Reproduced locally**: No
- **Historical**: 7 unique builds since Jul 2025. Low frequency: 0-2 builds/month.
- **Pattern**: Stable/low frequency. Relatively new flaky test (< 1 year old), appears to be a timing-dependent corruption during segment replication.

## Reproduction Methodology

Each test was run locally on the current `main` branch using the exact seed from the failing CI build:
```
./gradlew <module>:<testType> --tests '<fully.qualified.TestClass.testMethod>' -Dtests.seed=<SEED>
```
None of the 8 failures reproduced, which is consistent with environment-dependent flakiness (timing, resource contention, CI node characteristics).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test report: committed-code failures on 2026-03-26 #220

Overview

Summary Table

Detailed Findings

1. RemoteRestoreSnapshotIT.testClusterManagerFailoverDuringSnapshotCreation

2. AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

3. RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes

4. ReindexBasicTests.testCopyMany

5. TelemetryMetricsEnabledSanityIT.testGaugeWithValueAndTagSupplier

6. RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

7. RemoteSegmentMetadataHandlerTests.testWriteContent

8. WarmIndexSegmentReplicationIT.testCancelPrimaryAllocation

Reproduction Methodology

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

#	Test	Builds Affected (all-time)	First Failure	Recent Build	Reproduced?	Trend
1	`RemoteRestoreSnapshotIT.testClusterManagerFailoverDuringSnapshotCreation`	207	2024-09-02	73305	No	Stable (chronic)
2	`AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness`	132	2024-08-31	73285	No	Worsening
3	`RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes`	80	2024-04-03	73269	No	Worsening
4	`ReindexBasicTests.testCopyMany`	75	2024-03-28	73298	No	Stable (chronic)
5	`TelemetryMetricsEnabledSanityIT.testGaugeWithValueAndTagSupplier`	17	2024-07-15	73291	No	Worsening
6	`RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing`	13	2024-05-20	73308	No	Stable (sporadic)
7	`RemoteSegmentMetadataHandlerTests.testWriteContent`	8	2024-04-17	73253	No	Stable (sporadic)
8	`WarmIndexSegmentReplicationIT.testCancelPrimaryAllocation`	7	2025-07-29	73321	No	Stable (low frequency)

Flaky test report: committed-code failures on 2026-03-26 #220

Description

Overview

Summary Table

Detailed Findings

1. RemoteRestoreSnapshotIT.testClusterManagerFailoverDuringSnapshotCreation

2. AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

3. RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes

4. ReindexBasicTests.testCopyMany

5. TelemetryMetricsEnabledSanityIT.testGaugeWithValueAndTagSupplier

6. RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

7. RemoteSegmentMetadataHandlerTests.testWriteContent

8. WarmIndexSegmentReplicationIT.testCancelPrimaryAllocation

Reproduction Methodology

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions