Skip to content

Flaky test report: committed-code failures on 2026-03-25 #219

@andrross

Description

@andrross

Committed-code failures detected on 2026-03-25

The following tests failed in gradle-check builds that ran against committed code (Timer runs on main or Post Merge Actions) within the past 24 hours. Historical failure data across all build types (including PR builds) is included to assess flake rates.

Failing Tests

1. MixedClusterClientYamlTestSuiteIT — 310_match_bool_prefix/multi_match multiple fields complete term

  • Recent build: #73223
  • First failure: 2024-03-25
  • Total unique builds affected: 144
  • Pattern: Chronic flaky test active for 2 years. Major spike in Sep 2024 (54 builds), then settled to a steady 1–5 builds/month through 2026. Still consistently failing every month. Stable (persistent low-rate flake).

2. MixedClusterClientYamlTestSuiteIT — 310_match_bool_prefix/multi_match multiple fields partial term

  • Recent build: #73223
  • First failure: 2024-03-25
  • Total unique builds affected: 137
  • Pattern: Nearly identical to the "complete term" variant above. Same Sep 2024 spike (55 builds), same persistent low-rate tail. Stable (persistent low-rate flake).

3. MixedClusterClientYamlTestSuiteIT — 110_strict_allow_templates (Index documents with setting dynamic parameter)

  • Recent build: #73215
  • First failure: 2024-06-26 (MixedCluster variant); 2024-08-06 (ClientYamlTestSuiteIT variant)
  • Total unique builds affected: 48 (MixedCluster) + 58 (ClientYamlTestSuiteIT) = ~106 across variants
  • Pattern: Sporadic flake. Had a large spike in Sep 2024 (39 builds for MixedCluster variant), then mostly quiet with occasional 1–4 builds/month. Jan 2026 saw a spike of 13 builds in the ClientYamlTestSuiteIT variant. Worsening slightly — Jan 2026 spike suggests renewed instability.

4. AwarenessAllocationIT — testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

  • Recent build: #73244
  • First failure: 2024-08-31
  • Total unique builds affected: 131
  • Pattern: Dormant until Apr 2025, then became a high-frequency flake: 9→5→3→9→18→10→16→12→9→14→16→8 builds/month from Apr 2025 through Mar 2026. Worsening — escalated significantly since mid-2025 and remains at high levels.

5. RemoteSegmentMetadataHandlerTests — testWriteContent

  • Recent build: #73253
  • First failure: 2024-04-17
  • Total unique builds affected: 8
  • Pattern: Very rare flake — only 8 builds in nearly 2 years. Scattered across months with long quiet periods. Stable (rare, low-impact flake).

6. RemoteSegmentMetadataHandlerTests — classMethod

  • Recent build: #73253
  • First failure: 2024-04-17
  • Total unique builds affected: 21
  • Pattern: Low-frequency flake, 1–3 builds/month when it appears. Mar 2026 has 3 occurrences so far. Likely a suite-level setup/teardown issue that surfaces when testWriteContent or other tests in the class fail. Stable (low-rate, correlated with other test failures in the class).

7. AzureBlobStoreRepositoryTests — testWriteRead

  • Recent build: #73222
  • First failure: 2024-04-29
  • Total unique builds affected: 75
  • Pattern: Persistent flake for nearly 2 years. Rate has been increasing: 1–2 builds/month in early history, rising to 5–9 builds/month since Nov 2025. Worsening — clear upward trend in recent months.

8. NodeJoinLeftIT — testClusterStabilityWhenDisconnectDuringSlowNodeLeftTask

  • Recent build: #73232
  • First failure: 2025-06-09
  • Total unique builds affected: 8
  • Pattern: Rare flake, only 8 builds in ~10 months. Appears sporadically with 1–2 builds in scattered months. Stable (rare flake).

9. RemoteRestoreSnapshotIT — testClusterManagerFailoverDuringSnapshotCreation (writable_warm_index=true)

  • Recent build: #73248
  • First failure: 2025-06-02
  • Total unique builds affected: 48
  • Pattern: Consistent flake since introduction in Jun 2025. Running at 1–8 builds/month with no sign of improvement. Mar 2026 already at 8 builds. Worsening — Mar 2026 is on track to be the worst month.

10. RemoteRestoreSnapshotIT — classMethod

  • Recent build: #73248
  • First failure: 2024-08-30
  • Total unique builds affected: 125
  • Pattern: High-frequency flake. Escalated from 1–3 builds/month in late 2024 to 9–16 builds/month since mid-2025. Jan 2026 peaked at 16 builds. Worsening — significant escalation over the past year.

Summary Table

# Test Recent Build First Seen Builds Affected Trend
1 MixedClusterClientYamlTestSuiteIT — 310_match_bool_prefix complete term #73223 2024-03 144 Stable
2 MixedClusterClientYamlTestSuiteIT — 310_match_bool_prefix partial term #73223 2024-03 137 Stable
3 AwarenessAllocationIT — testThreeZone...LoadAwareness #73244 2024-08 131 ⚠️ Worsening
4 RemoteRestoreSnapshotIT — classMethod #73248 2024-08 125 ⚠️ Worsening
5 MixedClusterClientYamlTestSuiteIT — 110_strict_allow_templates #73215 2024-06 ~106 ⚠️ Worsening
6 AzureBlobStoreRepositoryTests — testWriteRead #73222 2024-04 75 ⚠️ Worsening
7 RemoteRestoreSnapshotIT — testClusterManagerFailover (warm=true) #73248 2025-06 48 ⚠️ Worsening
8 RemoteSegmentMetadataHandlerTests — classMethod #73253 2024-04 21 Stable
9 NodeJoinLeftIT — testClusterStability...SlowNodeLeftTask #73232 2025-06 8 Stable
10 RemoteSegmentMetadataHandlerTests — testWriteContent #73253 2024-04 8 Stable

Key Observations

  • 4 of 10 tests are worsening: AwarenessAllocationIT, RemoteRestoreSnapshotIT (classMethod and testClusterManagerFailover), and AzureBlobStoreRepositoryTests show clear upward trends in failure frequency.
  • The MixedClusterClientYamlTestSuiteIT 310_match_bool_prefix tests are the longest-running flakes (2+ years) but have stabilized at a low rate.
  • RemoteRestoreSnapshotIT.classMethod is likely a suite-level issue that correlates with individual test failures in the class — fixing the underlying test flakes would likely resolve this.

Data sourced from the OpenSearch metrics cluster on 2026-03-25.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions