Skip to content

test(router): add retry to flaky indexers_sync e2e tests#9286

Open
KrishnanPrash wants to merge 1 commit intomainfrom
kprashanth/flaky-retry
Open

test(router): add retry to flaky indexers_sync e2e tests#9286
KrishnanPrash wants to merge 1 commit intomainfrom
kprashanth/flaky-retry

Conversation

@KrishnanPrash
Copy link
Copy Markdown
Contributor

@KrishnanPrash KrishnanPrash commented May 7, 2026

Overview

  • The test_indexers_sync router e2e tests intermittently fail in CI when the Rust JetStream indexer transiently loses its message stream (Message stream ended unexpectedly).

  • The Rust side handles this silently (log + retry loop), but the resulting incomplete indexer state causes Python-side assertion failures (e.g. assert successful 1 == 25).

    Repro:

    pytest -xvs tests/router/test_router_e2e_with_mockers.py::test_indexers_sync[file]

    Before: test_indexers_sync[file] intermittently fails with assertion errors on indexer state comparison

    After: Failed assertions auto-retry up to 3 times. Infrastructure errors (RuntimeError, TimeoutError) still surface immediately

    Details

    • Added @pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"]) to:
      • test_indexers_sync in test_router_e2e_with_mockers.py (3 variants:
        jetstream, nats_core, file)
      • test_vllm_indexers_sync in test_router_e2e_with_vllm.py
      • test_sglang_indexers_sync in test_router_e2e_with_sglang.py
      • test_trtllm_indexers_sync in test_router_e2e_with_trtllm.py

Summary by CodeRabbit

  • Tests
    • Enabled automatic test reruns for end-to-end indexer synchronization tests across multiple router configurations, improving test reliability by automatically retrying on assertion failures.

Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
@KrishnanPrash KrishnanPrash requested review from a team as code owners May 7, 2026 22:41
@github-actions github-actions Bot added the test label May 7, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Walkthrough

The PR adds @pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"]) decorators to four router end-to-end test functions across different backend implementations (mockers, sglang, trtllm, vllm) to automatically retry tests on assertion failures due to documented intermittent event-count mismatches.

Changes

Router E2E Flaky Test Configuration

Layer / File(s) Summary
Test Flakiness Markers
tests/router/test_router_e2e_with_mockers.py, tests/router/test_router_e2e_with_sglang.py, tests/router/test_router_e2e_with_trtllm.py, tests/router/test_router_e2e_with_vllm.py
Four indexer synchronization tests are marked with @pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"]) to automatically rerun on assertion failures across mockers, sglang, trtllm, and vllm backends.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding retry logic to flaky indexers_sync e2e tests across router test files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description includes all required sections: Overview, Details, and Related Issues placeholder, with clear context about the intermittent failures, reproduction steps, and implementation details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tests/router/test_router_e2e_with_mockers.py (1)

1199-1200: 🏗️ Heavy lift

Module-level pre_merge + timeout(300) + reruns=3 = up to 1200 s worst-case on the PR gate.

test_indexers_sync inherits pre_merge from the module-level pytestmark. With @pytest.mark.timeout(300) and reruns=3, the worst-case wall time for a completely flaky run is 4 × 300 s = 1200 s (~20 min) on the gate that blocks every PR. The in-code comment above this decorator (lines 1195–1198) even acknowledges the root cause is an ongoing race that "needs root-cause investigation, not a retry."

Consider moving test_indexers_sync to post_merge or adding reruns_delay=10 at minimum to give the Rust JetStream reconnection time to stabilize before the next attempt:

🔧 Minimal mitigation (add reruns_delay)
-@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"])
+@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"], reruns_delay=10)

As per coding guidelines: "Prefer post_merge for E2E tests unless they guard a critical path, as E2E tests involve more components and tend to be flakier" and "Only use pre_merge marker for absolutely critical tests that justify blocking every PR."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/router/test_router_e2e_with_mockers.py` around lines 1199 - 1200, The
module-level pytestmark uses `@pytest.mark.pre_merge` plus
`@pytest.mark.timeout`(300) and `@pytest.mark.flaky`(reruns=3) which makes
test_indexers_sync potentially block PRs for up to ~1200s; fix by either
changing the marker for this test to post_merge (move test_indexers_sync out of
module-level pre_merge or apply `@pytest.mark.post_merge` directly to that test)
or add a retry delay to the flaky decorator (e.g., set reruns_delay=10 on the
flaky marker applied to test_indexers_sync or module pytestmark) so retries have
time to recover from JetStream reconnection before re-running.
tests/router/test_router_e2e_with_vllm.py (1)

657-657: 🏗️ Heavy lift

E2E pre_merge test with reruns=3 amplifies worst-case pre-merge CI blocking to ~24 min.

test_vllm_indexers_sync is pre_merge with @pytest.mark.timeout(360). Adding reruns=3 makes the worst-case wall time 4 × 360 s = 1440 s (~24 min) per run on the gate that blocks every PR. The coding guidelines specify: "Prefer post_merge for E2E tests unless they guard a critical path" and "tests averaging over 60 seconds should default to post_merge."

Consider moving test_vllm_indexers_sync to post_merge to keep pre-merge gates fast, or at minimum add a reruns_delay so that rapid back-to-back retries are avoided while the Rust JetStream stream is reconnecting.

🔧 Minimal mitigation (add reruns_delay)
-@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"])
+@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"], reruns_delay=10)

As per coding guidelines: "Prefer post_merge for E2E tests unless they guard a critical path, as E2E tests involve more components and tend to be flakier."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/router/test_router_e2e_with_vllm.py` at line 657, The pre-merge E2E
test test_vllm_indexers_sync currently uses `@pytest.mark.flaky`(reruns=3,
only_rerun=["AssertionError"]) with `@pytest.mark.timeout`(360), which can
multiply CI wall-time; either change its marker to a post-merge marker (move the
test into the post_merge suite or add `@pytest.mark.post_merge`) so it no longer
runs in the fast pre-merge gate, or keep it pre-merge but add a reruns_delay to
the flaky decorator (e.g., `@pytest.mark.flaky`(reruns=3, reruns_delay=<seconds>,
only_rerun=["AssertionError"])) to avoid rapid back-to-back retries while the
Rust JetStream stream reconnects.
tests/router/test_router_e2e_with_sglang.py (1)

379-379: ⚡ Quick win

The @pytest.mark.skip_in_nightly and @pytest.mark.flaky target distinct failure modes — confirm this is intentional.

The skip_in_nightly marker (with the associated comment at lines 361–366) exists because the test hangs at the C level in the nightly environment, where pytest-timeout cannot interrupt it. That hang would surface as a TimeoutError/Failed report — not an AssertionError — so only_rerun=["AssertionError"] correctly won't retry it. The new flaky marker only targets the JetStream event-count mismatch race (which does raise AssertionError).

This is the correct design; just noting it explicitly so reviewers don't conflate the two mechanisms. The worst-case pre-merge wall time of 4 × 150 s = 600 s is the most impactful side-effect here — consider adding reruns_delay=10 to allow the Rust stream to reconnect between attempts:

🔧 Optional: add reruns_delay
-@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"])
+@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"], reruns_delay=10)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/router/test_router_e2e_with_sglang.py` at line 379, The flaky marker on
the test currently sets reruns=3 and only_rerun=["AssertionError"]; keep that
behavior but add a small retry backoff to reduce wasted wall time and allow the
Rust/JetStream connection to recover between attempts—update the
`@pytest.mark.flaky` decorator (the one with reruns=3 and
only_rerun=["AssertionError"]) to include reruns_delay=10; leave the existing
`@pytest.mark.skip_in_nightly` marker in place since it targets a different
C-level hang failure mode.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/router/test_router_e2e_with_mockers.py`:
- Around line 1199-1200: The module-level pytestmark uses `@pytest.mark.pre_merge`
plus `@pytest.mark.timeout`(300) and `@pytest.mark.flaky`(reruns=3) which makes
test_indexers_sync potentially block PRs for up to ~1200s; fix by either
changing the marker for this test to post_merge (move test_indexers_sync out of
module-level pre_merge or apply `@pytest.mark.post_merge` directly to that test)
or add a retry delay to the flaky decorator (e.g., set reruns_delay=10 on the
flaky marker applied to test_indexers_sync or module pytestmark) so retries have
time to recover from JetStream reconnection before re-running.

In `@tests/router/test_router_e2e_with_sglang.py`:
- Line 379: The flaky marker on the test currently sets reruns=3 and
only_rerun=["AssertionError"]; keep that behavior but add a small retry backoff
to reduce wasted wall time and allow the Rust/JetStream connection to recover
between attempts—update the `@pytest.mark.flaky` decorator (the one with reruns=3
and only_rerun=["AssertionError"]) to include reruns_delay=10; leave the
existing `@pytest.mark.skip_in_nightly` marker in place since it targets a
different C-level hang failure mode.

In `@tests/router/test_router_e2e_with_vllm.py`:
- Line 657: The pre-merge E2E test test_vllm_indexers_sync currently uses
`@pytest.mark.flaky`(reruns=3, only_rerun=["AssertionError"]) with
`@pytest.mark.timeout`(360), which can multiply CI wall-time; either change its
marker to a post-merge marker (move the test into the post_merge suite or add
`@pytest.mark.post_merge`) so it no longer runs in the fast pre-merge gate, or
keep it pre-merge but add a reruns_delay to the flaky decorator (e.g.,
`@pytest.mark.flaky`(reruns=3, reruns_delay=<seconds>,
only_rerun=["AssertionError"])) to avoid rapid back-to-back retries while the
Rust JetStream stream reconnects.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 055174e9-1f45-46b9-bb0f-81eb35c5522a

📥 Commits

Reviewing files that changed from the base of the PR and between 7714a20 and bad7ecc.

📒 Files selected for processing (4)
  • tests/router/test_router_e2e_with_mockers.py
  • tests/router/test_router_e2e_with_sglang.py
  • tests/router/test_router_e2e_with_trtllm.py
  • tests/router/test_router_e2e_with_vllm.py

Comment thread tests/router/test_router_e2e_with_mockers.py
Copy link
Copy Markdown
Contributor

@keivenchang keivenchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!!

Copy link
Copy Markdown
Contributor

@PeaBrane PeaBrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's root cause this instead of retries. I'll try to have a look on my end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants