test(router): add retry to flaky indexers_sync e2e tests by KrishnanPrash · Pull Request #9286 · ai-dynamo/dynamo

KrishnanPrash · 2026-05-07T22:41:11Z

Overview

The test_indexers_sync router e2e tests intermittently fail in CI when the Rust JetStream indexer transiently loses its message stream (Message stream ended unexpectedly).
The Rust side handles this silently (log + retry loop), but the resulting incomplete indexer state causes Python-side assertion failures (e.g. assert successful 1 == 25).

Repro:
```
pytest -xvs tests/router/test_router_e2e_with_mockers.py::test_indexers_sync[file]
```
Before: test_indexers_sync[file] intermittently fails with assertion errors on indexer state comparison

After: Failed assertions auto-retry up to 3 times. Infrastructure errors (RuntimeError, TimeoutError) still surface immediately

Details
- Added @pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"]) to:
  - test_indexers_sync in test_router_e2e_with_mockers.py (3 variants:
    jetstream, nats_core, file)
  - test_vllm_indexers_sync in test_router_e2e_with_vllm.py
  - test_sglang_indexers_sync in test_router_e2e_with_sglang.py
  - test_trtllm_indexers_sync in test_router_e2e_with_trtllm.py

Summary by CodeRabbit

Tests
- Enabled automatic test reruns for end-to-end indexer synchronization tests across multiple router configurations, improving test reliability by automatically retrying on assertion failures.

Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>

coderabbitai · 2026-05-07T22:48:12Z

Walkthrough

The PR adds @pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"]) decorators to four router end-to-end test functions across different backend implementations (mockers, sglang, trtllm, vllm) to automatically retry tests on assertion failures due to documented intermittent event-count mismatches.

Changes

Router E2E Flaky Test Configuration

Layer / File(s)	Summary
Test Flakiness Markers `tests/router/test_router_e2e_with_mockers.py`, `tests/router/test_router_e2e_with_sglang.py`, `tests/router/test_router_e2e_with_trtllm.py`, `tests/router/test_router_e2e_with_vllm.py`	Four indexer synchronization tests are marked with `@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"])` to automatically rerun on assertion failures across mockers, sglang, trtllm, and vllm backends.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding retry logic to flaky indexers_sync e2e tests across router test files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description includes all required sections: Overview, Details, and Related Issues placeholder, with clear context about the intermittent failures, reproduction steps, and implementation details.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

tests/router/test_router_e2e_with_mockers.py (1)
1199-1200: 🏗️ Heavy lift

Module-level pre_merge + timeout(300) + reruns=3 = up to 1200 s worst-case on the PR gate.

test_indexers_sync inherits pre_merge from the module-level pytestmark. With @pytest.mark.timeout(300) and reruns=3, the worst-case wall time for a completely flaky run is 4 × 300 s = 1200 s (~20 min) on the gate that blocks every PR. The in-code comment above this decorator (lines 1195–1198) even acknowledges the root cause is an ongoing race that "needs root-cause investigation, not a retry."

Consider moving test_indexers_sync to post_merge or adding reruns_delay=10 at minimum to give the Rust JetStream reconnection time to stabilize before the next attempt:
🔧 Minimal mitigation (add reruns_delay)
-@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"])
+@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"], reruns_delay=10)
As per coding guidelines: "Prefer post_merge for E2E tests unless they guard a critical path, as E2E tests involve more components and tend to be flakier" and "Only use pre_merge marker for absolutely critical tests that justify blocking every PR."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/router/test_router_e2e_with_mockers.py` around lines 1199 - 1200, The
module-level pytestmark uses `@pytest.mark.pre_merge` plus
`@pytest.mark.timeout`(300) and `@pytest.mark.flaky`(reruns=3) which makes
test_indexers_sync potentially block PRs for up to ~1200s; fix by either
changing the marker for this test to post_merge (move test_indexers_sync out of
module-level pre_merge or apply `@pytest.mark.post_merge` directly to that test)
or add a retry delay to the flaky decorator (e.g., set reruns_delay=10 on the
flaky marker applied to test_indexers_sync or module pytestmark) so retries have
time to recover from JetStream reconnection before re-running.
tests/router/test_router_e2e_with_vllm.py (1)
657-657: 🏗️ Heavy lift

E2E pre_merge test with reruns=3 amplifies worst-case pre-merge CI blocking to ~24 min.

test_vllm_indexers_sync is pre_merge with @pytest.mark.timeout(360). Adding reruns=3 makes the worst-case wall time 4 × 360 s = 1440 s (~24 min) per run on the gate that blocks every PR. The coding guidelines specify: "Prefer post_merge for E2E tests unless they guard a critical path" and "tests averaging over 60 seconds should default to post_merge."

Consider moving test_vllm_indexers_sync to post_merge to keep pre-merge gates fast, or at minimum add a reruns_delay so that rapid back-to-back retries are avoided while the Rust JetStream stream is reconnecting.
🔧 Minimal mitigation (add reruns_delay)
-@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"])
+@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"], reruns_delay=10)
As per coding guidelines: "Prefer post_merge for E2E tests unless they guard a critical path, as E2E tests involve more components and tend to be flakier."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/router/test_router_e2e_with_vllm.py` at line 657, The pre-merge E2E
test test_vllm_indexers_sync currently uses `@pytest.mark.flaky`(reruns=3,
only_rerun=["AssertionError"]) with `@pytest.mark.timeout`(360), which can
multiply CI wall-time; either change its marker to a post-merge marker (move the
test into the post_merge suite or add `@pytest.mark.post_merge`) so it no longer
runs in the fast pre-merge gate, or keep it pre-merge but add a reruns_delay to
the flaky decorator (e.g., `@pytest.mark.flaky`(reruns=3, reruns_delay=<seconds>,
only_rerun=["AssertionError"])) to avoid rapid back-to-back retries while the
Rust JetStream stream reconnects.
tests/router/test_router_e2e_with_sglang.py (1)
379-379: ⚡ Quick win

The @pytest.mark.skip_in_nightly and @pytest.mark.flaky target distinct failure modes — confirm this is intentional.

The skip_in_nightly marker (with the associated comment at lines 361–366) exists because the test hangs at the C level in the nightly environment, where pytest-timeout cannot interrupt it. That hang would surface as a TimeoutError/Failed report — not an AssertionError — so only_rerun=["AssertionError"] correctly won't retry it. The new flaky marker only targets the JetStream event-count mismatch race (which does raise AssertionError).

This is the correct design; just noting it explicitly so reviewers don't conflate the two mechanisms. The worst-case pre-merge wall time of 4 × 150 s = 600 s is the most impactful side-effect here — consider adding reruns_delay=10 to allow the Rust stream to reconnect between attempts:
🔧 Optional: add reruns_delay
-@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"])
+@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"], reruns_delay=10)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/router/test_router_e2e_with_sglang.py` at line 379, The flaky marker on
the test currently sets reruns=3 and only_rerun=["AssertionError"]; keep that
behavior but add a small retry backoff to reduce wasted wall time and allow the
Rust/JetStream connection to recover between attempts—update the
`@pytest.mark.flaky` decorator (the one with reruns=3 and
only_rerun=["AssertionError"]) to include reruns_delay=10; leave the existing
`@pytest.mark.skip_in_nightly` marker in place since it targets a different
C-level hang failure mode.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/router/test_router_e2e_with_mockers.py`:
- Around line 1199-1200: The module-level pytestmark uses `@pytest.mark.pre_merge`
plus `@pytest.mark.timeout`(300) and `@pytest.mark.flaky`(reruns=3) which makes
test_indexers_sync potentially block PRs for up to ~1200s; fix by either
changing the marker for this test to post_merge (move test_indexers_sync out of
module-level pre_merge or apply `@pytest.mark.post_merge` directly to that test)
or add a retry delay to the flaky decorator (e.g., set reruns_delay=10 on the
flaky marker applied to test_indexers_sync or module pytestmark) so retries have
time to recover from JetStream reconnection before re-running.

In `@tests/router/test_router_e2e_with_sglang.py`:
- Line 379: The flaky marker on the test currently sets reruns=3 and
only_rerun=["AssertionError"]; keep that behavior but add a small retry backoff
to reduce wasted wall time and allow the Rust/JetStream connection to recover
between attempts—update the `@pytest.mark.flaky` decorator (the one with reruns=3
and only_rerun=["AssertionError"]) to include reruns_delay=10; leave the
existing `@pytest.mark.skip_in_nightly` marker in place since it targets a
different C-level hang failure mode.

In `@tests/router/test_router_e2e_with_vllm.py`:
- Line 657: The pre-merge E2E test test_vllm_indexers_sync currently uses
`@pytest.mark.flaky`(reruns=3, only_rerun=["AssertionError"]) with
`@pytest.mark.timeout`(360), which can multiply CI wall-time; either change its
marker to a post-merge marker (move the test into the post_merge suite or add
`@pytest.mark.post_merge`) so it no longer runs in the fast pre-merge gate, or
keep it pre-merge but add a reruns_delay to the flaky decorator (e.g.,
`@pytest.mark.flaky`(reruns=3, reruns_delay=<seconds>,
only_rerun=["AssertionError"])) to avoid rapid back-to-back retries while the
Rust JetStream stream reconnects.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 055174e9-1f45-46b9-bb0f-81eb35c5522a

📥 Commits

Reviewing files that changed from the base of the PR and between 7714a20 and bad7ecc.

📒 Files selected for processing (4)

tests/router/test_router_e2e_with_mockers.py
tests/router/test_router_e2e_with_sglang.py
tests/router/test_router_e2e_with_trtllm.py
tests/router/test_router_e2e_with_vllm.py

keivenchang

Thank you!!!

PeaBrane

Let's root cause this instead of retries. I'll try to have a look on my end

simplify

bad7ecc

Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>

KrishnanPrash requested review from a team as code owners May 7, 2026 22:41

pull-request-size Bot added the size/XS label May 7, 2026

github-actions Bot added the test label May 7, 2026

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

keivenchang reviewed May 7, 2026

View reviewed changes

Comment thread tests/router/test_router_e2e_with_mockers.py

keivenchang approved these changes May 7, 2026

View reviewed changes

PeaBrane requested changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(router): add retry to flaky indexers_sync e2e tests#9286

test(router): add retry to flaky indexers_sync e2e tests#9286
KrishnanPrash wants to merge 1 commit intomainfrom
kprashanth/flaky-retry

KrishnanPrash commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

keivenchang left a comment

Uh oh!

PeaBrane left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KrishnanPrash commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Details

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

keivenchang left a comment

Choose a reason for hiding this comment

Uh oh!

PeaBrane left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KrishnanPrash commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading

PeaBrane left a comment •

edited

Loading