test(router): add retry to flaky indexers_sync e2e tests#9286
test(router): add retry to flaky indexers_sync e2e tests#9286KrishnanPrash wants to merge 1 commit intomainfrom
Conversation
WalkthroughThe PR adds ChangesRouter E2E Flaky Test Configuration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/router/test_router_e2e_with_mockers.py (1)
1199-1200: 🏗️ Heavy liftModule-level
pre_merge+timeout(300)+reruns=3= up to 1200 s worst-case on the PR gate.
test_indexers_syncinheritspre_mergefrom the module-levelpytestmark. With@pytest.mark.timeout(300)andreruns=3, the worst-case wall time for a completely flaky run is 4 × 300 s = 1200 s (~20 min) on the gate that blocks every PR. The in-code comment above this decorator (lines 1195–1198) even acknowledges the root cause is an ongoing race that "needs root-cause investigation, not a retry."Consider moving
test_indexers_synctopost_mergeor addingreruns_delay=10at minimum to give the Rust JetStream reconnection time to stabilize before the next attempt:🔧 Minimal mitigation (add reruns_delay)
-@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"]) +@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"], reruns_delay=10)As per coding guidelines: "Prefer
post_mergefor E2E tests unless they guard a critical path, as E2E tests involve more components and tend to be flakier" and "Only usepre_mergemarker for absolutely critical tests that justify blocking every PR."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/router/test_router_e2e_with_mockers.py` around lines 1199 - 1200, The module-level pytestmark uses `@pytest.mark.pre_merge` plus `@pytest.mark.timeout`(300) and `@pytest.mark.flaky`(reruns=3) which makes test_indexers_sync potentially block PRs for up to ~1200s; fix by either changing the marker for this test to post_merge (move test_indexers_sync out of module-level pre_merge or apply `@pytest.mark.post_merge` directly to that test) or add a retry delay to the flaky decorator (e.g., set reruns_delay=10 on the flaky marker applied to test_indexers_sync or module pytestmark) so retries have time to recover from JetStream reconnection before re-running.tests/router/test_router_e2e_with_vllm.py (1)
657-657: 🏗️ Heavy liftE2E
pre_mergetest withreruns=3amplifies worst-case pre-merge CI blocking to ~24 min.
test_vllm_indexers_syncispre_mergewith@pytest.mark.timeout(360). Addingreruns=3makes the worst-case wall time 4 × 360 s = 1440 s (~24 min) per run on the gate that blocks every PR. The coding guidelines specify: "Preferpost_mergefor E2E tests unless they guard a critical path" and "tests averaging over 60 seconds should default topost_merge."Consider moving
test_vllm_indexers_synctopost_mergeto keep pre-merge gates fast, or at minimum add areruns_delayso that rapid back-to-back retries are avoided while the Rust JetStream stream is reconnecting.🔧 Minimal mitigation (add reruns_delay)
-@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"]) +@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"], reruns_delay=10)As per coding guidelines: "Prefer
post_mergefor E2E tests unless they guard a critical path, as E2E tests involve more components and tend to be flakier."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/router/test_router_e2e_with_vllm.py` at line 657, The pre-merge E2E test test_vllm_indexers_sync currently uses `@pytest.mark.flaky`(reruns=3, only_rerun=["AssertionError"]) with `@pytest.mark.timeout`(360), which can multiply CI wall-time; either change its marker to a post-merge marker (move the test into the post_merge suite or add `@pytest.mark.post_merge`) so it no longer runs in the fast pre-merge gate, or keep it pre-merge but add a reruns_delay to the flaky decorator (e.g., `@pytest.mark.flaky`(reruns=3, reruns_delay=<seconds>, only_rerun=["AssertionError"])) to avoid rapid back-to-back retries while the Rust JetStream stream reconnects.tests/router/test_router_e2e_with_sglang.py (1)
379-379: ⚡ Quick winThe
@pytest.mark.skip_in_nightlyand@pytest.mark.flakytarget distinct failure modes — confirm this is intentional.The
skip_in_nightlymarker (with the associated comment at lines 361–366) exists because the test hangs at the C level in the nightly environment, wherepytest-timeoutcannot interrupt it. That hang would surface as aTimeoutError/Failedreport — not anAssertionError— soonly_rerun=["AssertionError"]correctly won't retry it. The new flaky marker only targets the JetStream event-count mismatch race (which does raiseAssertionError).This is the correct design; just noting it explicitly so reviewers don't conflate the two mechanisms. The worst-case pre-merge wall time of 4 × 150 s = 600 s is the most impactful side-effect here — consider adding
reruns_delay=10to allow the Rust stream to reconnect between attempts:🔧 Optional: add reruns_delay
-@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"]) +@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"], reruns_delay=10)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/router/test_router_e2e_with_sglang.py` at line 379, The flaky marker on the test currently sets reruns=3 and only_rerun=["AssertionError"]; keep that behavior but add a small retry backoff to reduce wasted wall time and allow the Rust/JetStream connection to recover between attempts—update the `@pytest.mark.flaky` decorator (the one with reruns=3 and only_rerun=["AssertionError"]) to include reruns_delay=10; leave the existing `@pytest.mark.skip_in_nightly` marker in place since it targets a different C-level hang failure mode.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/router/test_router_e2e_with_mockers.py`:
- Around line 1199-1200: The module-level pytestmark uses `@pytest.mark.pre_merge`
plus `@pytest.mark.timeout`(300) and `@pytest.mark.flaky`(reruns=3) which makes
test_indexers_sync potentially block PRs for up to ~1200s; fix by either
changing the marker for this test to post_merge (move test_indexers_sync out of
module-level pre_merge or apply `@pytest.mark.post_merge` directly to that test)
or add a retry delay to the flaky decorator (e.g., set reruns_delay=10 on the
flaky marker applied to test_indexers_sync or module pytestmark) so retries have
time to recover from JetStream reconnection before re-running.
In `@tests/router/test_router_e2e_with_sglang.py`:
- Line 379: The flaky marker on the test currently sets reruns=3 and
only_rerun=["AssertionError"]; keep that behavior but add a small retry backoff
to reduce wasted wall time and allow the Rust/JetStream connection to recover
between attempts—update the `@pytest.mark.flaky` decorator (the one with reruns=3
and only_rerun=["AssertionError"]) to include reruns_delay=10; leave the
existing `@pytest.mark.skip_in_nightly` marker in place since it targets a
different C-level hang failure mode.
In `@tests/router/test_router_e2e_with_vllm.py`:
- Line 657: The pre-merge E2E test test_vllm_indexers_sync currently uses
`@pytest.mark.flaky`(reruns=3, only_rerun=["AssertionError"]) with
`@pytest.mark.timeout`(360), which can multiply CI wall-time; either change its
marker to a post-merge marker (move the test into the post_merge suite or add
`@pytest.mark.post_merge`) so it no longer runs in the fast pre-merge gate, or
keep it pre-merge but add a reruns_delay to the flaky decorator (e.g.,
`@pytest.mark.flaky`(reruns=3, reruns_delay=<seconds>,
only_rerun=["AssertionError"])) to avoid rapid back-to-back retries while the
Rust JetStream stream reconnects.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 055174e9-1f45-46b9-bb0f-81eb35c5522a
📒 Files selected for processing (4)
tests/router/test_router_e2e_with_mockers.pytests/router/test_router_e2e_with_sglang.pytests/router/test_router_e2e_with_trtllm.pytests/router/test_router_e2e_with_vllm.py
Overview
The
test_indexers_syncrouter e2e tests intermittently fail in CI when the Rust JetStream indexer transiently loses its message stream (Message stream ended unexpectedly).The Rust side handles this silently (log + retry loop), but the resulting incomplete indexer state causes Python-side assertion failures (e.g.
assert successful 1 == 25).Repro:
Before:
test_indexers_sync[file]intermittently fails with assertion errors on indexer state comparisonAfter: Failed assertions auto-retry up to 3 times. Infrastructure errors (
RuntimeError,TimeoutError) still surface immediatelyDetails
@pytest.mark.flaky(reruns=3, only_rerun=["AssertionError"])to:test_indexers_sync in test_router_e2e_with_mockers.py(3 variants:jetstream, nats_core, file)
test_vllm_indexers_sync in test_router_e2e_with_vllm.pytest_sglang_indexers_sync in test_router_e2e_with_sglang.pytest_trtllm_indexers_sync in test_router_e2e_with_trtllm.pySummary by CodeRabbit