fix(test): retry TUI chat correlation E2E on transient gateway races#4382
Draft
hunglp6d wants to merge 3 commits into
Draft
fix(test): retry TUI chat correlation E2E on transient gateway races#4382hunglp6d wants to merge 3 commits into
hunglp6d wants to merge 3 commits into
Conversation
…y race The openclaw-tui-chat-correlation-e2e live test retried only when the WebSocket client captured zero chat events (the "event capture failure" pattern). A second transient race — where all replies arrive but every runId differs from the chat.send response — was not retried, causing intermittent nightly failures (e.g. run 26546628518). Add looksLikeTotalCorrelationRace() to detect this second transient pattern (all events uncorrelated + later user turns missing from chat.history) and extend the retry loop to allow up to two retries (three total attempts) for either transient signature. A real correlation regression would break only some runs or leave user turns intact, so it will not be masked by the retry. Signed-off-by: Hung Le <hple@nvidia.com>
The validation run for the total-correlation-race fix (attempt 1) revealed a third transient pattern: partial reply delivery where the gateway delivers only a subset of replies correctly while the remaining replies never arrive. Add looksLikePartialReplyDelivery() to detect this signature (missingReplies > 0, uncorrelatedReplies empty, no empty finals or duplicates) and include it in the transient gateway race retry logic. Signed-off-by: Hung Le <hple@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
2 tasks
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
openclaw-tui-chat-correlation-e2elive test retried only on the "zero-event capture" transient race, not on two other transient gateway patterns that cause intermittent nightly failures (e.g. run 26546628518). This PR adds detection for both additional patterns and extends the retry loop from 1 to 2 retries (3 total attempts), while preserving signal for real partial correlation regressions.Related Issue
Fixes #4383
Changes
looksLikeTotalCorrelationRace()to detect the transient pattern where all chat events are uncorrelated and later user turns are missing fromchat.historylooksLikePartialReplyDelivery()to detect the transient pattern where the gateway delivers only a subset of replies (correctly correlated) while the remaining replies never arrivelooksLikeTransientGatewayRace()combining all three transient detectorsrunLiveIssue2603ReproWithEventCaptureRetryfrom a singleif-guard to awhileloop withMAX_TRANSIENT_RETRIES=2(3 total attempts)Validation
A focused
custom-e2e.yamlworkflow was run on a sibling branch to confirm this fix repairs the regression. The workflow re-runs only the jobs from the original nightly that this PR targets, onubuntu-latest, off the same fix commit as this PR.fix/nightly-e2e-tui-chat-correlation-race-1daf081-custom-e2eonhunglp6d/NemoClawopenclaw-tui-chat-correlation-e2e (#78199767794)1daf081718489f514d8219d7e229f8ed712ce82dThe validation branch is intentionally not the head of this PR — it carries an extra
.github/workflows/custom-e2e.yamlcommit that is scaffolding, not part of the fix. Re-run the validation by pushing any commit to the validation branch.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesAI Disclosure
Signed-off-by: Hung Le hple@nvidia.com