Skip to content

fix(proxy): fail over pre-visible refresh connect stalls#822

Open
Komzpa wants to merge 3 commits into
mainfrom
fix/previsible-refresh-connect-failover
Open

fix(proxy): fail over pre-visible refresh connect stalls#822
Komzpa wants to merge 3 commits into
mainfrom
fix/previsible-refresh-connect-failover

Conversation

@Komzpa
Copy link
Copy Markdown
Collaborator

@Komzpa Komzpa commented May 26, 2026

Summary

  • fail over compact forced-refresh/connect stalls to another eligible account instead of surfacing before output
  • fail over HTTP Responses stream refresh/connect and post-401 forced-refresh/connect stalls before visible output
  • keep strict preferred-owner and file-pin paths fail-closed so pinned resources do not silently move accounts

Tests

  • uv run pytest tests/unit/test_proxy_utils.py::test_stream_refresh_timeout_before_visible_output_fails_over tests/unit/test_proxy_utils.py::test_stream_forced_refresh_timeout_before_visible_output_fails_over tests/unit/test_proxy_utils.py::test_compact_responses_forced_refresh_connection_reset_fails_over tests/unit/test_proxy_utils.py::test_compact_responses_refresh_connection_reset_fails_over tests/unit/test_proxy_utils.py::test_compact_responses_refresh_non_transient_client_error_does_not_penalize_accounts -q
  • uv run ruff check app/modules/proxy/service.py tests/unit/test_proxy_utils.py
  • uv run ty check app/modules/proxy/service.py tests/unit/test_proxy_utils.py
  • git diff --check

Follow-up

  • thread-goal, codex-control, transcribe, and files still have unary refresh/connect no-failover gaps; those should use a shared unary retry helper rather than four copy-paste loops.

No linked issue; addresses the owner-review file-pinned compact failover blocker in this PR.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 453eedba27

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread app/modules/proxy/service.py
@Soju06 Soju06 added the 🤖 codex: needs work [@codex review] raised an issue label May 26, 2026
@Komzpa Komzpa force-pushed the fix/previsible-refresh-connect-failover branch from 453eedb to cd91a88 Compare May 26, 2026 17:39
@Komzpa Komzpa removed the 🤖 codex: needs work [@codex review] raised an issue label May 26, 2026
@Komzpa
Copy link
Copy Markdown
Collaborator Author

Komzpa commented May 26, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@Komzpa
Copy link
Copy Markdown
Collaborator Author

Komzpa commented May 26, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@Soju06
Copy link
Copy Markdown
Owner

Soju06 commented May 27, 2026

Merge blocker found in owner review: compact initial refresh/connect failover can break file-pinned account ownership.

The forced-refresh path now correctly fail-closes when file_preferred_account_id is set, but the initial compact freshness failure path still treats transient refresh/connect errors as eligible for account failover. If a compact request includes an input_file.file_id pinned to account A and account A hits a transient initial refresh/connect error, this PR can exclude A and send the compact request through account B.

That violates the same file-owner invariant the forced-refresh guard is trying to preserve. Please either fail-closed for file-pinned compact requests in the initial refresh/connect transient branch too, or add a convincing test/proof that cross-account compact retry is safe there.

Also note current merge gates are not clean: fresh Codex review attempts hit usage limits, and this observable failover behavior change has no OpenSpec/issue trace.

@Soju06 Soju06 added the hermes: needs-followup Hermes left a blocker/comment that needs follow-up observation label May 27, 2026
@Soju06
Copy link
Copy Markdown
Owner

Soju06 commented May 27, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@Soju06
Copy link
Copy Markdown
Owner

Soju06 commented May 27, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@Komzpa
Copy link
Copy Markdown
Collaborator Author

Komzpa commented May 31, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Soju06 Soju06 added the 🤖 codex: ok [@codex review] says no issues found. label May 31, 2026
@Komzpa Komzpa removed the hermes: needs-followup Hermes left a blocker/comment that needs follow-up observation label Jun 1, 2026
@Soju06
Copy link
Copy Markdown
Owner

Soju06 commented Jun 2, 2026

Merge blocker: file-pinned compact requests can still cross accounts on the actual upstream compact connect path

The new guards cover _ensure_fresh_with_budget(...) refresh/connect failures and the 401 forced-refresh branch, but the ordinary _call_compact() path still has a gap. app/core/clients/proxy.py converts an upstream compact aiohttp.ClientError / timeout during session.post(...) into ProxyResponseError(..., failure_phase="connect", retryable_same_contract=True). In app/modules/proxy/service.py, that then goes through the generic compact ProxyResponseError handling: one same-contract retry, then _handle_stream_error(...), failover_decision(...), excluded_account_ids.add(account.id), and the next outer attempt can select another account. There is no file_preferred_account_id fail-closed guard on that branch.

I reproduced this locally with a pinned input_file.file_id: account A's compact connect raised a retryable upstream_unavailable, the service retried A once, excluded A, selected account B, and returned success from B:

seen_excluded [set(), {'acc_file_pin_connect_a'}]
compact_account_ids ['acc_file_pin_connect_a', 'acc_file_pin_connect_a', 'acc_file_pin_connect_b']
request_log_account_id acc_file_pin_connect_b
request_log_status success

That still violates the new OpenSpec requirement that a file-pinned compact request MUST surface upstream-unavailable instead of replaying on another account when the pinned account cannot open the upstream compact connection before output. Please add the same file-pin fail-closed guard/regression for the retryable_same_contract / failure_phase="connect" compact ProxyResponseError path, not only the freshness branches.

@Soju06 Soju06 added the hermes: needs-followup Hermes left a blocker/comment that needs follow-up observation label Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤖 codex: ok [@codex review] says no issues found. hermes: needs-followup Hermes left a blocker/comment that needs follow-up observation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants