Skip to content

fix(reliable): fail fast on SESSION_EXPIRED in provider retry loop#2200

Merged
senamakel merged 4 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/reliable-session-expired-non-retryable
May 19, 2026
Merged

fix(reliable): fail fast on SESSION_EXPIRED in provider retry loop#2200
senamakel merged 4 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/reliable-session-expired-non-retryable

Conversation

@YellowSnnowmann
Copy link
Copy Markdown
Contributor

@YellowSnnowmann YellowSnnowmann commented May 19, 2026

Summary

  • Treat SESSION_EXPIRED errors as non-retryable in ReliableProvider classification.
  • Stop wasting retry budget on auth-state failures that can only be resolved by re-auth/sign-in.
  • Reduce noisy aggregate failures like repeated attempt 1/3 ... attempt 3/3 for the same expired-session condition.
  • Add regression coverage in reliable_tests.rs to verify session-expired errors short-circuit retries.
  • Keep existing retry behavior unchanged for transient upstream failures (429/5xx/timeouts).

Problem

  • The reliable provider layer retried SESSION_EXPIRED as if it were a transient provider/network failure.
  • That caused repeated failed attempts with no chance of recovery, slower user feedback, and noisy Sentry events.
  • The expected behavior for expired backend session is immediate failure so the app can prompt sign-in/re-auth.

Solution

  • Updated is_non_retryable in src/openhuman/inference/provider/reliable.rs to classify messages matching is_session_expired_message(...) as non-retryable.
  • This ensures the retry loop exits after the first failed attempt for expired-session boundaries.

Added tests in src/openhuman/inference/provider/reliable_tests.rs:

  • classification test for SESSION_EXPIRED
  • end-to-end retry-loop test asserting only one call occurs and aggregate marks non_retryable.

Tradeoff: this relies on canonical session-expired message patterns, but those are already centralized in observability and used across the core.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
  • Diff coverage ≥ 80% — changed lines (Vitest + cargo-llvm-cov merged via diff-cover) meet the gate enforced by .github/workflows/coverage.yml. Run pnpm test:coverage and pnpm test:rust locally; PRs below 80% on changed lines will not merge.
  • Coverage matrix updated — added/removed/renamed feature rows in docs/TEST-COVERAGE-MATRIX.md reflect this change (or N/A: behaviour-only change)
  • All affected feature IDs from the matrix are listed in the PR description under ## Related
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • Manual smoke checklist updated if this touches release-cut surfaces (docs/RELEASE-MANUAL-SMOKE.md)
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Runtime/platform impact: Rust core provider reliability path (desktop app core behavior) only.
  • User impact: faster, clearer failure on expired session; fewer redundant retries before sign-in flow is needed.
  • Observability impact: reduced noise for this auth-state class; errors are treated as expected non-retryable flow.
  • Performance: avoids unnecessary retry delays/work for unrecoverable auth-state failures.
  • Security/compatibility: no new permissions, migrations, or external dependencies.

Related

Summary by CodeRabbit

  • Bug Fixes

    • "Session expired" errors are now treated as non-retryable, causing failed requests (both standard and streaming) to abort immediately instead of entering retry/poll loops, improving responsiveness and error clarity.
  • Tests

    • Added unit and streaming tests to verify immediate abort behavior and proper failure aggregation for session-expired errors.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2e3f6c07-4902-418e-b2e6-fb273fde6e34

📥 Commits

Reviewing files that changed from the base of the PR and between 6e57893 and a0bdee0.

📒 Files selected for processing (2)
  • src/openhuman/inference/provider/reliable.rs
  • src/openhuman/inference/provider/reliable_tests.rs

📝 Walkthrough

Walkthrough

Adds an upfront "session expired" detection to ReliableProvider's non-retryable classifiers (sync and streaming), and adds unit and integration tests that assert ReliableProvider aborts retries/polling immediately on the SESSION_EXPIRED marker.

Changes

Session-expired boundary detection in ReliableProvider

Layer / File(s) Summary
Error classification: session-expired shortcut
src/openhuman/inference/provider/reliable.rs
is_non_retryable converts the error to a string, checks for the session-expired marker and returns true immediately if present, then falls back to the existing reqwest::Error status and message-digit heuristics. is_stream_error_non_retryable mirrors this for StreamError::Provider(msg) by returning non-retryable when the marker is found.
Unit test: exact SESSION_EXPIRED assertion
src/openhuman/inference/provider/reliable_tests.rs (lines 219–221)
Updated the is_non_retryable test to assert against the exact SESSION_EXPIRED string used by the classifier.
Integration tests: abort-on-session-expired (non-streaming)
src/openhuman/inference/provider/reliable_tests.rs (lines 259–295)
New Tokio test session_expired_aborts_retries verifies ReliableProvider fails fast on SESSION_EXPIRED: single provider call, aggregated error marked non_retryable, and no later-attempt text present.
Integration tests: abort-on-session-expired (streaming)
src/openhuman/inference/provider/reliable_tests.rs (lines 297–419)
Added StreamingErrorMock and session_expired_aborts_retries_streaming to confirm streaming retry/polling is short-circuited on SESSION_EXPIRED (one stream created, one poll, terminal aggregated streaming error returned).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • tinyhumansai/openhuman#1763: Introduced/implemented the is_session_expired_message observability helper used by this PR.
  • tinyhumansai/openhuman#1719: Uses the same session-expired detection; relates to how session-expired is classified across observability and retry logic.
  • tinyhumansai/openhuman#2022: Related changes to streaming retry/failover logic that interact with is_stream_error_non_retryable behavior.

Suggested reviewers

  • senamakel

Poem

🐰 A session expired, so we stop the chase,
One swift fail, no retries to trace.
The provider sighs, the loop is through,
One clear error — honest and true.
Hoppity-hop, the rabbit says "phew!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(reliable): fail fast on SESSION_EXPIRED in provider retry loop' accurately describes the main change: treating SESSION_EXPIRED errors as non-retryable to exit the retry loop immediately.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@YellowSnnowmann YellowSnnowmann marked this pull request as ready for review May 19, 2026 11:20
@YellowSnnowmann YellowSnnowmann requested a review from a team May 19, 2026 11:20
@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 19, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/openhuman/inference/provider/reliable_tests.rs (1)

259-295: ⚡ Quick win

Add a streaming fail-fast regression test for SESSION_EXPIRED.

This new test is good for simple_chat, but the streaming retry classifier is separate. A focused streaming test would lock in the expected fail-fast auth behavior across both execution paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/inference/provider/reliable_tests.rs` around lines 259 - 295,
Add a new tokio::test (e.g., session_expired_aborts_retries_streaming) that
mirrors session_expired_aborts_retries but exercises the provider's streaming
path: construct a ReliableProvider with the same MockProvider (calls Arc,
fail_until_attempt = usize::MAX, error containing "SESSION_EXPIRED"), invoke the
streaming API on ReliableProvider (the streaming equivalent of simple_chat),
await the error and assert that only one call was made to MockProvider, the
error is classified as non_retryable, and the aggregate message does not include
further attempts; reuse the same assertions as session_expired_aborts_retries
but target the streaming method to lock in fail-fast auth behavior for the
streaming retry classifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/inference/provider/reliable.rs`:
- Around line 16-22: The streaming retry logic misses the SESSION_EXPIRED check:
update the streaming-specific non-retryable path by adding the same
session-expired classification used in is_non_retryable to
is_stream_error_non_retryable (or the function handling streaming retry
decisions) so that
crate::core::observability::is_session_expired_message(&err.to_string()) returns
true and causes an immediate non-retryable result for streaming requests; locate
the streaming retry branch that currently calls is_stream_error_non_retryable
and add the session-expired check there (or delegate to is_non_retryable) to
ensure parity with non-streaming behavior.

---

Nitpick comments:
In `@src/openhuman/inference/provider/reliable_tests.rs`:
- Around line 259-295: Add a new tokio::test (e.g.,
session_expired_aborts_retries_streaming) that mirrors
session_expired_aborts_retries but exercises the provider's streaming path:
construct a ReliableProvider with the same MockProvider (calls Arc,
fail_until_attempt = usize::MAX, error containing "SESSION_EXPIRED"), invoke the
streaming API on ReliableProvider (the streaming equivalent of simple_chat),
await the error and assert that only one call was made to MockProvider, the
error is classified as non_retryable, and the aggregate message does not include
further attempts; reuse the same assertions as session_expired_aborts_retries
but target the streaming method to lock in fail-fast auth behavior for the
streaming retry classifier.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e5299b6a-46f0-4818-9ade-8e13589e99e6

📥 Commits

Reviewing files that changed from the base of the PR and between 4384cd1 and 6e57893.

📒 Files selected for processing (2)
  • src/openhuman/inference/provider/reliable.rs
  • src/openhuman/inference/provider/reliable_tests.rs

Comment thread src/openhuman/inference/provider/reliable.rs
@senamakel senamakel merged commit d6a99fc into tinyhumansai:main May 19, 2026
27 checks passed
AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants