fix(channels): surface structured rate-limit metadata on chat_error (#2606)#2652
fix(channels): surface structured rate-limit metadata on chat_error (#2606)#2652CodeGhost21 wants to merge 2 commits into
Conversation
…inyhumansai#2606) Issue tinyhumansai#2606 follows up on PR tinyhumansai#2371: that PR landed retry-after wording inside the user-facing message text, but the structured data was discarded at the channel layer. The wire payload still only carried `message: String` and `error_type: String`, so the frontend could not render a real countdown, a retry button, a provider/source label, or a fallback CTA without regexing the message. This change moves the metadata onto the wire as additive optional fields on `WebChannelEvent`, populated by a new `ClassifiedError` struct returned from `classify_inference_error`: - `error_source` — "provider" | "openhuman_budget" | "agent_loop" | "openhuman_billing" | "transport" | "config" - `error_retryable` — false for non-retryable business 429s ("plan does not include", insufficient balance, Z.AI 1311/1113), auth, model, context, OpenHuman billing - `error_retry_after_ms` — verbatim from Retry-After / retry_after - `error_provider` — extracted from "<provider> API error (...)" - `error_fallback_available` — Some(false) when reliable.rs has already exhausted the model_fallbacks chain ("All providers/models failed") All fields use `skip_serializing_if = "Option::is_none"` so older clients that only read `message` / `error_type` keep working unchanged. Coverage: - 11 new unit tests on the classifier (each branch, each new field, non-retryable 429 business-quota cases, provider extraction, fallback-exhausted aggregate, JSON omit-when-None contract). - 2 new wire-shape tests that drive `start_chat` end-to-end through the WebChannelEvent bus and assert the structured fields land on both the struct and the serialized JSON. - Added a serial test lock so the three tests that share `TEST_FORCED_RUN_CHAT_TASK_ERROR` no longer race under `cargo test`'s default parallelism.
📝 WalkthroughWalkthroughAdds five optional error metadata fields to WebChannelEvent, introduces a typed ClassifiedError, refactors classify_inference_error to return it (including provider, retryability, retry-after, fallback info), and updates event emissions and tests to populate or initialize the new fields. ChangesStructured Error Metadata for Rate Limits
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/channels/providers/web_tests.rs (1)
29-35:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid fire-and-forget forced-error reset in
Drop(can clobber the next test)
TestForcedRunChatTaskErrorGuard::dropschedules async cleanup viatokio::spawn(set_test_forced_run_chat_task_error(None).await). Because locals drop in reverse order, the guard can be dropped (and the spawned reset started) beforeFORCED_ERROR_TEST_LOCKis released, letting the next test setSome(...)and then having the earlier spawned task overwrite it back toNonewhilerun_chat_taskhasn’t consumed it yet (TEST_FORCED_RUN_CHAT_TASK_ERRORis read viaslot.take()).
Remove the asyncDrop/spawn cleanup and instead perform an awaited reset inside each serialized test (applies to the three tests that currently createTestForcedRunChatTaskErrorGuard).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/channels/providers/web_tests.rs` around lines 29 - 35, The Drop implementation for TestForcedRunChatTaskErrorGuard must not spawn an async reset because that fire-and-forget reset (tokio::spawn calling set_test_forced_run_chat_task_error(None)) can race with FORCED_ERROR_TEST_LOCK and clobber a subsequent test's forced error; remove the async Drop::drop implementation entirely and instead update each test that constructs TestForcedRunChatTaskErrorGuard to explicitly await set_test_forced_run_chat_task_error(None) at the end of the test while still holding the FORCED_ERROR_TEST_LOCK (so run_chat_task’s TEST_FORCED_RUN_CHAT_TASK_ERROR slot.take() is not interfered with). Ensure references to set_test_forced_run_chat_task_error, TestForcedRunChatTaskErrorGuard, FORCED_ERROR_TEST_LOCK, run_chat_task, and TEST_FORCED_RUN_CHAT_TASK_ERROR are used to find and modify the guard and the three tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 496-508: The code branch that builds a ClassifiedError for
402/payment messages incorrectly sets source to the hardcoded string
"openhuman_billing", which misattributes upstream provider billing errors; in
the ClassifiedError constructor (the block creating ClassifiedError with
error_type "budget_exhausted" and calling with_provider_detail), replace the
hardcoded source with the actual provider identifier (use the existing provider
variable) or another provider-specific source instead of "openhuman_billing" so
upstream 402/payment errors preserve the provider as the error source.
- Around line 449-469: The current 429 branch sets retryable =
!is_non_retryable_rate_limit_text(&lower) but always builds a
transient/retry-focused summary; change it so you first compute a boolean (e.g.,
let non_retryable = is_non_retryable_rate_limit_text(&lower)) and then pick the
message accordingly: if non_retryable produce a non-retry message that directs
the user to billing/settings/plan (no "retry in this thread" hint), otherwise
keep the transient retry summary that uses retry_after_hint(retry_secs); pass
the chosen summary into with_provider_detail(...) and keep retryable set to
!non_retryable and retry_after_ms as-is. Ensure you reference
is_non_retryable_rate_limit_text, retry_after_hint, retry_secs, and
ClassifiedError::message/retryable when editing.
---
Outside diff comments:
In `@src/openhuman/channels/providers/web_tests.rs`:
- Around line 29-35: The Drop implementation for TestForcedRunChatTaskErrorGuard
must not spawn an async reset because that fire-and-forget reset (tokio::spawn
calling set_test_forced_run_chat_task_error(None)) can race with
FORCED_ERROR_TEST_LOCK and clobber a subsequent test's forced error; remove the
async Drop::drop implementation entirely and instead update each test that
constructs TestForcedRunChatTaskErrorGuard to explicitly await
set_test_forced_run_chat_task_error(None) at the end of the test while still
holding the FORCED_ERROR_TEST_LOCK (so run_chat_task’s
TEST_FORCED_RUN_CHAT_TASK_ERROR slot.take() is not interfered with). Ensure
references to set_test_forced_run_chat_task_error,
TestForcedRunChatTaskErrorGuard, FORCED_ERROR_TEST_LOCK, run_chat_task, and
TEST_FORCED_RUN_CHAT_TASK_ERROR are used to find and modify the guard and the
three tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5e36b968-60fa-4bdc-a35e-c54ea354d53b
📒 Files selected for processing (5)
src/core/socketio.rssrc/openhuman/channels/proactive.rssrc/openhuman/channels/providers/presentation.rssrc/openhuman/channels/providers/web.rssrc/openhuman/channels/providers/web_tests.rs
… copy (tinyhumansai#2606) Three follow-ups from CodeRabbit on PR tinyhumansai#2652: 1. The 402 / payment-required arm hardcoded source="openhuman_billing", misattributing upstream provider 402s (openrouter, openai, ...) to OpenHuman's own credit system. Switch the source based on the extracted provider envelope: when the leading envelope is OpenHuman (or absent), keep "openhuman_billing"; for any other identified upstream, emit source="provider" so the FE doesn't point the user at OpenHuman credits when their provider plan / balance is the issue. 2. The 429 arm always produced the transient "you can retry in this thread" copy regardless of retryable. Compute non_retryable first and pick the message accordingly: non-retryable cases (plan limit, insufficient balance, Z.AI 1311/1113) now route the user to billing / settings / plan instead of suggesting same-thread retry. retry_after_ms and retryable wire shape unchanged. 3. TestForcedRunChatTaskErrorGuard::drop spawned a fire-and-forget tokio task to reset TEST_FORCED_RUN_CHAT_TASK_ERROR. The spawned task could race FORCED_ERROR_TEST_LOCK and clobber a subsequent test's forced error. Remove the guard struct entirely and add an inline `set_test_forced_run_chat_task_error(None).await` at the end of each of the three tests, while still holding the lock. Adds regression tests for both source branches and both message branches.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/channels/providers/web.rs (1)
508-525:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve
fallback_availableon 402 classification.Line 525 resets
fallback_availabletoNone, so an aggregate likeAll providers/models failed ... 402 Payment Requireddrops the already-computedSome(false)from Lines 401-405. That loses the wire signalsrc/core/socketio.rsdocuments for exhausted fallback chains.Suggested fix
ClassifiedError { error_type: "budget_exhausted", message: with_provider_detail("Insufficient credits. Please top up to continue.", err), source, retryable: false, retry_after_ms: None, provider, - fallback_available: None, + fallback_available, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/channels/providers/web.rs` around lines 508 - 525, The 402 classification is overwriting the previously-computed fallback_available with None; update the ClassifiedError construction in this 402 branch to preserve the existing fallback_available value (do not set fallback_available: None) so the earlier computed Some(false)/Some(true) is retained; locate the ClassifiedError creation in src/openhuman/channels/providers/web.rs (the block that sets error_type "budget_exhausted" and message via with_provider_detail) and remove or replace the literal None so the existing fallback_available variable is passed through unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 508-525: The 402 classification is overwriting the
previously-computed fallback_available with None; update the ClassifiedError
construction in this 402 branch to preserve the existing fallback_available
value (do not set fallback_available: None) so the earlier computed
Some(false)/Some(true) is retained; locate the ClassifiedError creation in
src/openhuman/channels/providers/web.rs (the block that sets error_type
"budget_exhausted" and message via with_provider_detail) and remove or replace
the literal None so the existing fallback_available variable is passed through
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d8257f8d-8cdf-4cf0-9a3b-0d45e4b8afd7
📒 Files selected for processing (2)
src/openhuman/channels/providers/web.rssrc/openhuman/channels/providers/web_tests.rs
Summary
chat_errorso the frontend can render a real countdown, retry button, and provider/source label without regexing the message.WebChannelEvent:error_source,error_retryable,error_retry_after_ms,error_provider,error_fallback_available. Older FE clients that read onlymessage/error_typekeep working unchanged (skip_serializing_if = "Option::is_none").fallback_available: falseoncereliable.rs::format_failure_aggregatehas exhausted themodel_fallbackschain so the FE doesn't promise a fallback that doesn't exist.Problem
Issue #2606 follows up on PR #2371. That PR landed retry-after wording inside the user-facing message text — but the structured data was thrown away at the channel-classifier boundary.
WebChannelEventstill only carriedmessage: Stringanderror_type: String, so the frontend could not:Users are still reporting confusing rate-limit copy because of this gap (see issue #2606's acceptance criteria).
Solution
New
ClassifiedErrorstruct returned fromclassify_inference_errorcarries the typed metadata, and the chat-error publish path populates the newWebChannelEventfields from it.error_sourceerror_retryableprovidertrueretry_after_msparsed from bodyRetry-After:/retry_after:providerfalseopenhuman_budgettrueagent_looptrueopenhuman_billingfalsetransporttrueconfigfalseProvider name extracted best-effort from the
\"<provider> API error (...)\"prefix thatinference::provider::ops::api_errorformats. Fallback-exhausted signal extracted from the "All providers/models failed" aggregate thatreliable.rs::format_failure_aggregateemits.Submission Checklist
None-omit contract test.## Related— N/A: no matrix row touched.Closes #NNN— see## Related.Impact
channels::providers::web::classify_inference_errorand thechat_errorpublish instart_chat) plus theWebChannelEventwire struct.chat_errorSSE/Socket.IO frame and can be picked up by the FE in a subsequent PR to render a real countdown/retry/fallback UI.messagealready carries. Provider name extraction is allow-listed to ASCII alphanumeric/_/-.error_typetokens are unchanged for existing consumers. New fields are allOption<…>withskip_serializing_if = \"Option::is_none\", so older FE clients see exactly the same JSON shape they do today.Live verification
Beyond the unit + wire-shape tests, ran a full end-to-end live test:
HTTP/1.1 429 Too Many Requests,Retry-After: 30, body{\"error\":{\"message\":\"...\",\"code\":\"rate_limit_exceeded\"},\"retry_after\":30}on/v1/chat/completions.openhuman-core run --port 7892against an isolated workspace with that endpoint configured as acustom_openaiprovider.openhuman.channel_web_chat, listen to/eventsviacurl -N.The chat_error SSE frame received:
This exercises the full real path:
reliable.rs::is_rate_limited→ops::api_errorformatting →classify_inference_error→WebChannelEventserialization → SSE.Known follow-ups (out of scope)
inference::provider::ops::api_errorcurrently preserves only the response BODY, not theRetry-AfterHTTP header. When the upstream sends only the header (no bodyretry_after),retry_after_mswill beNone. Follow-up should thread the header into the error chain.Related
Retry-Afterheader fromops::api_error; FE consumer for the new structured fields.AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/2606-structured-rate-limit-metadata(branched fromorigin/mainafter fresh fetch)Validation Run
pnpm --filter openhuman-app format:check— N/A: no frontend changes.pnpm typecheck— fails on 4 pre-existing errors inapp/src/components/settings/panels/devices/PairPhoneModal.tsx,app/src/lib/tunnel/crypto.ts,app/src/pages/ios/PairScreen.tsx(missing iOS-only depsqrcode.react,@noble/ciphers,@tauri-apps/plugin-barcode-scanner). These exist onorigin/mainand are unrelated to this PR's Rust-only changes. Pushed with--no-verifyper CLAUDE.md guidance.cargo test --lib openhuman::channels::providers::web::tests→ 52 passed, 0 failed (39 pre-existing + 13 new).cargo test --lib openhuman::channels::providers→ 834 passed, 0 failed under default parallelism (added a serial test lock for the three tests sharingTEST_FORCED_RUN_CHAT_TASK_ERROR).cargo fmt --manifest-path Cargo.tomlapplied;cargo check --lib --bin openhuman-coreclean.cargo check --manifest-path app/src-tauri/Cargo.tomlclean.Validation Blocked
command:N/A — fullcargo test --libhad 42 pre-existing failures inopenhuman::memory::tree_global::*andopenhuman::memory_tree::*modules, confirmed reproducible onorigin/mainwith this branch stashed.error:pre-existing memory_tree test failures unrelated to channels work.impact:none on this PR — those modules are not touched.Behavior Changes
chat_errorWebChannelEvent now carries structurederror_source,error_retryable,error_retry_after_ms,error_provider,error_fallback_availablefields.Parity Contract
error_typetokens andmessagetext are unchanged for every existing branch. New fields are additiveOption<…>withskip_serializing_if. Three existing classifier tests still pass unchanged (only the destructuring shape was migrated from tuple to struct)._distinguishes_action_budget_from_provider_429and_max_iterations_gets_dedicated_branch).Duplicate / Superseded PR Handling
Summary by CodeRabbit
New Features
Bug Fixes
Tests