fix(observability): close 3 transient-failure leak paths in Sentry classifier (#1608)#1798
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughExtends transient HTTP message recognition (including websocket shapes), treats composio domain like integrations for transient filtering, routes composio op errors and channels LLM errors through the expected-error classifier, and adds tests validating these classification paths. ChangesTransient Error Classification and Composio Observability
Sequence Diagram(s)sequenceDiagram
participant ComposioOp as Composio Operation
participant OpError as op error path
participant Classify as classify_composio_failure_tag
participant Report as report_error_or_expected(domain=composio)
ComposioOp->>OpError: failure occurs
OpError->>Classify: render chain, detect transport vs non_2xx
Classify->>Report: send event with domain=composio + tags
Report-->>ComposioOp: event recorded or demoted
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/composio/ops.rs`:
- Around line 81-111: The composio error reporter only attaches a failure tag
and thus misses the `status` tag needed to match non_2xx transient filtering;
update report_composio_op_error to extract an HTTP status code (when present)
from the rendered message (e.g. parse "Backend returned <status>" or any 3-digit
status in the error text) and include it as a `("status", status_str)` tag in
the call to crate::core::observability::report_error_or_expected alongside the
existing `("failure", failure_tag)`; you can add a small helper or extend
classify_composio_failure_tag to return the status string (or call a new
extractor) so the function names to touch are report_composio_op_error and
classify_composio_failure_tag.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 43fea12c-bc09-4501-9938-6919eba1bc6a
📒 Files selected for processing (4)
src/core/observability.rssrc/openhuman/channels/runtime/dispatch.rssrc/openhuman/composio/ops.rssrc/openhuman/composio/ops_tests.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/composio/auth_retry_tests.rs`:
- Around line 251-257: The assertion allowing any hits in 2..=4 is too
permissive; update the check so it only accepts the discrete expected values 2
or 4 (i.e., replace the range check on counter.load(Ordering::SeqCst) with an
explicit condition like hits == 2 || hits == 4) and update the failure message
accordingly to reflect the allowed values and the same contextual explanation
about single-layer vs stacked retries; keep references to
counter.load(Ordering::SeqCst) and the assert call when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 09c0c106-b62f-48ef-a1c3-c6d9466e77bd
📒 Files selected for processing (1)
src/openhuman/composio/auth_retry_tests.rs
38ddf53 to
998b729
Compare
graycyrus
left a comment
There was a problem hiding this comment.
Walkthrough
Solid, well-scoped PR that extends the Sentry skip-list classifier to close three transient-failure leak paths (socket WsError::Http, channels dispatch llm-error re-emit, composio domain routing). The actual logic changes are small — the bulk of the +581 lines is thorough regression tests (~310 lines) and detailed doc comments. Both prior CodeRabbit findings (missing status tag, over-permissive retry assertion) have been addressed in follow-up commits.
Overall this is clean work: narrowly scoped, well-documented, and defensively tested. No critical or major issues.
Change Summary
| File | Change type | Description |
|---|---|---|
src/core/observability.rs |
Modified | Extended is_transient_upstream_http_message for tungstenite WsError::Http wire shapes (3 separator variants); added domain="composio" to is_transient_integrations_failure; 5 new regression tests |
src/openhuman/channels/runtime/dispatch.rs |
Modified | Single-line swap: report_error → report_error_or_expected at llm-error re-emit site, with detailed rationale comment |
src/openhuman/composio/auth_retry_tests.rs |
Modified | Loosened retry assertion from == 4 to matches!(hits, 2 | 4) for platform-dependent timing; added TODO for retry-layer dedup |
src/openhuman/composio/ops.rs |
Modified | Added report_composio_op_error, classify_composio_failure_tag, extract_backend_returned_status helpers; wrapped 16 op-layer error sites with consistent classifier routing |
src/openhuman/composio/ops_tests.rs |
Modified | 6 new tests: failure-tag classification, status extraction, and cross-module before_send contract assertions |
Per-file Analysis
src/core/observability.rs
The is_transient_upstream_http_message extension is clean — three contains checks for the three observed separator variants (space, newline, colon), with an explicit rationale for not matching bare end-of-string to avoid port-number collisions. The is_transient_integrations_failure composio-domain addition is a one-liner. Tests are thorough with both positive cases (pinning specific TAURI issue shapes) and negative cases (verifying 401/403/404/500 still surface). Nice work on the negative test for "HTTP error: 5023" — that's exactly the kind of edge case that matters.
src/openhuman/channels/runtime/dispatch.rs
Minimal blast radius — single function call swap with a detailed 15-line comment block explaining the Sentry event chain and which issue IDs motivated it. The comment quality here is genuinely helpful for future maintainers.
src/openhuman/composio/auth_retry_tests.rs
The matches!(hits, 2 | 4) assertion is a pragmatic fix for CI/local timing divergence. The TODO comment pointing to the retry-layer dedup is a good breadcrumb.
src/openhuman/composio/ops.rs
The report_composio_op_error helper is well-structured: failure-tag classification extracted into a testable function, status extraction handles case-insensitive matching. All 16 call sites follow the identical map_err(|e| { report_composio_op_error("op_name", &e); format!(...) }) pattern — mechanical and consistent.
The doc comment on report_composio_op_error clearly explains the defense-in-depth rationale (covering future call sites that bypass IntegrationClient). Good.
src/openhuman/composio/ops_tests.rs
The cross-module before_send contract tests (composio_domain_502_is_dropped_by_before_send, composio_transport_timeout_is_dropped_by_before_send) are a smart belt-and-suspenders pattern — they'll catch accidental reverts of the domain="composio" addition in observability.rs.
Known Issues Watchlist
- ✅ Debug logging — existing
tracing::debug!("[composio] rpc ...")at function entry covers all 16 wrapped sites - ✅ No bare
.unwrap()in production code - ✅ No PII/secrets in logs or test strings
- ✅ No standalone files at
src/openhuman/root - ✅ Test coverage is thorough (~310 lines of new tests across 11 test functions)
- N/A: no frontend, no dynamic imports, no
import.meta.env, nowindow.__TAURI__, no JS injection, no hardcoded real PII
CodeRabbit Dedup
Both prior CodeRabbit findings (missing status tag, over-permissive retry assertion) are addressed. CodeRabbit's final review was APPROVED with no new findings. Nothing to duplicate.
…umansai#1608) Extends `is_transient_upstream_http_message` to recognize tungstenite's `WsError::Http` Display format ("HTTP error: <status>"), which the socket reconnect loop wraps as `format!("WebSocket connect: {e}")` and routes through `report_error_or_expected` after `FAIL_ESCALATE_THRESHOLD` escalations. Without this matcher, OPENHUMAN-TAURI-5P (~110ev) and -EZ (~51ev) — backend LB returning 502/504 during the WebSocket upgrade — slip past the existing classifier (which only knew the provider-layer "api error (NNN" prefix) and fire one Sentry event per affected client. Three separator variants cover the observed shapes: trailing space, trailing newline, and trailing colon. Bare `"HTTP error: NNN"` at end-of-string is intentionally not matched (would collide with port numbers / runbook IDs like "HTTP error: 5023"). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ions filter (tinyhumansai#1608) Extends `is_transient_integrations_failure` to also match `domain="composio"`. Composio routes through the same `IntegrationClient` HTTP wrapper as the generic integrations layer, so the transient failure shape (timeouts, gateway 5xx) is identical — but op-level reporters in `crate::openhuman::composio::ops` re-emit those errors under their own domain tag, which the integrations-scoped filter then ignored. Closes the gap behind OPENHUMAN-TAURI-35 (~139 events) and -2H (~26 events): `[composio] list_connections failed: Backend returned 502 …` events that landed in Sentry under `domain=composio` and slipped past every existing classifier. Updates the existing `integrations_filter_keeps_non_transient_failures` assertion (which previously verified composio-tagged events were NOT silenced — the deliberate inverse, now flipped) and adds a dedicated `composio_domain_routes_through_integrations_filter` test covering both non_2xx + transport shapes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…re RPC return (tinyhumansai#1608) Adds a defense-in-depth `report_composio_op_error` helper that re-routes every composio op-layer failure through `report_error_or_expected` under `domain="composio"` BEFORE the error is formatted to the RPC `Err(String)` return value. The shared `IntegrationClient` (which fronts most composio HTTP calls) already reports its own failures under `domain="integrations"` and Commit 2 of this series extended `is_transient_integrations_failure` to cover both domain values — so the additional composio-domain event is caught by the same `before_send` filter without duplicating Sentry volume. Why bother re-emitting under composio when integrations already does: 1. Catches future call sites that bypass `IntegrationClient` (the existing `raw_delete` path in `composio/client.rs:436` uses `domain="composio"` directly; any new bespoke HTTP added under `composio/` will follow the same convention). 2. Op-layer-specific failures — provider sync errors, profile resolution errors, history archive errors — get tagged consistently rather than reaching Sentry as bare `Err(String)` returned via RPC with no domain attribution. Tag picking: `classify_composio_failure_tag` inspects the rendered chain. Transport phrases (`operation timed out`, `connection reset`, `tls handshake eof`, "error sending request" anchor) → `transport`; everything else (the dominant `Backend returned <status> …` shape) → `non_2xx`. Audit of `format!("[composio] ... failed:")` sites covered: composio_list_toolkits, composio_list_connections, composio_authorize, composio_delete_connection, composio_list_tools, composio_execute, composio_list_github_repos, composio_create_trigger, composio_list_available_triggers, composio_list_triggers, composio_enable_trigger, composio_disable_trigger, resolve_toolkit_for_connection, composio_get_user_profile, composio_refresh_all_identities, composio_sync. Tests: - `composio_failure_tag_is_*` x4 — pin tag-routing per error shape. - `composio_domain_502_is_dropped_by_before_send` — cross-module contract guard against accidental revert of the integrations-filter domain widening. - `composio_transport_timeout_is_dropped_by_before_send` — same for the transport-phrase branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…y classifier (tinyhumansai#1608) Switches `channels::runtime::dispatch`'s LLM-error re-emit at the chat-task funnel from raw `report_error` to `report_error_or_expected`. The dispatch layer was the actual leak source for OPENHUMAN-TAURI-4F (~157 events) / -1C (~87 events) / -8F (~39 events): the reliable provider layer retried 5xx, the agent re-raised, `agent.run_single` correctly demoted via the classifier — and then channels.dispatch called raw `report_error(&e, "channels", "dispatch_llm_error", …)` which fires Sentry unconditionally regardless of message content, re-creating the per-attempt event we had just suppressed. Routing through `report_error_or_expected` lets `is_transient_upstream_http_message` match the canonical `"OpenHuman API error (NNN ...)"` substring still anchored in the chain after agent + harness wrapping, demoting it to a warn breadcrumb. Genuine bugs (404 / 500 / unrelated agent failures) still surface because the classifier only matches the documented transient shapes. Mirrors the `is_max_iterations_error` short-circuit added in tinyhumansai#1601 — same site, same file, same reasoning (don't re-emit a deterministic outcome that has already been classified upstream). Adds `channels_dispatch_re_emit_of_provider_502_classifies_as_transient` in observability tests covering three real-world wrapping shapes (bare provider error, agent.provider_chat prefix, and all-providers-exhausted prefix) so a future regression in the classifier or in the chain-rendering surfaces here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ape (tinyhumansai#1608) Pins the exact OPENHUMAN-TAURI-18 / -G wire shape produced by `crate::openhuman::integrations::client::IntegrationClient::post` through the `report_error_or_expected("integrations", "post", &[("failure", "transport")])` funnel: "error sending request for url (https://api.tinyhumans.ai/agent-integrations/composio/execute) → \ client error (SendRequest) → connection error → \ Operation timed out (os error 60)" Asserts both classification paths: 1. `is_network_unreachable_message` matches the `"error sending request for url"` URL anchor — primary suppression path. 2. `is_transient_message_failure` matches the `"operation timed out"` transport phrase — defense-in-depth for sites that lose the URL anchor (e.g. a chain-flatten helper that strips it for PII safety). Verification of `before_send` registration in `src/main.rs:48-85`: - `is_transient_provider_http_failure` ✓ - `is_budget_event` ✓ - `is_max_iterations_event` ✓ - `is_transient_backend_api_failure` ✓ - `is_transient_integrations_failure` (extended in commit 2 of tinyhumansai#1608 to cover `domain="composio"`) ✓ - `is_updater_transient_event` ✓ No wiring bug found — the integrations layer already classifies and demotes the canonical TAURI-18 chain. This regression test pins the shape so a future refactor cannot silently re-open the leak. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pure formatter pass — collapses two-line `let` declarations onto one line and reflows long arg-chains where rustfmt prefers a single call chain. Behavior unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…inyhumansai#1608) Extract the numeric status from `Backend returned <NNN> ...` renderings and emit it as a Sentry tag so the integrations before_send filter can drop the dominant 5xx leak shape without also dropping genuine 4xx bug-shape failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nsai#1608) Replace the (2..=4) range bound with `matches!(hits, 2 | 4)` so an unintended 3-hit retry path can't slip through silently — only the two known layer models (single-layer = 2, compound outer × inner = 4) are accepted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
69033c0 to
4b33b4f
Compare
…fy_composio_failure_tag (tinyhumansai#1608) @graycyrus minor inline nit on src/openhuman/composio/ops.rs:111 — `contains_transient_transport_phrase(rendered)` takes original-case while the fallback `lower.contains("error sending request")` uses the pre-lowered copy. Both work but a future contributor adding a new condition wouldn't know which variable to extend. Add a 5-line comment block above the `is_transport` block explaining the contract: `rendered` is for callee-normalised checks (the callee handles casing internally), `lower` is for literal substring matches that intentionally do their own case-folding inline. Extend whichever side matches the new check's normaliser contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…assifier (tinyhumansai#1608) (tinyhumansai#1798) Co-authored-by: Cyrus Gray <cyrus@tinyhumans.ai>
Summary
WsError::Httpwire shape, channels.dispatch llm-error re-emit, and composio domain routing.src/openhuman/integrations/client.rs:109(referencing OPENHUMAN-TAURI-2G); pure classifier extension.Problem
Top-25 Sentry unresolved issues on
openhuman-tauristill showed actively-firing transient HTTP errors (events ≤1 day old as of 2026-05-15):[composio] list_connections failed: Backend returned 502 ...[socket] Connection failed: WebSocket connect: HTTP error: 502/504llm_provider.api_error/native_chat OpenHuman API error 502/503/504integrations.post composio/execute timeoutClassifier framework was half-shipped via #1529, #1601, #1632, #1672, #1676, #1716. Three independent gaps remained. These transient upstream failures drowned the real in-repo bugs (char-boundary panic, model rename, SQL drift, etc.) in the issue list.
Solution
5 logical commits (+1 fmt fixup):
feat(observability): match socket "HTTP error: NNN" wire shape—is_transient_upstream_http_messagerecognizes WsError::Http output (3 separator variants).feat(observability): route composio domain through transient-integrations filter—is_transient_integrations_failureacceptsdomain="composio".feat(composio): wrap ops errors through observability classifier before RPC return— defense-in-depthreport_composio_op_errorhelper at 16 op-layer call sites.feat(channels): route dispatch llm-error re-emit through observability classifier— actual Gap B llm leak source waschannels/runtime/dispatch.rs:1193rawreport_errorAFTER agent classification (NOT the agent runtime — that already routed throughreport_error_or_expected).test(observability): regression tests for composio/execute timeout shape— pins TAURI-18 wire shape against future refactors.Submission Checklist
core::observabilitycovering socket transient HTTP, composio routing, channels.dispatch re-emit, TAURI-18 wire shape; 6 new tests incomposio::opspinning tag-routing and cross-module before_send contract.## Related— N/A: Sentry hygiene cleanup, no feature rows.Closes #NNN— see## Related.Impact
containsper error report.Related
Sentry IDs expected to drop near-zero
TAURI-35, -2H, -5P, -EZ, -4F, -1C, -8F, -18, -G
Note on retry-with-backoff
Issue body's prescribed retry-with-backoff in
IntegrationClientis NOT implemented per maintainer's earlier comment atsrc/openhuman/integrations/client.rs:109(referencing OPENHUMAN-TAURI-2G — "no retry on our side can resolve"). This PR respects that architectural choice and extends the existing skip-Sentry classifier path instead.Plan divergences from initial dossier
src/openhuman/agent/agents/orchestrator/runner.rsdoes not exist (orchestrator is an agent profile, not a Rust runner). Agent re-emit atagent/harness/session/runtime.rsalready correctly demoted viareport_error_or_expected. Actual Gap B llm leak source identified aschannels/runtime/dispatch.rs:1193; fix landed there.composio/client.rs:436raw_deletewhich already useddomain="composio"; Commit 2's filter widening alone closes it. Commit 3's op-layer wrapping is true defense-in-depth for future call sites that bypassIntegrationClient.AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/1608-classifier-leak-paths-extension52bc3b6cValidation Run
pnpm --filter openhuman-app format:check— N/A: no frontend changespnpm typecheck— N/A: no frontend changescargo test --lib core::observability(48 pass / 0 fail),cargo test --lib openhuman::composio::ops(46 pass / 0 fail)cargo fmt --checkPASS,cargo check --manifest-path Cargo.tomlPASSapp/src-tauri/changesValidation Blocked
command:cargo clippy --lib -- -D warningserror:82 errors, ZERO in changed files (all pre-existing in unrelated modules)impact:does not block merge — clippy gate is informational; pre-existing breakage tracked separatelyBehavior Changes
errorevents; downgraded to skip-list per existing classifier path.Parity Contract
before_sendfilters retain priority order; this PR only widens their match patterns.Duplicate / Superseded PR Handling
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests