fix(observability): demote loopback sidecar-down noise to expected (#R5 #R6)#2063
Conversation
…#R6)
Pairs the new variant with its `report_expected_message` arm in a single
commit because Rust's exhaustive `match` requires the arm to land
alongside the enum addition. Matcher + ladder wiring follow in the next
commit; tests in the one after.
The arm demotes at `tracing::debug!` (lower than the `warn!` used for
`NetworkUnreachable`) and logs metadata-only fields — no raw `{message}`
in the structured payload or format string. Loopback URLs carry no PII
and the body adds no remediation signal, matching the tinyhumansai#1719 review
guidance to prefer tags over body for noise demotions.
…5 #R6) Adds `is_loopback_unavailable(lower)` — a conjunctive matcher requiring both a `127.0.0.1:<port>` / `localhost:<port>` host substring and a platform-specific `connection refused (os error N)` errno (61 = macOS, 111 = Linux, 10061 = Windows WSAECONNREFUSED). Wired into the `expected_error_kind` ladder *before* `is_network_unreachable_message` so loopback boot-window races win precedence over the broader user-environment bucket. Mirrors the `ProviderUserState`-before- `BackendUserError` precedence pattern from PR tinyhumansai#1795. Without precedence here both R5 and R6 fall through to `NetworkUnreachable` and conflate an internal lifecycle race against the embedded core's startup (`127.0.0.1:18474` not yet bound) with real user-environment problems (VPN drop, captive portal, ISP block). Keeping the buckets distinct preserves Sentry's "what class of transport failure is spiking?" signal.
- Verbatim R5 (`integrations.get`) and R6 (`rpc.invoke_method` re-wrap) bodies classify as `LoopbackUnavailable`. - Linux (`os error 111`) and Windows (`os error 10061`) errno suffixes also classify, covering the WSL / native-windows desktop targets. - Precedence guard: a body that satisfies both the loopback matcher and the broader `is_network_unreachable_message` must route through the loopback bucket first — protects the bucket separation that makes "which class is spiking?" answerable in Sentry. - Negative coverage: a loopback URL with a 503 status (developer proxy on `127.0.0.1`), a non-loopback `Connection refused` (real network unreachable), and prose bodies that satisfy only one of the two conjunctive anchors all stay out of the loopback bucket. - Smoke: `report_error_or_expected` routes both R5/R6 shapes without panicking (the arm wiring is exercised end-to-end).
📝 WalkthroughWalkthroughThis PR extends the error observability system in ChangesLoopback Unavailable Error Classification
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
graycyrus
left a comment
There was a problem hiding this comment.
Looks good, nice work!
Summary
ExpectedErrorKind::LoopbackUnavailablefor the in-process-core boot-window race: a sibling component (frontend RPC relay, integrations / composio HTTP clients) reaches127.0.0.1:<port>before the embedded core's listener finishes binding, gets a TCPConnection refused, and currently re-reports throughreport_error_message⇒ Sentry.is_loopback_unavailable—127.0.0.1:/localhost:host ANDconnection refused (os error 61 / 111 / 10061)— wired intoexpected_error_kindbeforeis_network_unreachable_messageso loopback wins precedence.tracing::debug!with metadata-only structured fields (no raw{message}); breadcrumb survives, Sentry capture is skipped.integrations.getemit site) and OPENHUMAN-TAURI-R6 (~2.5k events,rpc.invoke_methodre-wrap of the same trace).Problem
Two Sentry issues, ~5k events combined, share
trace_id=6ebf5b62748d5144e541e2cddeabbbd0— R6 is the rpc-layer wrap of R5. Both report fromsrc/core/observability.rsvia thereport_error_*site:integrations.get failed: error sending request for url (http://127.0.0.1:18474/agent-integrations/composio/connections) → client error (Connect) → tcp connect error → Connection refused (os error 61)rpc.invoke_method failed: [composio] list_connections failed: GET http://127.0.0.1:18474/agent-integrations/composio/connections failed: error sending request for url (…) → Connection refused (os error 61)R5 already carries the tracing tag
failure: transport, but the existing classifier has no loopback-specific arm — the body shape is the in-process core's127.0.0.1:18474listener not yet accepting connections during app startup. Perfeedback_in_process_core_restart_noopthis boot window self-resolves once the core finishes binding; no retry on the calling side can do better than waiting it out, and Sentry has no remediation path. This is the same demote-the-noise family as PR #1798 (3 transient leak paths) — leak #4.While these bodies would also satisfy the broader
is_network_unreachable_messagematcher (which catchesconnection refusedgenerically), folding them intoNetworkUnreachableconflates an internal lifecycle race with real user-environment problems (VPN drop, captive portal, ISP block) and makes Sentry's "which class of transport failure is spiking?" signal un-answerable.Solution
Three commits, scoped tightly to
src/core/observability.rs:ExpectedErrorKind::LoopbackUnavailablevariant +report_expected_messagearm — paired because Rust's exhaustivematchrequires the arm to land with the enum addition. The arm usestracing::debug!(lower thanwarn!used forNetworkUnreachable) and only logsdomain/operation/kind = "loopback_unavailable"— no raw{message}in fields or format string. Mirrors fix(observability): drop 401 session-expired Sentry noise (#25, #1Q, #27, #1G) #1719 review guidance to prefer tags over body for noise demotions.is_loopback_unavailablematcher + ladder precedence — conjunctive substring matcher requiring both a loopback host with port (127.0.0.1:orlocalhost:) and an explicit errno ((os error 61)macOS,(os error 111)Linux,(os error 10061)WindowsWSAECONNREFUSED). Inserted into theexpected_error_kindladder beforeis_network_unreachable_messageso the loopback bucket wins — same precedence pattern asProviderUserState-before-BackendUserErrorfrom PR fix(observability): demote composio validation noise to expected user-state (#3R #3S #33 #34 #97) #1795.Tests — verbatim R5/R6 bodies classify as
LoopbackUnavailable; Linux + Windows errno variants both match; precedence guard asserts loopback shape ≠NetworkUnreachable; loopback URL with a 503 status (developer proxy on127.0.0.1:8080) and non-loopbackConnection refusedstay out of the loopback bucket;report_error_or_expectedroutes both R5/R6 shapes through the expected path without panicking.Bug shape pattern follows
feedback_in_process_core_restart_noop— the in-process core boot window is the source — and the dedup gate fromsentry-workflowPhase S2 was satisfied (no open or 30-day merged PR mentions OPENHUMAN-TAURI-R5 / -R6).Submission Checklist
diff-cover) meet the gate enforced by.github/workflows/coverage.yml. Runpnpm test:coverageandpnpm test:rustlocally; PRs below 80% on changed lines will not merge.docs/TEST-COVERAGE-MATRIX.md.Closes #NNNin the## RelatedsectionImpact
expected_error_kindis one extra substring check before the existingNetworkUnreachablematcher on the same lowercased buffer. No allocations beyond the existingto_ascii_lowercase.EAI_AGAIN/ECONNREFUSEDon all three.Related
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
sentry-workflow.mdPhase S5 fallback).Commit & Branch
fix/sentry-loopback-classifier-r5-r6b61a7053(tip; full chain ingit log upstream/main..HEAD)Validation Run
pnpm --filter openhuman-app format:checknot exercised.pnpm typechecknot exercised.cargo test --lib core::observability::tests— 63 passed, 0 failed locally.cargo fmtclean,cargo check --libclean (4 unrelated warnings, all pre-existing onmain).pnpm rust:checknot exercised againstapp/src-tauri.Validation Blocked
command:cargo clippy --lib --all-targets -- -D warningserror:93+ pre-existing clippy errors onupstream/main(verified by re-running clippy with the working tree stashed); 0 new from this PR. The-D warningsgate cannot be cleared without an unrelated cleanup pass.impact:none — the PR does not introduce or worsen any clippy diagnostic; observability.rs lines flagged by clippy (2249-2256) are pre-existingDefault::default()field reassignments in test helpers.Behavior Changes
Connection refused (os error N)bodies stop being captured as Sentry error events; they demote to adebug!breadcrumb instead.Parity Contract
NetworkUnreachable,TransientUpstreamHttp, and all other existing buckets keep their current matchers — the new variant is additive and gated on a tighter conjunctive shape. The precedence guard test locks the ladder ordering.is_transient_message_failureat therpc.invoke_methodsite still catches the R6 shape on the fallback path (it already did, via theerror sending requestsubstring) — the new bucket is defense-in-depth at the classifier emit site.Duplicate / Superseded PR Handling
OPENHUMAN-TAURI-R5/-R6andLoopbackUnavailable) returned no matches.Summary by CodeRabbit
Bug Fixes
Tests