Skip to content

fix(observability): demote loopback sidecar-down noise to expected (#R5 #R6)#2063

Merged
graycyrus merged 3 commits into
tinyhumansai:mainfrom
oxoxDev:fix/sentry-loopback-classifier-r5-r6
May 19, 2026
Merged

fix(observability): demote loopback sidecar-down noise to expected (#R5 #R6)#2063
graycyrus merged 3 commits into
tinyhumansai:mainfrom
oxoxDev:fix/sentry-loopback-classifier-r5-r6

Conversation

@oxoxDev
Copy link
Copy Markdown
Contributor

@oxoxDev oxoxDev commented May 18, 2026

Summary

  • New ExpectedErrorKind::LoopbackUnavailable for the in-process-core boot-window race: a sibling component (frontend RPC relay, integrations / composio HTTP clients) reaches 127.0.0.1:<port> before the embedded core's listener finishes binding, gets a TCP Connection refused, and currently re-reports through report_error_message ⇒ Sentry.
  • Conjunctive matcher is_loopback_unavailable127.0.0.1: / localhost: host AND connection refused (os error 61 / 111 / 10061) — wired into expected_error_kind before is_network_unreachable_message so loopback wins precedence.
  • Demote arm uses tracing::debug! with metadata-only structured fields (no raw {message}); breadcrumb survives, Sentry capture is skipped.
  • Closes OPENHUMAN-TAURI-R5 (~2.5k events, integrations.get emit site) and OPENHUMAN-TAURI-R6 (~2.5k events, rpc.invoke_method re-wrap of the same trace).

Problem

Two Sentry issues, ~5k events combined, share trace_id=6ebf5b62748d5144e541e2cddeabbbd0 — R6 is the rpc-layer wrap of R5. Both report from src/core/observability.rs via the report_error_* site:

ID Events Body shape
OPENHUMAN-TAURI-R5 ~2526 integrations.get failed: error sending request for url (http://127.0.0.1:18474/agent-integrations/composio/connections) → client error (Connect) → tcp connect error → Connection refused (os error 61)
OPENHUMAN-TAURI-R6 ~2471 rpc.invoke_method failed: [composio] list_connections failed: GET http://127.0.0.1:18474/agent-integrations/composio/connections failed: error sending request for url (…) → Connection refused (os error 61)

R5 already carries the tracing tag failure: transport, but the existing classifier has no loopback-specific arm — the body shape is the in-process core's 127.0.0.1:18474 listener not yet accepting connections during app startup. Per feedback_in_process_core_restart_noop this boot window self-resolves once the core finishes binding; no retry on the calling side can do better than waiting it out, and Sentry has no remediation path. This is the same demote-the-noise family as PR #1798 (3 transient leak paths) — leak #4.

While these bodies would also satisfy the broader is_network_unreachable_message matcher (which catches connection refused generically), folding them into NetworkUnreachable conflates an internal lifecycle race with real user-environment problems (VPN drop, captive portal, ISP block) and makes Sentry's "which class of transport failure is spiking?" signal un-answerable.

Solution

Three commits, scoped tightly to src/core/observability.rs:

  1. ExpectedErrorKind::LoopbackUnavailable variant + report_expected_message arm — paired because Rust's exhaustive match requires the arm to land with the enum addition. The arm uses tracing::debug! (lower than warn! used for NetworkUnreachable) and only logs domain / operation / kind = "loopback_unavailable" — no raw {message} in fields or format string. Mirrors fix(observability): drop 401 session-expired Sentry noise (#25, #1Q, #27, #1G) #1719 review guidance to prefer tags over body for noise demotions.

  2. is_loopback_unavailable matcher + ladder precedence — conjunctive substring matcher requiring both a loopback host with port (127.0.0.1: or localhost:) and an explicit errno ((os error 61) macOS, (os error 111) Linux, (os error 10061) Windows WSAECONNREFUSED). Inserted into the expected_error_kind ladder before is_network_unreachable_message so the loopback bucket wins — same precedence pattern as ProviderUserState-before-BackendUserError from PR fix(observability): demote composio validation noise to expected user-state (#3R #3S #33 #34 #97) #1795.

  3. Tests — verbatim R5/R6 bodies classify as LoopbackUnavailable; Linux + Windows errno variants both match; precedence guard asserts loopback shape ≠ NetworkUnreachable; loopback URL with a 503 status (developer proxy on 127.0.0.1:8080) and non-loopback Connection refused stay out of the loopback bucket; report_error_or_expected routes both R5/R6 shapes through the expected path without panicking.

Bug shape pattern follows feedback_in_process_core_restart_noop — the in-process core boot window is the source — and the dedup gate from sentry-workflow Phase S2 was satisfied (no open or 30-day merged PR mentions OPENHUMAN-TAURI-R5 / -R6).

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
  • Diff coverage ≥ 80% — changed lines (Vitest + cargo-llvm-cov merged via diff-cover) meet the gate enforced by .github/workflows/coverage.yml. Run pnpm test:coverage and pnpm test:rust locally; PRs below 80% on changed lines will not merge.
  • N/A: behaviour-only demote of a noise bucket — no feature row maps to this in docs/TEST-COVERAGE-MATRIX.md.
  • N/A: no feature-matrix row affected by this PR.
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • N/A: this PR does not touch release-cut surfaces — no manual smoke checklist update needed.
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Sentry noise: ~5k events / window across R5 + R6 stop reaching Sentry. Both still log as breadcrumbs (debug level) so trace correlation survives.
  • Runtime: zero — expected_error_kind is one extra substring check before the existing NetworkUnreachable matcher on the same lowercased buffer. No allocations beyond the existing to_ascii_lowercase.
  • Platforms: desktop (macOS / Linux / Windows). The errno set covers EAI_AGAIN / ECONNREFUSED on all three.
  • Security / migration: none — observability-only, no schema or API change.

Related


AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A — Sentry-only triage, no Linear ticket created (per sentry-workflow.md Phase S5 fallback).
  • URL: N/A

Commit & Branch

  • Branch: fix/sentry-loopback-classifier-r5-r6
  • Commit SHA: b61a7053 (tip; full chain in git log upstream/main..HEAD)

Validation Run

  • N/A: no frontend changes in this PR — pnpm --filter openhuman-app format:check not exercised.
  • N/A: no TS changes — pnpm typecheck not exercised.
  • Focused tests: cargo test --lib core::observability::tests — 63 passed, 0 failed locally.
  • Rust fmt/check (if changed): cargo fmt clean, cargo check --lib clean (4 unrelated warnings, all pre-existing on main).
  • N/A: no Tauri shell changes — pnpm rust:check not exercised against app/src-tauri.

Validation Blocked

  • command: cargo clippy --lib --all-targets -- -D warnings
  • error: 93+ pre-existing clippy errors on upstream/main (verified by re-running clippy with the working tree stashed); 0 new from this PR. The -D warnings gate cannot be cleared without an unrelated cleanup pass.
  • impact: none — the PR does not introduce or worsen any clippy diagnostic; observability.rs lines flagged by clippy (2249-2256) are pre-existing Default::default() field reassignments in test helpers.

Behavior Changes

  • Intended behavior change: loopback Connection refused (os error N) bodies stop being captured as Sentry error events; they demote to a debug! breadcrumb instead.
  • User-visible effect: none — no log line the user sees changes; only the Sentry dashboard noise drops.

Parity Contract

  • Legacy behavior preserved: NetworkUnreachable, TransientUpstreamHttp, and all other existing buckets keep their current matchers — the new variant is additive and gated on a tighter conjunctive shape. The precedence guard test locks the ladder ordering.
  • Guard/fallback/dispatch parity checks: existing is_transient_message_failure at the rpc.invoke_method site still catches the R6 shape on the fallback path (it already did, via the error sending request substring) — the new bucket is defense-in-depth at the classifier emit site.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): none — sentry-workflow Phase S2 dedup (open + 30-day merged search on OPENHUMAN-TAURI-R5 / -R6 and LoopbackUnavailable) returned no matches.
  • Canonical PR: this PR.
  • Resolution: N/A

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced error classification for loopback connection failures to provide more accurate diagnostics.
  • Tests

    • Extended test coverage for loopback error detection across Linux, WSL, and Windows platforms.

Review Change Stack

oxoxDev added 3 commits May 18, 2026 13:00
…#R6)

Pairs the new variant with its `report_expected_message` arm in a single
commit because Rust's exhaustive `match` requires the arm to land
alongside the enum addition. Matcher + ladder wiring follow in the next
commit; tests in the one after.

The arm demotes at `tracing::debug!` (lower than the `warn!` used for
`NetworkUnreachable`) and logs metadata-only fields — no raw `{message}`
in the structured payload or format string. Loopback URLs carry no PII
and the body adds no remediation signal, matching the tinyhumansai#1719 review
guidance to prefer tags over body for noise demotions.
…5 #R6)

Adds `is_loopback_unavailable(lower)` — a conjunctive matcher requiring
both a `127.0.0.1:<port>` / `localhost:<port>` host substring and a
platform-specific `connection refused (os error N)` errno (61 = macOS,
111 = Linux, 10061 = Windows WSAECONNREFUSED). Wired into the
`expected_error_kind` ladder *before* `is_network_unreachable_message`
so loopback boot-window races win precedence over the broader
user-environment bucket. Mirrors the `ProviderUserState`-before-
`BackendUserError` precedence pattern from PR tinyhumansai#1795.

Without precedence here both R5 and R6 fall through to
`NetworkUnreachable` and conflate an internal lifecycle race against the
embedded core's startup (`127.0.0.1:18474` not yet bound) with real
user-environment problems (VPN drop, captive portal, ISP block).
Keeping the buckets distinct preserves Sentry's "what class of transport
failure is spiking?" signal.
- Verbatim R5 (`integrations.get`) and R6 (`rpc.invoke_method` re-wrap)
  bodies classify as `LoopbackUnavailable`.
- Linux (`os error 111`) and Windows (`os error 10061`) errno suffixes
  also classify, covering the WSL / native-windows desktop targets.
- Precedence guard: a body that satisfies both the loopback matcher and
  the broader `is_network_unreachable_message` must route through the
  loopback bucket first — protects the bucket separation that makes
  "which class is spiking?" answerable in Sentry.
- Negative coverage: a loopback URL with a 503 status (developer proxy
  on `127.0.0.1`), a non-loopback `Connection refused` (real network
  unreachable), and prose bodies that satisfy only one of the two
  conjunctive anchors all stay out of the loopback bucket.
- Smoke: `report_error_or_expected` routes both R5/R6 shapes without
  panicking (the arm wiring is exercised end-to-end).
@oxoxDev oxoxDev requested a review from a team May 18, 2026 07:58
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

📝 Walkthrough

Walkthrough

This PR extends the error observability system in src/core/observability.rs with a new expected error classification for core boot-window loopback connection failures (127.0.0.1:<port> / localhost:<port> "Connection refused"), routing them separately from general network problems and reporting them as debug breadcrumbs rather than full error bodies.

Changes

Loopback Unavailable Error Classification

Layer / File(s) Summary
Loopback detection variant and matcher
src/core/observability.rs
ExpectedErrorKind::LoopbackUnavailable variant is added with documentation for boot-window connection-refused scenarios. The is_loopback_unavailable matcher requires both loopback host+port anchors and OS-specific errno phrases. Classification precedence in expected_error_kind checks loopback detection before NetworkUnreachable, with documentation clarifying the routing.
Loopback error reporting
src/core/observability.rs
report_expected_message adds a LoopbackUnavailable arm that logs a debug breadcrumb with domain/operation/kind tags, excluding raw message/body.
Loopback classification test suite
src/core/observability.rs
Tests validate R5/R6 loopback connect-refused classification across platform variants, precedence over NetworkUnreachable, errno handling, rejection of non-transport loopback failures, non-loopback connect-refused routing, conjunctive matcher requirements, and safe report_error_or_expected routing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • tinyhumansai/openhuman#1763: Modifies ExpectedErrorKind and the expected reporting path in the same observability module, adding different classified variants (SessionExpired).
  • tinyhumansai/openhuman#1798: Modifies Sentry expected/transient error classification and routing in src/core/observability.rs, adding new categorization logic and corresponding tests.

Suggested labels

working

Suggested reviewers

  • graycyrus
  • senamakel

Poem

🐇 A boot-window hop so light,
Loopback whispers through the night,
Connection refused, but all is well—
A debug breadcrumb's gentle knell. 🌙✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: introducing a new error classifier to demote loopback connection refusals to expected/debug breadcrumbs for Sentry, with specific references to the affected R5/R6 issues.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 18, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants