fix(observability): Wave 4 classifier — socket transport + custom-provider config-rejection (~366 events, 13 IDs)#2309
Conversation
… (Wave 4 Lane O) Extend `is_provider_config_rejection_message` PHRASES with 8 new substrings covering wire shapes that the Wave 1-3 phrases miss: - `not available in your region` — R1 (region block) - `modelnotallowed` — R4 (Doubao/ChatGLM allowlist) - `invalid_authentication_error` — YC (user key rejected by upstream) - `requires more credits` — S5 (OpenRouter 402 out-of-credits) - `invalid model name passed in model=` — Y0 (litellm proxy pre-routing reject) - `no active credentials for provider` — JN + KB (upstream API key gap) - `litellm.badrequesterror` — JK (github_copilot OAuth gap) - `not_found_error` — J2 + J5 + J4 (litellm envelope `type`) Each is a deterministic user-state error (wrong model, wrong region, bad key, out of credits, missing OAuth scope) — the reliable-provider stack already falls back to OpenHuman's hosted tier, so the UX is intact; only the Sentry spam was leaking. Closes ~250 events across 11 issue IDs. Pinned tests against the literal Sentry event bodies from each ID so a future provider rename doesn't silently un-classify them. Closes OPENHUMAN-TAURI-R1 Closes OPENHUMAN-TAURI-R4 Closes OPENHUMAN-TAURI-YC Closes OPENHUMAN-TAURI-S5 Closes OPENHUMAN-TAURI-Y0 Closes OPENHUMAN-TAURI-JN Closes OPENHUMAN-TAURI-KB Closes OPENHUMAN-TAURI-JK Closes OPENHUMAN-TAURI-J2 Closes OPENHUMAN-TAURI-J5 Closes OPENHUMAN-TAURI-J4
…ve 4 Lane N)
Extend `is_network_unreachable_message` with three substring arms for
wire shapes the existing `dns error` / status-bearing matchers miss:
- `failed to lookup address` — libc `getaddrinfo()` rendering when
tungstenite wraps the resolver fail as
an `IO error` without the `dns error`
prefix (OPENHUMAN-TAURI-44 ~50 events).
- `nodename nor servname` — companion phrase from the macOS/BSD libc
resolver — same OPENHUMAN-TAURI-44
wire shape, second anchor.
- `http error: 200 ok` — tungstenite's `WsError::Http(200)`
rendering when a captive portal /
corporate proxy intercepts the WS
upgrade handshake and returns a plain
HTML 200 page (OPENHUMAN-TAURI-4P
~66 events). Tungstenite-only — reqwest
renders HTTP 200 as `HTTP status server
error (200)` so there is no collision
with the regular HTTP path.
A precedence test (`http_200_classifier_does_not_silence_unrelated_log_lines`)
pins the substring against benign `HTTP/1.1 200 OK` / `status: 200 OK`
prose so a future broadening does not silence success traces.
Sentry has no remediation path for any of these — the user must change
their network (firewall / proxy / DNS). Closes ~116 additional events.
Closes OPENHUMAN-TAURI-44
Closes OPENHUMAN-TAURI-4P
📝 WalkthroughWalkthroughThis PR extends two error classification matchers to recognize additional Wave 4 Sentry integration patterns: network transport failures (POSIX resolver and WebSocket captive-portal variants) and provider configuration rejections (region blocking, invalid auth, credit limits, and litellm error formats), with corresponding regression tests for each. ChangesWave 4 Error Pattern Classification
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Summary
expected_error_kindmatcher ladder with 11 new substring arms across two existing buckets — no new variants, no new emit sites, no behavior change beyond reclassifying known wire shapes that were leaking past Wave 1-3 anchors.report_error_or_expectedatws_loop.rs:191, so the threshold-escalation event simply demotes to awarnbreadcrumb instead of paging.is_provider_config_rejection_messagePHRASES const) and Lane N (is_network_unreachable_messagesubstring arms) — each independently revertible.Problem
After the Wave 1-3 sweep landed, fresh Sentry triage surfaced 13 unresolved IDs whose bodies match the spirit of an existing classifier bucket but use wire-shape variants the current substring anchors miss:
Lane O — custom-provider config rejection (~250 events)
The
ProviderConfigRejectionvariant +is_provider_config_rejection_messageexist precisely for "user pointed OpenHuman at a custom_openai endpoint with a model / temperature / region / credential that provider doesn't accept." But the 7 phrases shipped in Wave 1-3 (#2079 / #2076 / #2202) only cover the DeepSeek / OpenRouter abstract-tier-leak and Moonshot temperature shapes. New surfaces:R1{"error":{"message":"This model is not available in your region.","code":403}}R4{"code":403,"reason":"ModelNotAllowed","message":"模型不允许访问"…}(Doubao/ChatGLM)YC{"error":{"type":"invalid_authentication_error"…}}S5{"error":{"message":"This request requires more credits, or fewer max_tokens…"}}(OpenRouter 402)Y0{'error': '/chat/completions: Invalid model name passed in model=reasoning-v1…'}JN{"error":{"message":"No active credentials for provider: openai"…}}KBNo active credentials for providershape from OpenHuman backend re-emitJKlitellm.BadRequestError: Github_copilotException - Bad Request…J2/J5/J4{"error":{"message":"model 'llama3.3' not found","type":"not_found_error"…}}Lane N — socket WebSocket-connect transport (~116 events)
is_network_unreachable_messagealready catchesconnection refused/dns error/network is unreachablebut two real shapes escape:44[socket] Connection failed: WebSocket connect: IO error: failed to lookup address information: nodename nor servname provided, or not known4P[socket] Connection failed: WebSocket connect: HTTP error: 200 OK(captive portal / corporate proxy intercepting the WS upgrade handshake)Every one of these is deterministic user-environment / user-configuration state — the maintainers have no remediation. Sentry has no signal to act on. Every event was pure noise.
Solution
Lane O —
src/openhuman/inference/provider/config_rejection.rs:55-79Append 8 new phrases to the
PHRASESconst (case-insensitive substring match, same precedence as existing 7):Each anchor is intentionally narrow (e.g.
passed in model=not bareinvalid model name;litellm.badrequesterrornot barelitellm) so a stray log line elsewhere can't accidentally demote a real provider/backend bug. The HTTP-layer wrapper (is_provider_config_rejection_http) still guards onprovider != openhuman_backend::PROVIDER_LABEL, so a model rejection from our own backend (which would be a real regression) still reaches Sentry.Lane N —
src/core/observability.rs:299-319Three new substring arms appended to
is_network_unreachable_message:failed to lookup address+nodename nor servname— libcgetaddrinfo()failure renderings on macOS / BSD / POSIX resolvers when tungstenite wraps asIO errorwithout the reqwestdns errorprefix.http error: 200 ok— tungstenite-only render. Reqwest renders HTTP 200 as"HTTP status server error (200)", so no collision with the regular HTTP call path. A negative precedence test (http_200_classifier_does_not_silence_unrelated_log_lines) pins this against benignHTTP/1.1 200 OK/status: 200 OKprose so a future broadening cannot silence success traces.Design trade-off — classifier vs. root-fix validation layer
The Lane O bugs do have a more durable root fix: a pre-save provider validation layer (test the API key, validate the model id against
/v1/models, surface region-blocks at config-save time). That's a real product initiative requiring UX design, per-provider model-list infrastructure, and meaningful spec work — out of scope for a triage sweep. This PR follows the Wave 1-3 maintainer precedent (classifier-first noise suppression) so the Sentry signal/noise ratio improves immediately; the validation layer remains tracked as a separate follow-up. If/when it lands, these classifier arms become redundant belt-and-suspenders and can be deleted without conflict.Lane N has no root-fix alternative — offline / firewall / captive-portal / DNS failures are pure user-environment state. Classifier-demote is the correct disposition.
Submission Checklist
diff-cover) meet the gate enforced by.github/workflows/coverage.yml. 11 positive tests pinned to real Sentry bodies + 1 negative precedence test cover every new substring arm; only the rustdoc comments and the negativeunrelated_*test body are non-executable lines.docs/TEST-COVERAGE-MATRIX.mdreflect this change## Relateddocs/RELEASE-MANUAL-SMOKE.md)Closes #NNNin the## RelatedsectionImpact
report_error_or_expectednow classifies these 13 wire shapes asExpectedErrorKind::*instead of escalating astracing::error, so Sentry stops receiving the events.is_provider_config_rejection_messagelowercases once and runs 15 substring scans (was 7);is_network_unreachable_messageruns 11 substring scans (was 8). Both already on the error-path, never on the hot path.to_ascii_lowercase().Related
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
Agent
Summary by CodeRabbit
Bug Fixes
Tests