Skip to content

fix(observability): Wave 4 classifier — socket transport + custom-provider config-rejection (~366 events, 13 IDs)#2309

Merged
senamakel merged 2 commits into
tinyhumansai:mainfrom
oxoxDev:fix/sentry-wave4-classifier-no
May 20, 2026
Merged

fix(observability): Wave 4 classifier — socket transport + custom-provider config-rejection (~366 events, 13 IDs)#2309
senamakel merged 2 commits into
tinyhumansai:mainfrom
oxoxDev:fix/sentry-wave4-classifier-no

Conversation

@oxoxDev
Copy link
Copy Markdown
Contributor

@oxoxDev oxoxDev commented May 20, 2026

Summary

  • Extends expected_error_kind matcher ladder with 11 new substring arms across two existing buckets — no new variants, no new emit sites, no behavior change beyond reclassifying known wire shapes that were leaking past Wave 1-3 anchors.
  • Closes 13 Sentry IDs (~366 events) — all deterministic user-environment / user-config errors with no remediation path (offline, DNS fail, captive portal, bad API key, out of credits, wrong model id, missing OAuth scope, region block).
  • Pure classifier — no UI, retry, or fallback logic touched. The reliable-provider stack still falls back to OpenHuman's hosted tier for the Lane O bodies; users keep their chat. Lane N socket failures already gated through report_error_or_expected at ws_loop.rs:191, so the threshold-escalation event simply demotes to a warn breadcrumb instead of paging.
  • Two micro-commits — Lane O (is_provider_config_rejection_message PHRASES const) and Lane N (is_network_unreachable_message substring arms) — each independently revertible.

Problem

After the Wave 1-3 sweep landed, fresh Sentry triage surfaced 13 unresolved IDs whose bodies match the spirit of an existing classifier bucket but use wire-shape variants the current substring anchors miss:

Lane O — custom-provider config rejection (~250 events)

The ProviderConfigRejection variant + is_provider_config_rejection_message exist precisely for "user pointed OpenHuman at a custom_openai endpoint with a model / temperature / region / credential that provider doesn't accept." But the 7 phrases shipped in Wave 1-3 (#2079 / #2076 / #2202) only cover the DeepSeek / OpenRouter abstract-tier-leak and Moonshot temperature shapes. New surfaces:

Sentry ID Events Real wire body
R1 44 {"error":{"message":"This model is not available in your region.","code":403}}
R4 14 {"code":403,"reason":"ModelNotAllowed","message":"模型不允许访问"…} (Doubao/ChatGLM)
YC 16 {"error":{"type":"invalid_authentication_error"…}}
S5 14 {"error":{"message":"This request requires more credits, or fewer max_tokens…"}} (OpenRouter 402)
Y0 13 {'error': '/chat/completions: Invalid model name passed in model=reasoning-v1…'}
JN 14 {"error":{"message":"No active credentials for provider: openai"…}}
KB 16 Same No active credentials for provider shape from OpenHuman backend re-emit
JK 17 litellm.BadRequestError: Github_copilotException - Bad Request…
J2 / J5 / J4 62 {"error":{"message":"model 'llama3.3' not found","type":"not_found_error"…}}

Lane N — socket WebSocket-connect transport (~116 events)

is_network_unreachable_message already catches connection refused / dns error / network is unreachable but two real shapes escape:

Sentry ID Events Real wire body
44 50 [socket] Connection failed: WebSocket connect: IO error: failed to lookup address information: nodename nor servname provided, or not known
4P 66 [socket] Connection failed: WebSocket connect: HTTP error: 200 OK (captive portal / corporate proxy intercepting the WS upgrade handshake)

Every one of these is deterministic user-environment / user-configuration state — the maintainers have no remediation. Sentry has no signal to act on. Every event was pure noise.

Solution

Lane O — src/openhuman/inference/provider/config_rejection.rs:55-79

Append 8 new phrases to the PHRASES const (case-insensitive substring match, same precedence as existing 7):

"not available in your region",        // R1
"modelnotallowed",                     // R4
"invalid_authentication_error",        // YC
"requires more credits",               // S5
"invalid model name passed in model=", // Y0
"no active credentials for provider",  // JN + KB
"litellm.badrequesterror",             // JK
"not_found_error",                     // J2 + J5 + J4

Each anchor is intentionally narrow (e.g. passed in model= not bare invalid model name; litellm.badrequesterror not bare litellm) so a stray log line elsewhere can't accidentally demote a real provider/backend bug. The HTTP-layer wrapper (is_provider_config_rejection_http) still guards on provider != openhuman_backend::PROVIDER_LABEL, so a model rejection from our own backend (which would be a real regression) still reaches Sentry.

Lane N — src/core/observability.rs:299-319

Three new substring arms appended to is_network_unreachable_message:

  • failed to lookup address + nodename nor servname — libc getaddrinfo() failure renderings on macOS / BSD / POSIX resolvers when tungstenite wraps as IO error without the reqwest dns error prefix.
  • http error: 200 ok — tungstenite-only render. Reqwest renders HTTP 200 as "HTTP status server error (200)", so no collision with the regular HTTP call path. A negative precedence test (http_200_classifier_does_not_silence_unrelated_log_lines) pins this against benign HTTP/1.1 200 OK / status: 200 OK prose so a future broadening cannot silence success traces.

Design trade-off — classifier vs. root-fix validation layer

The Lane O bugs do have a more durable root fix: a pre-save provider validation layer (test the API key, validate the model id against /v1/models, surface region-blocks at config-save time). That's a real product initiative requiring UX design, per-provider model-list infrastructure, and meaningful spec work — out of scope for a triage sweep. This PR follows the Wave 1-3 maintainer precedent (classifier-first noise suppression) so the Sentry signal/noise ratio improves immediately; the validation layer remains tracked as a separate follow-up. If/when it lands, these classifier arms become redundant belt-and-suspenders and can be deleted without conflict.

Lane N has no root-fix alternative — offline / firewall / captive-portal / DNS failures are pure user-environment state. Classifier-demote is the correct disposition.

Submission Checklist

If a section does not apply to this change, mark the item as N/A with a one-line reason. Do not delete items.

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
  • Diff coverage ≥ 80% — changed lines (Vitest + cargo-llvm-cov merged via diff-cover) meet the gate enforced by .github/workflows/coverage.yml. 11 positive tests pinned to real Sentry bodies + 1 negative precedence test cover every new substring arm; only the rustdoc comments and the negative unrelated_* test body are non-executable lines.
  • N/A: behaviour-only change — Coverage matrix updated — added/removed/renamed feature rows in docs/TEST-COVERAGE-MATRIX.md reflect this change
  • All affected feature IDs from the matrix are listed in the PR description under ## Related
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • N/A: classifier-only change with no UI / release-cut surface touched — Manual smoke checklist updated if this touches release-cut surfaces (docs/RELEASE-MANUAL-SMOKE.md)
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Runtime: no behavior change for users — chat / agent / web-channel / socket paths continue to fall back / retry exactly as before. The only delta is that report_error_or_expected now classifies these 13 wire shapes as ExpectedErrorKind::* instead of escalating as tracing::error, so Sentry stops receiving the events.
  • Performance: negligible — is_provider_config_rejection_message lowercases once and runs 15 substring scans (was 7); is_network_unreachable_message runs 11 substring scans (was 8). Both already on the error-path, never on the hot path.
  • Security: no new attack surface. Substring matchers run on already-rendered error messages — no parsing, no allocation beyond the existing to_ascii_lowercase().
  • Migration: none.
  • Compatibility: forward-compatible. New phrases are append-only — no existing matcher behavior changes. If a future maintainer ships the root-fix validation layer, deleting these arms is a clean local edit.

Related

  • Closes OPENHUMAN-TAURI-R1
  • Closes OPENHUMAN-TAURI-R4
  • Closes OPENHUMAN-TAURI-YC
  • Closes OPENHUMAN-TAURI-S5
  • Closes OPENHUMAN-TAURI-Y0
  • Closes OPENHUMAN-TAURI-JN
  • Closes OPENHUMAN-TAURI-KB
  • Closes OPENHUMAN-TAURI-JK
  • Closes OPENHUMAN-TAURI-J2
  • Closes OPENHUMAN-TAURI-J5
  • Closes OPENHUMAN-TAURI-J4
  • Closes OPENHUMAN-TAURI-44
  • Closes OPENHUMAN-TAURI-4P
  • Follow-up PR(s)/TODOs: pre-save provider-validation layer (validate API key + model id + region at config-save time) tracked as separate product initiative; would supersede the Lane O arms.

AI Authored PR Metadata (required for Codex/Linear PRs)

Keep this section for AI-authored PRs. For human-only PRs, mark each field N/A.

Linear Issue

  • Key: N/A (Sentry-driven triage; no Linear ticket)
  • URL: N/A

Commit & Branch

  • Branch: fix/sentry-wave4-classifier-no
  • Commit SHA: e16414a (tip)

Agent

  • Claude Code (Sonnet 4.6) running locally

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced detection of network connectivity failures and provider configuration rejection scenarios for more accurate error classification.
  • Tests

    • Added comprehensive test coverage for network-related transport failures and provider configuration rejection cases.

Review Change Stack

oxoxDev added 2 commits May 20, 2026 13:50
… (Wave 4 Lane O)

Extend `is_provider_config_rejection_message` PHRASES with 8 new
substrings covering wire shapes that the Wave 1-3 phrases miss:

- `not available in your region`            — R1 (region block)
- `modelnotallowed`                          — R4 (Doubao/ChatGLM allowlist)
- `invalid_authentication_error`             — YC (user key rejected by upstream)
- `requires more credits`                    — S5 (OpenRouter 402 out-of-credits)
- `invalid model name passed in model=`      — Y0 (litellm proxy pre-routing reject)
- `no active credentials for provider`       — JN + KB (upstream API key gap)
- `litellm.badrequesterror`                  — JK (github_copilot OAuth gap)
- `not_found_error`                          — J2 + J5 + J4 (litellm envelope `type`)

Each is a deterministic user-state error (wrong model, wrong region, bad
key, out of credits, missing OAuth scope) — the reliable-provider stack
already falls back to OpenHuman's hosted tier, so the UX is intact; only
the Sentry spam was leaking. Closes ~250 events across 11 issue IDs.

Pinned tests against the literal Sentry event bodies from each ID so a
future provider rename doesn't silently un-classify them.

Closes OPENHUMAN-TAURI-R1
Closes OPENHUMAN-TAURI-R4
Closes OPENHUMAN-TAURI-YC
Closes OPENHUMAN-TAURI-S5
Closes OPENHUMAN-TAURI-Y0
Closes OPENHUMAN-TAURI-JN
Closes OPENHUMAN-TAURI-KB
Closes OPENHUMAN-TAURI-JK
Closes OPENHUMAN-TAURI-J2
Closes OPENHUMAN-TAURI-J5
Closes OPENHUMAN-TAURI-J4
…ve 4 Lane N)

Extend `is_network_unreachable_message` with three substring arms for
wire shapes the existing `dns error` / status-bearing matchers miss:

- `failed to lookup address`   — libc `getaddrinfo()` rendering when
                                  tungstenite wraps the resolver fail as
                                  an `IO error` without the `dns error`
                                  prefix (OPENHUMAN-TAURI-44 ~50 events).
- `nodename nor servname`      — companion phrase from the macOS/BSD libc
                                  resolver — same OPENHUMAN-TAURI-44
                                  wire shape, second anchor.
- `http error: 200 ok`         — tungstenite's `WsError::Http(200)`
                                  rendering when a captive portal /
                                  corporate proxy intercepts the WS
                                  upgrade handshake and returns a plain
                                  HTML 200 page (OPENHUMAN-TAURI-4P
                                  ~66 events). Tungstenite-only — reqwest
                                  renders HTTP 200 as `HTTP status server
                                  error (200)` so there is no collision
                                  with the regular HTTP path.

A precedence test (`http_200_classifier_does_not_silence_unrelated_log_lines`)
pins the substring against benign `HTTP/1.1 200 OK` / `status: 200 OK`
prose so a future broadening does not silence success traces.

Sentry has no remediation path for any of these — the user must change
their network (firewall / proxy / DNS). Closes ~116 additional events.

Closes OPENHUMAN-TAURI-44
Closes OPENHUMAN-TAURI-4P
@oxoxDev oxoxDev requested a review from a team May 20, 2026 08:22
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

This PR extends two error classification matchers to recognize additional Wave 4 Sentry integration patterns: network transport failures (POSIX resolver and WebSocket captive-portal variants) and provider configuration rejections (region blocking, invalid auth, credit limits, and litellm error formats), with corresponding regression tests for each.

Changes

Wave 4 Error Pattern Classification

Layer / File(s) Summary
Network unreachable transport patterns
src/core/observability.rs
is_network_unreachable_message adds substring matching for POSIX resolver failures (failed to lookup address, nodename nor servname) and WebSocket proxy errors (http error: 200 ok); tests verify the new matchers and assert the http error: 200 ok classifier does not misclassify unrelated HTTP success strings.
Provider config rejection patterns
src/openhuman/inference/provider/config_rejection.rs
Module documentation enumerates additional Wave 4 provider-rejection cases; the PHRASES matcher list extends with region-block, invalid-authentication, litellm, and credential-missing tokens; regression tests assert classification of concrete custom-openai and litellm error bodies.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • tinyhumansai/openhuman#2063: Both modify src/core/observability.rs expected-error classification; this PR expands is_network_unreachable_message for Wave 4 variants while #2063 adjusted matcher precedence and LoopbackUnavailable handling.
  • tinyhumansai/openhuman#2239: Introduced the provider config-rejection classification and demotion logic; this PR extends its is_provider_config_rejection_message matcher with additional Wave 4 phrase patterns.

Suggested labels

working

Suggested reviewers

  • graycyrus

Poem

🐰 Four Waves bring errors, patterns in the fold,
From POSIX whispers to captive-portal gold.
We catch the rejections that users have made—
Substring by substring, a classification parade!
Wave four shapes recognized, tests light the way. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically summarizes the main changes: extending Wave 4 classifiers for socket transport and custom-provider config-rejection with specific metrics (~366 events, 13 IDs).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants