Skip to content

fix(observability,database): silence expected provider/channel errors and add SQLite busy timeout for WhatsApp store#2107

Merged
senamakel merged 7 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/resolve-or-silent-sentry-bugs
May 20, 2026
Merged

fix(observability,database): silence expected provider/channel errors and add SQLite busy timeout for WhatsApp store#2107
senamakel merged 7 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/resolve-or-silent-sentry-bugs

Conversation

@YellowSnnowmann
Copy link
Copy Markdown
Contributor

@YellowSnnowmann YellowSnnowmann commented May 18, 2026

Summary

  • Demoted expected custom_openai upstream 400 errors (upstream_error / “Bad request to upstream provider”) from Sentry error events to structured info logs.
  • Extended suppression across all OpenAI-compatible provider paths (responses_api, chat_completions, native_chat, streaming_chat, stream_chat) to reduce duplicate llm_provider noise.
  • Updated expected-error classification so wrapped agent/runtime variants of upstream 400 failures are also treated as provider user-state.
  • Switched channel supervision error handling to report_error_or_expected(...) so transient network disconnects can be classified/silenced instead of always triggering alerts.
  • Added SQLite busy_timeout support for WhatsApp data store connections to reduce database is locked failures under concurrent ingestion/read activity.
  • Added regression tests for expected-error classification and supervision transport-wrapped network failures.

Problem

  • Sentry was receiving high-volume, low-actionability noise from:

    • transient channel/network disconnects,
    • deterministic custom_openai upstream 400 provider/user-state failures,
    • duplicate reporting through both provider and higher-level wrappers.
  • WhatsApp ingestion/read concurrency could trigger immediate SQLITE_BUSY (database is locked) failures because SQLite connections did not wait before failing.

Solution

  • Added targeted classification for exact custom_openai upstream 400 error envelopes and routed them to info-level observability instead of Sentry error capture.
  • Applied the classifier consistently across all OpenAI-compatible provider request paths to avoid duplicate provider noise regardless of request mode.
  • Expanded expected_error_kind matching so wrapped higher-level runtime/agent variants still classify as ProviderUserState.
  • Updated supervision loop reporting to use report_error_or_expected(...) with channel tagging, preserving actionable signal while suppressing expected transient transport failures.
  • Configured rusqlite busy_timeout (15s) in the WhatsApp data store to align with other SQLite store patterns and reduce lock-related ingestion flakiness.

Submission Checklist

If a section does not apply to this change, mark the item as N/A with a one-line reason. Do not delete items.

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
  • Diff coverage ≥ 80% — changed lines (Vitest + cargo-llvm-cov merged via diff-cover) meet the gate enforced by .github/workflows/coverage.yml. Run pnpm test:coverage and pnpm test:rust locally; PRs below 80% on changed lines will not merge.
  • Coverage matrix updated — added/removed/renamed feature rows in docs/TEST-COVERAGE-MATRIX.md reflect this change (or N/A: behaviour-only change)
  • All affected feature IDs from the matrix are listed in the PR description under ## Related
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • Manual smoke checklist updated if this touches release-cut surfaces (docs/RELEASE-MANUAL-SMOKE.md)
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Runtime/platform impact:

    • Rust core behavior changes for observability and SQLite store handling.
    • Affects desktop flows using:
      • channel supervision,
      • provider inference,
      • WhatsApp data store operations.
  • Performance impact:

    • Reduced noisy Sentry reporting.
    • SQLite operations may wait up to timeout instead of failing immediately under contention.
  • Security/compatibility impact:

    • No new external dependencies.
    • No API contract changes.
    • Fully backward-compatible behavior with improved resilience and cleaner telemetry.

Related

Summary by CodeRabbit

  • Improvements

    • Enhanced error classification to better distinguish user-related upstream failures from system errors.
    • Improved supervised channel error reporting for more reliable restart handling and clearer transport-related classification.
    • Added specialized handling and logging for a known upstream “bad request” pattern to avoid noisy alerts.
    • Configured DB connections to wait on locks to reduce immediate contention failures.
  • Tests

    • Added tests validating new error classifications and supervision message formatting.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bd34c804-4010-4f6c-b1f8-1d4e10dcf5b1

📥 Commits

Reviewing files that changed from the base of the PR and between 980184b and 5954fbe.

📒 Files selected for processing (1)
  • src/core/observability.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/core/observability.rs

📝 Walkthrough

Walkthrough

Adds detection and classification for a custom_openai upstream 400 error shape across provider request paths and observability, suppresses Sentry for that known user-state error, switches supervised listener error reporting to the centralized reporter with metadata, and configures a 15s SQLite busy timeout.

Changes

Custom OpenAI Upstream Error Handling and Infrastructure

Layer / File(s) Summary
Detection and Logging Helpers
src/openhuman/inference/provider/ops.rs
New is_custom_openai_upstream_bad_request_http_400 predicate and log_custom_openai_upstream_bad_request_http_400 info-level event to tag this known non-2xx/provider_user_state case.
API Error Handling and Sentry Suppression
src/openhuman/inference/provider/ops.rs
api_error computes the custom-openai flag and routes matching 400 responses to the new logging helper instead of default Sentry reporting.
Observability Classification Extension
src/core/observability.rs
is_provider_user_state_message now recognizes the custom_openai API error (400 envelope with inner bad request to upstream provider + upstream_error substrings; unit tests added for raw and wrapped payloads and regression negatives.
Supervised Error Reporting Integration
src/openhuman/channels/runtime/supervision.rs
Replace direct tracing error in spawn_supervised_listener with report_error_or_expected, formatting the error and adding channel metadata; test asserts NetworkUnreachable classification for a Discord gateway failure string.
Provider Request Path Integration
src/openhuman/inference/provider/compatible.rs
Five paths (chat_via_responses, stream_native_chat, chat_with_system, chat, stream_chat_with_system) add else if branches to call the custom-openai logging helper for the identified 400 shape before generic failure reporting.
SQLite Connection Timeout Configuration
src/openhuman/whatsapp_data/store.rs
Introduce SQLITE_BUSY_TIMEOUT (15s) and set busy_timeout on DB connections in open_conn to wait on contention.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • graycyrus
  • senamakel

Poem

🐰 A tiny rabbit hops by the log,
Finds a 400 wrapped in a fog.
"It's user-state," the rabbit declares,
Sentry stays quiet — fewer stares.
Timeouts wait patiently, debugging at ease.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main changes: demoting expected provider/channel errors from Sentry noise and adding SQLite busy timeout for WhatsApp store, which align with the core objectives across multiple modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@YellowSnnowmann YellowSnnowmann marked this pull request as ready for review May 18, 2026 13:18
@YellowSnnowmann YellowSnnowmann requested a review from a team May 18, 2026 13:18
@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 18, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/core/observability.rs`:
- Around line 329-331: The current global matcher if lower.contains("bad request
to upstream provider") && lower.contains("upstream_error") returns true too
broadly; tighten it so it only demotes the known custom_openai 400 envelope by
additionally asserting the envelope indicates the custom OpenAI provider and a
400 status (e.g., check provider == "custom_openai" or provider_name field and
http_status/status_code == 400) alongside the existing lower.contains(...)
checks; update the conditional that contains lower.contains(...) to include
these provider/status predicates so unrelated errors are not silenced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2b7d06fd-5d8d-4f17-b879-c1b53f801a73

📥 Commits

Reviewing files that changed from the base of the PR and between 70fdedc and 980184b.

📒 Files selected for processing (5)
  • src/core/observability.rs
  • src/openhuman/channels/runtime/supervision.rs
  • src/openhuman/inference/provider/compatible.rs
  • src/openhuman/inference/provider/ops.rs
  • src/openhuman/whatsapp_data/store.rs

Comment thread src/core/observability.rs Outdated
Per CodeRabbit feedback on PR tinyhumansai#2107: the previous matcher demoted any
error containing both "bad request to upstream provider" and
"upstream_error" — too broad. Anchor it to the canonical envelope
prefix "custom_openai api error (400" so it can't silence unrelated
errors that happen to mention either substring (e.g. a future provider
whose wire shape reuses one of them).

Adds a regression test confirming an error that contains both inner
substrings WITHOUT the custom_openai 400 anchor is no longer silenced.
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 18, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — graycyrus

Walkthrough: This PR systematically reduces Sentry noise by (1) classifying custom_openai upstream 400 errors as expected provider/user-state across all five OpenAI-compatible request paths, (2) routing channel supervision errors through the expected-error classifier so transient network disconnects are demoted from error to info, and (3) adding SQLite busy_timeout to the WhatsApp data store to handle lock contention gracefully. Well-structured change that follows existing patterns closely.

Overall: Clean. The three-condition anchor in observability.rs is well-scoped, the provider path coverage is complete, and the test suite covers both positive and negative classification cases. The busy_timeout addition aligns with memory/tree/store.rs. Nothing actionable from this reviewer — CodeRabbit's one finding (overly-broad matcher) was addressed with the custom_openai api error (400 prefix anchor and a regression test.

File Change Notes
src/core/observability.rs New pattern matcher + 2 tests Three-condition anchor prevents false positives; negative test pins safety
src/openhuman/channels/runtime/supervision.rs tracing::error!report_error_or_expected() Uses {e:#} for full error chain — good for classifier visibility
src/openhuman/inference/provider/compatible.rs 5× upstream-400 check blocks Mirrors existing budget_exhausted pattern across all paths
src/openhuman/inference/provider/ops.rs is_ + log_ helpers, api_error() integration Consistent with budget_exhausted helpers
src/openhuman/whatsapp_data/store.rs busy_timeout(15s) Matches memory/tree/store.rs pattern

# Conflicts:
#	src/openhuman/inference/provider/ops.rs
senamakel added 2 commits May 19, 2026 19:47
# Conflicts:
#	src/openhuman/inference/provider/compatible.rs
#	src/openhuman/inference/provider/ops.rs
# Conflicts:
#	src/openhuman/whatsapp_data/store.rs
@senamakel senamakel merged commit a40272e into tinyhumansai:main May 20, 2026
26 checks passed
mtkik pushed a commit to mtkik/openhuman-meet that referenced this pull request May 21, 2026
… and add SQLite busy timeout for WhatsApp store (tinyhumansai#2107)

Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
CodeGhost21 pushed a commit to CodeGhost21/openhuman that referenced this pull request May 22, 2026
… and add SQLite busy timeout for WhatsApp store (tinyhumansai#2107)

Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026
… and add SQLite busy timeout for WhatsApp store (tinyhumansai#2107)

Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants