fix(observability,database): silence expected provider/channel errors and add SQLite busy timeout for WhatsApp store#2107
Conversation
…and add test for expected error classification
…eam bad request errors with tests
…ent locking issues
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds detection and classification for a custom_openai upstream 400 error shape across provider request paths and observability, suppresses Sentry for that known user-state error, switches supervised listener error reporting to the centralized reporter with metadata, and configures a 15s SQLite busy timeout. ChangesCustom OpenAI Upstream Error Handling and Infrastructure
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/core/observability.rs`:
- Around line 329-331: The current global matcher if lower.contains("bad request
to upstream provider") && lower.contains("upstream_error") returns true too
broadly; tighten it so it only demotes the known custom_openai 400 envelope by
additionally asserting the envelope indicates the custom OpenAI provider and a
400 status (e.g., check provider == "custom_openai" or provider_name field and
http_status/status_code == 400) alongside the existing lower.contains(...)
checks; update the conditional that contains lower.contains(...) to include
these provider/status predicates so unrelated errors are not silenced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2b7d06fd-5d8d-4f17-b879-c1b53f801a73
📒 Files selected for processing (5)
src/core/observability.rssrc/openhuman/channels/runtime/supervision.rssrc/openhuman/inference/provider/compatible.rssrc/openhuman/inference/provider/ops.rssrc/openhuman/whatsapp_data/store.rs
Per CodeRabbit feedback on PR tinyhumansai#2107: the previous matcher demoted any error containing both "bad request to upstream provider" and "upstream_error" — too broad. Anchor it to the canonical envelope prefix "custom_openai api error (400" so it can't silence unrelated errors that happen to mention either substring (e.g. a future provider whose wire shape reuses one of them). Adds a regression test confirming an error that contains both inner substrings WITHOUT the custom_openai 400 anchor is no longer silenced.
graycyrus
left a comment
There was a problem hiding this comment.
Review — graycyrus
Walkthrough: This PR systematically reduces Sentry noise by (1) classifying custom_openai upstream 400 errors as expected provider/user-state across all five OpenAI-compatible request paths, (2) routing channel supervision errors through the expected-error classifier so transient network disconnects are demoted from error to info, and (3) adding SQLite busy_timeout to the WhatsApp data store to handle lock contention gracefully. Well-structured change that follows existing patterns closely.
Overall: Clean. The three-condition anchor in observability.rs is well-scoped, the provider path coverage is complete, and the test suite covers both positive and negative classification cases. The busy_timeout addition aligns with memory/tree/store.rs. Nothing actionable from this reviewer — CodeRabbit's one finding (overly-broad matcher) was addressed with the custom_openai api error (400 prefix anchor and a regression test.
| File | Change | Notes |
|---|---|---|
src/core/observability.rs |
New pattern matcher + 2 tests | Three-condition anchor prevents false positives; negative test pins safety |
src/openhuman/channels/runtime/supervision.rs |
tracing::error! → report_error_or_expected() |
Uses {e:#} for full error chain — good for classifier visibility |
src/openhuman/inference/provider/compatible.rs |
5× upstream-400 check blocks | Mirrors existing budget_exhausted pattern across all paths |
src/openhuman/inference/provider/ops.rs |
is_ + log_ helpers, api_error() integration |
Consistent with budget_exhausted helpers |
src/openhuman/whatsapp_data/store.rs |
busy_timeout(15s) |
Matches memory/tree/store.rs pattern |
# Conflicts: # src/openhuman/inference/provider/ops.rs
# Conflicts: # src/openhuman/inference/provider/compatible.rs # src/openhuman/inference/provider/ops.rs
# Conflicts: # src/openhuman/whatsapp_data/store.rs
… and add SQLite busy timeout for WhatsApp store (tinyhumansai#2107) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
… and add SQLite busy timeout for WhatsApp store (tinyhumansai#2107) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
… and add SQLite busy timeout for WhatsApp store (tinyhumansai#2107) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
Summary
custom_openaiupstream400errors (upstream_error/ “Bad request to upstream provider”) from Sentry error events to structured info logs.responses_api,chat_completions,native_chat,streaming_chat,stream_chat) to reduce duplicatellm_providernoise.400failures are also treated as provider user-state.report_error_or_expected(...)so transient network disconnects can be classified/silenced instead of always triggering alerts.busy_timeoutsupport for WhatsApp data store connections to reducedatabase is lockedfailures under concurrent ingestion/read activity.Problem
Sentry was receiving high-volume, low-actionability noise from:
custom_openaiupstream400provider/user-state failures,WhatsApp ingestion/read concurrency could trigger immediate
SQLITE_BUSY(database is locked) failures because SQLite connections did not wait before failing.Solution
custom_openaiupstream400error envelopes and routed them to info-level observability instead of Sentry error capture.expected_error_kindmatching so wrapped higher-level runtime/agent variants still classify asProviderUserState.report_error_or_expected(...)with channel tagging, preserving actionable signal while suppressing expected transient transport failures.rusqlitebusy_timeout(15s) in the WhatsApp data store to align with other SQLite store patterns and reduce lock-related ingestion flakiness.Submission Checklist
diff-cover) meet the gate enforced by.github/workflows/coverage.yml. Runpnpm test:coverageandpnpm test:rustlocally; PRs below 80% on changed lines will not merge.docs/TEST-COVERAGE-MATRIX.mdreflect this change (orN/A: behaviour-only change)## Relateddocs/RELEASE-MANUAL-SMOKE.md)Closes #NNNin the## RelatedsectionImpact
Runtime/platform impact:
Performance impact:
Security/compatibility impact:
Related
Summary by CodeRabbit
Improvements
Tests