fix(observability): demote transient OpenAI embeddings 429s to expected and reduce Sentry noise#2294
Conversation
…o handle transient failures gracefully
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughUpdates OpenAI embedding non-2xx error formatting to include parentheses around HTTP status and clarifies transient reporting via ChangesEmbedding API transient error handling
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
graycyrus
left a comment
There was a problem hiding this comment.
Nice, focused fix — the parenthesized format is clearly needed for the is_transient_upstream_http_message classifier (line 240 in observability.rs: lower.contains(&format!("api error ({code}"))) and the test is a welcome addition.
One thing worth tightening up on the test — see inline comment.
graycyrus
left a comment
There was a problem hiding this comment.
Re-review — all prior changes addressed.
The new commit (6acfcd02) adds the api error (429 substring assertion I flagged in the prior review. The test now has three layers of defense:
- Canonical
(429 Too Many Requests)shape check - Classifier-arm substring guard (
api error (429) - Broad
is_transient_message_failureclassifier pass
This is a clean, focused fix — format string tweak + solid regression test. No new issues found.
| File | Change |
|---|---|
openai.rs |
Format {status} → ({status}) in error string + updated comment |
openai_tests.rs |
New embed_429_uses_canonical_transient_format test with three assertions |
graycyrus
left a comment
There was a problem hiding this comment.
Looks good, nice work!
M3gA-Mind
left a comment
There was a problem hiding this comment.
Looks good, nice work!
…ed and reduce Sentry noise (tinyhumansai#2294)
…ed and reduce Sentry noise (tinyhumansai#2294)
…ed and reduce Sentry noise (tinyhumansai#2294)
Summary
Updated OpenAI embeddings error reporting to route non-2xx failures through report_error_or_expected instead of always emitting hard error events.
Standardized embedding HTTP error text to canonical format: Embedding API error (): .
Ensured transient upstream statuses (especially 429 Too Many Requests) match observability classifiers and are demoted to warning breadcrumbs.
Added a focused regression test for 429 formatting/classification to prevent future Sentry-noise regressions.
Problem
OpenAI embeddings calls can return transient 429 rate-limit responses during normal load.
These errors are retried by memory-tree/background flows, but each attempt was still producing Sentry error events.
The previous message shape (Embedding API error : ...) did not consistently match the transient HTTP classifier expectations, increasing alert noise and masking actionable failures.
Solution
Switched embeddings non-2xx reporting to report_error_or_expected(...), allowing transient failures to be classified as expected noise.
Changed the emitted error string to canonical classifier-compatible shape: Embedding API error (): .
Kept failure tags (model, status, failure=non_2xx) for diagnostics while reducing noisy escalation.
Added embed_429_uses_canonical_transient_format test to assert:
Submission Checklist
If a section does not apply to this change, mark the item as N/A with a one-line reason. Do not delete items.
Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
Diff coverage ≥ 80% — changed lines (Vitest + cargo-llvm-cov merged via diff-cover) meet the gate enforced by
.github/workflows/coverage.yml. Run pnpm test:coverage and pnpm test:rust locally; PRs below 80% on changed lines will not merge.Coverage matrix updated — added/removed/renamed feature rows in
docs/TEST-COVERAGE-MATRIX.mdreflect this change (or N/A: behaviour-only change)All affected feature IDs from the matrix are listed in the PR description under ## Related (N/A if no feature IDs are affected)
No new external network dependencies introduced (mock backend used per Testing Strategy)
Manual smoke checklist updated if this touches release-cut surfaces (
docs/RELEASE-MANUAL-SMOKE.md) (N/A if not release-cut)Linked issue closed via Closes #NNN in the ## Related section (N/A until issue is linked)
Impact
Runtime/platform impact: Rust core observability path only (desktop app behavior unchanged for users except reduced false-positive error noise).
Performance: negligible runtime overhead change; likely reduced observability/event volume under transient rate limiting.
Security/migration/compatibility: no migration or API compatibility changes; no new dependencies introduced.
Related
Summary by CodeRabbit
Bug Fixes
Tests