Skip to content

fix: pre-CEF single-instance mutex guard on Windows + provider retry for 502s#1723

Merged
senamakel merged 3 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/cef-init-race-windows-and-provider-retry
May 15, 2026
Merged

fix: pre-CEF single-instance mutex guard on Windows + provider retry for 502s#1723
senamakel merged 3 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/cef-init-race-windows-and-provider-retry

Conversation

@YellowSnnowmann
Copy link
Copy Markdown
Contributor

@YellowSnnowmann YellowSnnowmann commented May 14, 2026

Summary

  • Windows fatal panic fix: adds a named Win32 mutex guard at the top of run() so secondary instances exit before cef::initialize() is ever called — eliminates Sentry OPENHUMAN-TAURI-A (598 fatal panics, Windows-only)
  • Provider 502 retry fix: wraps the raw backend provider in ReliableProvider inside create_intelligent_routing_provider so transient 502/503/504 errors are retried instead of surfacing as fatal agent.run_single failures
  • No macOS / Linux behaviour changes; the CEF guard is #[cfg(windows)] only

Problem

1. OPENHUMAN-TAURI-A — Windows CEF init race (598 events)

tauri_plugin_single_instance detects duplicate launches inside its .setup() hook. But .setup() runs after Builder::build(), which calls CefRuntime::initcef::initialize(). When a second instance launches while the primary is running, cef::initialize() returns 0 (primary holds the CEF user-data-dir cache lock). The vendored runtime then hits assert_eq!(result, 1) → fatal panic:

assertion `left == right` failed
  left: 0
 right: 1

The macOS path is protected by cef_preflight::check_default_cache() which inspects Chromium's SingletonLock symlink before the builder. Windows had no equivalent (that module uses nix and Unix symlinks). The tauri_plugin_single_instance comment in Cargo.toml claimed the plugin fires before builder work — it doesn't; it fires in setup().

Sentry: https://tinyhumans.sentry.io/issues/7458830272/

2. Agent 502s surfacing as fatal

create_intelligent_routing_provider passed a raw OpenAiCompatibleProvider as the remote arm of IntelligentRoutingProvider — no retry wrapper. A single transient 502 from the backend propagated directly to run_single and logged [observability] agent.run_single failed: OpenHuman API error (502 Bad Gateway): error code: 502. The ReliableProvider retry layer (used by every other provider path) was bypassed entirely.

Solution

1. Windows pre-build mutex guard (app/src-tauri/src/lib.rs)

At the very top of run(), before any CEF or Tauri builder work:

#[cfg(windows)]
let _cef_init_mutex_guard = {
    // CreateMutexW("com.openhuman.app-cef-init")
    // ERROR_ALREADY_EXISTS → std::process::exit(0)
    // Primary → hold OwnedMutex (RAII) for lifetime of run()
};
  • Mutex name -cef-init is distinct from the plugin's -sim mutex — no interference with WM_COPYDATA forwarding for the fully-started case
  • Added Win32_System_Threading feature to windows-sys in Cargo.toml
  • Pattern mirrors macOS cef_preflight::check_default_cache() exactly

2. ReliableProvider wrap (src/openhuman/providers/ops.rs)

// Before: raw provider → IntelligentRoutingProvider (no retries)
// After:  raw provider → ReliableProvider → IntelligentRoutingProvider
let reliable_remote = ReliableProvider::new(
    vec![(INFERENCE_BACKEND_ID.to_string(), raw_remote)],
    config.reliability.provider_retries,   // default: 2
    config.reliability.provider_backoff_ms, // default: 500ms
).with_model_fallbacks(config.reliability.model_fallbacks.clone());

Submission Checklist

  • N/A: Tests added — the Windows mutex guard is #[cfg(windows)] platform code with no testable surface on macOS CI; the provider retry path is covered by existing ReliableProvider tests which already exercise 502 retry behaviour
  • N/A: Diff coverage ≥ 80% — both changes are thin wiring/guard code; underlying logic (ReliableProvider, Win32 mutex) is already tested
  • N/A: Coverage matrix updated — no new feature rows; bug fixes to existing provider and startup paths
  • N/A: Feature IDs — no matrix feature rows affected
  • N/A: No new external network dependencies introduced
  • N/A: Manual smoke checklist — not a release-cut surface change
  • N/A: Linked issue — Sentry issue referenced in Problem section above (no GitHub issue number)

Impact

  • Windows: secondary launches now exit cleanly before CEF is initialised; primary experience unchanged
  • All platforms: transient 502s from the OpenHuman backend inference API are retried up to 2× before surfacing as errors
  • No migration, no schema change, no API surface change

Related


AI Authored PR Metadata

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/cef-init-race-windows-and-provider-retry
  • Commit SHA: PLACEHOLDER

Validation Run

  • pnpm --filter openhuman-app format:check — passed (pre-push hook)
  • pnpm typecheck — N/A: no TS changes
  • Focused tests: N/A — Windows-only platform code, no Rust unit tests for mutex guard
  • Rust fmt/check (if changed): cargo fmt applied by pre-push hook; cargo check passed
  • Tauri fmt/check (if changed): passed

Validation Blocked

  • command: N/A
  • error: N/A
  • impact: N/A

Behavior Changes

  • Intended behavior change: secondary Windows launches exit before CEF init; 502s retried by provider layer
  • User-visible effect: no more fatal crash dialog on double-launch (Windows); fewer agent turn failures on transient backend outages

Parity Contract

  • Legacy behavior preserved: macOS/Linux startup unchanged; ReliableProvider config (retries/backoff) unchanged
  • Guard/fallback/dispatch parity checks: IntelligentRoutingProvider remote arm now goes through same retry layer as create_resilient_provider_with_options

Duplicate / Superseded PR Handling

  • Duplicate PR(s): none
  • Canonical PR: this PR
  • Resolution: N/A

Summary by CodeRabbit

  • New Features
    • Single-instance protection on Windows to ensure only one app instance runs; duplicate launches exit early to avoid conflicts and improve stability.
    • Backend provider now includes a reliability layer with automatic retries, backoff, and model fallbacks for more robust routing and fewer transient failures.
    • Improved startup logging around instance and provider initialization to aid diagnostics.

Review Change Stack

@YellowSnnowmann YellowSnnowmann requested a review from a team May 14, 2026 09:14
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b82d6dd7-1249-4285-b433-3b2e768d8a64

📥 Commits

Reviewing files that changed from the base of the PR and between 5c68f15 and ff2457c.

📒 Files selected for processing (1)
  • app/src-tauri/Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • app/src-tauri/Cargo.toml

📝 Walkthrough

Walkthrough

Adds a Windows pre-CEF named mutex to enforce single-instance behavior before CEF initializes, and wraps the OpenHuman backend provider with a ReliableProvider configured from runtime reliability settings before routing.

Changes

Infrastructure and Resilience Improvements

Layer / File(s) Summary
Windows pre-CEF single-instance guard
app/src-tauri/Cargo.toml, app/src-tauri/src/lib.rs
Adds Win32_System_Threading feature and installs a named Win32 mutex (com.openhuman.app-cef-init) in run() before CEF; secondary instances detect existing mutex, close handle, log, and exit, while the primary retains the handle via RAII.
Provider reliability wrapping
src/openhuman/providers/ops.rs
Constructs the backend as raw_backend, wraps it in reliable::ReliableProvider using config.reliability (retries, backoff, model_fallbacks), and uses the reliable wrapper for routing and RouterProvider entries.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • senamakel

Poem

🐰 I found a mutex, snug and neat,
Before CEF wakes from its sleep.
I guard the first, the rest step back,
Retries hum steady on the track.
Hop, code, and ship—this rabbit’s pleased!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main changes: a Windows pre-CEF single-instance mutex guard and provider retry logic for 5xx errors, matching the PR's core objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/providers/ops.rs`:
- Around line 368-380: Add verbose debug diagnostics around the ReliableProvider
wrapper initialization: log a grep-friendly prefix (e.g. "reliable:init") and
include INFERENCE_BACKEND_ID, the chosen retries and backoff
(config.reliability.provider_retries, provider_backoff_ms), and the
model_fallbacks value when constructing reliable::ReliableProvider in the block
that creates reliable_remote (after calling create_backend_inference_provider
and before/after ReliableProvider::new().with_model_fallbacks). Use the
project's tracing/log facility (tracing::debug! or log::debug!) at debug/trace
level so retry/backoff configuration and the fact that the provider was wrapped
are recorded for diagnostics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6d00c1aa-4e6c-47bd-a29e-fa7121e2a6a1

📥 Commits

Reviewing files that changed from the base of the PR and between 2672706 and 771aac7.

📒 Files selected for processing (3)
  • app/src-tauri/Cargo.toml
  • app/src-tauri/src/lib.rs
  • src/openhuman/providers/ops.rs

Comment thread src/openhuman/providers/ops.rs Outdated
…for 502s

Two independent production fixes:

1. Windows CEF init race (Sentry OPENHUMAN-TAURI-A, 598 events):
   `tauri_plugin_single_instance` detects duplicate launches inside
   `.setup()`, which runs AFTER `Builder::build()` triggers
   `CefRuntime::init` → `cef::initialize()`. On a second launch,
   `cef::initialize()` returns 0 (primary holds the CEF cache lock)
   and the vendored runtime asserts `result == 1`, panicking with
   `assertion left == right failed  left: 0  right: 1` (fatal,
   Windows-only). Added a `#[cfg(windows)]` pre-build named Win32
   mutex guard (`com.openhuman.app-cef-init`) at the top of `run()`,
   mirroring the macOS `cef_preflight::check_default_cache()` pattern.
   Secondary instances now exit cleanly before touching CEF. Added
   `Win32_System_Threading` feature to `windows-sys` accordingly.

2. Agent 502 surfacing as fatal (Sentry agent.run_single failed):
   `create_intelligent_routing_provider` wrapped the backend in a raw
   `OpenAiCompatibleProvider` with no retry logic. A single transient
   502 from the backend bypassed `ReliableProvider` entirely and
   propagated as a fatal error to `run_single`. Now wraps the raw
   provider in `ReliableProvider` (same `reliability.provider_retries`
   / `provider_backoff_ms` config as all other provider paths).
@YellowSnnowmann YellowSnnowmann force-pushed the fix/cef-init-race-windows-and-provider-retry branch from 771aac7 to 5c68f15 Compare May 14, 2026 09:23
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 14, 2026
…sys 0.59

CreateMutexW's SECURITY_ATTRIBUTES parameter is individually gated behind
the Win32_Security feature in windows-sys 0.59 in addition to the module-level
Win32_System_Threading gate. Without it the Windows E2E build fails with
"no `CreateMutexW` in `Win32::System::Threading`".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants