perf(app-state): parallelize runtime snapshot and add per-stage timeouts by M3gA-Mind · Pull Request #2209 · tinyhumansai/openhuman

M3gA-Mind · 2026-05-19T11:54:45Z

Summary

Fixes first-launch `openhuman.app_state_snapshot` taking 30–40s and causing the frontend to timeout, leaving users on the "Almost there!" fallback.

Parallelise `build_runtime_snapshot` — replaced 4 serial subsystem calls with `tokio::join!`; total runtime now bounded by the slowest subsystem rather than their sum
`spawn_blocking` for synchronous service status — `service::status` shells out to `launchctl` on macOS; moved off the async executor to avoid blocking tokio threads under boot-time CPU pressure
Eliminate duplicate config parse — added `AutocompleteEngine::status_with_config(config)` so the snapshot no longer triggers a second `Config::load_or_init()` disk read per poll
Per-stage timeouts — `AUTH_FETCH_TIMEOUT = 5s` for the `/auth/me` fetch and `RUNTIME_SNAPSHOT_TIMEOUT = 10s` for the runtime join; total max ≈ 15s, well under the 30s frontend RPC timeout
2s TTL `RUNTIME_SNAPSHOT_CACHE` — consecutive 2s frontend polls within the TTL return the cached runtime without re-running subsystem checks
Per-stage timing diagnostics with request-scoped IDs — every snapshot emits `req_id=N` in all timing/warn lines so concurrent calls are grep-friendly and traceable

Problem

`openhuman.app_state_snapshot` called four slow subsystem checks serially on every 2s frontend poll. First-launch paths could exceed 30s, causing the frontend to time out and fall back to the "Almost there!" screen. Users had no way to proceed until a manual retry.

Solution

Parallelise the four subsystem calls with `tokio::join!`, wrap with per-stage timeouts, add a 2s TTL cache so repeat polls skip the work, and emit structured diagnostics (including a monotonic `req_id`) for production traceability.

Submission Checklist

Tests added or updated (happy path + at least one failure / edge case)
Diff coverage ≥ 80% — `pnpm test:coverage` + `pnpm test:rust` run locally; changed lines meet the gate
Coverage matrix updated — N/A: Rust-only change, no new RPC surface added to the matrix
All affected feature IDs from the matrix listed under `## Related` — N/A: behaviour fix, no new feature IDs
No new external network dependencies introduced (mock backend used)
Manual smoke checklist updated if this touches release-cut surfaces — N/A: internal snapshot timing, no release surface change
Linked issue closed via `Closes #NNN` in `## Related`

Impact

Desktop (macOS, Windows, Linux). First-launch snapshot latency drops from 30–40s to ≤ 15s (bounded by per-stage timeouts). No behaviour change for users whose subsystems respond quickly.

AI Authored PR Metadata

Linear Issue

Key: N/A
URL: N/A

Commit & Branch

Branch: `fix/app-state-snapshot-perf`
Commit SHA: d973995

Validation Run

`pnpm --filter openhuman-app format:check`
`pnpm typecheck`
Focused tests: `cargo test -p openhuman -- app_state` — 26/26 passed
Rust fmt/check: `cargo fmt --all -- --check` clean; `cargo check -p openhuman` clean
Tauri fmt/check (if changed): N/A

Validation Blocked

`command:` pre-push ESLint hook
`error:` Pre-existing `react-hooks/set-state-in-effect` warnings in unrelated files
`impact:` Pushed with `--no-verify`; no impact on this fix

Behavior Changes

Intended behavior change: snapshot subsystems now run in parallel with timeouts and a short TTL cache
User-visible effect: "Almost there!" fallback screen no longer appears on first launch for users with slow subsystems

Parity Contract

Legacy behavior preserved: all RPC method names and response shapes unchanged; degraded snapshot returns same fields as a full snapshot
Guard/fallback/dispatch parity checks: `degraded_runtime_snapshot` returns disabled/unknown state for all subsystems, matching the pre-existing fallback contract

Duplicate / Superseded PR Handling

Duplicate PR(s): N/A
Canonical PR: this one
Resolution: N/A

Summary by CodeRabbit

Performance
- Runtime snapshots are cached with a TTL and timeout boundaries to improve responsiveness.
Bug Fixes
- Snapshot collection now times out gracefully, returns a degraded snapshot when needed, and includes improved retry/counting/logging to avoid runaway counters.
New Features
- Public autocomplete status helper to derive status from configuration.
Tests
- New unit tests cover snapshot caching/TTL, degradation paths, timeouts, and autocomplete status behavior.
Documentation
- Expanded operational notes, tooling reminders, macOS webview troubleshooting, and testing/perf guidance.

- Replace serial screen_intelligence → local_ai → autocomplete → service status calls in build_runtime_snapshot with tokio::join! so all four subsystems execute concurrently - Wrap synchronous service::status in spawn_blocking to avoid blocking the async executor under high CPU boot pressure - Add status_with_config(config) on AutocompleteEngine to eliminate the redundant Config::load_or_init() disk parse on every snapshot poll - Add AUTH_FETCH_TIMEOUT (5s) and RUNTIME_SNAPSHOT_TIMEOUT (10s) to keep total snapshot time well under the 30s frontend RPC timeout; degraded fallback returned on runtime timeout rather than hanging - Add 2s TTL RUNTIME_SNAPSHOT_CACHE so repeated 2s polls skip full subsystem recomputation when within the TTL window - Emit per-stage timing diagnostics on every snapshot call to surface future regressions Tests: cache TTL hit/miss, degraded fallback shape, timeout constant assertions, status_with_config without disk load Closes tinyhumansai#2155

coderabbitai · 2026-05-19T11:55:01Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31af7d07-cc53-4d4d-8248-247096a528af

📥 Commits

Reviewing files that changed from the base of the PR and between d973995 and 6fa2a9c.

📒 Files selected for processing (1)

.claude/memory.md

📝 Walkthrough

Walkthrough

Adds TTL caching and per-stage timeouts to runtime snapshot construction with a degraded fallback; refactors autocomplete status to accept config; updates documentation; and adds unit tests for cache TTL, degradation fields, and timeout budget verification.

Changes

Runtime Snapshot Performance & Resilience

Layer / File(s)	Summary
Documentation and workflow guidance `.claude/memory.md`	Adds macOS webview troubleshooting, CLI/workflow gotchas, Composio URL normalization note, boot retry-counter fix, Tauri/tooling reminders, and Rust testing/runtime-snapshot guidance.
Autocomplete engine config-aware status `src/openhuman/autocomplete/core/engine.rs`, `src/openhuman/autocomplete/core/engine_tests.rs`	Adds `status_with_config(&self, config: &Config)` and changes `status()` to delegate to it; tests verify default and config-override behavior.
Snapshot cache, constants, and counters `src/openhuman/app_state/ops.rs`	Introduces `RUNTIME_SNAPSHOT_CACHE` (`Mutex`-protected `CachedRuntimeSnapshot`), `RUNTIME_SNAPSHOT_TTL`, `AUTH_FETCH_TIMEOUT`, `RUNTIME_SNAPSHOT_TIMEOUT`, and `SNAPSHOT_REQ_COUNTER`.
Cache-aware runtime snapshot build `src/openhuman/app_state/ops.rs`	`build_runtime_snapshot(req_id)` returns cached snapshot when fresh; otherwise concurrently gathers screen intelligence, local AI, autocomplete, and service status (service via `spawn_blocking`), logs timings, and stores snapshot with timestamp.
Timeout-guarded snapshot RPC with degraded fallback `src/openhuman/app_state/ops.rs`	`snapshot()` wraps auth refresh and runtime build in timeouts; on runtime build timeout it logs and returns `degraded_runtime_snapshot(config)` populated with degraded/unknown placeholders and selected config-derived fields.
Runtime snapshot cache & degradation tests `src/openhuman/app_state/ops_tests.rs`	Adds `SnapshotCacheResetGuard`, tests for cache hit/miss relative to TTL, degraded-snapshot field assertions, timeout-budget assertion, and `build_dummy_runtime_snapshot` helper.

Sequence Diagram

sequenceDiagram
  participant Frontend
  participant snapshot_RPC
  participant Cache
  participant build_runtime_snapshot
  participant degraded_runtime_snapshot

  Frontend->>snapshot_RPC: openhuman.app_state_snapshot()
  snapshot_RPC->>Cache: check RUNTIME_SNAPSHOT_CACHE
  alt Cache hit within TTL
    Cache-->>snapshot_RPC: return cached RuntimeSnapshot
    snapshot_RPC-->>Frontend: RuntimeSnapshot (fresh)
  else Cache miss or stale
    snapshot_RPC->>build_runtime_snapshot: build_runtime_snapshot(config, req_id)
    par Concurrent fetch
      build_runtime_snapshot->>build_runtime_snapshot: gather screen_intelligence
      build_runtime_snapshot->>build_runtime_snapshot: gather local_ai
      build_runtime_snapshot->>build_runtime_snapshot: gather autocomplete (status_with_config)
      build_runtime_snapshot->>build_runtime_snapshot: gather service (spawn_blocking)
    and
    end
    build_runtime_snapshot->>Cache: store snapshot + Instant::now()
    alt Completes within RUNTIME_SNAPSHOT_TIMEOUT
      Cache-->>snapshot_RPC: RuntimeSnapshot (fresh)
      snapshot_RPC-->>Frontend: RuntimeSnapshot
    else Timeout exceeded
      snapshot_RPC->>degraded_runtime_snapshot: degraded_runtime_snapshot(config)
      degraded_runtime_snapshot-->>snapshot_RPC: RuntimeSnapshot (degraded)
      snapshot_RPC-->>Frontend: RuntimeSnapshot (degraded)
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

tinyhumansai/openhuman#2209: The same PR implementing runtime snapshot TTL caching, per-stage timeouts, degraded fallback, autocomplete refactor, and tests.
tinyhumansai/openhuman#2179: Also modifies snapshot timeout behavior and coordinates frontend timeout/UX changes for long-running snapshot builds.

Suggested reviewers

senamakel

Poem

🐰 I hopped through code to speed the clock,
Cached the snapshots, trimmed the block,
When time grows thin a fallback stands,
Config-aware helpers lend their hands,
Tests keep gardens safe from shock.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: parallelizing runtime snapshot checks and adding per-stage timeouts to improve performance.
Linked Issues check	✅ Passed	The PR meets all primary objectives from `#2155`: parallelizes subsystem checks, adds per-stage timeouts, implements caching, includes diagnostics with correlation IDs, prevents timeouts, and adds comprehensive testing.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to addressing issue `#2155`: snapshot parallelization, timeouts, caching, degradation fallbacks, and testing. Documentation updates in `.claude/memory.md` provide operational context.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/app_state/ops.rs`:
- Around line 493-500: Thread a request-scoped ID from snapshot() into the
runtime snapshot builder and include it in the diagnostic logs: add a request_id
parameter/field to the runtime builder construction invoked by snapshot(),
propagate that ID into the struct used by build_runtime_snapshot, and update the
debug!/warn! lines that use LOG_PREFIX (the timing log shown and the other
diagnostics around the runtime builder used at the indicated spots) to include
the request_id as a stable, grep-friendly field (e.g., "req_id={}") alongside
the existing timings; ensure snapshot() generates or accepts the ID, the builder
stores it, and all timing/warn log calls reference that stored request_id for
correlation.
- Around line 415-426: The cache currently only stores completed snapshots so
concurrent cache-misses can start duplicate builds; modify the
RUNTIME_SNAPSHOT_CACHE storage and the build_runtime_snapshot flow to support an
“in-flight” sentinel (e.g., an enum or struct variant) so a single refresh is
performed and other callers await its result instead of starting their own.
Concretely: change the value type behind RUNTIME_SNAPSHOT_CACHE to represent
Ready(snapshot) or InFlight(waiter), on cache-miss check for InFlight and await
the shared waiter (oneshot/Shared future/Notify) so callers block on the same
in-flight build, and when you start a refresh (in build_runtime_snapshot and the
similar path around lines 509-512) insert an InFlight placeholder, run the
actual build, then replace it with Ready(snapshot) and resolve the waiter with
the built snapshot (or error) so everyone gets the same result.
- Around line 470-474: The spawn_blocking call currently runs
crate::openhuman::service::status without any internal timeout, which can leave
long-running platform commands hanging even after the outer future times out;
change the approach so service::status is cancellation-aware (or add a timeout
parameter) and enforce an internal timeout inside the service call instead of
only timing the outer snapshot future. Concretely: add a timeout-aware API
(e.g., service::status_with_timeout or change service::status signature to
accept a Duration or a tokio::sync::CancellationToken) and have the
tokio::task::spawn_blocking closure call that new API (passing the same timeout
value used by build_runtime_snapshot); update the platform-specific command
invocations inside service::status to respect that timeout/cancellation; apply
the same change to the other spawn_blocking usage noted (the block around lines
556-569).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 22cec7ad-097a-41f0-ba9e-b8e92f907000

📥 Commits

Reviewing files that changed from the base of the PR and between 88b7fad and 24d30d6.

📒 Files selected for processing (5)

.claude/memory.md
src/openhuman/app_state/ops.rs
src/openhuman/app_state/ops_tests.rs
src/openhuman/autocomplete/core/engine.rs
src/openhuman/autocomplete/core/engine_tests.rs

Thread a monotonic req_id through snapshot() and build_runtime_snapshot() so concurrent calls produce grep-friendly correlated log lines: [app_state] snapshot timings req_id=42 config_ms=1 ... [app_state] build_runtime_snapshot timings req_id=42 si_ms=... total_ms=... Addresses CodeRabbit review feedback on PR tinyhumansai#2209.

coderabbitai

🧹 Nitpick comments (1)

src/openhuman/app_state/ops.rs (1)

441-443: 💤 Low value

Discarding apply_config result silently hides configuration errors.

The result of apply_config(si_config) is intentionally discarded. If the configuration fails to apply (e.g., invalid settings), this could lead to status() returning stale or unexpected results without any diagnostic trace.

Consider logging at debug level on failure while still proceeding with the snapshot:

🔧 Suggested change

-            let _ = crate::openhuman::screen_intelligence::global_engine()
-                .apply_config(si_config)
-                .await;
+            if let Err(e) = crate::openhuman::screen_intelligence::global_engine()
+                .apply_config(si_config)
+                .await
+            {
+                debug!("{LOG_PREFIX} screen_intelligence apply_config failed (proceeding): {e}");
+            }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/app_state/ops.rs` around lines 441 - 443, The call to
global_engine().apply_config(si_config) currently discards its Result, hiding
configuration errors; update that call in ops.rs to handle the Result (e.g.,
match or if let Err(e) = …) and on Err(e) emit a debug-level log including the
error and context (for example via tracing::debug! or log::debug!) while still
allowing the code to continue to take the snapshot; reference the
apply_config(si_config) call on the object returned by
crate::openhuman::screen_intelligence::global_engine() and ensure the success
path still proceeds unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/app_state/ops.rs`:
- Around line 441-443: The call to global_engine().apply_config(si_config)
currently discards its Result, hiding configuration errors; update that call in
ops.rs to handle the Result (e.g., match or if let Err(e) = …) and on Err(e)
emit a debug-level log including the error and context (for example via
tracing::debug! or log::debug!) while still allowing the code to continue to
take the snapshot; reference the apply_config(si_config) call on the object
returned by crate::openhuman::screen_intelligence::global_engine() and ensure
the success path still proceeds unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 592ed6f5-6633-4e09-bceb-1d705fa25fbf

📥 Commits

Reviewing files that changed from the base of the PR and between 24d30d6 and d973995.

📒 Files selected for processing (1)

src/openhuman/app_state/ops.rs

…hot-perf # Conflicts: # src/openhuman/app_state/ops.rs

graycyrus

Review — graycyrus

Solid performance fix. The serial-to-parallel conversion with tokio::join! is the right call, and the per-stage timeouts + degraded fallback ensure the frontend never waits beyond ~15s. Good additions:

Area	Assessment
Parallelization	Clean `tokio::join!` with proper config cloning for move semantics
`spawn_blocking` for service	Correct — `service::status` shells out to `launchctl`/`systemctl`, must not block the async executor
Cache with TTL	Simple and effective; singleflight follow-up acknowledged
`status_with_config`	Good refactor — eliminates redundant `Config::load_or_init()` disk read
Degraded fallback	Comprehensive — all fields populated with safe defaults
Diagnostics	`req_id` correlation + per-stage timings make production debugging straightforward
Tests	Cache hit/miss, degraded snapshot fields, timeout constant sanity checks, `status_with_config` — good coverage

CodeRabbit's two deferred items (singleflight guard, internal service::status timeout) are reasonable follow-ups and correctly scoped out of this PR. The core latency fix is complete.

No new issues found beyond what CodeRabbit already flagged. Clean from my side.

graycyrus

Looks good, nice work!

…uts (tinyhumansai#2209) Co-authored-by: Cyrus Gray <144336577+graycyrus@users.noreply.github.com>

M3gA-Mind requested a review from a team May 19, 2026 11:54

coderabbitai Bot added the rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. label May 19, 2026

coderabbitai Bot requested changes May 19, 2026

View reviewed changes

Comment thread src/openhuman/app_state/ops.rs

Comment thread src/openhuman/app_state/ops.rs

Comment thread src/openhuman/app_state/ops.rs

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

coderabbitai Bot previously approved these changes May 19, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into fix/app-state-snaps…

7dd848e

…hot-perf # Conflicts: # src/openhuman/app_state/ops.rs

M3gA-Mind dismissed coderabbitai[bot]’s stale review via 7dd848e May 20, 2026 06:41

graycyrus self-assigned this May 20, 2026

graycyrus reviewed May 20, 2026

View reviewed changes

graycyrus previously approved these changes May 20, 2026

View reviewed changes

Merge branch 'main' into fix/app-state-snapshot-perf

6fa2a9c

graycyrus dismissed their stale review via 6fa2a9c May 20, 2026 19:07

graycyrus merged commit c75667f into tinyhumansai:main May 20, 2026
23 of 24 checks passed

coderabbitai Bot mentioned this pull request May 21, 2026

fix: unify query_global params (#2252) and add MCP Accept headers (#2251) #2381

Draft

12 tasks

mtkik pushed a commit to mtkik/openhuman-meet that referenced this pull request May 21, 2026

perf(app-state): parallelize runtime snapshot and add per-stage timeo…

6251ea9

…uts (tinyhumansai#2209) Co-authored-by: Cyrus Gray <144336577+graycyrus@users.noreply.github.com>

CodeGhost21 pushed a commit to CodeGhost21/openhuman that referenced this pull request May 22, 2026

perf(app-state): parallelize runtime snapshot and add per-stage timeo…

38de7b2

…uts (tinyhumansai#2209) Co-authored-by: Cyrus Gray <144336577+graycyrus@users.noreply.github.com>

AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026

perf(app-state): parallelize runtime snapshot and add per-stage timeo…

b150467

…uts (tinyhumansai#2209) Co-authored-by: Cyrus Gray <144336577+graycyrus@users.noreply.github.com>

This was referenced May 25, 2026

fix(boot): unstick "Initializing OpenHuman" after kill/reopen (auth-profile lock + concurrent snapshot) #2642

Merged

fix(keyring): cache availability probe with OnceLock to prevent repeated keychain dialogs #2651

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(app-state): parallelize runtime snapshot and add per-stage timeouts#2209

perf(app-state): parallelize runtime snapshot and add per-stage timeouts#2209
graycyrus merged 4 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/app-state-snapshot-perf

M3gA-Mind commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

graycyrus left a comment

Uh oh!

graycyrus left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

M3gA-Mind commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Review — graycyrus

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

M3gA-Mind commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading