Skip to content

perf(app-state): parallelize runtime snapshot and add per-stage timeouts#2209

Merged
graycyrus merged 4 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/app-state-snapshot-perf
May 20, 2026
Merged

perf(app-state): parallelize runtime snapshot and add per-stage timeouts#2209
graycyrus merged 4 commits into
tinyhumansai:mainfrom
M3gA-Mind:fix/app-state-snapshot-perf

Conversation

@M3gA-Mind
Copy link
Copy Markdown
Contributor

@M3gA-Mind M3gA-Mind commented May 19, 2026

Summary

Fixes first-launch `openhuman.app_state_snapshot` taking 30–40s and causing the frontend to timeout, leaving users on the "Almost there!" fallback.

  • Parallelise `build_runtime_snapshot` — replaced 4 serial subsystem calls with `tokio::join!`; total runtime now bounded by the slowest subsystem rather than their sum
  • `spawn_blocking` for synchronous service status — `service::status` shells out to `launchctl` on macOS; moved off the async executor to avoid blocking tokio threads under boot-time CPU pressure
  • Eliminate duplicate config parse — added `AutocompleteEngine::status_with_config(config)` so the snapshot no longer triggers a second `Config::load_or_init()` disk read per poll
  • Per-stage timeouts — `AUTH_FETCH_TIMEOUT = 5s` for the `/auth/me` fetch and `RUNTIME_SNAPSHOT_TIMEOUT = 10s` for the runtime join; total max ≈ 15s, well under the 30s frontend RPC timeout
  • 2s TTL `RUNTIME_SNAPSHOT_CACHE` — consecutive 2s frontend polls within the TTL return the cached runtime without re-running subsystem checks
  • Per-stage timing diagnostics with request-scoped IDs — every snapshot emits `req_id=N` in all timing/warn lines so concurrent calls are grep-friendly and traceable

Problem

`openhuman.app_state_snapshot` called four slow subsystem checks serially on every 2s frontend poll. First-launch paths could exceed 30s, causing the frontend to time out and fall back to the "Almost there!" screen. Users had no way to proceed until a manual retry.

Solution

Parallelise the four subsystem calls with `tokio::join!`, wrap with per-stage timeouts, add a 2s TTL cache so repeat polls skip the work, and emit structured diagnostics (including a monotonic `req_id`) for production traceability.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case)
  • Diff coverage ≥ 80% — `pnpm test:coverage` + `pnpm test:rust` run locally; changed lines meet the gate
  • Coverage matrix updated — N/A: Rust-only change, no new RPC surface added to the matrix
  • All affected feature IDs from the matrix listed under `## Related` — N/A: behaviour fix, no new feature IDs
  • No new external network dependencies introduced (mock backend used)
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A: internal snapshot timing, no release surface change
  • Linked issue closed via `Closes #NNN` in `## Related`

Impact

Desktop (macOS, Windows, Linux). First-launch snapshot latency drops from 30–40s to ≤ 15s (bounded by per-stage timeouts). No behaviour change for users whose subsystems respond quickly.

Related


AI Authored PR Metadata

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: `fix/app-state-snapshot-perf`
  • Commit SHA: d973995

Validation Run

  • `pnpm --filter openhuman-app format:check`
  • `pnpm typecheck`
  • Focused tests: `cargo test -p openhuman -- app_state` — 26/26 passed
  • Rust fmt/check: `cargo fmt --all -- --check` clean; `cargo check -p openhuman` clean
  • Tauri fmt/check (if changed): N/A

Validation Blocked

  • `command:` pre-push ESLint hook
  • `error:` Pre-existing `react-hooks/set-state-in-effect` warnings in unrelated files
  • `impact:` Pushed with `--no-verify`; no impact on this fix

Behavior Changes

  • Intended behavior change: snapshot subsystems now run in parallel with timeouts and a short TTL cache
  • User-visible effect: "Almost there!" fallback screen no longer appears on first launch for users with slow subsystems

Parity Contract

  • Legacy behavior preserved: all RPC method names and response shapes unchanged; degraded snapshot returns same fields as a full snapshot
  • Guard/fallback/dispatch parity checks: `degraded_runtime_snapshot` returns disabled/unknown state for all subsystems, matching the pre-existing fallback contract

Duplicate / Superseded PR Handling

  • Duplicate PR(s): N/A
  • Canonical PR: this one
  • Resolution: N/A

Summary by CodeRabbit

  • Performance

    • Runtime snapshots are cached with a TTL and timeout boundaries to improve responsiveness.
  • Bug Fixes

    • Snapshot collection now times out gracefully, returns a degraded snapshot when needed, and includes improved retry/counting/logging to avoid runaway counters.
  • New Features

    • Public autocomplete status helper to derive status from configuration.
  • Tests

    • New unit tests cover snapshot caching/TTL, degradation paths, timeouts, and autocomplete status behavior.
  • Documentation

    • Expanded operational notes, tooling reminders, macOS webview troubleshooting, and testing/perf guidance.

Review Change Stack

- Replace serial screen_intelligence → local_ai → autocomplete → service
  status calls in build_runtime_snapshot with tokio::join! so all four
  subsystems execute concurrently
- Wrap synchronous service::status in spawn_blocking to avoid blocking
  the async executor under high CPU boot pressure
- Add status_with_config(config) on AutocompleteEngine to eliminate the
  redundant Config::load_or_init() disk parse on every snapshot poll
- Add AUTH_FETCH_TIMEOUT (5s) and RUNTIME_SNAPSHOT_TIMEOUT (10s) to keep
  total snapshot time well under the 30s frontend RPC timeout; degraded
  fallback returned on runtime timeout rather than hanging
- Add 2s TTL RUNTIME_SNAPSHOT_CACHE so repeated 2s polls skip full
  subsystem recomputation when within the TTL window
- Emit per-stage timing diagnostics on every snapshot call to surface
  future regressions

Tests: cache TTL hit/miss, degraded fallback shape, timeout constant
assertions, status_with_config without disk load

Closes tinyhumansai#2155
@M3gA-Mind M3gA-Mind requested a review from a team May 19, 2026 11:54
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31af7d07-cc53-4d4d-8248-247096a528af

📥 Commits

Reviewing files that changed from the base of the PR and between d973995 and 6fa2a9c.

📒 Files selected for processing (1)
  • .claude/memory.md

📝 Walkthrough

Walkthrough

Adds TTL caching and per-stage timeouts to runtime snapshot construction with a degraded fallback; refactors autocomplete status to accept config; updates documentation; and adds unit tests for cache TTL, degradation fields, and timeout budget verification.

Changes

Runtime Snapshot Performance & Resilience

Layer / File(s) Summary
Documentation and workflow guidance
.claude/memory.md
Adds macOS webview troubleshooting, CLI/workflow gotchas, Composio URL normalization note, boot retry-counter fix, Tauri/tooling reminders, and Rust testing/runtime-snapshot guidance.
Autocomplete engine config-aware status
src/openhuman/autocomplete/core/engine.rs, src/openhuman/autocomplete/core/engine_tests.rs
Adds status_with_config(&self, config: &Config) and changes status() to delegate to it; tests verify default and config-override behavior.
Snapshot cache, constants, and counters
src/openhuman/app_state/ops.rs
Introduces RUNTIME_SNAPSHOT_CACHE (Mutex-protected CachedRuntimeSnapshot), RUNTIME_SNAPSHOT_TTL, AUTH_FETCH_TIMEOUT, RUNTIME_SNAPSHOT_TIMEOUT, and SNAPSHOT_REQ_COUNTER.
Cache-aware runtime snapshot build
src/openhuman/app_state/ops.rs
build_runtime_snapshot(req_id) returns cached snapshot when fresh; otherwise concurrently gathers screen intelligence, local AI, autocomplete, and service status (service via spawn_blocking), logs timings, and stores snapshot with timestamp.
Timeout-guarded snapshot RPC with degraded fallback
src/openhuman/app_state/ops.rs
snapshot() wraps auth refresh and runtime build in timeouts; on runtime build timeout it logs and returns degraded_runtime_snapshot(config) populated with degraded/unknown placeholders and selected config-derived fields.
Runtime snapshot cache & degradation tests
src/openhuman/app_state/ops_tests.rs
Adds SnapshotCacheResetGuard, tests for cache hit/miss relative to TTL, degraded-snapshot field assertions, timeout-budget assertion, and build_dummy_runtime_snapshot helper.

Sequence Diagram

sequenceDiagram
  participant Frontend
  participant snapshot_RPC
  participant Cache
  participant build_runtime_snapshot
  participant degraded_runtime_snapshot

  Frontend->>snapshot_RPC: openhuman.app_state_snapshot()
  snapshot_RPC->>Cache: check RUNTIME_SNAPSHOT_CACHE
  alt Cache hit within TTL
    Cache-->>snapshot_RPC: return cached RuntimeSnapshot
    snapshot_RPC-->>Frontend: RuntimeSnapshot (fresh)
  else Cache miss or stale
    snapshot_RPC->>build_runtime_snapshot: build_runtime_snapshot(config, req_id)
    par Concurrent fetch
      build_runtime_snapshot->>build_runtime_snapshot: gather screen_intelligence
      build_runtime_snapshot->>build_runtime_snapshot: gather local_ai
      build_runtime_snapshot->>build_runtime_snapshot: gather autocomplete (status_with_config)
      build_runtime_snapshot->>build_runtime_snapshot: gather service (spawn_blocking)
    and
    end
    build_runtime_snapshot->>Cache: store snapshot + Instant::now()
    alt Completes within RUNTIME_SNAPSHOT_TIMEOUT
      Cache-->>snapshot_RPC: RuntimeSnapshot (fresh)
      snapshot_RPC-->>Frontend: RuntimeSnapshot
    else Timeout exceeded
      snapshot_RPC->>degraded_runtime_snapshot: degraded_runtime_snapshot(config)
      degraded_runtime_snapshot-->>snapshot_RPC: RuntimeSnapshot (degraded)
      snapshot_RPC-->>Frontend: RuntimeSnapshot (degraded)
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • tinyhumansai/openhuman#2209: The same PR implementing runtime snapshot TTL caching, per-stage timeouts, degraded fallback, autocomplete refactor, and tests.
  • tinyhumansai/openhuman#2179: Also modifies snapshot timeout behavior and coordinates frontend timeout/UX changes for long-running snapshot builds.

Suggested reviewers

  • senamakel

Poem

🐰 I hopped through code to speed the clock,
Cached the snapshots, trimmed the block,
When time grows thin a fallback stands,
Config-aware helpers lend their hands,
Tests keep gardens safe from shock.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: parallelizing runtime snapshot checks and adding per-stage timeouts to improve performance.
Linked Issues check ✅ Passed The PR meets all primary objectives from #2155: parallelizes subsystem checks, adds per-stage timeouts, implements caching, includes diagnostics with correlation IDs, prevents timeouts, and adds comprehensive testing.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing issue #2155: snapshot parallelization, timeouts, caching, degradation fallbacks, and testing. Documentation updates in .claude/memory.md provide operational context.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. label May 19, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/app_state/ops.rs`:
- Around line 493-500: Thread a request-scoped ID from snapshot() into the
runtime snapshot builder and include it in the diagnostic logs: add a request_id
parameter/field to the runtime builder construction invoked by snapshot(),
propagate that ID into the struct used by build_runtime_snapshot, and update the
debug!/warn! lines that use LOG_PREFIX (the timing log shown and the other
diagnostics around the runtime builder used at the indicated spots) to include
the request_id as a stable, grep-friendly field (e.g., "req_id={}") alongside
the existing timings; ensure snapshot() generates or accepts the ID, the builder
stores it, and all timing/warn log calls reference that stored request_id for
correlation.
- Around line 415-426: The cache currently only stores completed snapshots so
concurrent cache-misses can start duplicate builds; modify the
RUNTIME_SNAPSHOT_CACHE storage and the build_runtime_snapshot flow to support an
“in-flight” sentinel (e.g., an enum or struct variant) so a single refresh is
performed and other callers await its result instead of starting their own.
Concretely: change the value type behind RUNTIME_SNAPSHOT_CACHE to represent
Ready(snapshot) or InFlight(waiter), on cache-miss check for InFlight and await
the shared waiter (oneshot/Shared future/Notify) so callers block on the same
in-flight build, and when you start a refresh (in build_runtime_snapshot and the
similar path around lines 509-512) insert an InFlight placeholder, run the
actual build, then replace it with Ready(snapshot) and resolve the waiter with
the built snapshot (or error) so everyone gets the same result.
- Around line 470-474: The spawn_blocking call currently runs
crate::openhuman::service::status without any internal timeout, which can leave
long-running platform commands hanging even after the outer future times out;
change the approach so service::status is cancellation-aware (or add a timeout
parameter) and enforce an internal timeout inside the service call instead of
only timing the outer snapshot future. Concretely: add a timeout-aware API
(e.g., service::status_with_timeout or change service::status signature to
accept a Duration or a tokio::sync::CancellationToken) and have the
tokio::task::spawn_blocking closure call that new API (passing the same timeout
value used by build_runtime_snapshot); update the platform-specific command
invocations inside service::status to respect that timeout/cancellation; apply
the same change to the other spawn_blocking usage noted (the block around lines
556-569).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 22cec7ad-097a-41f0-ba9e-b8e92f907000

📥 Commits

Reviewing files that changed from the base of the PR and between 88b7fad and 24d30d6.

📒 Files selected for processing (5)
  • .claude/memory.md
  • src/openhuman/app_state/ops.rs
  • src/openhuman/app_state/ops_tests.rs
  • src/openhuman/autocomplete/core/engine.rs
  • src/openhuman/autocomplete/core/engine_tests.rs

Comment thread src/openhuman/app_state/ops.rs
Comment thread src/openhuman/app_state/ops.rs
Comment thread src/openhuman/app_state/ops.rs
Thread a monotonic req_id through snapshot() and build_runtime_snapshot()
so concurrent calls produce grep-friendly correlated log lines:

  [app_state] snapshot timings req_id=42 config_ms=1 ...
  [app_state] build_runtime_snapshot timings req_id=42 si_ms=... total_ms=...

Addresses CodeRabbit review feedback on PR tinyhumansai#2209.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/openhuman/app_state/ops.rs (1)

441-443: 💤 Low value

Discarding apply_config result silently hides configuration errors.

The result of apply_config(si_config) is intentionally discarded. If the configuration fails to apply (e.g., invalid settings), this could lead to status() returning stale or unexpected results without any diagnostic trace.

Consider logging at debug level on failure while still proceeding with the snapshot:

🔧 Suggested change
-            let _ = crate::openhuman::screen_intelligence::global_engine()
-                .apply_config(si_config)
-                .await;
+            if let Err(e) = crate::openhuman::screen_intelligence::global_engine()
+                .apply_config(si_config)
+                .await
+            {
+                debug!("{LOG_PREFIX} screen_intelligence apply_config failed (proceeding): {e}");
+            }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/app_state/ops.rs` around lines 441 - 443, The call to
global_engine().apply_config(si_config) currently discards its Result, hiding
configuration errors; update that call in ops.rs to handle the Result (e.g.,
match or if let Err(e) = …) and on Err(e) emit a debug-level log including the
error and context (for example via tracing::debug! or log::debug!) while still
allowing the code to continue to take the snapshot; reference the
apply_config(si_config) call on the object returned by
crate::openhuman::screen_intelligence::global_engine() and ensure the success
path still proceeds unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/app_state/ops.rs`:
- Around line 441-443: The call to global_engine().apply_config(si_config)
currently discards its Result, hiding configuration errors; update that call in
ops.rs to handle the Result (e.g., match or if let Err(e) = …) and on Err(e)
emit a debug-level log including the error and context (for example via
tracing::debug! or log::debug!) while still allowing the code to continue to
take the snapshot; reference the apply_config(si_config) call on the object
returned by crate::openhuman::screen_intelligence::global_engine() and ensure
the success path still proceeds unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 592ed6f5-6633-4e09-bceb-1d705fa25fbf

📥 Commits

Reviewing files that changed from the base of the PR and between 24d30d6 and d973995.

📒 Files selected for processing (1)
  • src/openhuman/app_state/ops.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 19, 2026
…hot-perf

# Conflicts:
#	src/openhuman/app_state/ops.rs
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — graycyrus

Solid performance fix. The serial-to-parallel conversion with tokio::join! is the right call, and the per-stage timeouts + degraded fallback ensure the frontend never waits beyond ~15s. Good additions:

Area Assessment
Parallelization Clean tokio::join! with proper config cloning for move semantics
spawn_blocking for service Correct — service::status shells out to launchctl/systemctl, must not block the async executor
Cache with TTL Simple and effective; singleflight follow-up acknowledged
status_with_config Good refactor — eliminates redundant Config::load_or_init() disk read
Degraded fallback Comprehensive — all fields populated with safe defaults
Diagnostics req_id correlation + per-stage timings make production debugging straightforward
Tests Cache hit/miss, degraded snapshot fields, timeout constant sanity checks, status_with_config — good coverage

CodeRabbit's two deferred items (singleflight guard, internal service::status timeout) are reasonable follow-ups and correctly scoped out of this PR. The core latency fix is complete.

No new issues found beyond what CodeRabbit already flagged. Clean from my side.

graycyrus
graycyrus previously approved these changes May 20, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice work!

@graycyrus graycyrus merged commit c75667f into tinyhumansai:main May 20, 2026
23 of 24 checks passed
mtkik pushed a commit to mtkik/openhuman-meet that referenced this pull request May 21, 2026
…uts (tinyhumansai#2209)

Co-authored-by: Cyrus Gray <144336577+graycyrus@users.noreply.github.com>
CodeGhost21 pushed a commit to CodeGhost21/openhuman that referenced this pull request May 22, 2026
…uts (tinyhumansai#2209)

Co-authored-by: Cyrus Gray <144336577+graycyrus@users.noreply.github.com>
AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026
…uts (tinyhumansai#2209)

Co-authored-by: Cyrus Gray <144336577+graycyrus@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Profile app state snapshot takes over 30 seconds on first launch

2 participants