perf(app-state): parallelize runtime snapshot and add per-stage timeouts#2209
Conversation
- Replace serial screen_intelligence → local_ai → autocomplete → service status calls in build_runtime_snapshot with tokio::join! so all four subsystems execute concurrently - Wrap synchronous service::status in spawn_blocking to avoid blocking the async executor under high CPU boot pressure - Add status_with_config(config) on AutocompleteEngine to eliminate the redundant Config::load_or_init() disk parse on every snapshot poll - Add AUTH_FETCH_TIMEOUT (5s) and RUNTIME_SNAPSHOT_TIMEOUT (10s) to keep total snapshot time well under the 30s frontend RPC timeout; degraded fallback returned on runtime timeout rather than hanging - Add 2s TTL RUNTIME_SNAPSHOT_CACHE so repeated 2s polls skip full subsystem recomputation when within the TTL window - Emit per-stage timing diagnostics on every snapshot call to surface future regressions Tests: cache TTL hit/miss, degraded fallback shape, timeout constant assertions, status_with_config without disk load Closes tinyhumansai#2155
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds TTL caching and per-stage timeouts to runtime snapshot construction with a degraded fallback; refactors autocomplete status to accept config; updates documentation; and adds unit tests for cache TTL, degradation fields, and timeout budget verification. ChangesRuntime Snapshot Performance & Resilience
Sequence DiagramsequenceDiagram
participant Frontend
participant snapshot_RPC
participant Cache
participant build_runtime_snapshot
participant degraded_runtime_snapshot
Frontend->>snapshot_RPC: openhuman.app_state_snapshot()
snapshot_RPC->>Cache: check RUNTIME_SNAPSHOT_CACHE
alt Cache hit within TTL
Cache-->>snapshot_RPC: return cached RuntimeSnapshot
snapshot_RPC-->>Frontend: RuntimeSnapshot (fresh)
else Cache miss or stale
snapshot_RPC->>build_runtime_snapshot: build_runtime_snapshot(config, req_id)
par Concurrent fetch
build_runtime_snapshot->>build_runtime_snapshot: gather screen_intelligence
build_runtime_snapshot->>build_runtime_snapshot: gather local_ai
build_runtime_snapshot->>build_runtime_snapshot: gather autocomplete (status_with_config)
build_runtime_snapshot->>build_runtime_snapshot: gather service (spawn_blocking)
and
end
build_runtime_snapshot->>Cache: store snapshot + Instant::now()
alt Completes within RUNTIME_SNAPSHOT_TIMEOUT
Cache-->>snapshot_RPC: RuntimeSnapshot (fresh)
snapshot_RPC-->>Frontend: RuntimeSnapshot
else Timeout exceeded
snapshot_RPC->>degraded_runtime_snapshot: degraded_runtime_snapshot(config)
degraded_runtime_snapshot-->>snapshot_RPC: RuntimeSnapshot (degraded)
snapshot_RPC-->>Frontend: RuntimeSnapshot (degraded)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/app_state/ops.rs`:
- Around line 493-500: Thread a request-scoped ID from snapshot() into the
runtime snapshot builder and include it in the diagnostic logs: add a request_id
parameter/field to the runtime builder construction invoked by snapshot(),
propagate that ID into the struct used by build_runtime_snapshot, and update the
debug!/warn! lines that use LOG_PREFIX (the timing log shown and the other
diagnostics around the runtime builder used at the indicated spots) to include
the request_id as a stable, grep-friendly field (e.g., "req_id={}") alongside
the existing timings; ensure snapshot() generates or accepts the ID, the builder
stores it, and all timing/warn log calls reference that stored request_id for
correlation.
- Around line 415-426: The cache currently only stores completed snapshots so
concurrent cache-misses can start duplicate builds; modify the
RUNTIME_SNAPSHOT_CACHE storage and the build_runtime_snapshot flow to support an
“in-flight” sentinel (e.g., an enum or struct variant) so a single refresh is
performed and other callers await its result instead of starting their own.
Concretely: change the value type behind RUNTIME_SNAPSHOT_CACHE to represent
Ready(snapshot) or InFlight(waiter), on cache-miss check for InFlight and await
the shared waiter (oneshot/Shared future/Notify) so callers block on the same
in-flight build, and when you start a refresh (in build_runtime_snapshot and the
similar path around lines 509-512) insert an InFlight placeholder, run the
actual build, then replace it with Ready(snapshot) and resolve the waiter with
the built snapshot (or error) so everyone gets the same result.
- Around line 470-474: The spawn_blocking call currently runs
crate::openhuman::service::status without any internal timeout, which can leave
long-running platform commands hanging even after the outer future times out;
change the approach so service::status is cancellation-aware (or add a timeout
parameter) and enforce an internal timeout inside the service call instead of
only timing the outer snapshot future. Concretely: add a timeout-aware API
(e.g., service::status_with_timeout or change service::status signature to
accept a Duration or a tokio::sync::CancellationToken) and have the
tokio::task::spawn_blocking closure call that new API (passing the same timeout
value used by build_runtime_snapshot); update the platform-specific command
invocations inside service::status to respect that timeout/cancellation; apply
the same change to the other spawn_blocking usage noted (the block around lines
556-569).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 22cec7ad-097a-41f0-ba9e-b8e92f907000
📒 Files selected for processing (5)
.claude/memory.mdsrc/openhuman/app_state/ops.rssrc/openhuman/app_state/ops_tests.rssrc/openhuman/autocomplete/core/engine.rssrc/openhuman/autocomplete/core/engine_tests.rs
Thread a monotonic req_id through snapshot() and build_runtime_snapshot() so concurrent calls produce grep-friendly correlated log lines: [app_state] snapshot timings req_id=42 config_ms=1 ... [app_state] build_runtime_snapshot timings req_id=42 si_ms=... total_ms=... Addresses CodeRabbit review feedback on PR tinyhumansai#2209.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/openhuman/app_state/ops.rs (1)
441-443: 💤 Low valueDiscarding
apply_configresult silently hides configuration errors.The result of
apply_config(si_config)is intentionally discarded. If the configuration fails to apply (e.g., invalid settings), this could lead tostatus()returning stale or unexpected results without any diagnostic trace.Consider logging at
debuglevel on failure while still proceeding with the snapshot:🔧 Suggested change
- let _ = crate::openhuman::screen_intelligence::global_engine() - .apply_config(si_config) - .await; + if let Err(e) = crate::openhuman::screen_intelligence::global_engine() + .apply_config(si_config) + .await + { + debug!("{LOG_PREFIX} screen_intelligence apply_config failed (proceeding): {e}"); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/app_state/ops.rs` around lines 441 - 443, The call to global_engine().apply_config(si_config) currently discards its Result, hiding configuration errors; update that call in ops.rs to handle the Result (e.g., match or if let Err(e) = …) and on Err(e) emit a debug-level log including the error and context (for example via tracing::debug! or log::debug!) while still allowing the code to continue to take the snapshot; reference the apply_config(si_config) call on the object returned by crate::openhuman::screen_intelligence::global_engine() and ensure the success path still proceeds unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/openhuman/app_state/ops.rs`:
- Around line 441-443: The call to global_engine().apply_config(si_config)
currently discards its Result, hiding configuration errors; update that call in
ops.rs to handle the Result (e.g., match or if let Err(e) = …) and on Err(e)
emit a debug-level log including the error and context (for example via
tracing::debug! or log::debug!) while still allowing the code to continue to
take the snapshot; reference the apply_config(si_config) call on the object
returned by crate::openhuman::screen_intelligence::global_engine() and ensure
the success path still proceeds unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 592ed6f5-6633-4e09-bceb-1d705fa25fbf
📒 Files selected for processing (1)
src/openhuman/app_state/ops.rs
…hot-perf # Conflicts: # src/openhuman/app_state/ops.rs
graycyrus
left a comment
There was a problem hiding this comment.
Review — graycyrus
Solid performance fix. The serial-to-parallel conversion with tokio::join! is the right call, and the per-stage timeouts + degraded fallback ensure the frontend never waits beyond ~15s. Good additions:
| Area | Assessment |
|---|---|
| Parallelization | Clean tokio::join! with proper config cloning for move semantics |
spawn_blocking for service |
Correct — service::status shells out to launchctl/systemctl, must not block the async executor |
| Cache with TTL | Simple and effective; singleflight follow-up acknowledged |
status_with_config |
Good refactor — eliminates redundant Config::load_or_init() disk read |
| Degraded fallback | Comprehensive — all fields populated with safe defaults |
| Diagnostics | req_id correlation + per-stage timings make production debugging straightforward |
| Tests | Cache hit/miss, degraded snapshot fields, timeout constant sanity checks, status_with_config — good coverage |
CodeRabbit's two deferred items (singleflight guard, internal service::status timeout) are reasonable follow-ups and correctly scoped out of this PR. The core latency fix is complete.
No new issues found beyond what CodeRabbit already flagged. Clean from my side.
graycyrus
left a comment
There was a problem hiding this comment.
Looks good, nice work!
…uts (tinyhumansai#2209) Co-authored-by: Cyrus Gray <144336577+graycyrus@users.noreply.github.com>
…uts (tinyhumansai#2209) Co-authored-by: Cyrus Gray <144336577+graycyrus@users.noreply.github.com>
…uts (tinyhumansai#2209) Co-authored-by: Cyrus Gray <144336577+graycyrus@users.noreply.github.com>
Summary
Fixes first-launch `openhuman.app_state_snapshot` taking 30–40s and causing the frontend to timeout, leaving users on the "Almost there!" fallback.
Problem
`openhuman.app_state_snapshot` called four slow subsystem checks serially on every 2s frontend poll. First-launch paths could exceed 30s, causing the frontend to time out and fall back to the "Almost there!" screen. Users had no way to proceed until a manual retry.
Solution
Parallelise the four subsystem calls with `tokio::join!`, wrap with per-stage timeouts, add a 2s TTL cache so repeat polls skip the work, and emit structured diagnostics (including a monotonic `req_id`) for production traceability.
Submission Checklist
Impact
Desktop (macOS, Windows, Linux). First-launch snapshot latency drops from 30–40s to ≤ 15s (bounded by per-stage timeouts). No behaviour change for users whose subsystems respond quickly.
Related
AI Authored PR Metadata
Linear Issue
Commit & Branch
Validation Run
Validation Blocked
Behavior Changes
Parity Contract
Duplicate / Superseded PR Handling
Summary by CodeRabbit
Performance
Bug Fixes
New Features
Tests
Documentation