Skip to content

Harden per-user actor model: panic isolation, supervision, idle eviction #30

@aruokhai

Description

@aruokhai

Context

The per-user actor pattern in cosigner-runtime/src/cosigner/ (registry + run_actor + CosignerCommand mpsc dispatch) is solid in shape but has three resilience gaps. Surfaced during a research spike on whether to adopt the actix actor framework — conclusion was to keep the homegrown pattern and close these gaps directly, since they're the only meaningful capabilities actix would have added.

Gaps

1. Handler panic kills the per-user actor permanently

In cosigner-runtime/src/cosigner/actor.rs at lines 38, 91, 243, 276 the spawn_blocking join results are unwrapped via .unwrap_or_else(|e| panic!(...)). A panic in any handler propagates up and terminates the actor's recv loop. The OwnedHandle in the registry then points at a dead mailbox, and every subsequent request for that user fails with Status::internal("user actor unavailable") (see registry.rs:136-137) until the process restarts.

Fix: Convert the panic into a per-request error via the existing oneshot reply, and let the actor loop keep running:

match join_result {
    Ok(Ok(resp))    => { let _ = reply.send(Ok(resp)); }
    Ok(Err(status)) => { let _ = reply.send(Err(status)); }
    Err(e) if e.is_panic() => {
        tracing::error!(user_id = %uid, "handler panicked");
        let _ = reply.send(Err(Status::internal("handler panicked")));
    }
    Err(_) => { /* task cancelled — actor is shutting down */ }
}

Apply at all four sites in actor.rs.

2. No actor-level supervision / restart

If the actor task itself dies (panic outside a handler, or after the fix above, an unrecoverable state corruption), there's no restart. registry.rs:65-96 spawns once via tokio::spawn(run_actor(...)) and never re-checks the JoinHandle.

Fix: Wrap the spawn in a supervisor loop that rebuilds the CosignerInstance via new_user_instance() and respawns on panic. Requires making the instance constructor callable inside the supervisor closure (currently the instance is moved into the task once).

let join = tokio::spawn(async move {
    loop {
        let instance = registry.new_user_instance(&uid).await?;
        let task = tokio::spawn(run_actor(rx_clone, instance));
        match task.await {
            Ok(()) => break,
            Err(e) if e.is_panic() => {
                tracing::error!(user_id = %uid, "actor panicked, restarting");
                continue;
            }
            Err(_) => break,
        }
    }
});

Note: requires deciding what to do with the in-flight oneshot replies (they'll be dropped, surfacing as Cancelled on the caller side — probably acceptable).

3. No idle eviction

Actors stay resident for the lifetime of the process. Each holds a wasmtime Store with linear memory for DKG/signing session state. With many users, memory grows unbounded.

Fix: Track a last_active: Instant per actor, update on each recv(), and have the existing auto-settle tick at main.rs:152-170 also sweep idle actors past a threshold (e.g. 30 min). Send CosignerCommand::Shutdown (command.rs:78) and remove from the DashMap.

Out of scope

  • Migration to actix actor framework — explicitly rejected; runtime semantics fight enclave deployment, multi-party rendezvous (dispatch_rendezvous!) doesn't fit its reply model, and spawn_blocking would still be needed for WASM CPU work.
  • Mailbox-full handling on try_send (main.rs:164-166) — current silent-drop is acceptable for TickAutoSettle.

Acceptance

  • Handler panics produce a typed error to the caller and leave the actor alive (test with a panicking handler stub)
  • Actor-task panics trigger a restart with a fresh CosignerInstance (test with std::process::abort-style injection in run_actor)
  • Idle actors are evicted after a configurable threshold and re-spawned lazily on next request

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions