Context
The per-user actor pattern in cosigner-runtime/src/cosigner/ (registry + run_actor + CosignerCommand mpsc dispatch) is solid in shape but has three resilience gaps. Surfaced during a research spike on whether to adopt the actix actor framework — conclusion was to keep the homegrown pattern and close these gaps directly, since they're the only meaningful capabilities actix would have added.
Gaps
1. Handler panic kills the per-user actor permanently
In cosigner-runtime/src/cosigner/actor.rs at lines 38, 91, 243, 276 the spawn_blocking join results are unwrapped via .unwrap_or_else(|e| panic!(...)). A panic in any handler propagates up and terminates the actor's recv loop. The OwnedHandle in the registry then points at a dead mailbox, and every subsequent request for that user fails with Status::internal("user actor unavailable") (see registry.rs:136-137) until the process restarts.
Fix: Convert the panic into a per-request error via the existing oneshot reply, and let the actor loop keep running:
match join_result {
Ok(Ok(resp)) => { let _ = reply.send(Ok(resp)); }
Ok(Err(status)) => { let _ = reply.send(Err(status)); }
Err(e) if e.is_panic() => {
tracing::error!(user_id = %uid, "handler panicked");
let _ = reply.send(Err(Status::internal("handler panicked")));
}
Err(_) => { /* task cancelled — actor is shutting down */ }
}
Apply at all four sites in actor.rs.
2. No actor-level supervision / restart
If the actor task itself dies (panic outside a handler, or after the fix above, an unrecoverable state corruption), there's no restart. registry.rs:65-96 spawns once via tokio::spawn(run_actor(...)) and never re-checks the JoinHandle.
Fix: Wrap the spawn in a supervisor loop that rebuilds the CosignerInstance via new_user_instance() and respawns on panic. Requires making the instance constructor callable inside the supervisor closure (currently the instance is moved into the task once).
let join = tokio::spawn(async move {
loop {
let instance = registry.new_user_instance(&uid).await?;
let task = tokio::spawn(run_actor(rx_clone, instance));
match task.await {
Ok(()) => break,
Err(e) if e.is_panic() => {
tracing::error!(user_id = %uid, "actor panicked, restarting");
continue;
}
Err(_) => break,
}
}
});
Note: requires deciding what to do with the in-flight oneshot replies (they'll be dropped, surfacing as Cancelled on the caller side — probably acceptable).
3. No idle eviction
Actors stay resident for the lifetime of the process. Each holds a wasmtime Store with linear memory for DKG/signing session state. With many users, memory grows unbounded.
Fix: Track a last_active: Instant per actor, update on each recv(), and have the existing auto-settle tick at main.rs:152-170 also sweep idle actors past a threshold (e.g. 30 min). Send CosignerCommand::Shutdown (command.rs:78) and remove from the DashMap.
Out of scope
- Migration to
actix actor framework — explicitly rejected; runtime semantics fight enclave deployment, multi-party rendezvous (dispatch_rendezvous!) doesn't fit its reply model, and spawn_blocking would still be needed for WASM CPU work.
- Mailbox-full handling on
try_send (main.rs:164-166) — current silent-drop is acceptable for TickAutoSettle.
Acceptance
Context
The per-user actor pattern in cosigner-runtime/src/cosigner/ (registry +
run_actor+CosignerCommandmpsc dispatch) is solid in shape but has three resilience gaps. Surfaced during a research spike on whether to adopt theactixactor framework — conclusion was to keep the homegrown pattern and close these gaps directly, since they're the only meaningful capabilitiesactixwould have added.Gaps
1. Handler panic kills the per-user actor permanently
In cosigner-runtime/src/cosigner/actor.rs at lines 38, 91, 243, 276 the
spawn_blockingjoin results are unwrapped via.unwrap_or_else(|e| panic!(...)). A panic in any handler propagates up and terminates the actor'srecvloop. TheOwnedHandlein the registry then points at a dead mailbox, and every subsequent request for that user fails withStatus::internal("user actor unavailable")(see registry.rs:136-137) until the process restarts.Fix: Convert the panic into a per-request error via the existing oneshot reply, and let the actor loop keep running:
Apply at all four sites in
actor.rs.2. No actor-level supervision / restart
If the actor task itself dies (panic outside a handler, or after the fix above, an unrecoverable state corruption), there's no restart. registry.rs:65-96 spawns once via
tokio::spawn(run_actor(...))and never re-checks theJoinHandle.Fix: Wrap the spawn in a supervisor loop that rebuilds the
CosignerInstancevianew_user_instance()and respawns on panic. Requires making the instance constructor callable inside the supervisor closure (currently the instance is moved into the task once).Note: requires deciding what to do with the in-flight oneshot replies (they'll be dropped, surfacing as
Cancelledon the caller side — probably acceptable).3. No idle eviction
Actors stay resident for the lifetime of the process. Each holds a wasmtime
Storewith linear memory for DKG/signing session state. With many users, memory grows unbounded.Fix: Track a
last_active: Instantper actor, update on eachrecv(), and have the existing auto-settle tick at main.rs:152-170 also sweep idle actors past a threshold (e.g. 30 min). SendCosignerCommand::Shutdown(command.rs:78) and remove from theDashMap.Out of scope
actixactor framework — explicitly rejected; runtime semantics fight enclave deployment, multi-party rendezvous (dispatch_rendezvous!) doesn't fit its reply model, andspawn_blockingwould still be needed for WASM CPU work.try_send(main.rs:164-166) — current silent-drop is acceptable forTickAutoSettle.Acceptance
CosignerInstance(test withstd::process::abort-style injection inrun_actor)