Harden per-user actor model: panic isolation, supervision, idle eviction

## Context

The per-user actor pattern in [cosigner-runtime/src/cosigner/](cosigner-runtime/src/cosigner/) (registry + `run_actor` + `CosignerCommand` mpsc dispatch) is solid in shape but has three resilience gaps. Surfaced during a research spike on whether to adopt the `actix` actor framework — conclusion was to keep the homegrown pattern and close these gaps directly, since they're the only meaningful capabilities `actix` would have added.

## Gaps

### 1. Handler panic kills the per-user actor permanently

In [cosigner-runtime/src/cosigner/actor.rs](cosigner-runtime/src/cosigner/actor.rs) at lines 38, 91, 243, 276 the `spawn_blocking` join results are unwrapped via `.unwrap_or_else(|e| panic!(...))`. A panic in any handler propagates up and terminates the actor's `recv` loop. The `OwnedHandle` in the registry then points at a dead mailbox, and every subsequent request for that user fails with `Status::internal("user actor unavailable")` (see [registry.rs:136-137](cosigner-runtime/src/cosigner/registry.rs#L136-L137)) until the process restarts.

**Fix**: Convert the panic into a per-request error via the existing oneshot reply, and let the actor loop keep running:

```rust
match join_result {
    Ok(Ok(resp))    => { let _ = reply.send(Ok(resp)); }
    Ok(Err(status)) => { let _ = reply.send(Err(status)); }
    Err(e) if e.is_panic() => {
        tracing::error!(user_id = %uid, "handler panicked");
        let _ = reply.send(Err(Status::internal("handler panicked")));
    }
    Err(_) => { /* task cancelled — actor is shutting down */ }
}
```

Apply at all four sites in `actor.rs`.

### 2. No actor-level supervision / restart

If the actor task itself dies (panic outside a handler, or after the fix above, an unrecoverable state corruption), there's no restart. [registry.rs:65-96](cosigner-runtime/src/cosigner/registry.rs#L65-L96) spawns once via `tokio::spawn(run_actor(...))` and never re-checks the `JoinHandle`.

**Fix**: Wrap the spawn in a supervisor loop that rebuilds the `CosignerInstance` via `new_user_instance()` and respawns on panic. Requires making the instance constructor callable inside the supervisor closure (currently the instance is moved into the task once).

```rust
let join = tokio::spawn(async move {
    loop {
        let instance = registry.new_user_instance(&uid).await?;
        let task = tokio::spawn(run_actor(rx_clone, instance));
        match task.await {
            Ok(()) => break,
            Err(e) if e.is_panic() => {
                tracing::error!(user_id = %uid, "actor panicked, restarting");
                continue;
            }
            Err(_) => break,
        }
    }
});
```

Note: requires deciding what to do with the in-flight oneshot replies (they'll be dropped, surfacing as `Cancelled` on the caller side — probably acceptable).

### 3. No idle eviction

Actors stay resident for the lifetime of the process. Each holds a wasmtime `Store` with linear memory for DKG/signing session state. With many users, memory grows unbounded.

**Fix**: Track a `last_active: Instant` per actor, update on each `recv()`, and have the existing auto-settle tick at [main.rs:152-170](cosigner-runtime/src/main.rs#L152-L170) also sweep idle actors past a threshold (e.g. 30 min). Send `CosignerCommand::Shutdown` ([command.rs:78](cosigner-runtime/src/cosigner/command.rs#L78)) and remove from the `DashMap`.

## Out of scope

- Migration to `actix` actor framework — explicitly rejected; runtime semantics fight enclave deployment, multi-party rendezvous (`dispatch_rendezvous!`) doesn't fit its reply model, and `spawn_blocking` would still be needed for WASM CPU work.
- Mailbox-full handling on `try_send` ([main.rs:164-166](cosigner-runtime/src/main.rs#L164-L166)) — current silent-drop is acceptable for `TickAutoSettle`.

## Acceptance

- [ ] Handler panics produce a typed error to the caller and leave the actor alive (test with a panicking handler stub)
- [ ] Actor-task panics trigger a restart with a fresh `CosignerInstance` (test with `std::process::abort`-style injection in `run_actor`)
- [ ] Idle actors are evicted after a configurable threshold and re-spawned lazily on next request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden per-user actor model: panic isolation, supervision, idle eviction #30

Context

Gaps

1. Handler panic kills the per-user actor permanently

2. No actor-level supervision / restart

3. No idle eviction

Out of scope

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Harden per-user actor model: panic isolation, supervision, idle eviction #30

Description

Context

Gaps

1. Handler panic kills the per-user actor permanently

2. No actor-level supervision / restart

3. No idle eviction

Out of scope

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions