Skip to content

Refactor disable requests query to async + RC actor lifecycle handling#833

Open
Andyz26 wants to merge 3 commits intomasterfrom
andyz/moveDisableReqQueryToAsync
Open

Refactor disable requests query to async + RC actor lifecycle handling#833
Andyz26 wants to merge 3 commits intomasterfrom
andyz/moveDisableReqQueryToAsync

Conversation

@Andyz26
Copy link
Collaborator

@Andyz26 Andyz26 commented Mar 10, 2026

Summary

  • Move loadAllDisableTaskExecutorsRequests from blocking preStart() to async pipe on default-blocking-io-dispatcher with retry (up to 5 attempts, 5s delay). A
    data store timeout during this synchronous call caused ActorInitializationException, which MantisActorSupervisorStrategy handles with permanent stop() — silently
    killing the ExecutorStateManagerActor with no recovery path.
  • Add child actor death detection in ResourceClusterActor — watch children, handle Terminated by nulling stale refs (activating existing null-check error paths) and
    scheduling delayed re-creation.
  • Add Terminated handler in ResourceClustersManagerActor — remove stale map entry and stop the surviving sibling actor so the full pair is cleanly recreated on next
    request.
  • Add observability metrics (executorStateManagerInitFailure, childActorTerminated, childActorRecreated) with actorType tags.

Root Cause

ExecutorStateManagerActor.preStart() made a synchronous blocking gRPC call to DGW via mantisJobStore.loadAllDisableTaskExecutorsRequests(). When actual data store exceeded its
SLO:

  1. CompletionException: DEADLINE_EXCEEDED thrown in preStart()
  2. Akka wraps as ActorInitializationException
  3. MantisActorSupervisorStrategySupervisorStrategy.stop() (permanent, no restart)
  4. ResourceClusterActor holds dead ActorRef — all forwarded messages go to dead letters silently
  5. No watch/Terminated handling at any level — entire resource cluster becomes non-functional with no detection or recovery

@Andyz26 Andyz26 force-pushed the andyz/moveDisableReqQueryToAsync branch from 0aaebb1 to fabbad0 Compare March 10, 2026 23:44
@github-actions
Copy link

github-actions bot commented Mar 10, 2026

Test Results

774 tests  +3   763 ✅ +3   10m 12s ⏱️ +2s
162 suites +1    11 💤 ±0 
162 files   +1     0 ❌ ±0 

Results for commit 70120bd. ± Comparison against base commit 406e1d7.

♻️ This comment has been updated with latest results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant