[SPARK-56235][CORE] Add reverse index in TaskSetManager to avoid O(N) scans in executorLost by DenineLu · Pull Request #55030 · apache/spark

DenineLu · 2026-03-26T07:30:29Z

What changes were proposed in this pull request?

This PR adds a reverse index executorIdToTaskIds: HashMap[String, OpenHashSet[Long]] in TaskSetManager to efficiently look up tasks by executor ID, replacing O(N) full scans over taskInfos in executorLost() with O(K) direct lookups (K = tasks per executor).

Changes:

Added executorIdToTaskIds field in TaskSetManager, populated at task launch in prepareLaunchingTask()
Rewrote the two loops in executorLost() to iterate only over tasks on the lost executor via the reverse index

Why are the changes needed?

In a production Spark job (Spark 3.5.1, dynamic allocation enabled, disable shuffle tracking) with a single stage containing 5 million tasks, we observed that near the end of the stage, the Spark UI showed the last few tasks stuck in "RUNNING" state for 1-2 hours.
However, checking executor thread dumps confirmed that no task threads were actually running — the tasks had already completed on the executor side, but the Driver had not processed their completion messages.

CPU profiling of the Driver JVM (5-minute snapshot) revealed that TaskSetManager.executorLost() was consuming 99.5% of all CPU samples, due to O(N) full scans over the taskInfos HashMap (N = 5,000,000 entries).

The executorLost() method scans the entire taskInfos map to find tasks on the lost executor:

// Before: O(N) — scans ALL task attempts to find those on the lost executor
for ((tid, info) <- taskInfos if info.executorId == execId) { ... }

The blocking is amplified when the following conditions are present:

Long-tail tasks at stage end — a few remaining tasks take longer than spark.dynamicAllocation.executorIdleTimeout (default 60s) to complete. Most executors have finished their work and sit idle, while these slow tasks are still running.
Batch executor removal — after the idle timeout is triggered, a large number of RemoveExecutor messages continue to be sent to the DriverEndpoint RPC queue.

After this PR, the same workload (5M tasks, 10K executors, dynamic allocation enabled) no longer exhibits the stall. Execution time reduced from 117 minutes to 45 minutes. At the end of the Stage, the optimization has eliminated the previous executorLost hotspot issue.

	Before	After
Job Timeline
Driver CPU Top Threads (Stage Tail)

Memory overhead (measured with jmap -histo:live):

Metric	Value
Total added memory	~81 MB
vs `taskInfos` overhead (829 MB)	~10%
vs Driver heap old gen used (9.3 GB)	< 1%

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added 2 tests in TaskSetManagerSuite

Was this patch authored or co-authored using generative AI tooling?

No.

sunchao

LGTM, thanks!

Add reverse index in TaskSetManager to avoid O(N) scans in executorLost

c4345b4

sunchao approved these changes Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56235][CORE] Add reverse index in TaskSetManager to avoid O(N) scans in executorLost#55030

[SPARK-56235][CORE] Add reverse index in TaskSetManager to avoid O(N) scans in executorLost#55030
DenineLu wants to merge 1 commit intoapache:masterfrom
DenineLu:optimize-executor-lost

DenineLu commented Mar 26, 2026 •

edited

Loading

Uh oh!

sunchao left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DenineLu commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DenineLu commented Mar 26, 2026 •

edited

Loading