Optimization Strategy for Allocating DNs in Rack Awareness #9298

weimingdiit · 2025-11-15T03:53:11Z

weimingdiit
Nov 15, 2025

Hi all,

We recently encountered an issue in our production environment and would like to share the details:

Background

Rack awareness is enabled in our cluster.

The overall storage utilization of the cluster is high (above 70%).

The number of machines within each rack is uneven, with significant differences between racks.

Problem Description

Assume a rack contains N DataNodes, and N-1 of them have already reached the storage threshold and are no longer writable. In this situation, the remaining writable DataNode shows a significantly higher iowait compared to the others.
This single DataNode ends up handling a disproportionally large amount of write traffic, which causes a noticeable drop in overall cluster write throughput.

We believe this behavior is caused by the current DataNode selection strategy when rack awareness is enabled.
If N-1 DataNodes in a rack are already full, then whenever the rack selection logic chooses that rack, the only writable DataNode in it will always be selected. This leads to severe load concentration.

Proposal

Have we considered designing an enhanced rack-selection strategy that incorporates an additional factor — the number of writable DataNodes in each rack?

In theory, a rack with more writable DataNodes should have a proportionally higher probability of being selected. This would naturally lead to better load balancing and help avoid situations where a single DataNode becomes a hotspot under high cluster utilization.

ivandika3 · 2025-11-18T01:10:16Z

ivandika3
Nov 18, 2025
Collaborator

Have we considered designing an enhanced rack-selection strategy that incorporates an additional factor — the number of writable DataNodes in each rack?

I think this makes sense in this given context. However, there might some tradeoffs here. One I can think of is that if we pick the DN with under the rack with highest number of writable DN, we might have some hotspots since client will always pick the same rack (same set of DNs) causing lower write throughput (e.g particular rack). ContainerPlacementPolicy is also used when deciding the datanode to replicate to so it might also cause a lot of replications to this particular rack which takes a lot of the network bandwidth of this rack.

Not saying that this is a wrong approach, but we might need provider stronger justifications (e.g. tradeoffs, etc). when designing a new placement policy. For example, we can check the approaches of literature paper like CopySet and some rigorous discussions (such as in HDFS-1094 where they calculated probability the of data loss) on how they evaluate a placement policy. We also need to consider tradeoffs such as data durability, load balancing, etc.

However, since container placement policy is pluggable, you can probably create a new ContainerPlacementPolicy first and see how it performs?

2 replies

weimingdiit Nov 21, 2025
Author

One I can think of is that if we pick the DN with under the rack with highest number of writable DN, we might have some hotspots since client will always pick the same rack (same set of DNs) causing lower write throughput (e.g particular rack)

First of all, thank you for your response and analysis. As you mentioned, there is indeed a possibility of similar hotspot issues. I will take a closer look at the current design strategy, and if I come up with new ideas, I look forward to discussing them with you further.

ivandika3 Jan 31, 2026
Collaborator

@weimingdiit You can research on "Power of Two Random Choices" to load balance the racks based on the "the number of writable DataNodes" to circumvent the issue I mentioned at this thread (where only one rack with the highest number of writable datanodes is always chosen). I think this is already implemented in CapacityVolumeChoosingPolicy for trying to choose volume with lower space usage.

See https://bigdata.2minutestreaming.com/i/174142024/shuffle-sharding-and-power-of-two for the application in AWS S3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization Strategy for Allocating DNs in Rack Awareness #9298

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Optimization Strategy for Allocating DNs in Rack Awareness #9298

Uh oh!

weimingdiit Nov 15, 2025

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

ivandika3 Nov 18, 2025 Collaborator

Uh oh!

Uh oh!

weimingdiit Nov 21, 2025 Author

Uh oh!

Uh oh!

ivandika3 Jan 31, 2026 Collaborator

weimingdiit
Nov 15, 2025

Replies: 1 comment 2 replies

ivandika3
Nov 18, 2025
Collaborator

weimingdiit Nov 21, 2025
Author

ivandika3 Jan 31, 2026
Collaborator