Skip to content

native_datafusion doesn't use all available parallelism for scan #3817

@comphead

Description

@comphead

What is the problem the feature request solves?

Observed the issue when Comet is not fully utilizing Spark cluster parallelism.
Input: 1200 HDFS files, number of Spark planned tasks: 1800. Every file is splittable, so Spark utilizes all 1800 scanning and writing the shuffle whereas Comet utilizing only 1200 tasks having 600 idle.

I was not able to reproduce the same locally, will try on local HDFS later

Describe the potential solution

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions