Skip to content

fix: Native_datafusion reports correct files and bytes scanned#3798

Open
0lai0 wants to merge 1 commit intoapache:mainfrom
0lai0:reports_scanned_twice
Open

fix: Native_datafusion reports correct files and bytes scanned#3798
0lai0 wants to merge 1 commit intoapache:mainfrom
0lai0:reports_scanned_twice

Conversation

@0lai0
Copy link
Copy Markdown
Contributor

@0lai0 0lai0 commented Mar 26, 2026

Which issue does this PR close?

Closes #3791

Rationale for this change

In CometScanExec, calling getFilePartitions() unconditionally executes sendDriverMetrics(). Because getFilePartitions() can be evaluated multiple times during planning (e.g., converting to CometNativeScanExec) and execution (e.g., fetching partitions), the SQLMetric accumulators like numFiles and filesSize were being duplicated. This led to incorrect double-counted values rendering in the Spark UI.

What changes are included in this PR?

  • Replaced metrics(...).add() with metrics(...).set() in CometScanExec to ensure idempotency when reporting metrics.
  • Wrapped the driver metric updates and Spark listener event dispatching inside a lazy val. This prevents both double-counting during Catalyst transformations (makeCopy) and sending redundant UI events.

How are these changes tested?

  • Added a dedicated end-to-end unit test in CometExecSuite.
  • The test writes a dummy Parquet dataset, sequentially triggers multiple UI actions (count and collect) to force severe plan evaluations, and strictly asserts that numFiles is exactly 2 without any duplication.

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @0lai0.

@andygrove andygrove requested a review from comphead March 26, 2026 13:07
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically it's an artifact of wrapping CometNativeScan in CometScan, which we hopefully won't do in the future anyway.

Thanks for the fix in the meantime, @0lai0!

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @0lai0 I'll quickly check it out today

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow on UI I can now see 0

number of files read: 0
size of files read: 0.0 B

spark.range(100).repartition(2).write.mode("overwrite").parquet(path)

withSQLConf(
CometConf.COMET_ENABLED.key -> "true",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include --conf spark.comet.scan.impl=native_datafusion

@0lai0
Copy link
Copy Markdown
Contributor Author

0lai0 commented Mar 27, 2026

Thank you all for the feedback. I’ll investigate this matter and fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

native_datafusion reports twice more files and bytes scanned

4 participants