Skip to content

Conversation

@400Ping
Copy link
Contributor

@400Ping 400Ping commented Dec 22, 2025

Purpose of PR

  • Add an pinned host buffer pool and wire it into the dual-stream pipeline so each chunk uses double-buffered pinned staging before H2D copies (reduces malloc/free and GPU idle).

Related Issues or PRs

Closes #703

Changes Made

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Breaking Changes

  • Yes
  • No

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes
  • Successfully built and ran all unit tests or manual tests locally
  • PR title follows "MAHOUT-XXX: Brief Description" format (if related to an issue)
  • Code follows ASF guidelines

@400Ping 400Ping marked this pull request as draft December 22, 2025 13:39
@400Ping 400Ping marked this pull request as ready for review December 22, 2025 13:51
@400Ping
Copy link
Contributor Author

400Ping commented Dec 22, 2025

@400Ping 400Ping changed the title [QDP] Double-buffered async I/O for read_parquet_batch [QDP] Pinned host buffer + dual-stream event pipeline to overlap copy and compute Dec 22, 2025
@rich7420
Copy link
Contributor

rich7420 commented Dec 23, 2025

Thanks @400Ping for the patch!

  1. What's the reason you define a ffi function again?
  2. some tests failed locally due to tensor shape problem or you should fix test for excepted output.

@400Ping
Copy link
Contributor Author

400Ping commented Dec 23, 2025

Thanks @400Ping for the patch!

  1. What's the reason you define a ffi function again?
  2. some tests failed locally due to tensor shape problem or you should fix test for excepted output.

My bad just fixed it.

@400Ping 400Ping marked this pull request as draft December 24, 2025 07:54
@400Ping 400Ping marked this pull request as ready for review December 25, 2025 10:18
Copy link
Contributor

@rich7420 rich7420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@400Ping thanks for the patch!
left some comments

@rich7420 rich7420 marked this pull request as draft December 25, 2025 13:24
@400Ping 400Ping changed the title [QDP] Pinned host buffer + dual-stream event pipeline to overlap copy and compute [QDP] Double-buffered pinned I/O pipeline and faster Parquet decode Dec 25, 2025
@400Ping 400Ping marked this pull request as ready for review December 25, 2025 22:51
@rich7420
Copy link
Contributor

I think maybe we could add some unit tests for this.

@ryankert01
Copy link
Contributor

We have 2 improvement in this PR. Based on the benchmark result, I'm speculating if there's one of them are not contributing to the speed improvement. What's your experience?

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
This reverts commit 3556b5a.
Signed-off-by: 400Ping <fourhundredping@gmail.com>
@400Ping
Copy link
Contributor Author

400Ping commented Dec 29, 2025

We have 2 improvement in this PR. Based on the benchmark result, I'm speculating if there's one of them are not contributing to the speed improvement. What's your experience?

I think both have improvements, for the second one is what @rich7420 and @guan404ming suggested to change a different decompression technique to improve its performance. But I think overall it is because of the first one improving the speed improvements.

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
@400Ping
Copy link
Contributor Author

400Ping commented Jan 1, 2026

cc @guan404ming @ryankert01

@ryankert01
Copy link
Contributor

ryankert01 commented Jan 1, 2026

please fix pre-commit. I tested locally and get a 2.8% speedup on arrow ipc case.
Will look into it next week.

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a double-buffered pinned host memory I/O pipeline to improve GPU data transfer performance. The key optimization is adding a reusable pool of pinned host buffers to eliminate repeated CUDA allocation/deallocation overhead in the streaming Parquet decode path.

  • Implements PinnedBufferPool with automatic RAII-based buffer management
  • Refactors PipelineContext to support multiple event slots for double-buffered synchronization
  • Renames PinnedBuffer to PinnedHostBuffer for clarity
  • Moves norm buffer allocation from per-pipeline to per-chunk

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
qdp/qdp-core/src/gpu/buffer_pool.rs New pinned host buffer pool with acquire/release semantics and automatic return-to-pool on drop
qdp/qdp-core/src/gpu/pipeline.rs Extended PipelineContext to support multiple event slots; integrated pinned buffer pool; improved error handling with Result returns
qdp/qdp-core/src/gpu/memory.rs Renamed PinnedBuffer to PinnedHostBuffer and added immutable slice accessor
qdp/qdp-core/src/lib.rs Integrated buffer pool types; moved norm buffer allocation to per-chunk scope; updated pipeline event handling
qdp/qdp-core/src/gpu/mod.rs Exposed new buffer_pool module and its public types
qdp/qdp-core/src/gpu/cuda_ffi.rs Removed redundant cfg attribute (already applied at module level)
qdp/qdp-kernels/tests/amplitude_encode.rs Refactored test loop to use idiomatic iterator pattern instead of direct indexing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@guan404ming
Copy link
Member

guan404ming commented Jan 1, 2026

I agree that the comment regarding .unwrap() is valid.
Do we want to handle this more gracefully, or is the current panic-on-poison behavior expected? If so, we can document it.

@400Ping
Copy link
Contributor Author

400Ping commented Jan 1, 2026

I agree that the comment regarding .unwrap() is valid. Do we want to handle this more gracefully, or is the current panic-on-poison behavior expected? If so, we can document it.

I think I will change the code to handle it more gracefully and add some comments to document this behavior.

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
@guan404ming
Copy link
Member

Need resolve conflicts, and overall looks good to me!

Copy link
Contributor

@ryankert01 ryankert01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg, some suggestions

Copy link
Contributor

@rich7420 rich7420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@400Ping thanks for the update

@rich7420
Copy link
Contributor

rich7420 commented Jan 3, 2026

nits: in qdp/qdp-core/src/lib.rs 38-41 lines seem not to be used and could be removed.

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
@rich7420
Copy link
Contributor

rich7420 commented Jan 3, 2026

thanks for the update!
There are many return Err() functions in this PR. but we don't actually sync compute stream.
I think we should prevent this in other PR.

@rich7420
Copy link
Contributor

rich7420 commented Jan 3, 2026

except that overall LGTM

@400Ping
Copy link
Contributor Author

400Ping commented Jan 3, 2026

thanks for the update! There are many return Err() functions in this PR. but we don't actually sync compute stream. I think we should prevent this in other PR.

Should I do this in this pr or open up a follow up

@rich7420
Copy link
Contributor

rich7420 commented Jan 4, 2026

In follow-up

@guan404ming
Copy link
Member

Let's refine it after we get into main.

@guan404ming guan404ming merged commit 6ca875d into apache:dev-qdp Jan 4, 2026
2 checks passed
guan404ming pushed a commit that referenced this pull request Jan 6, 2026
…751)

* Double-buffered async I/O for read_parquet_batch

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* update

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* fix python binding error

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* update

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* update

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* update

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* update

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* update

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* fix build error

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* Revert "fix build error"

This reverts commit 3556b5a.

* fix build errors

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* update unit test and boundary check

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* remove improvement 2

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* fix qdp-core error

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* fix pre-commit

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* [Fix] fix pre-commit errors & warnings

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* fix rust linters

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* [Fix] handle buffer pool lock poisoning

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* [Chore] fix rust linters

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* update

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* Remove unused func

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* update

Signed-off-by: 400Ping <fourhundredping@gmail.com>

---------

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants