Skip to content

async cuda stream batch decode#1028

Merged
rapids-bot[bot] merged 72 commits intorelease/26.04from
feature/async-cuda-stream-batch-decode
Mar 24, 2026
Merged

async cuda stream batch decode#1028
rapids-bot[bot] merged 72 commits intorelease/26.04from
feature/async-cuda-stream-batch-decode

Conversation

@cdinea
Copy link
Copy Markdown
Contributor

@cdinea cdinea commented Feb 17, 2026

Summary

Refactors the nvImageCodec batch decoding pipeline in cuslide2 to decouple decode scheduling from completion waiting and to support user-specified CUDA streams, enabling pipelined GPU decode execution across batches.

Problem

The existing decode_batch_regions_nvimgcodec() function is fully synchronous — it schedules decode work via nvimgcodecDecoderDecode(), immediately blocks on nvimgcodecFutureWaitForAll(), and returns results in a single call. This prevents the caller from overlapping I/O preparation of the next batch with the GPU decode of the current batch. Additionally, all decode operations are hardcoded to the default CUDA stream (stream 0), preventing integration with stream-ordered memory allocators or user-managed stream pipelines.

Solution

Split Async API

Introduces two new functions that decompose batch decoding into non-blocking scheduling and blocking completion:

Function nvImageCodec API Blocking
schedule_batch_decode() nvimgcodecDecoderDecode() No — returns immediately with a BatchDecodeState
wait_batch_decode() nvimgcodecFutureWaitForAll() + cudaStreamSynchronize() Yes — blocks until all regions are decoded

This allows NvImageCodecProcessor::request() to schedule a batch and return immediately, while NvImageCodecProcessor::wait_batch() retrieves results from the oldest in-flight batch — achieving single-batch prefetch pipelining.

Custom CUDA Stream Passthrough

All decode entry points now accept an optional cudaStream_t cuda_stream parameter (defaults to nullptr → default stream). The stream is propagated to nvimgcodecImageInfo_t::cuda_stream so nvImageCodec enqueues GPU decode kernels onto the caller's stream, and cudaStreamSynchronize() is called only when decoding to device memory.

Resource Lifetime Management

Introduces BatchDecodeState using the pimpl pattern to safely carry in-flight decode resources (nvimgcodecFuture_t, RAII-wrapped code streams and images, allocated output buffers) across the async scheduling/waiting boundary. Move-only semantics ensure single ownership, and RAII wrappers with custom deleters (nvimgcodecCodeStreamDestroy, nvimgcodecImageDestroy) guarantee cleanup on exceptions or early returns.

Changes

File Description
nvimgcodec_decoder.h New schedule_batch_decode() / wait_batch_decode() API, BatchDecodeState struct, cudaStream_t parameters
nvimgcodec_decoder.cpp Split async implementation, BatchDecodeStateImpl (pimpl), RAII helpers (UniqueCodeStream, UniqueImage)
nvimgcodec_processor.h pending_batches_ / pending_requests_ queues, schedule_roi_batch() return type
nvimgcodec_processor.cpp Async request() / wait_batch() flow using split API
ifd.cpp Include reordering and namespace qualification fixes

Testing

  • Full project builds successfully (./run build_local all release)
  • Unit tests
  • Benchmark: sync vs async batch decode throughput

@cdinea cdinea requested review from a team as code owners February 17, 2026 03:24
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cdinea cdinea self-assigned this Feb 17, 2026
@cdinea cdinea added breaking Introduces a breaking change improvement Improves an existing functionality labels Feb 17, 2026
@cdinea cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from dbf16b1 to 4f3c3e4 Compare February 18, 2026 05:02
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Feb 18, 2026

/ok to test 4f3c3e4

@cdinea cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from 4f3c3e4 to 1d81c3c Compare February 18, 2026 19:31
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Feb 18, 2026

/ok to test 1d81c3c

1 similar comment
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Feb 18, 2026

/ok to test 1d81c3c

@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Feb 19, 2026

/ok to test e4d3ec8

@cdinea cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from e4d3ec8 to 1ab1a60 Compare February 19, 2026 18:51
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Feb 19, 2026

/ok to test 1ab1a60

@cdinea cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from 1ab1a60 to 85bd8a7 Compare February 19, 2026 19:03
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Feb 19, 2026

/ok to test 85bd8a7

@cdinea cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from faba493 to 55c6f39 Compare February 20, 2026 22:48
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Feb 20, 2026

/ok to test 55c6f39

@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Feb 21, 2026

/ok to test 2a2d7ae

@cdinea cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from 1e1f8e0 to d1f8eda Compare February 24, 2026 01:58
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Feb 24, 2026

/ok to test d1f8eda

@cdinea cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from d0604fa to 191623b Compare March 17, 2026 21:09
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 17, 2026

/ok to test 191623b

Copy link
Copy Markdown
Contributor

@grlee77 grlee77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments @cdinea

Copy link
Copy Markdown
Contributor

@gigony gigony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cdinea for addressing the comment!
It looks good to me. I have a minor comment on API doc.

Comment on lines +152 to +177
/**
* Schedule batch decoding of multiple regions asynchronously
*
* This function prepares and schedules a batch decode operation but does not wait
* for completion. Use wait_batch_decode() to wait for completion and get results.
*
* @param ifd_info IFD information (resolution level to decode from)
* @param main_code_stream Main TIFF code stream (from TiffFileParser)
* @param regions Vector of ROI specifications (all from the same IFD)
* @param out_device Output device ("cpu" or "cuda")
* @param cuda_stream Optional CUDA stream for asynchronous execution (nullptr = default stream)
* @return BatchDecodeState containing the future and all necessary state
*
* @note The returned state must be passed to wait_batch_decode() to complete the operation
* @note Caller is responsible for cleaning up resources via wait_batch_decode()
*/
/**
* Schedule batch decoding with optional caller-provided output buffers.
*
* When @p output_buffers is non-empty the decoder writes directly into those
* buffers (one per region, same order). The caller retains ownership; the
* returned BatchDecodeState will NOT free them on destruction.
*
* When @p output_buffers is empty (default), the function allocates its own
* buffers, which are freed by wait_batch_decode() or ~BatchDecodeState.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check and reformat this API doc may be needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the review @gigony

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in this commit 7fe6054

@@ -134,19 +153,18 @@ class NvImageCodecProcessor : public cucim::loader::BatchDataProcessor
// Request queue
std::mutex request_mutex_;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the feedback @jantonguirao - request_mutex_ is still used. It's actively locked in nvimgcodec_processor.cpp to protect the next_decode_index_ counter when building a batch of ROI decode requests. Without it, concurrent calls to this function could race on next_decode_index_, leading to duplicate or skipped tile decodes

Comment on lines 30 to +31
typedef void* nvimgcodecCodeStream_t;
typedef void* cudaStream_t;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any translation unit includes this header while also including <cuda_runtime.h> (even transitively), this creates a redefinition conflict. Better to forward-declare or use #ifndef cudaStream_t guard, or scope the stub differently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the review @jantonguirao - I think that the typedef void* cudaStream_t is inside an #else branch of #ifdef CUCIM_HAS_NVIMGCODEC. The <cuda_runtime.h> include (which provides the real cudaStream_t) is in the corresponding #ifdef CUCIM_HAS_NVIMGCODEC branch. So both definitions are in mutually exclusive preprocessor branches — they can never both be active

Comment on lines +315 to +316
std::atexit([]() {
NvImageCodecTiffParserManager::instance().shutdown();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NvImageCodecTiffParserManager is a singleton so this fires once, but if the singleton is ever reset/recreated in tests, multiple atexit handlers accumulate. Consider a std::once_flag guard around the std::atexit call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the feedback @jantonguirao - I believe that a function-local static object is constructed exactly once for the lifetime of the program — it cannot be reset or recreated < I think that the C++ standard guarantees this. So the constructor runs once, std::atexit is registered once, and no accumulation can happen

Comment on lines 393 to +394
const uint32_t load_size =
std::min(static_cast<uint64_t>(batch_size) * (1 + prefetch_factor), adjusted_location_len);
std::min(static_cast<uint64_t>(1), adjusted_location_len);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems wrong. The min(1, x) will be always 1. Is it intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the feedback @jantonguirao - because the load function does nothing (all actual work is done by the NvImageCodecProcessor batch processor), the load_size passed to loader->request() only controls how many times this no-op is called to set up bookkeeping (queued_item_count_, batch_item_counts_, etc.) for the initial prefetch. Since the no-op enqueues zero thread pool tasks, the value 1 is intentional — it seeds exactly one bookkeeping entry to kick off the batch processor pipeline.So you are right that min(1, x) is always 1 (when x >= 1), but it's intentionally always 1. The std::min is just a safety guard for the degenerate case of adjusted_location_len == 0

(void)index;
// With the zero-copy path, decoded data is written directly into the
// raster ring buffer. There is no intermediate cache to query.
return nullptr;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the zero-copy model makes this method genuinely unsupported (not just no-op), the override should make that explicit:

  std::shared_ptr<cucim::cache::ImageCacheValue> wait_for_processing(uint32_t) override {
      // Zero-copy path: decoded data is written directly into the raster ring                                            
      // buffer via OutputBufferProvider. This method must never be called on                                             
      // NvImageCodecProcessor — use next_data() instead.                                                                 
      throw std::logic_error("NvImageCodecProcessor: wait_for_processing() is not supported in zero-copy mode");          
  }      

Or at minimum add an assert(false) so it fails loudly in debug builds rather than returning a nullptr that propagates silently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the feedback @jantonguirao - I think the suggestion is reasonable, but a throw might be the wrong fix.
the nullptr return is actually part of the contract, not a bug. However, the cuslide2 plugin itself never calls wait_for_processing on NvImageCodecProcessor . It's only overridden because BatchDataProcessor declares it as a virtual method. So you might be right that if someone mistakenly calls it in a cuslide2 context, the nullptr would silently propagate.But a throw would break the interface contract (callers expect nullptr for the shutdown path). But an assert(false) in debug builds, as you suggested, would be a reasonable safety net without changing runtime behavior. That said, this is a minor defensive-coding preference, and whether to adopt it is a judgment call that @gigony @grlee77 might be able to chime in

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jantonguirao @gigony I addressed this feedback in this commit 6bff13c by adding #include to the includes
Also replaced the comment-only no-op with an assert(false) that fires in debug builds if this method is ever accidentally called, while still returning nullptr in release builds to preserve the base class interface contract

cdinea and others added 2 commits March 23, 2026 13:42
The two adjacent /** */ blocks were causing Doxygen to only associate the
second block with the function, orphaning the @param documentation.
Consolidate into a single block and add missing @param output_buffers.

Made-with: Cursor
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 23, 2026

/ok to test a760d63

This method should never be called on NvImageCodecProcessor since the
zero-copy path writes directly into the raster ring buffer. The assert
catches accidental misuse in debug builds while preserving the base
class interface contract (nullptr return) in release builds.

Made-with: Cursor
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 23, 2026

./ok to test 6bff13c

@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 24, 2026

/ok to test 6bff13c

Copy link
Copy Markdown
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Cristiana! 🙏

@jakirkham
Copy link
Copy Markdown
Member

Looks like everything has been addressed and everyone has approved

Let's go ahead and merge. Thanks all! 🙏

/merge

@jakirkham
Copy link
Copy Markdown
Member

/merge

@rapids-bot rapids-bot bot merged commit 594d918 into release/26.04 Mar 24, 2026
68 checks passed
@jakirkham jakirkham deleted the feature/async-cuda-stream-batch-decode branch March 24, 2026 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Introduces a breaking change improvement Improves an existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants