async cuda stream batch decode by cdinea · Pull Request #1028 · rapidsai/cucim

cdinea · 2026-02-17T03:24:45Z

Summary

Refactors the nvImageCodec batch decoding pipeline in cuslide2 to decouple decode scheduling from completion waiting and to support user-specified CUDA streams, enabling pipelined GPU decode execution across batches.

Problem

The existing decode_batch_regions_nvimgcodec() function is fully synchronous — it schedules decode work via nvimgcodecDecoderDecode(), immediately blocks on nvimgcodecFutureWaitForAll(), and returns results in a single call. This prevents the caller from overlapping I/O preparation of the next batch with the GPU decode of the current batch. Additionally, all decode operations are hardcoded to the default CUDA stream (stream 0), preventing integration with stream-ordered memory allocators or user-managed stream pipelines.

Solution

Split Async API

Introduces two new functions that decompose batch decoding into non-blocking scheduling and blocking completion:

Function	nvImageCodec API	Blocking
`schedule_batch_decode()`	`nvimgcodecDecoderDecode()`	No — returns immediately with a `BatchDecodeState`
`wait_batch_decode()`	`nvimgcodecFutureWaitForAll()` + `cudaStreamSynchronize()`	Yes — blocks until all regions are decoded

This allows NvImageCodecProcessor::request() to schedule a batch and return immediately, while NvImageCodecProcessor::wait_batch() retrieves results from the oldest in-flight batch — achieving single-batch prefetch pipelining.

Custom CUDA Stream Passthrough

All decode entry points now accept an optional cudaStream_t cuda_stream parameter (defaults to nullptr → default stream). The stream is propagated to nvimgcodecImageInfo_t::cuda_stream so nvImageCodec enqueues GPU decode kernels onto the caller's stream, and cudaStreamSynchronize() is called only when decoding to device memory.

Resource Lifetime Management

Introduces BatchDecodeState using the pimpl pattern to safely carry in-flight decode resources (nvimgcodecFuture_t, RAII-wrapped code streams and images, allocated output buffers) across the async scheduling/waiting boundary. Move-only semantics ensure single ownership, and RAII wrappers with custom deleters (nvimgcodecCodeStreamDestroy, nvimgcodecImageDestroy) guarantee cleanup on exceptions or early returns.

Changes

File	Description
`nvimgcodec_decoder.h`	New `schedule_batch_decode()` / `wait_batch_decode()` API, `BatchDecodeState` struct, `cudaStream_t` parameters
`nvimgcodec_decoder.cpp`	Split async implementation, `BatchDecodeStateImpl` (pimpl), RAII helpers (`UniqueCodeStream`, `UniqueImage`)
`nvimgcodec_processor.h`	`pending_batches_` / `pending_requests_` queues, `schedule_roi_batch()` return type
`nvimgcodec_processor.cpp`	Async `request()` / `wait_batch()` flow using split API
`ifd.cpp`	Include reordering and namespace qualification fixes

Testing

Full project builds successfully (./run build_local all release)
Unit tests
Benchmark: sync vs async batch decode throughput

copy-pr-bot · 2026-02-17T03:24:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cdinea · 2026-02-18T05:03:20Z

/ok to test 4f3c3e4

cdinea · 2026-02-18T19:47:19Z

/ok to test 1d81c3c

cdinea · 2026-02-18T23:46:53Z

/ok to test 1d81c3c

cdinea · 2026-02-19T18:46:09Z

/ok to test e4d3ec8

cdinea · 2026-02-19T18:52:51Z

/ok to test 1ab1a60

cdinea · 2026-02-19T19:03:48Z

/ok to test 85bd8a7

cdinea · 2026-02-20T22:49:42Z

/ok to test 55c6f39

cdinea · 2026-02-21T04:55:11Z

/ok to test 2a2d7ae

cdinea · 2026-02-24T02:00:42Z

/ok to test d1f8eda

…tting

…rocessor.cpp Co-authored-by: Gregory Lee <grelee@nvidia.com>

…rocessor.h Co-authored-by: Gregory Lee <grelee@nvidia.com>

…actor)

…tch drain

…rruption

cdinea · 2026-03-17T21:13:04Z

/ok to test 191623b

grlee77

Thanks for addressing my comments @cdinea

gigony

Thanks @cdinea for addressing the comment!
It looks good to me. I have a minor comment on API doc.

gigony · 2026-03-20T21:51:38Z

+/**
+ * Schedule batch decoding of multiple regions asynchronously
+ *
+ * This function prepares and schedules a batch decode operation but does not wait
+ * for completion. Use wait_batch_decode() to wait for completion and get results.
+ *
+ * @param ifd_info IFD information (resolution level to decode from)
+ * @param main_code_stream Main TIFF code stream (from TiffFileParser)
+ * @param regions Vector of ROI specifications (all from the same IFD)
+ * @param out_device Output device ("cpu" or "cuda")
+ * @param cuda_stream Optional CUDA stream for asynchronous execution (nullptr = default stream)
+ * @return BatchDecodeState containing the future and all necessary state
+ *
+ * @note The returned state must be passed to wait_batch_decode() to complete the operation
+ * @note Caller is responsible for cleaning up resources via wait_batch_decode()
+ */
+/**
+ * Schedule batch decoding with optional caller-provided output buffers.
+ *
+ * When @p output_buffers is non-empty the decoder writes directly into those
+ * buffers (one per region, same order).  The caller retains ownership; the
+ * returned BatchDecodeState will NOT free them on destruction.
+ *
+ * When @p output_buffers is empty (default), the function allocates its own
+ * buffers, which are freed by wait_batch_decode() or ~BatchDecodeState.
+ */


Please check and reformat this API doc may be needed.

thank you for the review @gigony

addressed in this commit 7fe6054

jantonguirao · 2026-03-23T14:27:57Z

@@ -134,19 +153,18 @@ class NvImageCodecProcessor : public cucim::loader::BatchDataProcessor
    // Request queue
    std::mutex request_mutex_;


this is no longer used?

thank you for the feedback @jantonguirao - request_mutex_ is still used. It's actively locked in nvimgcodec_processor.cpp to protect the next_decode_index_ counter when building a batch of ROI decode requests. Without it, concurrent calls to this function could race on next_decode_index_, leading to duplicate or skipped tile decodes

jantonguirao · 2026-03-23T14:29:17Z

 typedef void* nvimgcodecCodeStream_t;
+typedef void* cudaStream_t;


If any translation unit includes this header while also including <cuda_runtime.h> (even transitively), this creates a redefinition conflict. Better to forward-declare or use #ifndef cudaStream_t guard, or scope the stub differently.

thank you for the review @jantonguirao - I think that the typedef void* cudaStream_t is inside an #else branch of #ifdef CUCIM_HAS_NVIMGCODEC. The <cuda_runtime.h> include (which provides the real cudaStream_t) is in the corresponding #ifdef CUCIM_HAS_NVIMGCODEC branch. So both definitions are in mutually exclusive preprocessor branches — they can never both be active

jantonguirao · 2026-03-23T14:30:28Z

+        std::atexit([]() {
+            NvImageCodecTiffParserManager::instance().shutdown();


NvImageCodecTiffParserManager is a singleton so this fires once, but if the singleton is ever reset/recreated in tests, multiple atexit handlers accumulate. Consider a std::once_flag guard around the std::atexit call.

thank you for the feedback @jantonguirao - I believe that a function-local static object is constructed exactly once for the lifetime of the program — it cannot be reset or recreated < I think that the C++ standard guarantees this. So the constructor runs once, std::atexit is registered once, and no accumulation can happen

jantonguirao · 2026-03-23T14:33:30Z

        const uint32_t load_size =
-            std::min(static_cast<uint64_t>(batch_size) * (1 + prefetch_factor), adjusted_location_len);
+            std::min(static_cast<uint64_t>(1), adjusted_location_len);


this seems wrong. The min(1, x) will be always 1. Is it intentional?

thank you for the feedback @jantonguirao - because the load function does nothing (all actual work is done by the NvImageCodecProcessor batch processor), the load_size passed to loader->request() only controls how many times this no-op is called to set up bookkeeping (queued_item_count_, batch_item_counts_, etc.) for the initial prefetch. Since the no-op enqueues zero thread pool tasks, the value 1 is intentional — it seeds exactly one bookkeeping entry to kick off the batch processor pipeline.So you are right that min(1, x) is always 1 (when x >= 1), but it's intentionally always 1. The std::min is just a safety guard for the degenerate case of adjusted_location_len == 0

jantonguirao · 2026-03-23T14:48:33Z

+    (void)index;
+    // With the zero-copy path, decoded data is written directly into the
+    // raster ring buffer.  There is no intermediate cache to query.
+    return nullptr;


Since the zero-copy model makes this method genuinely unsupported (not just no-op), the override should make that explicit:

std::shared_ptr<cucim::cache::ImageCacheValue> wait_for_processing(uint32_t) override { // Zero-copy path: decoded data is written directly into the raster ring // buffer via OutputBufferProvider. This method must never be called on // NvImageCodecProcessor — use next_data() instead. throw std::logic_error("NvImageCodecProcessor: wait_for_processing() is not supported in zero-copy mode"); }

Or at minimum add an assert(false) so it fails loudly in debug builds rather than returning a nullptr that propagates silently.

thank you for the feedback @jantonguirao - I think the suggestion is reasonable, but a throw might be the wrong fix.
the nullptr return is actually part of the contract, not a bug. However, the cuslide2 plugin itself never calls wait_for_processing on NvImageCodecProcessor . It's only overridden because BatchDataProcessor declares it as a virtual method. So you might be right that if someone mistakenly calls it in a cuslide2 context, the nullptr would silently propagate.But a throw would break the interface contract (callers expect nullptr for the shutdown path). But an assert(false) in debug builds, as you suggested, would be a reasonable safety net without changing runtime behavior. That said, this is a minor defensive-coding preference, and whether to adopt it is a judgment call that @gigony @grlee77 might be able to chime in

@jantonguirao @gigony I addressed this feedback in this commit 6bff13c by adding #include to the includes
Also replaced the comment-only no-op with an assert(false) that fires in debug builds if this method is ever accidentally called, while still returning nullptr in release builds to preserve the base class interface contract

@param

The two adjacent /** */ blocks were causing Doxygen to only associate the second block with the function, orphaning the @param documentation. Consolidate into a single block and add missing @param output_buffers. Made-with: Cursor

cdinea · 2026-03-23T21:30:56Z

/ok to test a760d63

This method should never be called on NvImageCodecProcessor since the zero-copy path writes directly into the raster ring buffer. The assert catches accidental misuse in debug builds while preserving the base class interface contract (nullptr return) in release builds. Made-with: Cursor

cdinea · 2026-03-23T23:19:07Z

./ok to test 6bff13c

cdinea · 2026-03-24T17:00:24Z

/ok to test 6bff13c

jakirkham

Thanks Cristiana! 🙏

jakirkham · 2026-03-24T19:01:06Z

Looks like everything has been addressed and everyone has approved

Let's go ahead and merge. Thanks all! 🙏

/merge

jakirkham · 2026-03-24T19:03:46Z

/merge

cdinea requested review from a team as code owners February 17, 2026 03:24

cdinea self-assigned this Feb 17, 2026

cdinea added breaking Introduces a breaking change improvement Improves an existing functionality labels Feb 17, 2026

cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from dbf16b1 to 4f3c3e4 Compare February 18, 2026 05:02

cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from 4f3c3e4 to 1d81c3c Compare February 18, 2026 19:31

cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from e4d3ec8 to 1ab1a60 Compare February 19, 2026 18:51

cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from 1ab1a60 to 85bd8a7 Compare February 19, 2026 19:03

cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from faba493 to 55c6f39 Compare February 20, 2026 22:48

cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from 1e1f8e0 to d1f8eda Compare February 24, 2026 01:58

cdinea requested review from gigony, grlee77, jantonguirao and mkepa-nv February 24, 2026 02:01

cdinea and others added 9 commits March 17, 2026 14:09

refactor(tests): raise exceptions on failure instead of returning bool

b15c4ed

Fix pre-commit: update copyright years to 2026 and apply Python forma…

dad7c51

…tting

minor whitespace fix in nvimgcodec_tiff_parser.h

af96af8

Update cpp/plugins/cucim.kit.cuslide2/src/cuslide/loader/nvimgcodec_p…

f0201ee

…rocessor.cpp Co-authored-by: Gregory Lee <grelee@nvidia.com>

Update cpp/plugins/cucim.kit.cuslide2/src/cuslide/loader/nvimgcodec_p…

624b04f

…rocessor.h Co-authored-by: Gregory Lee <grelee@nvidia.com>

remove dead code: pending_requests_ queue (unused after zero-copy ref…

ef489d7

…actor)

drain in-flight batches in destructor before destroying CUDA stream

caf7ff0

simplify destructor: use cudaStreamSynchronize instead of explicit ba…

999091c

…tch drain

log batch decode failures in release builds to prevent silent data co…

191623b

…rruption

cdinea force-pushed the feature/async-cuda-stream-batch-decode branch from d0604fa to 191623b Compare March 17, 2026 21:09

grlee77 approved these changes Mar 18, 2026

View reviewed changes

gigony approved these changes Mar 20, 2026

View reviewed changes

jantonguirao reviewed Mar 23, 2026

View reviewed changes

cdinea and others added 2 commits March 23, 2026 13:42

Merge branch 'release/26.04' into feature/async-cuda-stream-batch-decode

a760d63

jantonguirao approved these changes Mar 24, 2026

View reviewed changes

jakirkham approved these changes Mar 24, 2026

View reviewed changes

rapids-bot bot merged commit 594d918 into release/26.04 Mar 24, 2026
68 checks passed

jakirkham deleted the feature/async-cuda-stream-batch-decode branch March 24, 2026 19:05

		@@ -134,19 +153,18 @@ class NvImageCodecProcessor : public cucim::loader::BatchDataProcessor
		// Request queue
		std::mutex request_mutex_;

		typedef void* nvimgcodecCodeStream_t;
		typedef void* cudaStream_t;

		std::atexit([]() {
		NvImageCodecTiffParserManager::instance().shutdown();

Conversation

cdinea commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Split Async API

Custom CUDA Stream Passthrough

Resource Lifetime Management

Changes

Testing

Uh oh!

copy-pr-bot bot commented Feb 17, 2026

Uh oh!

cdinea commented Feb 18, 2026

Uh oh!

cdinea commented Feb 18, 2026

Uh oh!

cdinea commented Feb 18, 2026

Uh oh!

cdinea commented Feb 19, 2026

Uh oh!

cdinea commented Feb 19, 2026

Uh oh!

cdinea commented Feb 19, 2026

Uh oh!

cdinea commented Feb 20, 2026

Uh oh!

cdinea commented Feb 21, 2026

Uh oh!

cdinea commented Feb 24, 2026

Uh oh!

cdinea commented Mar 17, 2026

Uh oh!

grlee77 left a comment

Choose a reason for hiding this comment

Uh oh!

gigony left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdinea commented Mar 23, 2026

Uh oh!

cdinea commented Mar 23, 2026

Uh oh!

cdinea commented Mar 24, 2026

Uh oh!

jakirkham left a comment

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Mar 24, 2026

Uh oh!

jakirkham commented Mar 24, 2026

cdinea commented Feb 17, 2026 •

edited

Loading