async cuda stream batch decode#1028
Conversation
dbf16b1 to
4f3c3e4
Compare
|
/ok to test 4f3c3e4 |
4f3c3e4 to
1d81c3c
Compare
|
/ok to test 1d81c3c |
1 similar comment
|
/ok to test 1d81c3c |
|
/ok to test e4d3ec8 |
e4d3ec8 to
1ab1a60
Compare
|
/ok to test 1ab1a60 |
1ab1a60 to
85bd8a7
Compare
|
/ok to test 85bd8a7 |
faba493 to
55c6f39
Compare
|
/ok to test 55c6f39 |
|
/ok to test 2a2d7ae |
1e1f8e0 to
d1f8eda
Compare
|
/ok to test d1f8eda |
…rocessor.cpp Co-authored-by: Gregory Lee <grelee@nvidia.com>
…rocessor.h Co-authored-by: Gregory Lee <grelee@nvidia.com>
d0604fa to
191623b
Compare
|
/ok to test 191623b |
| /** | ||
| * Schedule batch decoding of multiple regions asynchronously | ||
| * | ||
| * This function prepares and schedules a batch decode operation but does not wait | ||
| * for completion. Use wait_batch_decode() to wait for completion and get results. | ||
| * | ||
| * @param ifd_info IFD information (resolution level to decode from) | ||
| * @param main_code_stream Main TIFF code stream (from TiffFileParser) | ||
| * @param regions Vector of ROI specifications (all from the same IFD) | ||
| * @param out_device Output device ("cpu" or "cuda") | ||
| * @param cuda_stream Optional CUDA stream for asynchronous execution (nullptr = default stream) | ||
| * @return BatchDecodeState containing the future and all necessary state | ||
| * | ||
| * @note The returned state must be passed to wait_batch_decode() to complete the operation | ||
| * @note Caller is responsible for cleaning up resources via wait_batch_decode() | ||
| */ | ||
| /** | ||
| * Schedule batch decoding with optional caller-provided output buffers. | ||
| * | ||
| * When @p output_buffers is non-empty the decoder writes directly into those | ||
| * buffers (one per region, same order). The caller retains ownership; the | ||
| * returned BatchDecodeState will NOT free them on destruction. | ||
| * | ||
| * When @p output_buffers is empty (default), the function allocates its own | ||
| * buffers, which are freed by wait_batch_decode() or ~BatchDecodeState. | ||
| */ |
| @@ -134,19 +153,18 @@ class NvImageCodecProcessor : public cucim::loader::BatchDataProcessor | |||
| // Request queue | |||
| std::mutex request_mutex_; | |||
There was a problem hiding this comment.
thank you for the feedback @jantonguirao - request_mutex_ is still used. It's actively locked in nvimgcodec_processor.cpp to protect the next_decode_index_ counter when building a batch of ROI decode requests. Without it, concurrent calls to this function could race on next_decode_index_, leading to duplicate or skipped tile decodes
| typedef void* nvimgcodecCodeStream_t; | ||
| typedef void* cudaStream_t; |
There was a problem hiding this comment.
If any translation unit includes this header while also including <cuda_runtime.h> (even transitively), this creates a redefinition conflict. Better to forward-declare or use #ifndef cudaStream_t guard, or scope the stub differently.
There was a problem hiding this comment.
thank you for the review @jantonguirao - I think that the typedef void* cudaStream_t is inside an #else branch of #ifdef CUCIM_HAS_NVIMGCODEC. The <cuda_runtime.h> include (which provides the real cudaStream_t) is in the corresponding #ifdef CUCIM_HAS_NVIMGCODEC branch. So both definitions are in mutually exclusive preprocessor branches — they can never both be active
| std::atexit([]() { | ||
| NvImageCodecTiffParserManager::instance().shutdown(); |
There was a problem hiding this comment.
NvImageCodecTiffParserManager is a singleton so this fires once, but if the singleton is ever reset/recreated in tests, multiple atexit handlers accumulate. Consider a std::once_flag guard around the std::atexit call.
There was a problem hiding this comment.
thank you for the feedback @jantonguirao - I believe that a function-local static object is constructed exactly once for the lifetime of the program — it cannot be reset or recreated < I think that the C++ standard guarantees this. So the constructor runs once, std::atexit is registered once, and no accumulation can happen
| const uint32_t load_size = | ||
| std::min(static_cast<uint64_t>(batch_size) * (1 + prefetch_factor), adjusted_location_len); | ||
| std::min(static_cast<uint64_t>(1), adjusted_location_len); |
There was a problem hiding this comment.
this seems wrong. The min(1, x) will be always 1. Is it intentional?
There was a problem hiding this comment.
thank you for the feedback @jantonguirao - because the load function does nothing (all actual work is done by the NvImageCodecProcessor batch processor), the load_size passed to loader->request() only controls how many times this no-op is called to set up bookkeeping (queued_item_count_, batch_item_counts_, etc.) for the initial prefetch. Since the no-op enqueues zero thread pool tasks, the value 1 is intentional — it seeds exactly one bookkeeping entry to kick off the batch processor pipeline.So you are right that min(1, x) is always 1 (when x >= 1), but it's intentionally always 1. The std::min is just a safety guard for the degenerate case of adjusted_location_len == 0
| (void)index; | ||
| // With the zero-copy path, decoded data is written directly into the | ||
| // raster ring buffer. There is no intermediate cache to query. | ||
| return nullptr; |
There was a problem hiding this comment.
Since the zero-copy model makes this method genuinely unsupported (not just no-op), the override should make that explicit:
std::shared_ptr<cucim::cache::ImageCacheValue> wait_for_processing(uint32_t) override {
// Zero-copy path: decoded data is written directly into the raster ring
// buffer via OutputBufferProvider. This method must never be called on
// NvImageCodecProcessor — use next_data() instead.
throw std::logic_error("NvImageCodecProcessor: wait_for_processing() is not supported in zero-copy mode");
}
Or at minimum add an assert(false) so it fails loudly in debug builds rather than returning a nullptr that propagates silently.
There was a problem hiding this comment.
thank you for the feedback @jantonguirao - I think the suggestion is reasonable, but a throw might be the wrong fix.
the nullptr return is actually part of the contract, not a bug. However, the cuslide2 plugin itself never calls wait_for_processing on NvImageCodecProcessor . It's only overridden because BatchDataProcessor declares it as a virtual method. So you might be right that if someone mistakenly calls it in a cuslide2 context, the nullptr would silently propagate.But a throw would break the interface contract (callers expect nullptr for the shutdown path). But an assert(false) in debug builds, as you suggested, would be a reasonable safety net without changing runtime behavior. That said, this is a minor defensive-coding preference, and whether to adopt it is a judgment call that @gigony @grlee77 might be able to chime in
There was a problem hiding this comment.
@jantonguirao @gigony I addressed this feedback in this commit 6bff13c by adding #include to the includes
Also replaced the comment-only no-op with an assert(false) that fires in debug builds if this method is ever accidentally called, while still returning nullptr in release builds to preserve the base class interface contract
|
/ok to test a760d63 |
This method should never be called on NvImageCodecProcessor since the zero-copy path writes directly into the raster ring buffer. The assert catches accidental misuse in debug builds while preserving the base class interface contract (nullptr return) in release builds. Made-with: Cursor
|
./ok to test 6bff13c |
|
/ok to test 6bff13c |
|
Looks like everything has been addressed and everyone has approved Let's go ahead and merge. Thanks all! 🙏 /merge |
|
/merge |
Summary
Refactors the
nvImageCodecbatch decoding pipeline incuslide2to decouple decode scheduling from completion waiting and to support user-specified CUDA streams, enabling pipelined GPU decode execution across batches.Problem
The existing
decode_batch_regions_nvimgcodec()function is fully synchronous — it schedules decode work vianvimgcodecDecoderDecode(), immediately blocks onnvimgcodecFutureWaitForAll(), and returns results in a single call. This prevents the caller from overlapping I/O preparation of the next batch with the GPU decode of the current batch. Additionally, all decode operations are hardcoded to the default CUDA stream (stream 0), preventing integration with stream-ordered memory allocators or user-managed stream pipelines.Solution
Split Async API
Introduces two new functions that decompose batch decoding into non-blocking scheduling and blocking completion:
schedule_batch_decode()nvimgcodecDecoderDecode()BatchDecodeStatewait_batch_decode()nvimgcodecFutureWaitForAll()+cudaStreamSynchronize()This allows
NvImageCodecProcessor::request()to schedule a batch and return immediately, whileNvImageCodecProcessor::wait_batch()retrieves results from the oldest in-flight batch — achieving single-batch prefetch pipelining.Custom CUDA Stream Passthrough
All decode entry points now accept an optional
cudaStream_t cuda_streamparameter (defaults tonullptr→ default stream). The stream is propagated tonvimgcodecImageInfo_t::cuda_streamso nvImageCodec enqueues GPU decode kernels onto the caller's stream, andcudaStreamSynchronize()is called only when decoding to device memory.Resource Lifetime Management
Introduces
BatchDecodeStateusing the pimpl pattern to safely carry in-flight decode resources (nvimgcodecFuture_t, RAII-wrapped code streams and images, allocated output buffers) across the async scheduling/waiting boundary. Move-only semantics ensure single ownership, and RAII wrappers with custom deleters (nvimgcodecCodeStreamDestroy,nvimgcodecImageDestroy) guarantee cleanup on exceptions or early returns.Changes
nvimgcodec_decoder.hschedule_batch_decode()/wait_batch_decode()API,BatchDecodeStatestruct,cudaStream_tparametersnvimgcodec_decoder.cppBatchDecodeStateImpl(pimpl), RAII helpers (UniqueCodeStream,UniqueImage)nvimgcodec_processor.hpending_batches_/pending_requests_queues,schedule_roi_batch()return typenvimgcodec_processor.cpprequest()/wait_batch()flow using split APIifd.cppTesting
./run build_local all release)