Conversation
|
/ok to test 87618ac |
|
/ok to test 5ee97fa |
gigony
left a comment
There was a problem hiding this comment.
Thanks @cdinea !
I reviewed partial commits (https://github.com/rapidsai/cucim/pull/1034/changes/f4333cc2b6e39e4a27fa1d1efc30be2b1b576ec8..34c17ab793ea0653dc42070b607c61a9bab7d383) and it looks good to me.
…r_provider for zero-copy decode
|
/ok to test b5f8660 |
Apply pre-commit hook auto-fixes: remove unnecessary f-string prefixes, fix line length, and add missing blank lines. Made-with: Cursor
|
/ok to test e986c14 |
Remove unused region_overlap assignments (F841), fix unnecessary f-string prefixes, fix line length, and remove extra blank line. Made-with: Cursor
|
/ok to test 7c69564 |
jakirkham
left a comment
There was a problem hiding this comment.
Thanks Cristiana! 🙏
Generally looks good
Had a few comments to clean up a couple things
| cucim_free(host_raster); | ||
| throw std::runtime_error("Failed to allocate GPU buffer for tile-cached output"); | ||
| } | ||
| cudaMemcpy(gpu_buf, host_raster, one_raster_size, cudaMemcpyHostToDevice); |
There was a problem hiding this comment.
In the future we may want to look at using stream to make this async
There was a problem hiding this comment.
thank you for the feedback @jakirkham - I agree, cudaMemcpyAsync with a stream would allow overlapping the H2D transfer with other work. For now this only runs on the single-location tile-cached path (location_len == 1, batch_size == 1), so there's no pipeline to overlap with, but it would be a worthwhile optimization if we later add prefetching or pipelining to this path. The multi-location batch path already avoids this copy entirely via zero-copy OutputBufferProvider - keepign track of thsi RFE here #1064
| if (out_device.type() == cucim::io::DeviceType::kCUDA) | ||
| { | ||
| uint8_t* gpu_buf = nullptr; | ||
| cudaError_t err = cudaMalloc(reinterpret_cast<void**>(&gpu_buf), one_raster_size); |
There was a problem hiding this comment.
Similarly in the future we could look at making this allocation async
There was a problem hiding this comment.
thank you for the feedback @jakirkam - do you suggest to add cudaMallocAsync here?the batch decode path already uses a custom CUDA stream (decode_stream_ in NvImageCodecProcessor, created with cudaStreamNonBlocking) for async operations. This single-location tile-cached path doesn't have a stream because it runs synchronously in IFD::read() outside the NvImageCodecProcessor pipeline. Pllumbing a stream into this path (or using cudaMallocAsync / cudaMemcpyAsync) would be a natural follow-up - tracking this here #1065
| print(f" Cache type: {CuImage.cache().type}") | ||
| print(f" Stat recording: {CuImage.cache().record()}") | ||
|
|
||
| import numpy as np |
There was a problem hiding this comment.
Can we move this import to the top-level?
There was a problem hiding this comment.
thank you @jakirkham - addressed this feedback here fbb7e6e
|
|
||
| except Exception as e: | ||
| print(f" ❌ Caching test failed: {e}") | ||
| import traceback |
There was a problem hiding this comment.
thank you for the feedback @jakirkham - addressed this feedback here a2cc720
| Raises: | ||
| RuntimeError: If caching assertions fail. | ||
| """ | ||
| import numpy as np |
There was a problem hiding this comment.
Let's move this out of the function and to the top-level
There was a problem hiding this comment.
thanky ou for the feedback @jakirkham - feedback already addressed in commit fbb7e6e
Move inline `import numpy as np` statements to the module top-level per Python conventions and reviewer feedback. Made-with: Cursor
Made-with: Cursor
|
/ok to test a2cc720 |
Add a non-cached baseline read (direct decode, no cache) before enabling the tile cache, then verify pixel-exact match against the cached cold read. This catches any tile decomposition or assembly bugs that produce different output from the direct nvImageCodec path. Made-with: Cursor
Assert that the overlapping read produces both cache hits (shared tiles from previous region) and cache misses (new tiles). Compare the cached overlapping result pixel-for-pixel against a non-cached direct decode of the same region. Made-with: Cursor
1. Replace manual image_cache.unlock() calls with a CacheLockGuard RAII struct that auto-unlocks on destruction. This prevents lock leaks if SharedMemoryImageCache::allocate() or create_value() throw while the per-tile lock is held. 2. Add bits_per_sample_ == 8 && samples_per_pixel_ == 3 to the tile-caching eligibility check, matching the format constraints in is_read_optimizable(). Non-8-bit or non-RGB images now fall through to the direct decode path. Made-with: Cursor
|
/ok to test 1e19e3c |
Made-with: Cursor
|
/ok to test 23c3fee |
The compile-time guard for TIFF tag metadata used #if defined() on enum values (NVIMGCODEC_METADATA_KIND_TIFF_TAG, etc.), which always evaluates to false since these are C enum constants, not preprocessor macros. This caused tile_width_/tile_height_ to remain 0 and disabled tile-level caching entirely. Replace the broken guard with an unconditional #define. Also fix the overlapping-read cache test: the "non-cached baseline" read went through the global cache, pre-populating tiles before the cached read and causing a false assertion failure on zero new misses. Add a deprecation notice to the README for cuslide in favor of cuslide2. Made-with: Cursor
|
/ok to test 379e3d6 |
Made-with: Cursor
|
/ok to test 29c2136 |
|
/ok to test 96dd52a |
jakirkham
left a comment
There was a problem hiding this comment.
Thanks Cristiana! 🙏
Only remaining suggestion from me is to put the version of the deprecation
Anything that remains unaddressed would make sure to file an issue so we can continue to iterate on the good feedback we have gotten here (thanks all for taking the time to look and share your thoughts 🙏)
Co-authored-by: jakirkham <jakirkham@gmail.com>
|
/ok to test 48b0416 |
mkepa-nv
left a comment
There was a problem hiding this comment.
I think there is a bug now with how user provided buffer is used. IIRC Joaquin noted before that user provided buffer is never used currently? But it would be good to fix that anyway
|
|
||
| if (caller_owns_buffer) | ||
| { | ||
| host_raster = output_buffer; |
There was a problem hiding this comment.
@cdinea The user provided buffer can be on device right? Then this code will not work (it assumes host_raster is in host memory, as it uses memcpy and memset to work with it
There was a problem hiding this comment.
thank you for the feedback @mkepa-nv - i added a caller_buffer_on_device check: if the caller's buffer is on CUDA, a temporary host staging buffer is allocated for assembly, and the result is cudaMemcpy'd into the caller's device buffer at the end. Only when the caller's buffer is on the CPU does the code write into it directly .pushed this change in this commit ccdb2b6
The tile assembly loop uses host-side memcpy/memset, so host_raster must point to CPU memory. Previously, when the caller provided a pre-allocated GPU buffer, host_raster was set directly to that device pointer, which would segfault. Now a caller_buffer_on_device check ensures a temporary host staging buffer is used for assembly, with the result cudaMemcpy'd into the caller's device buffer at the end. Made-with: Cursor
Made-with: Cursor
|
/ok to test f032db2 |
The hardcoded (128,128) offset did not extend past the cached tile grid for 240x240 tiles, causing zero new misses. Derive the shift and read size from the actual tile dimensions so the overlapping read always produces both cache hits and misses regardless of tile size. Made-with: Cursor
|
/ok to test c8acab3 |
jakirkham
left a comment
There was a problem hiding this comment.
Based on offline discussion it sounds like we want to include cuslide's removal date. Have tried to capture that below
Co-authored-by: jakirkham <jakirkham@gmail.com>
|
/ok to test 533786c |
jakirkham
left a comment
There was a problem hiding this comment.
Reapproving. Thanks again Cristiana! Also thanks again to the reviewers 🙏
|
/merge |
Description
Implements tile-level caching for cuslide2's single-region read_region() path
Key changes