Skip to content

Caching for cuslide2#1034

Merged
rapids-bot[bot] merged 28 commits intorelease/26.04from
feature/caching
Apr 1, 2026
Merged

Caching for cuslide2#1034
rapids-bot[bot] merged 28 commits intorelease/26.04from
feature/caching

Conversation

@cdinea
Copy link
Copy Markdown
Contributor

@cdinea cdinea commented Mar 3, 2026

Description

Implements tile-level caching for cuslide2's single-region read_region() path

Key changes

  • Tile-level caching (ifd.cpp)
  • Decomposes each read_region() ROI into its constituent TIFF tiles, performs per-tile cache lookups via cuCIM's existing ImageCache API, decodes only cache-miss tiles via nvImageCodec, inserts decoded tiles into the cache, and assembles the final output raster.
  • Reuses cuslide's cache infrastructure (create_key, find, lock/unlock, allocate, insert) — no new cache backend code.
  • Fixed hash_value_ to include the file handle hash (file_hash ^ splitmix64(ifd_index)) for cross-file cache key uniqueness.
  • Edge tiles are clipped to actual image bounds; decode failures fill with background_value.
  • Tiles always cached in host memory; GPU output transferred via a single cudaMemcpy(H2D) after assembly.
  • Falls back to direct ROI decode when caching is not applicable (strip-based images, out-of-bounds ROI, or no_cache mode).

@cdinea cdinea requested review from a team as code owners March 3, 2026 16:53
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cdinea cdinea self-assigned this Mar 3, 2026
@cdinea cdinea added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Mar 3, 2026
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 3, 2026

/ok to test 87618ac

@cdinea cdinea changed the title [WIP] Caching [WIP] Caching for cuslide2 Mar 3, 2026
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 3, 2026

/ok to test 5ee97fa

@cdinea cdinea changed the title [WIP] Caching for cuslide2 Caching for cuslide2 Mar 16, 2026
@jakirkham jakirkham changed the base branch from main to release/26.04 March 17, 2026 18:37
@jakirkham jakirkham requested review from a team as code owners March 17, 2026 18:37
@jakirkham jakirkham requested a review from bdice March 17, 2026 18:37
Copy link
Copy Markdown
Contributor

@gigony gigony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 29, 2026

/ok to test b5f8660

Apply pre-commit hook auto-fixes: remove unnecessary f-string prefixes,
fix line length, and add missing blank lines.

Made-with: Cursor
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 29, 2026

/ok to test e986c14

Remove unused region_overlap assignments (F841), fix unnecessary
f-string prefixes, fix line length, and remove extra blank line.

Made-with: Cursor
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 29, 2026

/ok to test 7c69564

Copy link
Copy Markdown
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Cristiana! 🙏

Generally looks good

Had a few comments to clean up a couple things

cucim_free(host_raster);
throw std::runtime_error("Failed to allocate GPU buffer for tile-cached output");
}
cudaMemcpy(gpu_buf, host_raster, one_raster_size, cudaMemcpyHostToDevice);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we may want to look at using stream to make this async

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the feedback @jakirkham - I agree, cudaMemcpyAsync with a stream would allow overlapping the H2D transfer with other work. For now this only runs on the single-location tile-cached path (location_len == 1, batch_size == 1), so there's no pipeline to overlap with, but it would be a worthwhile optimization if we later add prefetching or pipelining to this path. The multi-location batch path already avoids this copy entirely via zero-copy OutputBufferProvider - keepign track of thsi RFE here #1064

if (out_device.type() == cucim::io::DeviceType::kCUDA)
{
uint8_t* gpu_buf = nullptr;
cudaError_t err = cudaMalloc(reinterpret_cast<void**>(&gpu_buf), one_raster_size);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly in the future we could look at making this allocation async

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the feedback @jakirkam - do you suggest to add cudaMallocAsync here?the batch decode path already uses a custom CUDA stream (decode_stream_ in NvImageCodecProcessor, created with cudaStreamNonBlocking) for async operations. This single-location tile-cached path doesn't have a stream because it runs synchronously in IFD::read() outside the NvImageCodecProcessor pipeline. Pllumbing a stream into this path (or using cudaMallocAsync / cudaMemcpyAsync) would be a natural follow-up - tracking this here #1065

Comment thread scripts/test_philips_tiff.py Outdated
print(f" Cache type: {CuImage.cache().type}")
print(f" Stat recording: {CuImage.cache().record()}")

import numpy as np
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this import to the top-level?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @jakirkham - addressed this feedback here fbb7e6e

Comment thread scripts/test_philips_tiff.py Outdated

except Exception as e:
print(f" ❌ Caching test failed: {e}")
import traceback
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this import

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the feedback @jakirkham - addressed this feedback here a2cc720

Comment thread scripts/test_aperio_svs.py Outdated
Raises:
RuntimeError: If caching assertions fail.
"""
import numpy as np
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this out of the function and to the top-level

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanky ou for the feedback @jakirkham - feedback already addressed in commit fbb7e6e

cdinea added 2 commits March 30, 2026 12:33
Move inline `import numpy as np` statements to the module top-level
per Python conventions and reviewer feedback.

Made-with: Cursor
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 30, 2026

/ok to test a2cc720

cdinea added 3 commits March 31, 2026 11:11
Add a non-cached baseline read (direct decode, no cache) before
enabling the tile cache, then verify pixel-exact match against the
cached cold read. This catches any tile decomposition or assembly
bugs that produce different output from the direct nvImageCodec path.

Made-with: Cursor
Assert that the overlapping read produces both cache hits (shared tiles
from previous region) and cache misses (new tiles). Compare the cached
overlapping result pixel-for-pixel against a non-cached direct decode
of the same region.

Made-with: Cursor
1. Replace manual image_cache.unlock() calls with a CacheLockGuard
   RAII struct that auto-unlocks on destruction. This prevents lock
   leaks if SharedMemoryImageCache::allocate() or create_value()
   throw while the per-tile lock is held.

2. Add bits_per_sample_ == 8 && samples_per_pixel_ == 3 to the
   tile-caching eligibility check, matching the format constraints
   in is_read_optimizable(). Non-8-bit or non-RGB images now fall
   through to the direct decode path.

Made-with: Cursor
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 31, 2026

/ok to test 1e19e3c

@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 31, 2026

/ok to test 23c3fee

The compile-time guard for TIFF tag metadata used #if defined() on enum
values (NVIMGCODEC_METADATA_KIND_TIFF_TAG, etc.), which always evaluates
to false since these are C enum constants, not preprocessor macros. This
caused tile_width_/tile_height_ to remain 0 and disabled tile-level
caching entirely. Replace the broken guard with an unconditional #define.

Also fix the overlapping-read cache test: the "non-cached baseline" read
went through the global cache, pre-populating tiles before the cached
read and causing a false assertion failure on zero new misses.

Add a deprecation notice to the README for cuslide in favor of cuslide2.

Made-with: Cursor
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 31, 2026

/ok to test 379e3d6

@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Mar 31, 2026

/ok to test 29c2136

@jakirkham
Copy link
Copy Markdown
Member

/ok to test 96dd52a

Copy link
Copy Markdown
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Cristiana! 🙏

Only remaining suggestion from me is to put the version of the deprecation

Anything that remains unaddressed would make sure to file an issue so we can continue to iterate on the good feedback we have gotten here (thanks all for taking the time to look and share your thoughts 🙏)

Comment thread README.md Outdated
cdinea and others added 2 commits March 31, 2026 22:52
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Apr 1, 2026

/ok to test 48b0416

Copy link
Copy Markdown

@mkepa-nv mkepa-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a bug now with how user provided buffer is used. IIRC Joaquin noted before that user provided buffer is never used currently? But it would be good to fix that anyway


if (caller_owns_buffer)
{
host_raster = output_buffer;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdinea The user provided buffer can be on device right? Then this code will not work (it assumes host_raster is in host memory, as it uses memcpy and memset to work with it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the feedback @mkepa-nv - i added a caller_buffer_on_device check: if the caller's buffer is on CUDA, a temporary host staging buffer is allocated for assembly, and the result is cudaMemcpy'd into the caller's device buffer at the end. Only when the caller's buffer is on the CPU does the code write into it directly .pushed this change in this commit ccdb2b6

cdinea added 2 commits April 1, 2026 09:58
The tile assembly loop uses host-side memcpy/memset, so host_raster must
point to CPU memory. Previously, when the caller provided a pre-allocated
GPU buffer, host_raster was set directly to that device pointer, which
would segfault. Now a caller_buffer_on_device check ensures a temporary
host staging buffer is used for assembly, with the result cudaMemcpy'd
into the caller's device buffer at the end.

Made-with: Cursor
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Apr 1, 2026

/ok to test f032db2

The hardcoded (128,128) offset did not extend past the cached tile grid
for 240x240 tiles, causing zero new misses. Derive the shift and read
size from the actual tile dimensions so the overlapping read always
produces both cache hits and misses regardless of tile size.

Made-with: Cursor
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Apr 1, 2026

/ok to test c8acab3

Copy link
Copy Markdown
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on offline discussion it sounds like we want to include cuslide's removal date. Have tried to capture that below

Comment thread README.md Outdated
Co-authored-by: jakirkham <jakirkham@gmail.com>
@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Apr 1, 2026

/ok to test 533786c

Copy link
Copy Markdown
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reapproving. Thanks again Cristiana! Also thanks again to the reviewers 🙏

@cdinea
Copy link
Copy Markdown
Contributor Author

cdinea commented Apr 1, 2026

/merge

@rapids-bot rapids-bot Bot merged commit c121adf into release/26.04 Apr 1, 2026
131 of 132 checks passed
@jakirkham jakirkham deleted the feature/caching branch April 4, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants