Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/coderabbit |
|
Tip For best results, initiate chat on the files or code changes.
|
|
Tip For best results, initiate chat on the files or code changes.
Just let me know what you'd like me to do! 🐰 |
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Signed-off-by: polinabinder1 <pbinder@nvidia.com>
Summary
Introduces chunked dataset format for SCDL, splitting large monolithic datasets into smaller chunks for more efficient loading and future remote storage support.
Key Features
SingleCellMemMapDatasetinterfaceChanges
partition_scdl.pysingle_cell_memmap_dataset.pyto_chunked()method and chunked loading supportUsage
Convert existing SCDL to chunked format
from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset
Load existing dataset
ds = SingleCellMemMapDataset("/path/to/scdl")
Convert to chunked (100k rows per chunk)
chunked_ds = ds.to_chunked("/path/to/chunked_scdl", chunk_size=100_000)
Use normally
print(len(chunked_ds)) # Same row count
row = chunked_ds.get_row(0) # Works transparently