uint64 header support for cuvs_bench#2130
Conversation
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR introduces shared binary header helpers to enable cuvs-bench to support uint64 headers (16-byte) alongside legacy uint32 headers (8-byte). New functions auto-detect header layout by file-size matching and update vector loading, groundtruth utilities, and dataset conversion scripts to use standardized header I/O instead of manual parsing. ChangesBinary Header Format Upgrade
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/cuvs_bench/cuvs_bench/_bin_format.py`:
- Around line 78-90: The function read_bin_header currently uses itemsize in the
legacy size check without validating it, so callers passing itemsize=0 or
non-positive values can cause false positives; add an upfront validation in
read_bin_header to ensure itemsize is an integer > 0 (and optionally raise
TypeError for non-int types and ValueError for <=0), before computing file_size
== LEGACY_HEADER_BYTES + n_rows_32 * n_cols_32 * itemsize; reference the
parameters itemsize and the function read_bin_header and keep the existing
LEGACY_HEADER_BYTES logic after the validation.
- Around line 136-143: The current header writer truncates non-integral
dimensions by calling int(n_rows)/int(n_cols) which can silently corrupt data;
before packing in the branch that writes "<QQ" or "<II", validate that n_rows
and n_cols are integral (accept Python/NumPy integer types or floats that are
exact integers) and reject fractional values by raising a TypeError/ValueError;
use the existing negative check for <0 but add an explicit integrality check for
n_rows and n_cols (reference symbols: n_rows, n_cols, UINT32_MAX,
EXTENDED_HEADER_BYTES, struct.pack("<QQ"/"<II") ) and only cast to int after
passing the check.
In `@python/cuvs_bench/cuvs_bench/backends/_utils.py`:
- Around line 99-106: Validate that subset_size is an integer-like positive
value before using it: after computing itemsize (np.dtype(dtype).itemsize) and
before clamping n_rows, add a check that if subset_size is not None it must be
an integral type (e.g., isinstance(subset_size, numbers.Integral) and not a
bool) and >= 1, otherwise raise ValueError with a clear message including the
received value; keep the existing clamping of n_rows = min(n_rows, subset_size)
and then call read_bin_header/read/reshape as before so downstream code never
receives a float count.
In `@python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py`:
- Around line 80-95: Validate the incoming shape parameter before building
final_shape: in the block using read_bin_header and later where final_shape is
constructed (variables header_dims, final_shape, shape, and the np.memmap call),
require that shape is either None or an iterable of exactly two entries; for
each provided entry ensure it is either None or an integer >= 1, otherwise raise
a ValueError that includes expected "2 dimensions" and the actual shape value;
additionally, when mode indicates writing, ensure provided shape is consistent
with header_dims (or vice versa) and raise a clear error if they conflict so you
never return a memmap with mismatched dimensions.
- Around line 44-46: The public function memmap_bin_file currently removed the
size_dtype kwarg; restore compatibility by accepting size_dtype as an optional
keyword argument (e.g., def memmap_bin_file(..., *, force_uint64=False,
size_dtype=None)) and map it to force_uint64 when provided (prefer explicit
precedence if both given), emitting a DeprecationWarning that size_dtype is
deprecated and will be removed in a future release; update any internal uses
within memmap_bin_file to use the resolved force_uint64 value and ensure callers
passing size_dtype continue to work for one release cycle.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f762686e-0e10-4f6d-8967-b37391a77504
📒 Files selected for processing (6)
python/cuvs_bench/cuvs_bench/_bin_format.pypython/cuvs_bench/cuvs_bench/backends/_utils.pypython/cuvs_bench/cuvs_bench/generate_groundtruth/utils.pypython/cuvs_bench/cuvs_bench/get_dataset/fbin_to_f16bin.pypython/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.pypython/cuvs_bench/cuvs_bench/tests/test_utils.py
dantegd
left a comment
There was a problem hiding this comment.
To make this end-to-end for the beyond uint32 rows goal, I think we may need to audit a few non-helper paths too. For example, cpp/bench/ann/src/common/blob.hpp still appears to treat .bin files as two uint32_t header values and uses an 8-byte data offset, so C++ benchmark consumers may not read these extended files correctly yet. Also, python/cuvs_bench/cuvs_bench/generate_groundtruth/main.py still writes ground-truth neighbor IDs as uint32, which may not be enough once base-row IDs can exceed 4B rows. Would it make sense to either update those paths here or call them out as explicit follow-up scope?
|
|
||
| if mode[0] == "r": | ||
| a = np.memmap(bin_file, mode=mode, dtype=size_dtype, shape=(2,)) | ||
| n_rows, n_cols, header_bytes = read_bin_header(bin_file, itemsize) |
There was a problem hiding this comment.
What do you think about adding direct tests for memmap_bin_file with both legacy and extended headers? This is one of the important large-file paths, and this PR changes the read offset detection as well as the write path, so a small round-trip test here would make breakages easy to spot
| """ | ||
| path = tmp_path / "huge.fbin" | ||
| n_rows = UINT32_MAX + 17 | ||
| n_cols = 0 |
There was a problem hiding this comment.
Could we make this use a positive column count and create the expected file size with truncate instead of using n_cols = 0? The zero-column case proves the parser can see a large uint64 value, but it doesn’t really exercise the large positive-shape case this feature is meant to support.
| @pytest.mark.parametrize( | ||
| "ext, dtype, size_dtype", | ||
| [ | ||
| (".fbin", np.float32, np.uint32), |
There was a problem hiding this comment.
Could we include the float16 extension in this parametrized coverage too? load_vectors supports .f16bin, np? So it would be nice to cover that path for both header layouts as well.
|
Hi @dantegd I've addressed the test-related feedback. And thanks for mentioning GT! The default for groundtruth I'll open another PR for the cpp side! |
Closes #2129
This PR adds a extended uint64 header support for
bindata files. It automatically detects based on the filesize, and therefore does not break support for existing files.This is useful for datasets that exceed
uint32max number of rows (4.2B).