`uint64` header support for `cuvs_bench` by jinsolp · Pull Request #2130 · rapidsai/cuvs

jinsolp · 2026-05-26T20:27:04Z

This PR adds a extended uint64 header support for bin data files. It automatically detects based on the filesize, and therefore does not break support for existing files.
This is useful for datasets that exceed uint32 max number of rows (4.2B).

coderabbitai · 2026-05-26T20:34:42Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added support for extended binary file headers (uint64) alongside legacy format (uint32) with automatic detection.
- Binary format now automatically validates file integrity and provides detailed error messages for corrupted or truncated files.
Tests
- Expanded test coverage for binary header handling, including round-trip validation across both header formats and large-file scenarios.

Walkthrough

This PR introduces shared binary header helpers to enable cuvs-bench to support uint64 headers (16-byte) alongside legacy uint32 headers (8-byte). New functions auto-detect header layout by file-size matching and update vector loading, groundtruth utilities, and dataset conversion scripts to use standardized header I/O instead of manual parsing.

Changes

Binary Header Format Upgrade

Layer / File(s)	Summary
Binary header read/write helpers `python/cuvs_bench/cuvs_bench/_bin_format.py`	New module defines constants for legacy (8-byte) and extended (16-byte) header sizes. `read_bin_header()` auto-detects layout by unpacking candidate dimensions and verifying file size matches; `write_bin_header()` writes the appropriate header based on dimension fit and `force_uint64` flag.
Vector loading with auto-detected headers `python/cuvs_bench/cuvs_bench/backends/_utils.py`	`load_vectors()` now computes dtype itemsize, calls `read_bin_header()` to detect layout and dimensions, seeks past the header, reads payload bytes, and validates against truncation.
Groundtruth memory-mapping and binary writing `python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py`	`memmap_bin_file()` auto-detects header layout when reading, supports optional shape override, and writes headers via `write_bin_header()`. `write_bin()` also gains `force_uint64` parameter to control extended-header output.
Dataset conversion scripts with standardized headers `python/cuvs_bench/cuvs_bench/get_dataset/fbin_to_f16bin.py`, `python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py`	Both scripts now use `read_bin_header()` and `write_bin_header()` instead of manual header I/O. Functions expose `force_uint64` parameter to control whether extended headers are written.
Comprehensive test suite for header handling `python/cuvs_bench/cuvs_bench/tests/test_utils.py`	Expands coverage to validate header auto-detection, uint64-layout round-trips, overflow behavior, error dispatch for invalid files, and parameterized full and subset reads across both legacy and extended header modes. New `TestBinHeaderHelpers` class exercises the core helper functions directly.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding uint64 header support for cuvs_bench binary files.
Description check	✅ Passed	The description explains the change (uint64 header support), its benefit (handling larger datasets), and mentions automatic detection to maintain backward compatibility.
Linked Issues check	✅ Passed	The PR successfully implements uint64 header support for cuvs_bench, directly addressing issue `#2129`'s requirement for billion-scale dataset support.
Out of Scope Changes check	✅ Passed	All changes are scoped to implementing uint64 header support with automatic detection and backward compatibility, with comprehensive test coverage.
Docstring Coverage	✅ Passed	Docstring coverage is 81.48% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cuvs_bench/cuvs_bench/_bin_format.py`:
- Around line 78-90: The function read_bin_header currently uses itemsize in the
legacy size check without validating it, so callers passing itemsize=0 or
non-positive values can cause false positives; add an upfront validation in
read_bin_header to ensure itemsize is an integer > 0 (and optionally raise
TypeError for non-int types and ValueError for <=0), before computing file_size
== LEGACY_HEADER_BYTES + n_rows_32 * n_cols_32 * itemsize; reference the
parameters itemsize and the function read_bin_header and keep the existing
LEGACY_HEADER_BYTES logic after the validation.
- Around line 136-143: The current header writer truncates non-integral
dimensions by calling int(n_rows)/int(n_cols) which can silently corrupt data;
before packing in the branch that writes "<QQ" or "<II", validate that n_rows
and n_cols are integral (accept Python/NumPy integer types or floats that are
exact integers) and reject fractional values by raising a TypeError/ValueError;
use the existing negative check for <0 but add an explicit integrality check for
n_rows and n_cols (reference symbols: n_rows, n_cols, UINT32_MAX,
EXTENDED_HEADER_BYTES, struct.pack("<QQ"/"<II") ) and only cast to int after
passing the check.

In `@python/cuvs_bench/cuvs_bench/backends/_utils.py`:
- Around line 99-106: Validate that subset_size is an integer-like positive
value before using it: after computing itemsize (np.dtype(dtype).itemsize) and
before clamping n_rows, add a check that if subset_size is not None it must be
an integral type (e.g., isinstance(subset_size, numbers.Integral) and not a
bool) and >= 1, otherwise raise ValueError with a clear message including the
received value; keep the existing clamping of n_rows = min(n_rows, subset_size)
and then call read_bin_header/read/reshape as before so downstream code never
receives a float count.

In `@python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py`:
- Around line 80-95: Validate the incoming shape parameter before building
final_shape: in the block using read_bin_header and later where final_shape is
constructed (variables header_dims, final_shape, shape, and the np.memmap call),
require that shape is either None or an iterable of exactly two entries; for
each provided entry ensure it is either None or an integer >= 1, otherwise raise
a ValueError that includes expected "2 dimensions" and the actual shape value;
additionally, when mode indicates writing, ensure provided shape is consistent
with header_dims (or vice versa) and raise a clear error if they conflict so you
never return a memmap with mismatched dimensions.
- Around line 44-46: The public function memmap_bin_file currently removed the
size_dtype kwarg; restore compatibility by accepting size_dtype as an optional
keyword argument (e.g., def memmap_bin_file(..., *, force_uint64=False,
size_dtype=None)) and map it to force_uint64 when provided (prefer explicit
precedence if both given), emitting a DeprecationWarning that size_dtype is
deprecated and will be removed in a future release; update any internal uses
within memmap_bin_file to use the resolved force_uint64 value and ensure callers
passing size_dtype continue to work for one release cycle.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f762686e-0e10-4f6d-8967-b37391a77504

📥 Commits

Reviewing files that changed from the base of the PR and between 4addf4e and c1df117.

📒 Files selected for processing (6)

python/cuvs_bench/cuvs_bench/_bin_format.py
python/cuvs_bench/cuvs_bench/backends/_utils.py
python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py
python/cuvs_bench/cuvs_bench/get_dataset/fbin_to_f16bin.py
python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py
python/cuvs_bench/cuvs_bench/tests/test_utils.py

dantegd

To make this end-to-end for the beyond uint32 rows goal, I think we may need to audit a few non-helper paths too. For example, cpp/bench/ann/src/common/blob.hpp still appears to treat .bin files as two uint32_t header values and uses an 8-byte data offset, so C++ benchmark consumers may not read these extended files correctly yet. Also, python/cuvs_bench/cuvs_bench/generate_groundtruth/main.py still writes ground-truth neighbor IDs as uint32, which may not be enough once base-row IDs can exceed 4B rows. Would it make sense to either update those paths here or call them out as explicit follow-up scope?

dantegd · 2026-06-04T21:10:36Z


    if mode[0] == "r":
-        a = np.memmap(bin_file, mode=mode, dtype=size_dtype, shape=(2,))
+        n_rows, n_cols, header_bytes = read_bin_header(bin_file, itemsize)


What do you think about adding direct tests for memmap_bin_file with both legacy and extended headers? This is one of the important large-file paths, and this PR changes the read offset detection as well as the write path, so a small round-trip test here would make breakages easy to spot

dantegd · 2026-06-04T21:13:39Z

+        """
+        path = tmp_path / "huge.fbin"
+        n_rows = UINT32_MAX + 17
+        n_cols = 0


Could we make this use a positive column count and create the expected file size with truncate instead of using n_cols = 0? The zero-column case proves the parser can see a large uint64 value, but it doesn’t really exercise the large positive-shape case this feature is meant to support.

dantegd · 2026-06-04T21:14:14Z

+    @pytest.mark.parametrize(
+        "ext, dtype, size_dtype",
+        [
+            (".fbin", np.float32, np.uint32),


Could we include the float16 extension in this parametrized coverage too? load_vectors supports .f16bin, np? So it would be nice to cover that path for both header layouts as well.

jinsolp · 2026-06-05T02:42:30Z

Hi @dantegd I've addressed the test-related feedback.

And thanks for mentioning GT! The default for groundtruth ibin. Now it also writes groundtruth files as u64bin if needed.

I'll open another PR for the cpp side!

uint64 header support based on size

c1df117

jinsolp self-assigned this May 26, 2026

jinsolp requested a review from a team as a code owner May 26, 2026 20:27

github-project-automation Bot added this to Unstructured Data Processing May 26, 2026

jinsolp added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels May 26, 2026

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

jinsolp moved this to In Progress in Unstructured Data Processing May 27, 2026

jinsolp mentioned this pull request Jun 2, 2026

[FEA] BSDG: Open source the billion-scale synthetic generator #2208

Open

jinsolp added 3 commits June 4, 2026 10:31

Merge branch 'main' into cuvs-bench-uint64-header

24923df

coderabbit reviews

1103bcb

simplify comment

8240e52

dantegd requested changes Jun 4, 2026

View reviewed changes

jinsolp added 4 commits June 5, 2026 00:37

fix/add tests

751d0f3

gt uint64 support

9a5b9e0

Merge branch 'main' into cuvs-bench-uint64-header

89f2779

shift neighbor index support large dtype

44856bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`uint64` header support for `cuvs_bench`#2130

`uint64` header support for `cuvs_bench`#2130
jinsolp wants to merge 8 commits into
rapidsai:mainfrom
jinsolp:cuvs-bench-uint64-header

jinsolp commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dantegd left a comment

Uh oh!

dantegd Jun 4, 2026

Uh oh!

dantegd Jun 4, 2026

Uh oh!

dantegd Jun 4, 2026

Uh oh!

jinsolp commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jinsolp commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dantegd left a comment

Choose a reason for hiding this comment

Uh oh!

dantegd Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

dantegd Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

dantegd Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

jinsolp commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants