Switch to public RAFT numpy_serialize APIs#2038
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (6)
💤 Files with no reviewable changes (1)
📝 WalkthroughSummary by CodeRabbit
Note: This release contains internal infrastructure improvements with no changes to end-user functionality. WalkthroughThis PR systematically migrates NumPy serializer API calls across twelve files from internal ChangesNumPy serializer API migration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@c/src/neighbors/brute_force.cpp`:
- Around line 242-244: The code reads 4 bytes into dtype_string using is.read
and immediately calls raft::numpy_serializer::parse_descr on those bytes;
however it doesn't check whether the read succeeded. Update the block around
dtype_string/is.read to verify the stream read (e.g., check is.gcount() == 4 or
is.fail()/is.good()) before calling raft::numpy_serializer::parse_descr, and on
failure handle the truncated header by returning an error/throwing an exception
or logging and aborting the parse so parse_descr never receives uninitialized
data.
In `@c/src/neighbors/cagra.cpp`:
- Around line 877-880: The code reads 4 bytes into dtype_string then calls
raft::numpy_serializer::parse_descr without validating the read; first check the
read succeeded (e.g., verify is.read(...) and that is.gcount() == 4 or that the
stream is in a good state) before constructing std::string(dtype_string, 4) and
calling raft::numpy_serializer::parse_descr; if the read fails/returns fewer
than 4 bytes, handle the error (throw, return an error code, or log and abort)
instead of passing potentially truncated data to parse_descr.
In `@c/src/neighbors/ivf_flat.cpp`:
- Around line 303-306: The code reads four bytes into dtype_string then calls
raft::numpy_serializer::parse_descr without verifying the read succeeded;
validate the read length immediately after is.read(dtype_string, 4) (e.g., check
is.gcount() == 4 or test stream state like if (!is || is.gcount() != 4)) and on
short reads/failure throw or return a clear error (or set an error status)
before calling raft::numpy_serializer::parse_descr, so malformed or truncated
files are handled safely.
In `@c/src/neighbors/mg_cagra.cpp`:
- Around line 403-406: The 4-byte dtype header read into dtype_string before
calling raft::numpy_serializer::parse_descr is not validated; check the istream
read result (e.g., is.read(...) and then is.gcount() == 4 or !is.fail()) and
handle short/truncated reads by logging/throwing/returning an error instead of
calling parse_descr with partial data; apply the same validation to the other
occurrence that reads 4 bytes so neither is.read -> parse_descr path can receive
garbage.
In `@c/src/neighbors/mg_ivf_flat.cpp`:
- Around line 400-403: The code reads 4 bytes into dtype_string and calls
raft::numpy_serializer::parse_descr without validating the read; update both the
deserialize and distribute code paths to check is.read(...) and the stream state
(e.g., verify gcount() == 4 or is.good()/is) before constructing
std::string(dtype_string,4) and calling parse_descr, and handle truncated reads
by returning/throwing an error or reporting via existing error path; reference
the dtype_string buffer, the is.read call, and the parse_descr invocation so you
change both occurrences (around the calls near deserialize and distribute).
In `@c/src/neighbors/mg_ivf_pq.cpp`:
- Around line 392-395: The code calls is.read into dtype_string and immediately
hands those bytes to raft::numpy_serializer::parse_descr without checking the
read result; update the logic around dtype_string/is.read to validate that 4
bytes were actually read (e.g., check is.gcount() == 4 and/or
is.good()/is.fail()) before calling parse_descr, and handle the short-read error
path (return an error, throw, or log and exit) so parse_descr is never called
with incomplete data; reference the dtype_string buffer, the is.read call, and
raft::numpy_serializer::parse_descr when making the change and ensure is.close()
still runs in all code paths.
In `@cpp/cmake/thirdparty/get_raft.cmake`:
- Around line 9-10: The CMake variables RAFT_FORK and RAFT_PINNED_TAG are
hard-pinned to a personal fork/branch; change the defaults to use the official
upstream (e.g., "rapidsai/raft") and a safe default tag (empty or a release
tag), and expose RAFT_FORK and RAFT_PINNED_TAG as CMake CACHE variables so they
can only be overridden explicitly via -D on the command line; additionally gate
any non-upstream usage behind an opt-in flag (e.g., USE_CUSTOM_RAFT or
RAFT_USE_FORK) so CI/releases use the official repo by default and personal-fork
overrides are explicitly documented and temporary.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 393379db-9f11-4936-953c-40a2df4cec4c
📒 Files selected for processing (15)
c/src/neighbors/brute_force.cppc/src/neighbors/cagra.cppc/src/neighbors/ivf_flat.cppc/src/neighbors/mg_cagra.cppc/src/neighbors/mg_ivf_flat.cppc/src/neighbors/mg_ivf_pq.cppcpp/cmake/thirdparty/get_raft.cmakecpp/include/cuvs/neighbors/cagra.hppcpp/include/cuvs/util/file_io.hppcpp/src/neighbors/brute_force_serialize.cucpp/src/neighbors/detail/cagra/cagra_build.cuhcpp/src/neighbors/detail/cagra/cagra_serialize.cuhcpp/src/neighbors/detail/hnsw.hppcpp/src/neighbors/ivf_flat/ivf_flat_serialize.cuhcpp/src/neighbors/mg/snmg.cuh
cuVS uses
parse_descr,read_header,get_numpy_dtype,write_headerfromraft::detail::numpy_serializeras discussed here. This PR proposes to expose them publicly in RAFT instead and use them throughout cuVS.There are similar issues with the
raft/core/detail/macros.hppto be address in a separate PR.Depends on rapidsai/raft#3003