[REVIEW] New Dataset API Clarifying Ownership by HowardHuang1 · Pull Request #1846 · NVIDIA/cuvs

HowardHuang1 · 2026-02-24T23:21:21Z

Overview

Addressing #1574 and #1571.

Replaced strided_dataset with padded_dataset class. Added support all the way up to CAGRA code.

Old class structure (Classes + Inheritance):

Intermediate Class Structure (Classes + Inheritance):

dataset and dataset_view are now 2 separate parent classes.

Intermediate Class Structure (ContainerType Tags + Composition + Variants):

New Class Structure (ContainerType Tags + Composition):

Inheritance is removed entirely and all dataset types are on the same level of the inheritance tree.

Ownership

The index and cagra::build / cagra::index do not own raw vector storage, they only take views.

callers (or the C merged holder) must keep backing memory alive for as long as the index is used.

The old code had a type-erased std::unique_ptr<dataset_view<...>>, i.e. non-owning view handles. The new code uses templates on the index type which determines the type of dataset_view the index holds.

ACE v.s. non-ACE paths on Host

ACE path copies datasets that can't entirely fit in CPU memory in chunks onto GPU memory by calling make_padded_dataset. This is 1x memory on CPU and 1x memory on GPU.

Return types:

Used mainly to maintain lifetime of dataset.

cuvs_cagra_c_api_lifetime_holder

unique_ptr<vpq_dataset> vpq_owner
unique_ptr padded_dataset_owner
raft::device_matrix dataset
cagra::index idx
It is a single C++ struct in cagra.cpp that groups the real cagra::index with any extra heap-owned things the C API had to create so the index’s non-owning views stay valid.

cuvs_cagra_c_api_lifetime_holder is a separate heap object from cagra::index. It is heap-allocated in cagra.cpp with new cuvs_cagra_c_api_lifetime_holder<...>. The C API keeps a raw pointer to it in cuvsCagraIndex.cuvs_cagra_c_api_lifetime_holder It is not embedded in the index, which is why the C layer needs that second field to delete the holder on destroy.

Heap-allocated bundle for the C API: owns cagra::index and any co-owned device storage (VPQ, padded dataset copy, merge/de-serialize/extend buffers) when the index is not standalone. cuvsCagraIndex.c_api_lifetime_owner points at this. Used for merge, build, deserialize, from_args, extend.

The holder moves the owning device_padded_dataset (as unique_ptr<dataset<>> in padded_dataset_owner) to the heap, and cuvsCagraIndex.merged_owner points at the holder. Destroying the C index later destroys the holder first, so the dataset outlives the index’s use of the view, or the ordering is set up so the view is not used after free.

In cuvsCagraIndexFromArgs in cagra.cpp (C API) where callers are things like the Python cagra.from_graph (via Cython) and the Java CagraIndexImpl, and any C code that uses that function:

The flow is: caller → cuvsCagraIndexFromArgs → _from_args, which writes into the cuvsCagraIndex struct the user passed

The holder is not returned as a separate C return value. It is allocated on the heap and its address is stored in output_index->merged_owner, and output_index->addr points at the index inside that holder (or at a freestanding index when merged_owner == 0).

So when _from_args returns, the user’s cuvsCagraIndex already holds the pointers that describe where everything lives.

The unique_ptr to the copy of the dataset from make_padded_dataset is not local to _from_args—it is a member of the holder, which is on the heap and stays alive.

Miscellaneous: Extend Serialize Deserialize

Will fill in later

Factories:

make_device_padded_dataset_view
make_host_padded_dataset_view
make_device_padded_dataset
make_host_padded_dataset
make_vpq_dataset
- in pq.hpp and pq.cu
make_merged_dataset
- in cagra.hpp

Places where make_padded_dataset/view are called internally (not by user):

Host non-ACE path

cpp/src/neighbors/cagra_build_inst.cu.in
cagra_from_host_padded in cpp/src/neighbors/iface/iface.hpp
c/src/neighbors/cagra.cpp

Tiered CAGRA

update_cagra_ann_dataset_for_stride
build_upstream_ann

Ownership in Downstream Functions:

build() takes dataset_view only.
Downstream functions search / serialize / deserialize / merge only take views.

Improvements:

build() function previously only supported device dataset inputs. It now supports host dataset inputs.

Breaking Changes for Dataset API:

The following functions are removed since index no longer owns the dataset, index only takes views:

Removed all owning dataset based builds. Build only takes views.
Removed all update_dataset() overloads that take owning dataset. Update_dataset() only takes views.
Removed old functions that took mdspan or derivatives of mdspan.

4 cases where index previously owned dataset [all deprecated paths]:

2 edge case build() paths when attach_dataset_on_build == true and a successful dense attach:

Non-ACE / typical padded attach: rows live under index_owning_dataset_storage_ (type-erased owning wrapper, commonly device_padded_dataset).
ACE in-memory device_matrix attach: rows live under index_owning_dataset_storage_ (optional raw device_matrix).

Compression Param:

implicit vpq dataset creation within build() when compression==True
this forces index to own new vpq dataset which violates our new contract that we want index to never own dataset.

Merge:

MERGE path: merge() internally creates merged_dataset on a deprecated internal merged dataset creation path. Here, index takes ownership of merged_dataset by storing it in index_owning_dataset_storage_ .

These paths have since been removed.

Attach Dataset

Previously, in the old code, ACE attach_dataset_on_build called make_device_padded_dataset on host dataset which did a H2D copy in order to attach dataset to final index.
cpp/src/neighbors/detail/cagra/cagra_build.cuh
This has since been removed. attach_dataset_on_build is disabled for host build paths. This avoids a H2D copy.

Compressed Dataset

Removed old code that did compressed dataset creation within build. This should only happen in the factory.

Merged Dataset

Removed implicit memory allocation within merge(), memory allocation now delegated to make_merged_dataset() factory. Removed index ownership within merge.

Deserialize

Removed index ownership of dataset during deserialization. Now users are expected to create/declare the dataset type to be deserialized and then pass it as a reference to the deserialize() function which will then populate this dataset and return it to the caller.

Helpers

cagra_required_row_width
matrix_actual_row_width
matrix_row_width_matches_cagra_required
convert_dataset_view_to_padded_for_graph_build
convert_host_to_device_index
attach_device_dataset_on_host_index

How to attach a compressed dataset onto an uncompressed index?

construct a new compressed index
copy over the graph and other params from the uncompressed index
delete the old uncompressed index
attach the vpq dataset onto the compressed index

How to attach a searchable device dataset onto an index built with host build?

Convert host index to device index with helper function convert_host_to_device_index
a. Utilizes map of host dataset type to device dataset type counterpart

TODOs:

Bring back Host functions [DONE]
Mark any old functions that are no longer used as [DONE]
Use templates wherever possible. Shift towards composition rather than inheritance [DONE]

Recent Updates:

build_ace() and build() functions merged on Public API surface
removed build_result(), ace_build_result(), and merge_result()
deprecated internal vpq dataset creation inside build() when index_params::compression == true --> moved to make_vpq_dataset() factory
deprecated internal merged dataset creation inside merge() --> moved to make_merged_dataset() factory. For backwards compatibility, have index take ownership of deprecated internal merged dataset creation path ONLY.
build() and build_ace() both had a attach_dataset_on_build which requires ownership of dataset. Ownership is given to index temporarily. This will later be deprecated. Users will be expected to pass padded dataset on device and call search() directly. Attach_dataset_on_build will no longer be supported for host builds.
added host versions of dataset API
templated build() and downstream functions to work on host datasets
added index template type conversion helpers

Future PRs:

PR#2: Add Support for Compressed Datasets

pq_dataset
bbq_dataset
rabitq_dataset
sq_dataset

PR#3: Migrate Rest of Algorithms to use Dataset API

HNSW
IVF
Vamana
Scann
Brute Force

…A level

copy-pr-bot · 2026-02-24T23:21:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

aamijar · 2026-02-25T01:51:35Z

/ok to test 5447a4c

aamijar · 2026-02-25T02:12:45Z

/ok to test 17ab09d

achirkin · 2026-02-25T06:06:02Z

NB: I updated the label to breaking, since the description implies removal of a publicly visible class strided_dataset

cjnolet · 2026-02-25T22:22:05Z

Does the dataset(_view) type bring anything on top of mdarray/mdspan in that case?

@achirkin The problem w/ using mdspan/mdarray for this is that it's not carrying along the proper information to either the algorithms nor the user (which is why we created this specialized class for this in the first place!).

Two immediate reasons why this API is necessary:

The user should not have to know that they need to pad a dataset in order to use cagra without the additional copy. They should not need to know how any of these algorithms work internally. They should, however, need to know that CAGRA expects a padded dataset, and they should have an API to construct one so that they can own the dataset class and not have cagra creating one under the hood.
APIs, especially the graph-based APIs, should be able to accept as inputs data which has been quantized using a metod like PQ, which carries with additional information. In the case of PQ, the codebooks are needed to compute the distances. This again decouples the quantization from the algorithm (CAGRA-Q does not need to do its own quantization. It should just accept the quantized vectors). We're being asked for the same behavior with Vamana.

This new API solves both of these problems while leaving the control over the memory ownership entirely in the user's hands. We've discussed this for a long time. We've known this is needed for a long time. it's time to prioritize this and get it done. I agree that an anstract class might make more sense, but ultimately we should not be moving any owneship over to the algorithm (the user should maintain ownership over the class and underlying memory the entire time).

…tion between make host/device padded dataset in factory

… of dataset + create build_result struct which returns both index and vpq_dataset to prevent automatic out of scope destruction of dataset for vpq case

…rt for cases where we DO need to own the dataset (in order to keep view alive for index). All cases where we build() from dataset already on device --> we don't need to own. Merge + All cases when data is on host --> we DO need to own the device copy we create. This includes within ACE build and C API build from host and from_args with host dataset

HowardHuang1 · 2026-03-04T23:24:04Z

The doc that outlines some of the API design choices can be found in slack. Let me know if there are any parts of the design that can be altered to better suit our users' needs.

The following files are test case files I've added and can be ignored for now. They will be removed before the final merge with upstream repo:

cagra_build_view_only.cu
cagra_padded_dataset.cu
cagra_vpq_build_result.cu
dataset_compression.cu
dataset_types.cu

cjnolet · 2026-06-30T17:06:11Z

  }

-  nlohmann::json comp_search_conf = collect_conf_with_prefix(conf, "compression_");
-  if (!comp_search_conf.empty()) {


We need to figure out how to update the datasets API without removing this funcionality altogeher. We can't just remove features without having something to take its place.

cjnolet · 2026-06-30T17:09:31Z

@@ -5,12 +5,14 @@

 #pragma once

-#include "common.hpp"


Is this really no longer needed?

cjnolet · 2026-06-30T17:10:20Z

 #include <cuvs/distance/distance.hpp>
+#include <cuvs/neighbors/cagra_dataset_view_dispatch.hpp>


Why all the extra headers? Headers are super lightwiehgt. Let's please try to consolidate all cagra related headers into this file. It's just too much for users to have to remember all of this

cjnolet · 2026-06-30T17:10:37Z

 #include <cuvs/neighbors/common.hpp>
+#include <cuvs/neighbors/dataset_view_concepts.hpp>


This is what "common.hpp" is for. Please consolidate.

cjnolet · 2026-06-30T17:13:55Z

-    return *dataset_;
-  }
+  /** Non-owning dataset binding stored by the index. */
+  [[nodiscard]] inline auto data() const noexcept -> DatasetViewT const& { return dataset_; }


Propose we rename this to "dataset" just to be more explicit and concise. "data()" makes me think of a pointer, which is how other APIs treat this.

cjnolet · 2026-06-30T17:15:18Z

+   * then call `update_dataset(res, std::move(*stolen_fd))` on the target device index.
+   * Clears the stored fd (and leaves n_rows_/dim_ in place for the remaining graph).
+   */
+  [[nodiscard]] inline auto steal_dataset_fd() noexcept


This is a little bit awkward. I'm not speaking to wheher it's necessary, just that it's a bit odd to see in a public API.

Resolve faiss-1.14-cuvs-26.08.diff conflict by combining Dataset API changes with upstream CuvsFlatIndex distance.hpp include fix.

…nd dataset_view should control only owning vs non-owning. New design has 3 layers: (1) storage (2) container (3) dataset/dataset_view. Storage contains the real implementation with the data members. Container layer has owning vs view storage for every container type. Dataset/Dataset_view layer contains the public API names with storage types declared.

…e but internally convert it to a device_padded_dataset because iterative build is the exception which calls search and search kernels need correct padding and search can only happen on device. Remove cagra_dataset_view_dispatch.hpp file

…-Dataset-API

HowardHuang1 added 4 commits February 23, 2026 09:49

get build working

d78f459

add dataset compression test and basic constructor types test

4febf8b

add padded_dataset class along with test cases

b403473

add support for new padded_dataset classes all the way up to the CAGR…

8d6833a

…A level

HowardHuang1 requested review from a team as code owners February 24, 2026 23:21

github-project-automation Bot added this to Unstructured Data Processing Feb 24, 2026

aamijar added non-breaking Introduces a non-breaking change feature request New feature or request labels Feb 25, 2026

aamijar assigned HowardHuang1 Feb 25, 2026

aamijar moved this to In Progress in Unstructured Data Processing Feb 25, 2026

Merge branch 'main' into HH-Dataset-API

5447a4c

fix style

17ab09d

achirkin added breaking Introduces a breaking change and removed non-breaking Introduces a non-breaking change labels Feb 25, 2026

achirkin reviewed Feb 25, 2026

View reviewed changes

Comment thread cpp/include/cuvs/neighbors/cagra.hpp

aamijar reviewed Feb 25, 2026

View reviewed changes

Comment thread cpp/CMakeLists.txt Outdated

seunghwak mentioned this pull request Feb 27, 2026

[WIP] Clarify dataset ownership and allocation semantics #1738

Closed

build() now only takes views and not unique ptrs + get rid of distinc…

fb556c9

…tion between make host/device padded dataset in factory

HowardHuang1 requested a review from a team as a code owner February 28, 2026 03:34

HowardHuang1 added 4 commits March 2, 2026 15:27

clean up old overloads of build & index functions that take ownership…

37d28dc

… of dataset + create build_result struct which returns both index and vpq_dataset to prevent automatic out of scope destruction of dataset for vpq case

fix failing mg tests that do build -> serialize -> deserialize -> search

26b46a2

Merge remote-tracking branch 'upstream' into HH-Dataset-API

a38fb18

Merge remote-tracking branch 'upstream' into HH-Dataset-API

df925ec

HowardHuang1 requested review from a team as code owners June 29, 2026 16:54

cjnolet reviewed Jun 30, 2026

View reviewed changes

Comment thread c/include/cuvs/neighbors/cagra.h

cjnolet reviewed Jun 30, 2026

View reviewed changes

Comment thread cpp/include/cuvs/neighbors/cagra.hpp

cjnolet reviewed Jun 30, 2026

View reviewed changes

HowardHuang1 added 6 commits June 30, 2026 10:45

Merge upstream into HH-Dataset-API

f722c2c

Resolve faiss-1.14-cuvs-26.08.diff conflict by combining Dataset API changes with upstream CuvsFlatIndex distance.hpp include fix.

add host_padded_index overloads for serialize to hnsw path

fadf92e

move cagra_dataset_view_dispatch.hpp to detail namespace

6c47d31

add doxygen for search function overloads in cagra.hpp

4dbceca

Merge remote-tracking branch 'upstream' into HH-Dataset-API

f22232e

josephine-wolf-oberholtzer added this to Unstructured Data Processing Jul 1, 2026

josephine-wolf-oberholtzer moved this to In Progress in Unstructured Data Processing Jul 1, 2026

HowardHuang1 added 5 commits July 1, 2026 15:53

Merge remote-tracking branch 'upstream' into HH-Dataset-API

06c4050

add serialize_to_hnswlib and from_cagra overloads for standard dataset

c5c060d

Merge branch 'HH-Dataset-API' of github.com:HowardHuang1/cuvs into HH…

51b19f5

…-Dataset-API

		#include <cuvs/distance/distance.hpp>
		#include <cuvs/neighbors/cagra_dataset_view_dispatch.hpp>

		#include <cuvs/neighbors/common.hpp>
		#include <cuvs/neighbors/dataset_view_concepts.hpp>

Uh oh!

Conversation

HowardHuang1 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Old class structure (Classes + Inheritance):

Intermediate Class Structure (Classes + Inheritance):

Intermediate Class Structure (ContainerType Tags + Composition + Variants):

New Class Structure (ContainerType Tags + Composition):

Ownership

ACE v.s. non-ACE paths on Host

Return types:

Miscellaneous: Extend Serialize Deserialize

Factories:

Places where make_padded_dataset/view are called internally (not by user):

Ownership in Downstream Functions:

Improvements:

Breaking Changes for Dataset API:

4 cases where index previously owned dataset [all deprecated paths]:

Attach Dataset

Compressed Dataset

Merged Dataset

Deserialize

Helpers

How to attach a compressed dataset onto an uncompressed index?

How to attach a searchable device dataset onto an index built with host build?

TODOs:

Recent Updates:

Future PRs:

Uh oh!

copy-pr-bot Bot commented Feb 24, 2026

Uh oh!

aamijar commented Feb 25, 2026

Uh oh!

aamijar commented Feb 25, 2026

Uh oh!

achirkin commented Feb 25, 2026

Uh oh!

Uh oh!

cjnolet commented Feb 25, 2026

Uh oh!

Uh oh!

HowardHuang1 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cjnolet Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

cjnolet Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

cjnolet Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

cjnolet Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cjnolet Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

cjnolet Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

HowardHuang1 commented Feb 24, 2026 •

edited

Loading

HowardHuang1 commented Mar 4, 2026 •

edited

Loading