[REVIEW] New Dataset API Clarifying Ownership#1846
Conversation
|
/ok to test 5447a4c |
|
/ok to test 17ab09d |
|
NB: I updated the label to |
@achirkin The problem w/ using mdspan/mdarray for this is that it's not carrying along the proper information to either the algorithms nor the user (which is why we created this specialized class for this in the first place!). Two immediate reasons why this API is necessary:
This new API solves both of these problems while leaving the control over the memory ownership entirely in the user's hands. We've discussed this for a long time. We've known this is needed for a long time. it's time to prioritize this and get it done. I agree that an anstract class might make more sense, but ultimately we should not be moving any owneship over to the algorithm (the user should maintain ownership over the class and underlying memory the entire time). |
…tion between make host/device padded dataset in factory
… of dataset + create build_result struct which returns both index and vpq_dataset to prevent automatic out of scope destruction of dataset for vpq case
…rt for cases where we DO need to own the dataset (in order to keep view alive for index). All cases where we build() from dataset already on device --> we don't need to own. Merge + All cases when data is on host --> we DO need to own the device copy we create. This includes within ACE build and C API build from host and from_args with host dataset
|
The doc that outlines some of the API design choices can be found in slack. Let me know if there are any parts of the design that can be altered to better suit our users' needs. The following files are test case files I've added and can be ignored for now. They will be removed before the final merge with upstream repo:
|
| } | ||
|
|
||
| nlohmann::json comp_search_conf = collect_conf_with_prefix(conf, "compression_"); | ||
| if (!comp_search_conf.empty()) { |
There was a problem hiding this comment.
We need to figure out how to update the datasets API without removing this funcionality altogeher. We can't just remove features without having something to take its place.
| @@ -5,12 +5,14 @@ | |||
|
|
|||
| #pragma once | |||
|
|
|||
| #include "common.hpp" | |||
There was a problem hiding this comment.
Is this really no longer needed?
| #include <cuvs/distance/distance.hpp> | ||
| #include <cuvs/neighbors/cagra_dataset_view_dispatch.hpp> |
There was a problem hiding this comment.
Why all the extra headers? Headers are super lightwiehgt. Let's please try to consolidate all cagra related headers into this file. It's just too much for users to have to remember all of this
| #include <cuvs/neighbors/common.hpp> | ||
| #include <cuvs/neighbors/dataset_view_concepts.hpp> |
There was a problem hiding this comment.
This is what "common.hpp" is for. Please consolidate.
| return *dataset_; | ||
| } | ||
| /** Non-owning dataset binding stored by the index. */ | ||
| [[nodiscard]] inline auto data() const noexcept -> DatasetViewT const& { return dataset_; } |
There was a problem hiding this comment.
Propose we rename this to "dataset" just to be more explicit and concise. "data()" makes me think of a pointer, which is how other APIs treat this.
| * then call `update_dataset(res, std::move(*stolen_fd))` on the target device index. | ||
| * Clears the stored fd (and leaves n_rows_/dim_ in place for the remaining graph). | ||
| */ | ||
| [[nodiscard]] inline auto steal_dataset_fd() noexcept |
There was a problem hiding this comment.
This is a little bit awkward. I'm not speaking to wheher it's necessary, just that it's a bit odd to see in a public API.
Resolve faiss-1.14-cuvs-26.08.diff conflict by combining Dataset API changes with upstream CuvsFlatIndex distance.hpp include fix.
…nd dataset_view should control only owning vs non-owning. New design has 3 layers: (1) storage (2) container (3) dataset/dataset_view. Storage contains the real implementation with the data members. Container layer has owning vs view storage for every container type. Dataset/Dataset_view layer contains the public API names with storage types declared.
…e but internally convert it to a device_padded_dataset because iterative build is the exception which calls search and search kernels need correct padding and search can only happen on device. Remove cagra_dataset_view_dispatch.hpp file
…e but internally convert it to a device_padded_dataset because iterative build is the exception which calls search and search kernels need correct padding and search can only happen on device. Remove cagra_dataset_view_dispatch.hpp file
Overview
Addressing #1574 and #1571.
Replaced strided_dataset with padded_dataset class. Added support all the way up to CAGRA code.
Old class structure (Classes + Inheritance):
Intermediate Class Structure (Classes + Inheritance):
dataset and dataset_view are now 2 separate parent classes.
Intermediate Class Structure (ContainerType Tags + Composition + Variants):
New Class Structure (ContainerType Tags + Composition):
Inheritance is removed entirely and all dataset types are on the same level of the inheritance tree.
Ownership
The index and cagra::build / cagra::index do not own raw vector storage, they only take views.
The old code had a type-erased std::unique_ptr<dataset_view<...>>, i.e. non-owning view handles. The new code uses templates on the index type which determines the type of dataset_view the index holds.
ACE v.s. non-ACE paths on Host
ACE path copies datasets that can't entirely fit in CPU memory in chunks onto GPU memory by calling make_padded_dataset. This is 1x memory on CPU and 1x memory on GPU.
Return types:
Used mainly to maintain lifetime of dataset.
cuvs_cagra_c_api_lifetime_holder
It is a single C++ struct in cagra.cpp that groups the real cagra::index with any extra heap-owned things the C API had to create so the index’s non-owning views stay valid.
cuvs_cagra_c_api_lifetime_holder is a separate heap object from cagra::index. It is heap-allocated in cagra.cpp with new cuvs_cagra_c_api_lifetime_holder<...>. The C API keeps a raw pointer to it in cuvsCagraIndex.cuvs_cagra_c_api_lifetime_holder It is not embedded in the index, which is why the C layer needs that second field to delete the holder on destroy.
Heap-allocated bundle for the C API: owns
cagra::indexand any co-owned device storage (VPQ, padded dataset copy, merge/de-serialize/extend buffers) when the index is not standalone.cuvsCagraIndex.c_api_lifetime_ownerpoints at this. Used for merge, build, deserialize, from_args, extend.The holder moves the owning device_padded_dataset (as unique_ptr<dataset<>> in padded_dataset_owner) to the heap, and cuvsCagraIndex.merged_owner points at the holder. Destroying the C index later destroys the holder first, so the dataset outlives the index’s use of the view, or the ordering is set up so the view is not used after free.
In cuvsCagraIndexFromArgs in cagra.cpp (C API) where callers are things like the Python cagra.from_graph (via Cython) and the Java CagraIndexImpl, and any C code that uses that function:
The flow is: caller → cuvsCagraIndexFromArgs → _from_args, which writes into the cuvsCagraIndex struct the user passed
The holder is not returned as a separate C return value. It is allocated on the heap and its address is stored in output_index->merged_owner, and output_index->addr points at the index inside that holder (or at a freestanding index when merged_owner == 0).
So when _from_args returns, the user’s cuvsCagraIndex already holds the pointers that describe where everything lives.
The unique_ptr to the copy of the dataset from make_padded_dataset is not local to _from_args—it is a member of the holder, which is on the heap and stays alive.
Miscellaneous: Extend Serialize Deserialize
Will fill in later
Factories:
Places where make_padded_dataset/view are called internally (not by user):
Host non-ACE path
Tiered CAGRA
Ownership in Downstream Functions:
Improvements:
Breaking Changes for Dataset API:
The following functions are removed since index no longer owns the dataset, index only takes views:
Removed old functions that took mdspan or derivatives of mdspan.
4 cases where index previously owned dataset [all deprecated paths]:
2 edge case build() paths when attach_dataset_on_build == true and a successful dense attach:
Compression Param:
Merge:
These paths have since been removed.
Attach Dataset
Compressed Dataset
Merged Dataset
Deserialize
Helpers
How to attach a compressed dataset onto an uncompressed index?
How to attach a searchable device dataset onto an index built with host build?
a. Utilizes map of host dataset type to device dataset type counterpart
TODOs:
Recent Updates:
Future PRs:
PR#2: Add Support for Compressed Datasets
PR#3: Migrate Rest of Algorithms to use Dataset API