Add HNSW Layered Index Support#2148
Draft
julianmi wants to merge 7 commits into
Draft
Conversation
- Add base node IDs for sequential access. - Scattered writes happen only in deserialization step using host memory.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The CAGRA graph built by the disk-backed ACE algorithm partitions the dataset. Thus, the CAGRA graph uses the reordered index space. Building a HNSW index using
hnsw::from_cagrauses the reordered dataset and CAGRA graph. Downstream consumers building an HNSW index would therefore require the reordered dataset, which is typically large when requiring the disk-backed ACE algorithm. Thus, building only the layers of the HNSW index without the dataset and moving this to the search node can minimize the network transfers for downstream consumers if they have the original dataset locally available. Thehnsw::deserializestep then takes the layered index and combines it with the local dataset to form a hnswlib compatible search index.Artifact Layout
Layered HNSW Serialization
The layered serializer creates
hnsw_index.cuvsfrom the disk-backed ACE graph..cuvsfile and write the fixed header and metadata JSON.levelssequentially.dataset_mapping.npysequentially intoreordered_to_original.cagra_graph.npysource-sequentially in ACE reordered row order.base_nodes[row] = reordered_to_original[ace_reordered_row]base_links[row]upper_nodesandupper_linkswith node IDs and neighbor IDs converted back to original IDs.This keeps remapping, link padding, and upper-layer KNN work on the build node.
Deserialization
The search node reads:
hnsw_index.cuvsindex_params.dataset_pathThe loader:
levelssequentially.levels[original_id]base_nodesandbase_linkssequentially.get_linklist0(base_node_id).upper_nodesandupper_linkssequentially by layer.get_linklist(node_id, level).The search node does no graph remapping, no level generation, no link padding, and no KNN work.
Disk Access Patterns
Build node:
reordered_dataset.npyandaugmented_dataset.npy.cagra_graph.npywhen creating the final layered artifact.hnsw_index.cuvs.Search node:
hnsw_index.cuvs.Runtime Requirements
Only
hnsw_index.cuvsis copied to the search node. ACE temporary files remain build-node-only.The search node must have the original dataset in original row order and must provide that path through
index_params.dataset_path.Misc
Unifies the logging format of the ACE algorithm.