diff --git a/docs/benchmarking.md b/docs/benchmarking.md new file mode 100644 index 000000000..e27113dc1 --- /dev/null +++ b/docs/benchmarking.md @@ -0,0 +1,130 @@ +# How to run benchmarks + +JVector comes with a built-in benchmarking system in `jvector-examples/.../BenchYAML.java`. + +To run a benchmark +- Decide which dataset(s) you want to benchmark. A dataset consists of + - The vectors to be indexed, usually called the "base" or "target" vectors. + - The query vectors. + - The "ground truth" results which are used to compute accuracy metrics. + - The similarity metric which should have been used to compute the ground truth (dot product, cosine similarity or L2 distance). +- Configure the parameters combinations for which you want to run the benchmark. This includes graph index parameters, quantization parameters and search parameters. + +JVector supports two types of datasets: +- **Fvec/Ivec**: The dataset consists of three files, for example `base.fvec`, `queries.fvec` and `neighbors.ivec` containing the base vectors, query vectors, and ground truth. (`fvec` and `ivec` file formats are described [here](http://corpus-texmex.irisa.fr/)) +- **HDF5**: The dataset consists of a single HDF5 file with three datasets labelled `train`, `test` and `neighbors`, representing the base vectors, query vectors and the ground truth. + +The general procedure for running benchmarks is mentioned below. The following sections describe the process in more detail. +- [Specify the dataset](#specifying-datasets) names to benchmark in `datasets.yml`. +- Certain datasets will be downloaded automatically. If using a different dataset, make sure the dataset files are downloaded and made available (refer the section on [Custom datasets](#custom-datasets)). +- Adjust the benchmark parameters in `default.yml`. This will affect the parameters for all datasets to be benchmarked. You can specify custom parameters for a specific dataset by creating a file called `.yml` in the same folder. +- Decide on the kind of measurements and logging you want and configure them in `run.yml`. + +You can run the configured benchmark with maven: +```sh +mvn clean compile exec:exec@bench -pl jvector-examples -am +``` + +## Specifying dataset(s) + +The datasets you want to benchmark should be specified in `jvector-examples/yaml-configs/datasets.yml`. You'll notice this file already contains some entries; these are datasets that `bench` can automatically download and test with minimal additional configuration. Running `bench` without arguments and without changing this file will cause ALL the datasets to be benchmarked one by one (this is probably not what you want). + +To benchmark a single dataset, comment out the entries corresponding to all other datasets. (Or provide command line arguments as described in [Running `bench` from the command line](#running-bench-from-the-command-line)) + +Datasets are assumed to be Fvec/Ivec based unless the entry in the `datasets.yml` ends with `.hdf5`. In this case, `.hdf5` is not considered part of the "dataset name" referenced in other sections. + +You'll notice that datasets are grouped into categories. The categories can be arbitrarily chosen for convenience and are not currently considered by the benchmarking system. + +For HDF5 files, the substrings `-angular`, `-euclidean` and `-dot` correspond to cosine similarity, L2 distance, and dot product similarity functions (these substrings ARE considered to be part of the "dataset name"). Currently, Fvec/Ivec datasets are implicitly assumed to use cosine similarity (changing this requires editing `DataSetLoaderMFD.java`). + +Example `datasets.yml`: + +```yaml +category0: + - my-fvec-dataset # fvec/ivec dataset, cosine similarity + - my-hdf5-dataset-angular.hdf5 # hdf5 dataset, cosine similarity +some-other-category: + - a-huge-dataset-1024d-euclidean.hdf5 # hdf5 dataset, L2 similarity + - my-simple-dataset-dot.hdf5 # hdf5 dataset, dot product similarity + - some-dataset-euclidean # fvec/ivec dataset, cosine similarity (NOT L2 unless you change the code!) +``` + +## Setting benchmark parameters + +### default.yml / \.yml + +`jvector-examples/yaml-configs/default.yml` specifies the default index construction and search parameters to be used by `bench` for all datasets. + +You can specify a custom set of a parameters for any given dataset by creating a file called `.yml`, with `` replaced by the actual name of the dataset. This is the same as the identifier used in `datasets.yml`, but without the `.hdf5` suffix for hdf5 datasets. The format of this file is exactly the same as `default.yml`. + +Refer to `default.yml` for a list of all options. + +Most parameters can be specified as an array. For these parameters, a separate benchmark is run for each value of the parameter. If multiple parameters are specified as arrays, a benchmark is run for each combination (i.e. taking the Cartesian product). For example: +```yaml +construction: + M: [32, 64] + ef: [100, 200] +``` +will build and benchmark four graphs, one for each combination of M and ef in {(32, 100), (64, 100), (32, 200), (64, 200)}. This is particularly useful when running a Grid search to identify the best performing parameters. + +### run.yml + +This file contains configurations for +- Specifying the measurements you want to report, like QPS, latency and recall +- Specifying where to output these measurements, i.e. to the console, or to a file, or both. + +The configurations in this file are "run-level", meaning that they are shared across all the datasets being benchmarked. + +See `run.yml` for a full list of all options. + +## Running `bench` from the command line + +Once configured to your liking, you can run the benchmark through maven: +```sh +mvn compile exec:exec@bench -pl jvector-examples -am +``` + +To benchmark a subset of the datasets in `datasets.yml`, you can provide a space-separated list of regexes as arguments. +```sh +# matches `glove-25-angular.hdf5`, `glove-50-angular.hdf5`, `nytimes-256-angular.hdf5` etc +mvn compile exec:exec@bench -pl jvector-examples -am -DbenchArgs="glove nytimes" +``` + +## Custom Datasets + +### Custom Fvec/Ivec datasets + +Using fvec/ivec datasets requires them to be configured in `DataSetLoaderMFD.java`. Some datasets are already pre-configured; these will be downloaded and used automatically on running the benchmark. + +To use a custom dataset consisting of files `base.fvec`, `queries.fvec` and `neighbors.ivec`, do the following: +- Ensure that you have three files: + - `base.fvec` containing N D-dimensional float vectors. These are used to build the index. + - `queries.fvec` containing Q D-dimensional float vectors. These are used for querying the built index. + - `neighbors.ivec` containing Q K-dimensional integer vectors, one for each query vector, representing the exact K-nearest neighbors for that query among the base vectors. + The files can be named however you like. +- Save all three files somewhere in the `fvec` directory in the root of the `jvector` repo (if it doesn't exist, create it). It's recommended to create at least one sub-folder with the name of the dataset and copy or move all three files there. +- Edit `DataSetLoaderMFD.java` to configure a new dataset and it's associated files: + ```java + put("cust-ds", new MultiFileDatasource("cust-ds", + "cust-ds/base.fvec", + "cust-ds/query.fvec", + "cust-ds/neighbors.ivec")); + ``` + The file paths are resolved relative to the `fvec` directory. `cust-ds` is the name of the dataset and can be changed to whatever is appropriate. +- In `jvector-examples/yaml-configs/datasets.yml`, add an entry corresponding to your custom dataset. Comment out other datasets which you do not want to benchmark. + ```yaml + custom: + - cust-ds + ``` + +## Custom HDF5 datasets + +HDF5 datasets consist of a single file. The Hdf5Loader looks for three HDF5 datasets within the file, `train`, `test` and `neighbors`. These correspond to the base, query and neighbors vectors described above for fvec/ivec files. + +To use an HDF5 dataset, edit `jvector-examples/yaml-configs/datasets.yml` to add an entry like the following: +```yaml +category: + - .hdf5 +``` + +BenchYAML looks for hdf5 datasets with the name `.hdf5` in the `hdf5` folder in the root of this repo. If the file doesn't exist, BenchYAML will attempt to automatically download the dataset from ann-benchmarks.com. If your dataset is not from ann-benchmarks.com, simply ensure that the dataset is available in the `hdf5` folder and edit `datasets.yml` accordingly. diff --git a/docs/tutorials/1-intro-tutorial.md b/docs/tutorials/1-intro-tutorial.md new file mode 100644 index 000000000..b95e606c2 --- /dev/null +++ b/docs/tutorials/1-intro-tutorial.md @@ -0,0 +1,149 @@ +# JVector Tutorial Part 1: Introduction + +JVector provides a graph index for ANN search which is a hybrid of DiskANN and HNSW. You can think of it as a Vamana index with an HNSW-style hierarchy. The rest of this tutorial assumes you have a basic understanding of Vector search, but no prior understanding of HNSW or DiskANN is assumed. + +JVector provides a `VectorFloat` datatype for representing vectors, as an abstraction over the physical vector type. Therefore, the first step to using JVector is to understand how to create a `VectorFloat`: + +```java +// `VectorizationProvider` is automatically picked based on the system, language version and runtime flags +// and determines the actual type of the vector data, and provides implementations for common operations +// like the inner product. +VectorTypeSupport vts = VectorizationProvider.getInstance().getVectorTypeSupport(); + +int dimension = 3; + +// Create a `VectorFloat` from a `float[]`. +// The types that can be converted to a VectorFloat are technically dependent on which VectorizationProvider is picked, +// but `float[]` is generally a safe bet. +float[] vector0Array = new float[]{0.1f, 0.2f, 0.3f}; +VectorFloat vector0 = vts.createFloatVector(vector0Array); +``` + +> [!TIP] +> For other ways to create vectors, refer to the javadoc for `VectorTypeSupport`. + +Before creating the vector index, we will group all of our base vectors into a container which implements the `RandomAccessVectorValues` interface. Many APIs in JVector accept an instance of `RandomAccessVectorValues` as input. In this case, we'll use it to specify the vectors to be used to build the index. + +```java +// This toy example uses only three vectors, in practical cases you might have millions or more. +List> baseVectors = List.of( + vector0, + vts.createFloatVector(new float[]{0.01f, 0.15f, -0.3f}), + vts.createFloatVector(new float[]{-0.2f, 0.1f, 0.35f}) +); + +// RAVV or `ravv` is convenient shorthand for a RandomAccessVectorValues instance +RandomAccessVectorValues ravv = new ListRandomAccessVectorValues(baseVectors, dimension /* 3 */); +``` + +> [!TIP] +> In this example, all vectors are loaded in-memory, but RAVVs are quite versatile. For example, you might have a RAVV backed by disk (check out `MMapRandomAccessVectorValues.java`) or write your own custom RAVV that transfers data over a network interface. + +> [!NOTE] +> A note on terminology: +> - "Base" vectors are the vectors used to build the index. Each vector becomes a node in the graph. +> - "Query" vectors are vectors used as queries for ANN search after the index has been built. In some cases you may want to use some base vectors as queries. + +We're now ready to create a Graph-based vector index. We'll do this using a `GraphIndexBuilder` as an intermediate. Let's take a look at the signature of one of it's constructors: + +```java +public GraphIndexBuilder(BuildScoreProvider scoreProvider, + int dimension, + int M, + int beamWidth, + float neighborOverflow, + float alpha, + boolean addHierarchy, + boolean refineFinalGraph); +``` + +This constructor asks for something called a `BuildScoreProvider`, the vector dimension, and a set of graph parameters. + +The `BuildScoreProvider` is used by the graph builder to compute the similarity scores between any two vectors at build time. We'll use the RAVV we created earlier to generate a BuildScoreProvider: + +```java +// The type of similarity score to use. JVector supports EUCLIDEAN (L2 distance), DOT_PRODUCT and COSINE. +VectorSimilarityFunction similarityFunction = VectorSimilarityFunction.EUCLIDEAN; + +// A simple score provider which can compute exact similarity scores by holding a reference to all the base vectors. +BuildScoreProvider bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, similarityFunction); +``` + +Let's also initialize the graph parameters. For now we will not worry about the exact function of the parameters, except to note that these are reasonable defaults. Refer to the DiskANN and HNSW papers for more details. + + + +```java +// Graph construction parameters +int M = 32; // maximum degree of each node +int efConstruction = 100; // search depth during construction +float neighborOverflow = 1.2f; +float alpha = 1.2f; // note: not the best setting for 3D vectors, but good in the general case +boolean addHierarchy = true; // use an HNSW-style hierarchy +boolean refineFinalGraph = true; +``` + +and now we can create the graph index + +```java +// Build the graph index using a Builder +// Remember to close the builder using builder.close() or a try-with-resources block +GraphIndexBuilder builder = new GraphIndexBuilder(bsp, + dimension, + M, + efConstruction, + neighborOverflow, + alpha, + addHierarchy, + refineFinalGraph); +ImmutableGraphIndex graph = builder.build(ravv); +``` + +> [!NOTE] +> You may notice that we supplied the same `ravv` to `builder.build`, even though we'd already passed in the RAVV while creating the `BuildScoreProvider`. This is necessary since generally speaking, the `BuildScoreProvider` won't keep a reference to the actual base vectors, but it just so happens that we're using an "exact" score provider that does so. + +At this point, you have a completed Graph Index that resides in-memory. + +To perform a search operation, you need to first create a `GraphSearcher`. + +> [!IMPORTANT] +> The graph index itself can be shared between threads, but `GraphSearcher`s maintain internal state and are therefore NOT thread-safe. To run concurrent searches across multiple threads, each thread should have it's own `GraphSearcher`. The same searcher can be re-used across different queries in the same thread. + +```java +// Remember to close the searcher using searcher.close() or a try-with-resources block +var searcher = new GraphSearcher(graph); +``` + +Generally speaking, you can't pass in a `VectorFloat` directly to the `GraphSearcher`. You need to wrap the query vector with a `SearchScoreProvider`, similar in spirit to the `BuildScoreProvider` we created earlier. + +```java +VectorFloat queryVector = vts.createFloatVector(new float[]{0.2f, 0.3f, 0.4f}); // for example +// The in-memory graph index doesn't own the actual vectors used to construct it. +// To compute exact scores at search time, you need to pass in the base RAVV again, +// in addition to the actual query vector +SearchScoreProvider ssp = DefaultSearchScoreProvider.exact(queryVector, similarityFunction, ravv); +``` + +Now we can run a search + +```java +int topK = 10; // number of approximate nearest neighbors to fetch +// You can provide a filter to the query as a bit mask. +// In this case we want the actual topK neighbors without filtering, +// so we pass in a virtual bit mask representing all ones. +SearchResult result = searcher.search(ssp, topK, Bits.ALL); + +for (NodeScore ns : result.getNodes()) { + int id = ns.node; // you can look up this ID in the RAVV + float score = ns.score; // the similarity score between this vector and the query vector (higher -> more similar) + System.out.println("ID: " + id + ", Score: " + score + ", Vector: " + ravv.getVector(id)); +} +``` + +For the full example, refer `jvector-examples/../VectorIntro.java`. Run it using +```sh +mvn compile exec:exec@tutorial -Dtutorial=intro -pl jvector-examples -am +``` + +Next tutorial: +- The OnDiskGraphIndex diff --git a/docs/tutorials/2-disk-tutorial.md b/docs/tutorials/2-disk-tutorial.md new file mode 100644 index 000000000..4040ff7de --- /dev/null +++ b/docs/tutorials/2-disk-tutorial.md @@ -0,0 +1,110 @@ +# JVector Tutorial Part 2: The OnDiskGraphIndex + +In-memory indexes are relatively simple to create, but they are not persistent and are fundamentally limited by the amount of memory available on the machine. To solve this, JVector provides an `OnDiskGraphIndex` backed by a file. + +Unlike the previous example, this time we'll use a proper ANN dataset based on OpenAI's text-embedding-ada-002 embedding model. + +```java +// This is a preconfigured dataset that will be downloaded automatically. +DataSet dataset = DataSets.loadDataSet("ada002-100k").orElseThrow(() -> + new RuntimeException("Dataset doesn't exist or wasn't configured correctly") +); +``` + +> [!TIP] +> A `DataSet` provides the base vectors used to be indexed, the query vectors, and the expected or "ground truth" results used for computing accuracy metrics. + +We'll create the graph in-memory in exactly the same way as we did in the introductory tutorial. You can't create larger-than-memory indexes this way, but we'll cover how to do that in a later tutorial. + +```java +// The loaded DataSet provides a RAVV over the base vectors +RandomAccessVectorValues ravv = dataset.getBaseRavv(); +VectorSimilarityFunction vsf = dataset.getSimilarityFunction(); +int dim = dataset.getDimension(); + +// reasonable defaults +int M = 32; +int ef = 100; +float overflow = 1.2f; +float alpha = 1.2f; +boolean addHierarchy = true; + +BuildScoreProvider bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, vsf); + +// nothing new here +ImmutableGraphIndex heapGraph; +try (GraphIndexBuilder builder = new GraphIndexBuilder(bsp, dim, M, ef, overflow, alpha, addHierarchy)) { + heapGraph = builder.build(ravv); +} +``` + +We can write this graph to the disk using a `GraphIndexWriter`. + +```java +Path graphPath = Files.createTempFile("jvector-example-graph", null); // or wherever you want to save the graph + +// Create a writer for the on-heap graph we just built. +// Remember to close when done. +GraphIndexWriter writer = GraphIndexWriter.getBuilderFor(GraphIndexWriterTypes.RANDOM_ACCESS_PARALLEL, heapGraph, graphPath) + // Let the writer know that we'll also be passing in the actual vector data + // to be saved "inline" with the data for each corresponding graph node. + .with(new InlineVectors(dim)) + .build(); +``` + +This is a good time to discuss "Features" associated with on-disk JVector indexes. A Feature is any kind of additional information about each node in the graph that gets saved to the same index file. Here, we'll be writing "Inline Vectors" as a feature, which means that the vector value will be stored alongside the neighbor lists for each node in the graph. This will be useful later. + +Since we want to include a feature, we need to tell the writer how to obtain the feature's "state" (or value) for each node. For inline vectors, this just means we need to specify the vector associated with each node. + +```java +// Supply one map entry for each feature. +// The key is a FeatureId enum corresponding to the feature +// and the value is a function which generates the feature state for each graph node. +writer.write(Map.of( + FeatureId.INLINE_VECTORS, + // we already have a RAVV, so we'll just use that to supply the writer. + nodeId -> new InlineVectors.State(ravv.getVector(nodeId)))); +// writer.close() if not using try-with-resources +``` + +At this point the graph index has been written to disk and can be used by creating an instance of `OnDiskGraphIndex`. + +To do so, we'll need to specify a `RandomAccessReader` implementation which JVector will use to read parts of the file as needed. This interface isn't thread safe, so when creating an `OnDiskGraphIndex` we pass in a `ReaderSupplier` object which can be used by the `OnDiskGraphIndex` to create `RandomAccessReader`s as needed. + +```java +// ReaderSupplierFactory automatically picks an available RandomAccessReader implementation +ReaderSupplier readerSupplier = ReaderSupplierFactory.open(graphPath); +OnDiskGraphIndex graph = OnDiskGraphIndex.load(readerSupplier); +``` + +Now we can perform searches exactly as with the in-memory index, with one minor difference: for the in-memory index, we had to keep track of the RAVV we used to build the graph in order to create an exact `SearchScoreProvider` for each query. In this case, since Inline vectors are available as a feature, we can acquire a RAVV from the index itself. + +```java +GraphSearcher searcher = new GraphSearcher(graph); +// Views of an OnDiskGraphIndex with inline or separated vectors can be used as RAVVs! +// In multi-threaded scenarios you should have one searcher per thread +// and extract a view for each thread from the associated searcher. +RandomAccessVectorValues graphRavv = (RandomAccessVectorValues) searcher.getView(); + +// number of search results we want +int topK = 10; +// `rerankK` controls the number of nodes to fetch from the initial graph search. +// which are then re-ranked to return the actual topK results. +// Increasing rerankK improves accuracy at the cost of latency and throughput. +int rerankK = 20; +VectorFloat query = dataset.getQueryVectors().get(0); +// use the RAVV from the graph instead of the one from the original dataSet +SearchScoreProvider ssp = DefaultSearchScoreProvider.exact(query, vsf, graphRavv); +// A slightly more complex overload of `search` which adds three extra parameters. +// Right now we only care about `rerankK`. +SearchResult sr = searcher.search(ssp, topK, rerankK, 0.0f, 0.0f, Bits.ALL); +``` + +The code from this tutorial is available in `DiskIntro.java`. Run it from the root of this repo using +```sh +mvn compile exec:exec@tutorial -Dtutorial=disk -pl jvector-examples -am +``` +The full example also illustrates the impact that adjusting `rerankK` has on recall. + +Next tutorial: +- Larger than memory indexes with Product Quantization diff --git a/docs/tutorials/3-larger-than-memory-tutorial.md b/docs/tutorials/3-larger-than-memory-tutorial.md new file mode 100644 index 000000000..1374f3c11 --- /dev/null +++ b/docs/tutorials/3-larger-than-memory-tutorial.md @@ -0,0 +1,188 @@ +# JVector Tutorial Part 3: Larger-Than-Memory Indexes with Product Quantization + +In the previous tutorials, we built indexes in-memory then wrote them to disk. However, this requires enough memory to hold all the vectors during construction. In this tutorial we'll build larger-than-memory indexes using two key techniques: +- Write vectors to disk as they are streamed in +- Keep compressed versions of the vectors in the memory to allow JVector to compute similarities during construction. We'll use Product Quantization (PQ). + +> [!TIP] +> Product Quantization is a lossy compression technique that works by: +> 1. Dividing each vector into subspaces (e.g., a 128-dimensional vector into 16 subspaces of 8 dimensions each) +> 2. Learning a codebook of centroids for each subspace +> 3. Representing each subspace by the index of its nearest centroid +> +> For example, consider 128 dimension vectors. With 16 subspaces of 256 centroids per subspace, each vector is represented by 16 bytes. instead of 512 bytes, achieving a 32x compression ratio. +> +> Read the [PQ paper](https://ieeexplore.ieee.org/document/5432202) for more details. + +## Loading the Dataset + +We'll change things up with a different dataset: + +```java +// The DataSet provided by loadDataSet is in-memory, +// but you can apply the same technique even when you don't have +// the base vectors in-memory. +DataSet dataset = DataSets.loadDataSet("e5-small-v2-100k").orElseThrow(() -> + new RuntimeException("Dataset doesn't exist or wasn't configured correctly") +); + +// Remember that RAVVs need not be in-memory in the general case. +// We will sample from this RAVV to compute the PQ codebooks. +// In general you don't need to have a RAVV over all the vectors to +// build PQ codebooks, but you do need a "representative set". +RandomAccessVectorValues ravv = dataset.getBaseRavv(); +VectorSimilarityFunction vsf = dataset.getSimilarityFunction(); +int dim = dataset.getDimension(); +``` + +## Computing the Product Quantization Codebook + +Before we can compress vectors, we need to compute a PQ codebook using a representative set of vectors. This is just a set of vectors whose distribution matches or approximates the distribution of the entire set of base vectors. In this case we'll select the representative set by randomly sampling from the entire set of vectors. + +```java +// PQ parameters +int subspaces = 64; // number of subspaces to divide each vector into +int centroidsPerSubspace = 256; // number of centroids per subspace (256 => 1 byte) +boolean centerDataset = false; // we won't ask to center the dataset before quantization + +// This method randomly samples at most MAX_PQ_TRAINING_SET_SIZE vectors +// from the RAVV and considers that a "representative set" used to build the codebooks. +ProductQuantization pq = ProductQuantization.compute(ravv, subspaces, centroidsPerSubspace, centerDataset); +``` + +## Setting Up for Incremental Construction + +Now we'll set up the structures needed for incremental index construction: + +```java +// MutablePQVectors is a thread-safe, dynamically growing container for compressed vectors. +// As we add vectors to the index, we'll compress them and store them here. +// These compressed vectors are used during graph construction for approximate distance calculations. +var pqVectors = new MutablePQVectors(pq); + +// Provides approximate scores during graph construction using the compressed vectors +BuildScoreProvider bsp = BuildScoreProvider.pqBuildScoreProvider(vsf, pqVectors); +``` + +Since we have a BuildScoreProvider, we're now ready to create and use a `GraphIndexBuilder`. We will also create the corresponding `GraphIndexWriter` at the same time. Unlike the last tutorial we created the writer only after building the complete graph, we'll write full-resolution vectors to disk in tandem with adding them to the graph. + +```java +// Initialize the graph parameters M, ef etc +// ... + +Path graphPath = Files.createTempFile("jvector-ltm-graph", null); + +// remember to close these manually or use try-with-resources +GraphIndexBuilder builder = new GraphIndexBuilder(bsp, dim, M, ef, overflow, alpha, addHierarchy); +OnDiskGraphIndexWriter writer = new OnDiskGraphIndexWriter.Builder(builder.getGraph(), graphPath) + .with(new InlineVectors(dim)) + // Since we start with an empty graph, the writer will, by default, + // assume an ordinal mapping of size 0 (which is obviously incorrect). + // This is easy to rectify if you know the number of vectors beforehand, + // if not you may need to implement OrdinalMapper yourself. + .withMapper(new OrdinalMapper.IdentityMapper(ravv.size() - 1)) + .build(); +``` + +## Incremental Index Construction + +We'll proceed to add vectors to the index one at a time. The general procedure is: +- Encode the vector using the PQ codebooks created earlier, and add it to the collection of PQ vectors +- Write the full-resolution vector to disk +- Add the new vector to the graph + +```java +for (int ordinal = 0; ordinal < ravv.size(); ordinal++) { + VectorFloat v = ravv.getVector(ordinal); + + // Encode and add the vector to the working set of PQ vectors, + // which allows the graph builder to access it through the BuildScoreProvder. + pqVectors.encodeAndSet(ordinal, v); + + // Write the feature (full-resolution vector) for a single vector instead of all at once. + writer.writeFeaturesInline(ordinal, Feature.singleState(FeatureId.INLINE_VECTORS, new InlineVectors.State(v))); + + builder.addGraphNode(ordinal, v); +} +``` + +> [!TIP] +> When you call `builder.build(ravv)`, vectors from the RAVV are added to the graph in parallel. There's no parallelization if you add vectors one by one in the same thread. But you can parallelize manually by inserting from multiple threads at once, see the full example in LargerThanMemory.java. + +Once we've added all the vectors we can write the graph structure to disk: + +```java +// Must be done manually for incrementally built graphs. +// Enforces maximum degree constraint among other things. +builder.cleanup(); + +// Write the graph structure (neighbor lists) to disk. +// No need to pass-in a feature supplier since we wrote the features incrementally +// using writer.writeInLine +writer.write(Map.of()); +``` + +We should also save the PQ vectors we created since we'll need them later on when it's time to run searches: + +```java +// PQ codebooks and vectors also need to be saved somewhere! (we'll not concern ourselves with Fused PQ) +Path pqPath = Files.createTempFile("jvector-ltm-pq", null); +try (var pqOut = new BufferedRandomAccessWriter(pqPath)) { + pqVectors.write(pqOut); +} +``` + +> [!NOTE] +> JVector supports a Graph feature called "Fused PQ" using which you can embed the PQ vectors and codebooks in the on-disk graph, similar to how full-resolution vectors are embedded. This saves you from the need to store them separately, but we won't cover that in this tutorial. + +## Searching with Two-Pass Approach + +Now that we've built and saved the index and the PQ vectors, we're ready to start searching. The only things that need to be held in memory are the PQ Vectors and the upper layers of the graph (if the graph is hierarchial). The full-resolution vectors and the lower layer of the graph remains on-disk and are read on demand. + +Searching is done in two phases: +- The first phase searches the graph to identify a set of candidates (as many as `rerankK`). In-memory PQ vectors are used to compute scores. +- In the second phase reranks the candidates and returns the top `topK` results. Exact scores are computed using full-resolution vectors from disk. + +```java +// nothing new here +ReaderSupplier graphSupplier = ReaderSupplierFactory.open(graphPath); +OnDiskGraphIndex graph = OnDiskGraphIndex.load(graphSupplier); +// except that we also need a reader for the PQ vectors +ReaderSupplier pqSupplier = ReaderSupplierFactory.open(pqPath); +RandomAccessReader pqReader = pqSupplier.get() +// don't forget to close all of the above! (or just use try-with-resources) + +// we need to have the PQ vectors in memory +PQVectors pqVectorsSearch = PQVectors.load(pqReader); + +int topK = 10; +float overqueryFactor = 32.0f; +int rerankK = (int) (topK * overqueryFactor); + +// a friendly reminder that searchers need closing too +GraphSearcher searcher = new GraphSearcher(graph); +var graphRavv = (RandomAccessVectorValues) searcher.getView(); +VectorFloat query : dataset.getQueryVectors().get(0); + +// Two-phase search: +// 1. ApproximateScoreFunction (ASF) uses compressed vectors for fast initial search +// 2. Reranker uses full-resolution vectors from disk for accurate final ranking +var asf = pqVectorsSearch.precomputedScoreFunctionFor(query, vsf); +var reranker = graphRavv.rerankerFor(query, vsf); +SearchScoreProvider ssp = new DefaultSearchScoreProvider(asf, reranker); + +SearchResult sr = searcher.search(ssp, topK, rerankK, 0.0f, 0.0f, Bits.ALL); +``` + +## Full Example + +The complete code from this tutorial is available in `LargerThanMemory.java`. Run it from the root of this repo using: + +```sh +mvn compile exec:exec@tutorial -Dtutorial=ltm -pl jvector-examples -am +``` + +## Next Steps + +- Fused PQ +- NVQ diff --git a/jvector-examples/pom.xml b/jvector-examples/pom.xml index 0f79b2986..ae8c77d6d 100644 --- a/jvector-examples/pom.xml +++ b/jvector-examples/pom.xml @@ -166,6 +166,18 @@ false + + tutorial + + + -classpath + + -ea + io.github.jbellis.jvector.example.tutorial.TutorialRunner + ${tutorial} + + + sift @@ -232,6 +244,19 @@ false + + tutorial + + + -classpath + + --add-modules=jdk.incubator.vector + -ea + io.github.jbellis.jvector.example.tutorial.TutorialRunner + ${tutorial} + + + sift @@ -348,6 +373,21 @@ false + + tutorial + + + -classpath + + --enable-native-access=ALL-UNNAMED + --add-modules=jdk.incubator.vector + -ea + -Djvector.experimental.enable_native_vectorization=true + io.github.jbellis.jvector.example.tutorial.TutorialRunner + ${tutorial} + + + sift diff --git a/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/DiskIntro.java b/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/DiskIntro.java new file mode 100644 index 000000000..e38c5c5b8 --- /dev/null +++ b/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/DiskIntro.java @@ -0,0 +1,136 @@ +/* + * Copyright DataStax, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package io.github.jbellis.jvector.example.tutorial; + +import java.io.IOException; +import java.nio.file.Files; +import java.nio.file.Path; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +import io.github.jbellis.jvector.disk.ReaderSupplier; +import io.github.jbellis.jvector.disk.ReaderSupplierFactory; +import io.github.jbellis.jvector.example.benchmarks.datasets.DataSet; +import io.github.jbellis.jvector.example.benchmarks.datasets.DataSets; +import io.github.jbellis.jvector.example.util.AccuracyMetrics; +import io.github.jbellis.jvector.graph.GraphIndexBuilder; +import io.github.jbellis.jvector.graph.GraphSearcher; +import io.github.jbellis.jvector.graph.ImmutableGraphIndex; +import io.github.jbellis.jvector.graph.RandomAccessVectorValues; +import io.github.jbellis.jvector.graph.SearchResult; +import io.github.jbellis.jvector.graph.disk.GraphIndexWriter; +import io.github.jbellis.jvector.graph.disk.GraphIndexWriterTypes; +import io.github.jbellis.jvector.graph.disk.OnDiskGraphIndex; +import io.github.jbellis.jvector.graph.disk.feature.FeatureId; +import io.github.jbellis.jvector.graph.disk.feature.InlineVectors; +import io.github.jbellis.jvector.graph.similarity.BuildScoreProvider; +import io.github.jbellis.jvector.graph.similarity.DefaultSearchScoreProvider; +import io.github.jbellis.jvector.graph.similarity.SearchScoreProvider; +import io.github.jbellis.jvector.util.Bits; +import io.github.jbellis.jvector.vector.VectorSimilarityFunction; +import io.github.jbellis.jvector.vector.types.VectorFloat; + +// Provides the code used in disk-tutorial.md. +// If you edit this file you may also want to edit disk-tutoral.md +public class DiskIntro { + public static void main(String[] args) throws IOException { + // This is a preconfigured dataset that will be downloaded automatically. + DataSet dataset = DataSets.loadDataSet("ada002-100k").orElseThrow(() -> + new RuntimeException("Dataset doesn't exist or wasn't configured correctly") + ); + + // The loaded DataSet provides a RAVV over the base vectors + RandomAccessVectorValues ravv = dataset.getBaseRavv(); + VectorSimilarityFunction vsf = dataset.getSimilarityFunction(); + int dim = dataset.getDimension(); + + // reasonable defaults + int M = 32; + int ef = 100; + float overflow = 1.2f; + float alpha = 1.2f; + boolean addHierarchy = true; + + BuildScoreProvider bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, vsf); + + System.out.println("Building the graph, may take a few minutes"); + + // nothing new here + ImmutableGraphIndex heapGraph; + try (GraphIndexBuilder builder = new GraphIndexBuilder(bsp, dim, M, ef, overflow, alpha, addHierarchy)) { + heapGraph = builder.build(ravv); + } + + Path graphPath = Files.createTempFile("jvector-example-graph", null); // or wherever you want to save the graph + try ( + // Create a writer for the on-heap graph we just built. + GraphIndexWriter writer = GraphIndexWriter.getBuilderFor(GraphIndexWriterTypes.RANDOM_ACCESS_PARALLEL, heapGraph, graphPath) + // Let the writer know that we'll also be passing in the actual vector data + // to be saved "inline" with the data for each corresponding graph node. + .with(new InlineVectors(dim)) + .build(); + ) { + // Supply one map entry for each feature. + // The key is a FeatureId enum corresponding to the feature + // and the value is a function which generates the feature state for each graph node. + writer.write(Map.of( + FeatureId.INLINE_VECTORS, + // we already have a RAVV, so we'll just use that to supply the writer. + nodeId -> new InlineVectors.State(ravv.getVector(nodeId)))); + } + + // ReaderSupplierFactory automatically picks an available RandomAccessReader implementation + ReaderSupplier readerSupplier = ReaderSupplierFactory.open(graphPath); + OnDiskGraphIndex graph = OnDiskGraphIndex.load(readerSupplier); + + System.out.println("Performing searches with increasing overquery factor"); + + // number of search results we want + int topK = 10; + for (float overqueryFactor : new float[]{1.0f, 1.5f, 2.0f, 5.0f, 10.0f}) { + // `rerankK` controls the number of nodes to fetch from the initial graph search. + // which are then re-ranked to return the actual topK results. + // Increasing rerankK improves accuracy at the cost of latency and throughput. + int rerankK = (int) (topK * overqueryFactor); + + try (GraphSearcher searcher = new GraphSearcher(graph)) { + // Views of an OnDiskGraphIndex with inline or separated vectors can be used as RAVVs! + // In multi-threaded scenarios you should have one searcher per thread + // and extract a view for each thread from the associated searcher. + var graphRavv = (RandomAccessVectorValues) searcher.getView(); + + List results = new ArrayList<>(); + for (VectorFloat query : dataset.getQueryVectors()) { + // use the RAVV from the graph instead of the one from the original dataSet + SearchScoreProvider ssp = DefaultSearchScoreProvider.exact(query, vsf, graphRavv); + // A slightly more complex overload of `search` which adds three extra parameters. + // Right now we only care about `rerankK`. + SearchResult sr = searcher.search(ssp, topK, rerankK, 0.0f, 0.0f, Bits.ALL); + results.add(sr); + } + + double recall = AccuracyMetrics.recallFromSearchResults(dataset.getGroundTruth(), results, topK, topK); + System.out.println(String.format("Recall@%d for overquery by %f = %f", topK, overqueryFactor, recall)); + } + } + + // cleanup + readerSupplier.close(); + Files.deleteIfExists(graphPath); + } +} diff --git a/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/LargerThanMemory.java b/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/LargerThanMemory.java new file mode 100644 index 000000000..4e16cda5d --- /dev/null +++ b/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/LargerThanMemory.java @@ -0,0 +1,206 @@ +/* + * Copyright DataStax, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package io.github.jbellis.jvector.example.tutorial; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.nio.file.Files; +import java.nio.file.Path; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.stream.IntStream; + +import io.github.jbellis.jvector.disk.BufferedRandomAccessWriter; +import io.github.jbellis.jvector.disk.RandomAccessReader; +import io.github.jbellis.jvector.disk.ReaderSupplier; +import io.github.jbellis.jvector.disk.ReaderSupplierFactory; +import io.github.jbellis.jvector.example.benchmarks.datasets.DataSet; +import io.github.jbellis.jvector.example.benchmarks.datasets.DataSets; +import io.github.jbellis.jvector.example.util.AccuracyMetrics; +import io.github.jbellis.jvector.graph.GraphIndexBuilder; +import io.github.jbellis.jvector.graph.GraphSearcher; +import io.github.jbellis.jvector.graph.RandomAccessVectorValues; +import io.github.jbellis.jvector.graph.SearchResult; +import io.github.jbellis.jvector.graph.disk.OnDiskGraphIndex; +import io.github.jbellis.jvector.graph.disk.OnDiskGraphIndexWriter; +import io.github.jbellis.jvector.graph.disk.OrdinalMapper; +import io.github.jbellis.jvector.graph.disk.feature.Feature; +import io.github.jbellis.jvector.graph.disk.feature.FeatureId; +import io.github.jbellis.jvector.graph.disk.feature.InlineVectors; +import io.github.jbellis.jvector.graph.similarity.BuildScoreProvider; +import io.github.jbellis.jvector.graph.similarity.DefaultSearchScoreProvider; +import io.github.jbellis.jvector.graph.similarity.SearchScoreProvider; +import io.github.jbellis.jvector.quantization.MutablePQVectors; +import io.github.jbellis.jvector.quantization.PQVectors; +import io.github.jbellis.jvector.quantization.ProductQuantization; +import io.github.jbellis.jvector.util.Bits; +import io.github.jbellis.jvector.util.PhysicalCoreExecutor; +import io.github.jbellis.jvector.vector.VectorSimilarityFunction; +import io.github.jbellis.jvector.vector.types.VectorFloat; + +// Provides the code used in 3-larger-than-memory-tutorial.md. +// If you edit this file you may also want to edit 3-larger-than-memory-tutorial.md +public class LargerThanMemory { + public static void main(String[] args) throws IOException { + // The DataSet provided by loadDataSet is in-memory, + // but you can apply the same technique even when you don't have + // the base vectors in-memory. + DataSet dataset = DataSets.loadDataSet("e5-small-v2-100k").orElseThrow(() -> + new RuntimeException("Dataset doesn't exist or wasn't configured correctly") + ); + + // Remember that RAVVs need not be in-memory in the general case. + // We will sample from this RAVV to compute the PQ codebooks. + // In general you don't need to have a RAVV over all the vectors to + // build PQ codebooks, but you do need a "representative set". + RandomAccessVectorValues ravv = dataset.getBaseRavv(); + VectorSimilarityFunction vsf = dataset.getSimilarityFunction(); + int dim = dataset.getDimension(); + + // PQ parameters + int subspaces = 64; // number of subspaces to divide each vector into + int centroidsPerSubspace = 256; // number of centroids per subspace (256 => 1 byte) + boolean centerDataset = false; // we won't ask to center the dataset before quantization + + System.out.println("Computing PQ codebooks..."); + // This method randomly samples at most MAX_PQ_TRAINING_SET_SIZE vectors + // from the RAVV and considers that a "representative set" used to build the codebooks. + ProductQuantization pq = ProductQuantization.compute(ravv, subspaces, centroidsPerSubspace, centerDataset); + + // MutablePQVectors is a thread-safe, dynamically growing container for compressed vectors. + // As we add vectors to the index, we'll compress them and store them here. + // These compressed vectors are used during graph construction for approximate distance calculations. + var pqVectors = new MutablePQVectors(pq); + + // Provides approximate scores during graph construction using the compressed vectors + BuildScoreProvider bsp = BuildScoreProvider.pqBuildScoreProvider(vsf, pqVectors); + + // Graph construction parameters + int M = 32; + int ef = 100; + float overflow = 1.2f; + float alpha = 1.2f; + boolean addHierarchy = true; + + Path graphPath = Files.createTempFile("jvector-ltm-graph", null); + + System.out.println("Building index incrementally..."); + + try ( + GraphIndexBuilder builder = new GraphIndexBuilder(bsp, dim, M, ef, overflow, alpha, addHierarchy); + // In DiskIntro we created the writer after generating the complete graph and closing the builder, + // but for incremental construction we will build and write in concert. + OnDiskGraphIndexWriter writer = new OnDiskGraphIndexWriter.Builder(builder.getGraph(), graphPath) + .with(new InlineVectors(dim)) + // Since we start with an empty graph, the writer will, by default, + // assume an ordinal mapping of size 0 (which is obviously incorrect). + // This is easy to rectify if you know the number of vectors beforehand, + // if not you may need to implement OrdinalMapper yourself. + .withMapper(new OrdinalMapper.IdentityMapper(ravv.size() - 1)) + .build(); + ) { + // Graph building is best done with threads = number of physical cores + // PhysicalCoreExecutor assumes hyperthreading by default, i.e. cores = vCPUs / 2 + // If this is not correct, set the system property `jvector.physical_core_count` + PhysicalCoreExecutor.pool().submit(() -> { + IntStream.range(0, ravv.size()).parallel().forEach(ordinal -> { + VectorFloat v = ravv.getVector(ordinal); + + // Encode and add the vector to the working set of PQ vectors, + // which allows the graph builder to access it through the BuildScoreProvder. + pqVectors.encodeAndSet(ordinal, v); + + // Write the feature (full-resolution vector) for a single vector instead of all at once. + try { + writer.writeFeaturesInline(ordinal, Feature.singleState(FeatureId.INLINE_VECTORS, new InlineVectors.State(v))); + } catch (IOException e) { + throw new UncheckedIOException(e); + } + + builder.addGraphNode(ordinal, v); + }); + }).join(); + + // Must be done manually for incrementally built graphs. + // Enforces maximum degree constraint among other things. + builder.cleanup(); + + // No need to pass-in a feature supplier since we wrote the features incrementally + // using writer.writeInLine + writer.write(Map.of()); + } + + // PQ codebooks and vectors also need to be saved somewhere! (we'll not concern ourselves with Fused PQ) + Path pqPath = Files.createTempFile("jvector-ltm-pq", null); + try (var pqOut = new BufferedRandomAccessWriter(pqPath)) { + pqVectors.write(pqOut); + } + + System.out.println("Index built successfully!"); + + // Calculate and display the compression ratio + int compressedSize = pq.compressedVectorSize(); + int originalSize = dim * Float.BYTES; + float compressionRatio = (float) originalSize / compressedSize; + System.out.println(String.format("Compression ratio: %.1fx (%d bytes -> %d bytes per vector)", + compressionRatio, originalSize, compressedSize)); + + System.out.println("\nSearching with two-pass approach..."); + + try ( + // nothing new here + ReaderSupplier graphSupplier = ReaderSupplierFactory.open(graphPath); + OnDiskGraphIndex graph = OnDiskGraphIndex.load(graphSupplier); + // except that we also need a reader for the PQ vectors + ReaderSupplier pqSupplier = ReaderSupplierFactory.open(pqPath); + RandomAccessReader pqReader = pqSupplier.get() + ) { + // we need to have the PQ vectors in memory + PQVectors pqVectorsSearch = PQVectors.load(pqReader); + + int topK = 10; + for (float overqueryFactor : new float[]{4.0f, 8.0f, 16.0f, 32.0f, 64.0f}) { + int rerankK = (int) (topK * overqueryFactor); + + try (GraphSearcher searcher = new GraphSearcher(graph)) { + var graphRavv = (RandomAccessVectorValues) searcher.getView(); + + List results = new ArrayList<>(); + for (VectorFloat query : dataset.getQueryVectors()) { + // Two-phase search: + // 1. ApproximateScoreFunction (ASF) uses compressed vectors for fast initial search + // 2. Reranker uses full-resolution vectors from disk for accurate final ranking + var asf = pqVectorsSearch.precomputedScoreFunctionFor(query, vsf); + var reranker = graphRavv.rerankerFor(query, vsf); + SearchScoreProvider ssp = new DefaultSearchScoreProvider(asf, reranker); + + SearchResult sr = searcher.search(ssp, topK, rerankK, 0.0f, 0.0f, Bits.ALL); + results.add(sr); + } + + double recall = AccuracyMetrics.recallFromSearchResults(dataset.getGroundTruth(), results, topK, topK); + System.out.println(String.format("Recall@%d for overquery by %f = %f", topK, overqueryFactor, recall)); + } + } + } + + // cleanup after ourselves + Files.deleteIfExists(graphPath); + Files.deleteIfExists(pqPath); + } +} diff --git a/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/TutorialRunner.java b/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/TutorialRunner.java new file mode 100644 index 000000000..9ad4f5b84 --- /dev/null +++ b/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/TutorialRunner.java @@ -0,0 +1,45 @@ +/* + * Copyright DataStax, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package io.github.jbellis.jvector.example.tutorial; + +import java.io.IOException; + +public class TutorialRunner { + public static void main(String[] args) throws IOException { + if (args.length != 1) { + throw new IllegalArgumentException("Please pick an example"); + } + String[] forwardArgs = new String[args.length - 1]; + for (int i = 0; i < forwardArgs.length; i++) { + forwardArgs[i] = args[i + 1]; + } + + switch (args[0]) { + case "intro": + VectorIntro.main(forwardArgs); + break; + case "disk": + DiskIntro.main(forwardArgs); + break; + case "ltm": + LargerThanMemory.main(forwardArgs); + break; + default: + throw new IllegalArgumentException("Unknown example" + args[0]); + } + } +} diff --git a/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/VectorIntro.java b/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/VectorIntro.java new file mode 100644 index 000000000..4bc917e3a --- /dev/null +++ b/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial/VectorIntro.java @@ -0,0 +1,117 @@ +/* + * Copyright DataStax, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package io.github.jbellis.jvector.example.tutorial; + +import java.io.IOException; +import java.util.List; + +import io.github.jbellis.jvector.graph.GraphIndexBuilder; +import io.github.jbellis.jvector.graph.GraphSearcher; +import io.github.jbellis.jvector.graph.ImmutableGraphIndex; +import io.github.jbellis.jvector.graph.ListRandomAccessVectorValues; +import io.github.jbellis.jvector.graph.RandomAccessVectorValues; +import io.github.jbellis.jvector.graph.SearchResult; +import io.github.jbellis.jvector.graph.SearchResult.NodeScore; +import io.github.jbellis.jvector.graph.similarity.BuildScoreProvider; +import io.github.jbellis.jvector.graph.similarity.DefaultSearchScoreProvider; +import io.github.jbellis.jvector.graph.similarity.SearchScoreProvider; +import io.github.jbellis.jvector.util.Bits; +import io.github.jbellis.jvector.vector.VectorSimilarityFunction; +import io.github.jbellis.jvector.vector.VectorizationProvider; +import io.github.jbellis.jvector.vector.types.VectorFloat; +import io.github.jbellis.jvector.vector.types.VectorTypeSupport; + +// This file provides the entirety of the code used in introductory tutorial. +// Changes to this file should be applied to `intro-tutorial.md` as well. +// If you're looking through this file to learn JVector, +// you may want to go through `docs/intro-tutorial.md` as well. +public class VectorIntro { + public static void main(String[] args) throws IOException { + // `VectorizationProvider` is automatically picked based on the system, language version and runtime flags + // and determines the actual type of the vector data, and provides implementations for common operations + // like the inner product. + VectorTypeSupport vts = VectorizationProvider.getInstance().getVectorTypeSupport(); + + int dimension = 3; + + // Create a `VectorFloat` from a `float[]`. + // The types that can be converted to a VectorFloat are technically dependent on which VectorizationProvider is picked, + // but `float[]` is generally a safe bet. + float[] vector0Array = new float[]{0.1f, 0.2f, 0.3f}; + VectorFloat vector0 = vts.createFloatVector(vector0Array); + + // This toy example uses only three vectors, in practical cases you might have millions or more. + List> baseVectors = List.of( + vector0, + vts.createFloatVector(new float[]{0.01f, 0.15f, -0.3f}), + vts.createFloatVector(new float[]{-0.2f, 0.1f, 0.35f}) + ); + + // RAVV or `ravv` is convenient shorthand for a RandomAccessVectorValues instance + RandomAccessVectorValues ravv = new ListRandomAccessVectorValues(baseVectors, dimension /* 3 */); + + // The type of similarity score to use. JVector supports EUCLIDEAN (L2 distance), DOT_PRODUCT and COSINE. + VectorSimilarityFunction similarityFunction = VectorSimilarityFunction.EUCLIDEAN; + + // A simple score provider which can compute exact similarity scores by holding a reference to all the base vectors. + BuildScoreProvider bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, similarityFunction); + + // Graph construction parameters + int M = 32; // maximum degree of each node + int efConstruction = 100; // search depth during construction + float neighborOverflow = 1.2f; + float alpha = 1.2f; // note: not the best setting for 3D vectors, but good in the general case + boolean addHierarchy = true; // use an HNSW-style hierarchy + boolean refineFinalGraph = true; + + // Build the graph index using a Builder + ImmutableGraphIndex graph; + try (GraphIndexBuilder builder = new GraphIndexBuilder(bsp, + dimension, + M, + efConstruction, + neighborOverflow, + alpha, + addHierarchy, + refineFinalGraph)) { + graph = builder.build(ravv); + } + + VectorFloat queryVector = vts.createFloatVector(new float[]{0.2f, 0.3f, 0.4f}); // for example + // The in-memory graph index doesn't own the actual vectors used to construct it. + // To compute exact scores at search time, you need to pass in the base RAVV again, + // in addition to the actual query vector + SearchScoreProvider ssp = DefaultSearchScoreProvider.exact(queryVector, similarityFunction, ravv); + + int topK = 10; // number of approximate nearest neighbors to fetch + + // Search the graph via a GraphSearcher + SearchResult result; + try (GraphSearcher searcher = new GraphSearcher(graph)) { + // You can provide a filter to the query as a bit mask. + // In this case we want the actual topK neighbors without filtering, + // so we pass in a virtual bit mask representing all ones. + result = searcher.search(ssp, topK, Bits.ALL); + } + + for (NodeScore ns : result.getNodes()) { + int id = ns.node; // you can look up this ID in the RAVV + float score = ns.score; // the similarity score between this vector and the query vector (higher -> more similar) + System.out.println("ID: " + id + ", Score: " + score + ", Vector: " + ravv.getVector(id)); + } + } +} diff --git a/rat-excludes.txt b/rat-excludes.txt index e9dd9fb37..64aeba7a3 100644 --- a/rat-excludes.txt +++ b/rat-excludes.txt @@ -25,4 +25,5 @@ results.csv scripts/test_node_setup.sh scripts/jmh_results_formatter.py yaml-configs/*.yml -src/main/resources/logback.xml \ No newline at end of file +src/main/resources/logback.xml +docs/**/*.md