Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions docs/benchmarking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# How to run benchmarks

JVector comes with a built-in benchmarking system in `jvector-examples/.../BenchYAML.java`.

To run a benchmark
- Decide which dataset(s) you want to benchmark. A dataset consists of
- The vectors to be indexed, usually called the "base" or "target" vectors.
- The query vectors.
- The "ground truth" results which are used to compute accuracy metrics.
- The similarity metric which should have been used to compute the ground truth (dot product, cosine similarity or L2 distance).
- Configure the parameters combinations for which you want to run the benchmark. This includes graph index parameters, quantization parameters and search parameters.

JVector supports two types of datasets:
- **Fvec/Ivec**: The dataset consists of three files, for example `base.fvec`, `queries.fvec` and `neighbors.ivec` containing the base vectors, query vectors, and ground truth. (`fvec` and `ivec` file formats are described [here](http://corpus-texmex.irisa.fr/))
- **HDF5**: The dataset consists of a single HDF5 file with three datasets labelled `train`, `test` and `neighbors`, representing the base vectors, query vectors and the ground truth.

The general procedure for running benchmarks is mentioned below. The following sections describe the process in more detail.
- [Specify the dataset](#specifying-datasets) names to benchmark in `datasets.yml`.
- Certain datasets will be downloaded automatically. If using a different dataset, make sure the dataset files are downloaded and made available (refer the section on [Custom datasets](#custom-datasets)).
- Adjust the benchmark parameters in `default.yml`. This will affect the parameters for all datasets to be benchmarked. You can specify custom parameters for a specific dataset by creating a file called `<your-dataset-name>.yml` in the same folder.

You can run the configured benchmark with maven:
```sh
mvn clean compile exec:exec@bench -pl jvector-examples -am
```

## Specifying dataset(s)

The datasets you want to benchmark should be specified in `jvector-examples/yaml-configs/datasets.yml`. You'll notice this file already contains some entries; these are datasets that `bench` can automatically download and test with minimal additional configuration. Running `bench` without arguments and without changing this file will cause ALL the datasets to be benchmarked one by one (this is probably not what you want).

To benchmark a single dataset, comment out the entries corresponding to all other datasets. (Or provide command line arguments as described in [Running `bench` from the command line](#running-bench-from-the-command-line))

Datasets are assumed to be Fvec/Ivec based unless the entry in the `datasets.yml` ends with `.hdf5`. In this case, `.hdf5` is not considered part of the "dataset name" referenced in other sections.

You'll notice that datasets are grouped into categories. The categories can be arbitrarily chosen for convenience and are not currently considered by the benchmarking system.

For HDF5 files, the substrings `-angular`, `-euclidean` and `-dot` correspond to cosine similarity, L2 distance, and dot product similarity functions (these substrings ARE considered to be part of the "dataset name"). Currently, Fvec/Ivec datasets are implicitly assumed to use cosine similarity (changing this requires editing `MultiFileDataSource.java`).

Example `datasets.yml`:

```yaml
category0:
- my-fvec-dataset # fvec/ivec dataset, cosine similarity
- my-hdf5-dataset-angular.hdf5 # hdf5 dataset, cosine similarity
some-other-category:
- a-huge-dataset-1024d-euclidean.hdf5 # hdf5 dataset, L2 similarity
- my-simple-dataset-dot.hdf5 # hdf5 dataset, dot product similarity
- some-dataset-euclidean # fvec/ivec dataset, cosine similarity (NOT L2 unless you change the code!)
```

## Setting benchmark parameters

### default.yml / \<dataset-name\>.yml

`jvector-examples/yaml-configs/default.yml` specifies the default index construction and search parameters to be used by `bench` for all datasets.

You can specify a custom set of a parameters for any given dataset by creating a file called `<dataset-name>.yml`, with `<dataset-name>` replaced by the actual name of the dataset. This is the same as the identifier used in `datasets.yml`, but without the `.hdf5` suffix for hdf5 datasets. The format of this file is exactly the same as `default.yml`.

Refer `default.yml` for a list of all options.

Most parameters can be specified as an array. For these parameters, a separate benchmark is run for each value of the parameter. If multiple parameters are specified as arrays, a benchmark is run for each combination (i.e. taking the Cartesian product). For example:
```yaml
construction:
M: [32, 64]
ef: [100, 200]
```
will build and benchmark four graphs, one for each combination of M and ef in {(32, 100), (64, 100), (32, 200), (64, 200)}. This is particularly useful when running a Grid search to identify the best performing parameters.

## Running `bench` from the command line

Once configured to your liking, you can run the benchmark through maven:
```sh
mvn compile exec:exec@bench -pl jvector-examples -am
```

To benchmark a subset of the datasets in `datasets.yml`, you can provide a space-separated list of regexes as arguments.
```sh
# matches `glove-25-angular.hdf5`, `glove-50-angular.hdf5`, `nytimes-256-angular.hdf5` etc
mvn compile exec:exec@bench -pl jvector-examples -am -DbenchArgs="glove nytimes"
```

## Custom Datasets

### Custom Fvec/Ivec datasets

Using fvec/ivec datasets requires them to be configured in `MultiFileDatasource.java`. Some datasets are already pre-configured; these will be downloaded and used automatically on running the benchmark.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section needs some tweaking following the latest changes from @jshook. Maybe @jshook can help to adjust this section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell there's no change required in the content of this section, @jshook would if you notice something please let me know.


To use a custom dataset consisting of files `base.fvec`, `queries.fvec` and `neighbors.ivec`, do the following:
- Ensure that you have three files:
- `base.fvec` containing N D-dimensional float vectors. These are used to build the index.
- `queries.fvec` containing Q D-dimensional float vectors. These are used for querying the built index.
- `neighbors.ivec` containing Q K-dimensional integer vectors, one for each query vector, representing the exact K-nearest neighbors for that query among the base vectors.
The files can be named however you like.
- Save all three files somewhere in the `fvec` directory in the root of the `jvector` repo (if it doesn't exist, create it). It's recommended to create at least one sub-folder with the name of the dataset and copy or move all three files there.
- Edit `MultiFileDatasource.java` to configure a new dataset and it's associated files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section also needs to describe how we are using the environment variable DATASET_HASH and how it needs to be set on the target system or the downloads will fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be applicable to specific pre-configured datasets (cohere 1M-10M, dpr etc) that are defined in MultiFileDatasource. If I'm not mistaken, accessing these datasets requires not only the dataset hash, but also the credentials of the corresponding S3 bucket. I think it's alright to skip it for now considering that the affected datasets are not public.

Note that this won't prevent the user from externally downloading and using any dataset they have access to.

```java
put("cust-ds", new MultiFileDatasource("cust-ds",
"/cust-ds/base.fvec",
"/cust-ds/query.fvec",
"/cust-ds/neighbors.ivec"));
```
The file paths are resolved relative to the `fvec` directory. `cust-ds` is the name of the dataset and can be changed to whatever is appropriate.
- In `jvector-examples/yaml-configs/datasets.yml`, add an entry corresponding to your custom dataset. Comment out other datasets which you do not want to benchmark.
```yaml
custom:
- cust-ds
```

## Custom HDF5 datasets

HDF5 datasets consist of a single file. The Hdf5Loader looks for three HDF5 datasets within the file, `train`, `test` and `neighbors`. These correspond to the base, query and neighbors vectors described above for fvec/ivec files.

To use an HDF5 dataset, edit `jvector-examples/yaml-configs/datasets.yml` to add an entry like the following:
```yaml
category:
- <dataset-name>.hdf5
```

BenchYAML looks for hdf5 datasets with the name `<dataset-name>.hdf5` in the `hdf5` folder in the root of this repo. If the file doesn't exist, BenchYAML will attempt to automatically download the dataset from ann-benchmarks.com. If your dataset is not from ann-benchmarks.com, simply ensure that the dataset is available in the `hdf5` folder and edit `datasets.yml` accordingly.
153 changes: 153 additions & 0 deletions docs/tutorials/1-intro-tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# JVector Tutorial

JVector provides a graph index for ANN search which is a hybrid of DiskANN and HNSW. You can think of it as a Vamana index with an HNSW-style hierarchy. The rest of this tutorial assumes you have a basic understanding of Vector search, but no prior understanding of HNSW or DiskANN is assumed.

JVector provides a `VectorFloat` datatype for representing vectors, as an abstraction over the physical vector type. Therefore, the first step to using JVector is to understand how to create a `VectorFloat`:

```java
// `VectorizationProvider` is automatically picked based on the system, language version and runtime flags
// and determines the actual type of the vector data, and provides implementations for common operations
// like the inner product.
VectorTypeSupport vts = VectorizationProvider.getInstance().getVectorTypeSupport();

int dimension = 3;

// Create a `VectorFloat` from a `float[]`.
// The types that can be converted to a VectorFloat are technically dependent on which VectorizationProvider is picked,
// but `float[]` is generally a safe bet.
float[] vector0Array = new float[]{0.1f, 0.2f, 0.3f};
VectorFloat<?> vector0 = vts.createFloatVector(vector0Array);
```

> [!TIP]
> For other ways to create vectors, refer to the javadoc for `VectorTypeSupport`.

Before creating the vector index, we will group all of our base vectors into a container which implements the `RandomAccessVectorValues` interface. Many APIs in JVector accept an instance of `RandomAccessVectorValues` as input. In this case, we'll use it to specify the vectors to be used to build the index.

```java
// This toy example uses only three vectors, in practical cases you might have millions or more.
List<VectorFloat<?>> baseVectors = List.of(
vector0,
vts.createFloatVector(new float[]{0.01f, 0.15f, -0.3f}),
vts.createFloatVector(new float[]{-0.2f, 0.1f, 0.35f})
);

// RAVV or `ravv` is convenient shorthand for a RandomAccessVectorValues instance
RandomAccessVectorValues ravv = new ListRandomAccessVectorValues(baseVectors, dimension /* 3 */);
```

> [!TIP]
> In this example, all vectors are loaded in-memory, but RAVVs are quite versatile. For example, you might have a RAVV backed by disk (check out `MMapRandomAccessVectorValues.java`) or write your own custom RAVV that transfers data over a network interface.

> [!NOTE]
> A note on terminology:
> - "Base" vectors are the vectors used to build the index. Each vector becomes a node in the graph. May also be referred to as the "train" set.
> - "Query" vectors are vectors used as queries for ANN search after the index has been built. In some cases you may want to use some base vectors as queries. Also referred to as the "test" set.

We're now ready to create a Graph-based vector index. We'll do this using a `GraphIndexBuilder` as an intermediate. Let's take a look at the signature of one of it's constructors:

```java
public GraphIndexBuilder(BuildScoreProvider scoreProvider,
int dimension,
int M,
int beamWidth,
float neighborOverflow,
float alpha,
boolean addHierarchy,
boolean refineFinalGraph);
```

This constructor asks for something called a `BuildScoreProvider`, the vector dimension, and a set of graph parameters.

The `BuildScoreProvider` is used by the graph builder to compute the similarity scores between any two vectors at build time. We'll use the RAVV we created earlier to generate a BuildScoreProvider:

```java
// The type of similarity score to use. JVector supports EUCLIDEAN (L2 distance), DOT_PRODUCT and COSINE.
VectorSimilarityFunction similarityFunction = VectorSimilarityFunction.EUCLIDEAN;

// A simple score provider which can compute exact similarity scores by holding a reference to all the base vectors.
BuildScoreProvider bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, similarityFunction);
```

Let's also initialize the graph parameters. For now we will not worry about the exact function of the parameters, except to note that these are reasonable defaults. Refer the DiskANN and HNSW papers for more details.

<!-- TODO describe graph parameters in a separate doc -->

```java
// Graph construction parameters
int M = 32; // maximum degree of each node
int efConstruction = 100; // search depth during construction
float neighborOverflow = 1.2f;
float alpha = 1.2f; // note: not the best setting for 3D vectors, but good in the general case
boolean addHierarchy = true; // use an HNSW-style hierarchy
boolean refineFinalGraph = true;
```

and now we can create the graph index

```java
// Build the graph index using a Builder
// Remember to close the builder using builder.close() or a try-with-resources block
GraphIndexBuilder builder = new GraphIndexBuilder(bsp,
dimension,
M,
efConstruction,
neighborOverflow,
alpha,
addHierarchy,
refineFinalGraph);
ImmutableGraphIndex graph = builder.build(ravv);
```

> [!NOTE]
> You may notice that we supplied the same `ravv` to `builder.build`, even though we'd already passed in the RAVV while creating the `BuildScoreProvider`. This is necessary since generally speaking, the `BuildScoreProvider` won't keep a reference to the actual base vectors, it just so happens that we're using an "exact" score provider that does so.

At this point, you have a completed Graph Index that resides in-memory.

To perform a search operation, you need to first create a `GraphSearcher`.

> [!IMPORTANT]
> The graph index itself can be shared between threads, but `GraphSearcher`s maintain internal state and are therefore NOT thread-safe. To run concurrent searches across multiple threads, each thread should have it's own `GraphSearcher`. The same searcher can be re-used across different queries in the same thread.

```java
// Remember to close the searcher using searcher.close() or a try-with-resources block
var searcher = new GraphSearcher(graph);
```

Generally speaking, you can't pass in a `VectorFloat<?>` directly to the `GraphSearcher`. You need to wrap the query vector with a `SearchScoreProvider`, similar in spirit to the `BuildScoreProvider` we created earlier.

```java
VectorFloat<?> queryVector = vts.createFloatVector(new float[]{0.2f, 0.3f, 0.4f}); // for example
// The in-memory graph index doesn't own the actual vectors used to construct it.
// To compute exact scores at search time, you need to pass in the base RAVV again,
// in addition to the actual query vector
SearchScoreProvider ssp = DefaultSearchScoreProvider.exact(queryVector, similarityFunction, ravv);
```

Now we can run a search

```java
int topK = 10; // number of approximate nearest neighbors to fetch
// You can provide a filter to the query as a bit mask.
// In this case we want the actual topK neighbors without filtering,
// so we pass in a virtual bit mask representing all ones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of what is meant by filtering when something other than Bits.ALL is used might be useful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid digressing too much at this point. I've only mentioned filtering since I can't run a search without passing in some Bits instance, and I didn't want to do that with zero commentary. I was planning to elaborate further in other docs, but do you think it's too confusing without it?

SearchResult result = searcher.search(ssp, topK, Bits.ALL);

for (NodeScore ns : result.getNodes()) {
int id = ns.node; // you can look up this ID in the RAVV
float score = ns.score; // the similarity score between this vector and the query vector (higher -> more similar)
System.out.println("ID: " + id + ", Score: " + score + ", Vector: " + ravv.getVector(id));
}
```

For the full example, refer `jvector-examples/../VectorIntro.java`. Run it using
```sh
mvn compile exec:exec@example -Dexample=intro -pl jvector-examples -am
```

Next steps:
- Understand index construction parameters
- Overquerying to improve search accuracy
- Quantization for space efficiency
- Building indexes for larger-than-memory datasets on disk
- VectorizationProviders
Loading
Loading