Skip to content

Vector index workload #17795

@azevaykin

Description

@azevaykin

Workload application for vector index

Inspired by https://ydb.tech/docs/en/recipes/ydb-cli/benchmarks

Commands and subcommands

ydb [global options...] workload vector [options...] <subcommand>

vector                    YDB Vector workload
├─ init                     Initializes a new data table.
├─ import                   Fill vectors and build a vector index.
├─ run                      Run workload operations
└─ clean                    Drop the table created in the initialization phase

import                    Import/generate some vectors and build a vector index.
├─ generator                Generate random vectors and build a vector index.
└─ files                    Import vectors from files and build a vector index.

run                       Run workload operations
├─ upsert                   Insert or update vector rows in the table
└─ select                   Retrieve top-K vectors

Internal table

CREATE TABLE `vector_index_workload` (
    id uint64 not null,
    prefix uint64 not null,
    embedding string,
    PRIMARY KEY (id)
);

Predefined search query table (--query-table)

CREATE TABLE `vector_index_sample` (
    id uint64 not null,
    prefix uint64 not null,
    embedding string not null,
    PRIMARY KEY (id)
);

Filling samples from a large table:

INSERT INTO vector_index_sample
SELECT id, prefix, embedding FROM large_table WHERE RandomNumber(id) < 878416384462360;

Where 878416384462360 should be replaced with 0xFFFFFFFFFFFFFFFF / <estimated row count> * <desired sample row count>.

init parameters:

--min-partitions           Table min partitions count (AUTO_PARTITIONING_MIN_PARTITIONS_COUNT). Default: 40.
--partition-size           Table max partition size (AUTO_PARTITIONING_PARTITION_SIZE_MB). Default: 2000.
--auto-partition           Table automatic partitioning by load (AUTO_PARTITIONING_BY_LOAD). Default: 1.

import common parameters:

--index                    Type of index. Possible values: 'None', 'KmeansTree'. Default: 'KmeansTree'.
--distance                 Distance/similarity function. Default: 'inner-product'.
--vector-type              Type of vectors. Default: 'float'.
--dimension                Vector dimension. Default: 1024.

--kmeans-tree-covering     Index is covering. Default: 0.
--kmeans-tree-prefixed     Index is prefixed. Default: 0.
--kmeans-tree-levels       Number of tree levels. Kmeans-tree specific. Default: 1.
--kmeans-tree-clusters     Number of cluster during init. Kmeans-tree specific. Default: 10.

--upload-threads           Number of threads to generate tables content. Default: number of core on the machine.
--bulk-size                Number of rows in a import batch.	Default: 100.
--max-in-flight            The maximum number of data chunks that can be processed simultaneously.	128

import generate parameters:

--rows                     Number of rows added during init. Default: 10000.
--prefix-count             Number of prefixes for prefix index. Default: 100.

import files parameters:

--input                    Path to the source data files. Required.
--format                   Source files format. One of 'csv|json|parquet|tsv'. Both unpacked and packed .gz files are supported. Required.
--embedding-column         Name of a column with an embedding vector. Default: 'embedding'.
--state                    Path to the import state file. If the import is interrupted, it will resume from the same point when restarted. Default: ''.  
--clear-state              Clears the import state file and restarts the import from the beginning. Default: 1.

run parameters:

--seconds                  Seconds to run workload. Default: 10.
--threads                  Number of parallel threads in workload. Default: 10.
--quiet                    Quiet mode. Doesn't print statistics each second.
--print-timestamp          Print timestamp each second with statistics.
--client-timeout           Client timeout in ms. (default: 1000)
--operation-timeout        Operation timeout in ms. (default: 800)
--cancel-after             Cancel after timeout in ms. (default: 800)
--window                   Window duration in seconds. Default: 1.
--executer                 Query executer type (data or generic). (default: generic)

upsert parameters:

--bulk-size                Number of rows in a upsert batch. Default: 100.
--prefixed                 Index is prefixed.
--prefix-count             Number of pregenerated prefixes. Uses only in prefix index. Default: 1000.

select parameters:

--table                    Table name. Default: "vector_index_workload".
--index                    Index name. Default: "index".
--query-table              Name of the table with predefined search vectors. Default: "".
--targets                  Number of vectors to search as targets. Default: 100.
--limit                    Maximum number of vectors to return. Default: 5.
--kmeans-tree-clusters     Maximum number of clusters to use during search. Default: 1.
--recall-threads           Number of threads for concurrent queries during recall measurement. Default: 10.
--recall                   Measure recall metrics. It trains on 'targets' vector by bruce-force search. Default: 0.

Sub-issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions