-
Notifications
You must be signed in to change notification settings - Fork 740
Description
Workload application for vector index
Inspired by https://ydb.tech/docs/en/recipes/ydb-cli/benchmarks
Commands and subcommands
ydb [global options...] workload vector [options...] <subcommand>
vector YDB Vector workload
├─ init Initializes a new data table.
├─ import Fill vectors and build a vector index.
├─ run Run workload operations
└─ clean Drop the table created in the initialization phase
import Import/generate some vectors and build a vector index.
├─ generator Generate random vectors and build a vector index.
└─ files Import vectors from files and build a vector index.
run Run workload operations
├─ upsert Insert or update vector rows in the table
└─ select Retrieve top-K vectors
Internal table
CREATE TABLE `vector_index_workload` (
id uint64 not null,
prefix uint64 not null,
embedding string,
PRIMARY KEY (id)
);Predefined search query table (--query-table)
CREATE TABLE `vector_index_sample` (
id uint64 not null,
prefix uint64 not null,
embedding string not null,
PRIMARY KEY (id)
);Filling samples from a large table:
INSERT INTO vector_index_sample
SELECT id, prefix, embedding FROM large_table WHERE RandomNumber(id) < 878416384462360;Where 878416384462360 should be replaced with 0xFFFFFFFFFFFFFFFF / <estimated row count> * <desired sample row count>.
init parameters:
--min-partitions Table min partitions count (AUTO_PARTITIONING_MIN_PARTITIONS_COUNT). Default: 40.
--partition-size Table max partition size (AUTO_PARTITIONING_PARTITION_SIZE_MB). Default: 2000.
--auto-partition Table automatic partitioning by load (AUTO_PARTITIONING_BY_LOAD). Default: 1.
import common parameters:
--index Type of index. Possible values: 'None', 'KmeansTree'. Default: 'KmeansTree'.
--distance Distance/similarity function. Default: 'inner-product'.
--vector-type Type of vectors. Default: 'float'.
--dimension Vector dimension. Default: 1024.
--kmeans-tree-covering Index is covering. Default: 0.
--kmeans-tree-prefixed Index is prefixed. Default: 0.
--kmeans-tree-levels Number of tree levels. Kmeans-tree specific. Default: 1.
--kmeans-tree-clusters Number of cluster during init. Kmeans-tree specific. Default: 10.
--upload-threads Number of threads to generate tables content. Default: number of core on the machine.
--bulk-size Number of rows in a import batch. Default: 100.
--max-in-flight The maximum number of data chunks that can be processed simultaneously. 128
import generate parameters:
--rows Number of rows added during init. Default: 10000.
--prefix-count Number of prefixes for prefix index. Default: 100.
import files parameters:
--input Path to the source data files. Required.
--format Source files format. One of 'csv|json|parquet|tsv'. Both unpacked and packed .gz files are supported. Required.
--embedding-column Name of a column with an embedding vector. Default: 'embedding'.
--state Path to the import state file. If the import is interrupted, it will resume from the same point when restarted. Default: ''.
--clear-state Clears the import state file and restarts the import from the beginning. Default: 1.
run parameters:
--seconds Seconds to run workload. Default: 10.
--threads Number of parallel threads in workload. Default: 10.
--quiet Quiet mode. Doesn't print statistics each second.
--print-timestamp Print timestamp each second with statistics.
--client-timeout Client timeout in ms. (default: 1000)
--operation-timeout Operation timeout in ms. (default: 800)
--cancel-after Cancel after timeout in ms. (default: 800)
--window Window duration in seconds. Default: 1.
--executer Query executer type (data or generic). (default: generic)
upsert parameters:
--bulk-size Number of rows in a upsert batch. Default: 100.
--prefixed Index is prefixed.
--prefix-count Number of pregenerated prefixes. Uses only in prefix index. Default: 1000.
select parameters:
--table Table name. Default: "vector_index_workload".
--index Index name. Default: "index".
--query-table Name of the table with predefined search vectors. Default: "".
--targets Number of vectors to search as targets. Default: 100.
--limit Maximum number of vectors to return. Default: 5.
--kmeans-tree-clusters Maximum number of clusters to use during search. Default: 1.
--recall-threads Number of threads for concurrent queries during recall measurement. Default: 10.
--recall Measure recall metrics. It trains on 'targets' vector by bruce-force search. Default: 0.
Sub-issues
Metadata
Metadata
Assignees
Labels
No labels