Unified C++17 generators for TPC-H, TPC-DS, and SSB. The C++ API streams Arrow
RecordBatch data, while the benchgen CLI emits the canonical pipe-delimited
.tbl/.dat outputs. Select the suite with --benchmark.
- CMake 3.16+
- C++17 compiler
- Apache Arrow C++ 19.0.1+
- Python 3 (required to embed distribution files during the build)
- GoogleTest 1.10.0+ (only if tests are enabled)
cmake -S . -B build
cmake --build buildcmake -S . -B build -DBENCHGEN_ENABLE_TESTS=ON
cmake --build build
ctest --test-dir buildIf Arrow or GTest are found under the prefixes, they are used; otherwise the build falls back to fetching them.
cmake -S . -B build \
-DBENCHGEN_ARROW_PREFIX=/path/to/arrow \
-DBENCHGEN_ENABLE_TESTS=ON \
-DBENCHGEN_GTEST_PREFIX=/path/to/gtest
cmake --build build-DBENCHGEN_STATIC_STDLIB=ON: statically link the C++ standard library.
cmake -S . -B build -DCMAKE_INSTALL_PREFIX=/path/to/install
cmake --build build
cmake --build build --target install--benchmark,-b:tpch,tpcds, orssb--table,-t: table name (see suite-specific lists ininclude/benchgen/table.h)--scale,--scale-factor,-s: scale factor (default: 1)--chunk-size: rows perRecordBatch(default: 10000)--start-row: 0-based row offset (default: 0)--row-count: number of rows to emit (default: -1 = to end)--output,-o: output path (required for TPC-DS; optional for others)--dbgen-seed-mode: TPCH/SSB seed init (per-tabledefault;all-tablesmatchesdbgen -T a)--parallel: worker thread count (default: 1; requires--outputwhen parallel generation applies, emits--output-prefixed parts, and falls back to serial if total rows unknown)
./build/src/benchgen --benchmark tpch \
--table customer \
--scale 1 \
--output build/customer.tbl \
--chunk-size 10000./build/src/benchgen --benchmark tpcds \
--table customer \
--scale 1 \
--output build/customer.dat \
--chunk-size 10000./build/src/benchgen --benchmark ssb \
--table customer \
--scale-factor 1 \
--output build/customer.tblUse --dbgen-seed-mode all-tables to match dbgen -T a output.
./build/src/benchgen --benchmark tpch \
--table lineitem \
--scale 10 \
--start-row 500000 \
--row-count 200000 \
--output build/lineitem_500k.tbl./build/src/gen_schema --benchmark tpcds \
--output build/tpcds_schema.jsonUse --benchmark tpch or --benchmark ssb to generate schemas for other suites.
gen_schema also accepts --scale/--scale-factor and --dbgen-seed-mode.
Public headers live in include/benchgen. The core entry points are
MakeBenchmarkSuite and MakeRecordBatchIterator, driven by
GeneratorOptions (scale, row ranges, chunk size, column projection, and SSB
seed mode).
#include "benchgen/generator_options.h"
#include "benchgen/record_batch_iterator_factory.h"
benchgen::GeneratorOptions options;
options.scale_factor = 1;
options.start_row = 0;
options.row_count = 100000;
options.column_names = {"c_custkey", "c_name"};
std::unique_ptr<benchgen::RecordBatchIterator> iter;
auto status = benchgen::MakeRecordBatchIterator(
benchgen::SuiteId::kTpch, "customer", options, &iter);
if (!status.ok()) {
// Handle error.
}include/benchgen/: public API headers (suite interfaces, generator options)src/tpch/,src/tpcds/,src/ssb/: benchmark implementationssrc/common/: shared CLI entrypointssrc/util/: shared utilitiesresources/<benchmark>/distribution/: distribution filesscripts/: helper scripts (scripts/<benchmark>/for benchmark-specific helpers)tests/<benchmark>/: gtest-based verification
- Distributions are embedded into binaries at build time from
resources/<benchmark>/distribution. - Deterministic output allows chunked workflows via disjoint
--start-row/--row-countranges.--parallelsplits the resolved row range across worker threads when total row counts are known, writing-0,-1, ... part files based on the--outputprefix. TPC-Hlineitemand SSBlineorderinfer totals from dbgen scale 1/5/10 anchors so--parallelapplies to them as well. - Column projection is supported in the C++ API via
GeneratorOptions::column_names.
See NOTICE.