Gengo is a high-performance command-line tool written in Go for generating TPC-DS benchmark datasets - the industry standard for testing data warehousing and analytics systems. It generates complete star schemas with realistic business data at scale, perfect for benchmarking database performance, testing BI tools, and validating data pipelines.
TPC-DS is the most important benchmark for data warehousing systems, and Gengo implements it with production-quality data modeling and optimizations for enterprise-scale datasets.
- Ultra-Fast: Leverages Go's performance and sophisticated optimizations for ultra-fast data generation, achieving up to 17x speed improvements over previous versions for relational models.
- Relational Model: Generates predefined 3NF data models for:
- E-commerce TPC-DS: Complete TPC-DS benchmark with 17 dimensions and 7 fact tables (Store/Web/Catalog sales, returns, inventory)
- Financial:
dim_companies,dim_exchanges,fact_daily_stock_prices - Medical:
dim_patients,dim_doctors,dim_clinics,fact_appointments
- Multiple Formats: Output data as CSV, JSON Lines (one JSON object per line), or efficient Apache Parquet.
- Realistic Facts: Uses weighted sampling for selecting customers and products when generating orders, simulating more realistic purchasing patterns (e.g., some customers/products appear more frequently).
- Compressed Parquet: Generates compressed Parquet files (Snappy by default) for smaller disk usage (one file per table).
- Size-Based Input: Tell Gengo the approximate target size in GB for the dataset, and it estimates the required row counts for dimensions and facts.
- Simple Usage: Interactive command-line prompts guide you through the setup.
- Customizable Code: Easily tweak the data generation logic, schema structs, or data realism features within the Go code (uses
brianvoe/gofakeitand other standard libraries).
-
Clone the repo:
git clone https://github.com/peekknuf/Gengo.git # Or your repo URL -
Navigate into the directory:
cd Gengo -
Ensure dependencies are downloaded:
go mod tidy
-
Build the binary:
go build
Simply run the compiled binary with the gen command:
./Gengo genGengo will then prompt you interactively:
-
Enter the data model to generate: Type
ecommerce-ds(for TPC-DS benchmark),ecommerce,financial, ormedical. Gengo can handle common misspellings and abbreviations (e.g.,eds,ecom,fin,med). -
Enter the approximate target size in GB: (e.g., 0.5, 10, 50). Gengo will display the estimated row counts for each table based on this.
-
Enter the desired output format: Type csv, json, or parquet.
-
Enter the output directory name: This directory will be created if it doesn't exist, and all generated table files (e.g., dim_customers.parquet, fact_orders.parquet) will be saved inside it.
Gengo will then get to work, showing progress and timing information when complete.
The TPC-DS (Transaction Processing Performance Council Decision Support) benchmark is the industry standard for data warehousing performance testing. Gengo implements a complete TPC-DS schema with realistic business data modeling.
./Gengo gen
# Enter: eds (or ecommerce-ds)
# Enter: 10 (for 10GB)
# Enter: csv
# Enter: output_directoryThis generates a comprehensive dataset with:
📊 Fact Tables (57.7M total rows):
- Store Sales: 31.7M rows (55% of sales)
- Web Sales: 17.3M rows (30% of sales)
- Catalog Sales: 8.6M rows (15% of sales)
- Store Returns: 2.5M rows (8% return rate)
- Web Returns: 2.1M rows (12% return rate)
- Catalog Returns: 865K rows (10% return rate)
- Inventory: 356M rows (weekly snapshots)
🏢 Dimension Tables (17 tables):
- 2.3M Customers with 5.1M Addresses
- 481K Items with 80K Promotions
- 72 Stores, 57 Warehouses, 38 Call Centers
- Rich demographics and geographic data
Performance Metrics:
- 25 orders per customer annually
- 9.5% overall return rate
- Realistic sales channel distribution
- Complete foreign key relationships
Gengo can generate terabyte-scale TPC-DS datasets efficiently:
- 1TB dataset: ~73 minutes on 4 cores, 6-9 minutes on 64 cores
- 10TB dataset: ~12 hours on 4 cores, 1-1.5 hours on 64 cores
- Optimized for multi-core scaling with high-performance NVMe storage
Want different fake data or schema modifications?
- Schema: Modify the Go structs in
internal/models/ecommerce/ecommerce.go,internal/models/financial/financial.go, andinternal/models/medical/medical.go. Remember to update struct tags (json,parquet) accordingly. - Dimension Data: Change the
gofakeitfunctions or logic used within theGenerate*functions ininternal/simulation/ecommerce/simulate_dims.go,internal/simulation/financial/simulate_financial_dims.go, andinternal/simulation/medical/simulate_medical_dims.go. - Fact Data & Realism: Adjust the generation logic (e.g., distributions, static lists), foreign key selection (including weighted sampling), or calculation logic within the
Generate*ModelDatafunctions ininternal/simulation/ecommerce/simulate_facts.go,internal/simulation/financial/simulate_financial_facts.go, andinternal/simulation/medical/simulate_medical_facts.go. - Sizing Ratios: Modify the constants in
internal/core/sizing.goto change the relative sizes of the generated tables.
For those interested in the technical underpinnings, Gengo's performance and design are rooted in the following key implementation choices:
Gengo implements several sophisticated optimizations to achieve ultra-fast data generation, particularly for large fact tables:
- Modified
weightedSampler.Sample()to accept an RNG parameter instead of using globalrand.Float64() - Each worker goroutine gets its own
*rand.Randinstance with unique seed generation - Eliminates contention on the global random number generator lock
- Replaced struct-based channels with byte chunk channels for fact tables
- Worker goroutines format CSV rows directly into byte slices using
strconv.AppendInt/AppendFloat - Single dedicated writer goroutine performs simple
Write()operations - Eliminated
encoding/csvusage for fact tables (kept for dimensions with string fields) - Buffer size increased from 64KB to 16MB for better throughput
- Replaced expensive
time.Format(time.RFC3339)calls with raw epoch seconds - Updated header from
order_timestamptoorder_timestamp_unix - Eliminates costly timestamp formatting in the hot path (1.45M times)
- Implemented
idBlocktype withnextID()method for managing ID blocks - Block size of 100,000 IDs to reduce atomic contention
- Replaced per-item atomic operations with per-block atomic operations
- Contends on atomic only once per 100k items instead of once per item
- Replaced
map[int]ProductDetailswith[]ProductDetailsslice - Eliminated hash lookups in inner loop with direct array access
- Product details accessed via
productDetails[productID]with zero allocations
- Increased worker buffer sizes from 1MB to 16MB
- Reduced number of channel operations and buffer reallocations
- Better matching of buffer size to file writer buffer size (16MB)
- Concurrent Data Generation: Leveraging Go's lightweight goroutines and channels, data generation for large fact tables (e.g.,
fact_daily_stock_prices,fact_order_items) is parallelized across available CPU cores. This employs a producer-consumer pattern where worker goroutines generate data chunks, which are then aggregated and written. - Atomic ID Management: Unique primary keys (e.g.,
OrderItemID,AppointmentID) are managed across concurrent generation streams usingsync/atomicoperations, ensuring correctness without performance bottlenecks from locks. - In-Memory Aggregation: For optimal write performance, data for each table is generated and aggregated in memory before being written to disk in a single operation. Go's efficient garbage collector handles the memory management for these large in-memory structures.
- Dynamic Sizing Heuristics: The
internal/core/sizing.gopackage dynamically estimates row counts for all tables based on a target GB size, using empirically derived ratios and logical dependencies between dimensions and facts. - Efficient File Formats: Integration with Apache Arrow and Parquet (via
apache/arrow/go) enables highly efficient, columnar storage for generated data, reducing disk footprint and improving read performance for downstream systems.
Based on actual generation runs with Gengo:
| Dataset | Size | Generation Time | Throughput | Row Rate |
|---|---|---|---|---|
| TPC-DS | 10GB | 42 seconds | 238 MB/s | 10M rows/sec |
| TPC-DS | 1GB | ~4 seconds | ~250 MB/s | ~10M rows/sec |
Throughput Analysis:
- 10GB TPC-DS dataset: 10GB ÷ 42s = 238 MB/s sustained write speed
- 422M total rows: 422,000,000 ÷ 42s = 10 million rows/second
- 24 files generated concurrently: Multi-file parallel output
| Tool | Dataset Size | Throughput | Notes |
|---|---|---|---|
| Gengo | 10GB TPC-DS | 238 MB/s | Go, optimized for relational data |
| Python Faker | 1GB simple CSV | ~10-50 MB/s | Single-threaded, interpreted overhead |
| Mockaroo | 1GB generated | ~20-80 MB/s | Web service, network limited |
| SQL Data Generator | 1GB relational | ~50-100 MB/s | Database overhead, logging |
Key Advantages:
- 10-20x faster than Python Faker for relational datasets
- 3-5x faster than web-based generators (no network latency)
- 2-4x faster than SQL-based generators (no database overhead)
- True relational integrity: Foreign keys, realistic distributions, 3NF normalized
Apple Silicon (M5 Mac) Performance:
| Dataset | Size | Generation Time | Throughput | Row Rate |
|---|---|---|---|---|
| TPC-DS | 10GB | 8 seconds | 1.25 GB/s | 52M rows/sec |
| TPC-DS | 100GB | 3m 24s | 0.53 GB/s | 22M rows/sec |
Scalability Notes:
- Excellent parallel scaling with consistent worker completion times
- Intelligent skipping of prohibitively large tables (11B+ row Inventory table)
- Minor bottleneck in sequential dimension generation for largest tables
- 100GB dataset generates ~109GB actual output across 42 files
Test Environment:
- CPU: Multi-core processor (4+ cores)
- Storage: NVMe SSD (recommended for optimal performance)
- Memory: 16GB+ RAM
- OS: Linux/Windows/macOS
Scaling Characteristics:
- Performance scales linearly with CPU cores
- NVMe storage recommended for sustained 200+ MB/s write speeds
- Memory usage: ~2GB per concurrent worker
Happy generating and playing around with the data!