A comprehensive benchmark suite for collecting GPU performance data across multiple NVIDIA GPU architectures. This repository contains CUDA kernels, calibration tools, and data collection pipelines to generate detailed performance datasets.
This project systematically collects GPU performance metrics by running 16 different CUDA kernel benchmarks across multiple problem sizes on various NVIDIA GPUs. The generated datasets include execution times, hardware specifications, and performance characteristics.
gpu-perf/
├── bin/ # Compiled CUDA executables
├── calibration/ # GPU calibration source code
│ ├── props.cu # Device properties extraction
│ ├── stream_like.cu # Memory bandwidth calibration
│ └── gemm_cublas.cu # Compute throughput calibration
├── data/ # Generated datasets and calibration outputs
├── kernels/ # CUDA kernel implementations (16 kernels)
├── runner/ # Main benchmark orchestrator
│ └── main.cu # Kernel dispatcher and timing harness
└── scripts/ # Data collection pipeline scripts
├── generate_dataset_full.sh # Main pipeline orchestrator
├── gen_trials_2080ti.sh # RTX 2080 Ti trial generation
├── gen_trials_4070.sh # RTX 4070 trial generation
├── gen_trials_titanv.sh # Titan V trial generation
├── gen_trials_titanx.sh # Titan X trial generation
├── run_trials.sh # Individual kernel execution
└── build_final_dataset.py # Data aggregation and processing
- RTX 2080 Ti (Turing, sm_75)
- RTX 4070 (Ada Lovelace, sm_89)
- Titan V (Volta, sm_70)
- Titan X (Turing, sm_75)
- vector_add: Simple element-wise addition (coalesced memory access)
- saxpy: Scaled vector addition (2 FLOPs per element)
- strided_copy_8: Strided memory access pattern (stride=8)
- random_access: Random gather operations
- naive_transpose: Matrix transpose with uncoalesced access
- shared_transpose: Optimized transpose using shared memory
- reduce_sum: Parallel reduction with shared memory
- dot_product: Vector dot product with reduction
- matmul_naive: Basic matrix multiplication
- matmul_tiled: Tiled matrix multiplication with shared memory
- conv2d_3x3: 2D convolution with 3x3 kernel
- conv2d_7x7: 2D convolution with 7x7 kernel
- histogram: Atomic operations for histogram computation
- atomic_hotspot: Extreme atomic contention benchmark
- vector_add_divergent: Branch divergence test
- shared_bank_conflict: Shared memory bank conflict test
Before collecting data, calibrate the GPU to determine hardware limits:
cd gpu-perf/calibration
nvcc -O3 -o ../bin/props props.cu && ../bin/props > ../data/props_2080ti.out
nvcc -O3 -o ../bin/stream_like stream_like.cu && ../bin/stream_like > ../data/stream_like_2080ti.out
nvcc -O3 -lcublas -o ../bin/gemm_cublas gemm_cublas.cu && ../bin/gemm_cublas > ../data/gemm_cublas_2080ti.outCalibration outputs:
props_*.out: Device specifications (SM count, memory, cache sizes)stream_like_*.out: Sustained memory bandwidth (GB/s)gemm_cublas_*.out: Sustained compute throughput (GFLOPS)
For any GPU (RTX 2080 Ti, RTX 4070, Titan V, or Titan X), run the two-step process:
For RTX 2080 Ti:
cd gpu-perf
./scripts/gen_trials_2080ti.sh
python3 scripts/build_final_dataset.py \
"data/trials_*__2080ti.csv" \
data/props_2080ti.out \
data/stream_like_2080ti.out \
data/gemm_cublas_2080ti.out \
data/runs_2080ti_final.csvFor RTX 4070:
cd gpu-perf
./scripts/gen_trials_4070.sh
python3 scripts/build_final_dataset.py \
"data/trials_*__4070.csv" \
data/props_4070.out \
data/stream_like_4070.out \
data/gemm_cublas_4070.out \
data/runs_4070_final.csvFor Titan V:
cd gpu-perf
./scripts/gen_trials_titanv.sh
python3 scripts/build_final_dataset.py \
"data/trials_*__titanv.csv" \
data/props_titanv.out \
data/stream_like_titanv.out \
data/gemm_cublas_titanv.out \
data/runs_titanv_final.csvFor Titan X:
cd gpu-perf
./scripts/gen_trials_titanx.sh
python3 scripts/build_final_dataset.py \
"data/trials_*__titanx.csv" \
data/props_titanx.out \
data/stream_like_titanx.out \
data/gemm_cublas_titanx.out \
data/runs_titanx_final.csvEach pipeline will:
- Build the CUDA runner binary for the specific GPU architecture
- Execute 16 kernels × 4 problem sizes = ~65 configurations
- Run 10 trials per configuration with warmup
- Aggregate results and calculate all metrics
- Output:
data/runs_<gpu>_final.csv
Important: You can only generate data for one GPU at a time since the pipeline uses the GPU currently available in your system.
generate_dataset_full.sh orchestrates:
gen_trials_2080ti.sh
├── Builds CUDA runner (nvcc with sm_75)
├── Runs run_trials.sh 65+ times (different kernels/sizes)
└── Outputs: data/trials_*__2080ti.csv
build_final_dataset.py
├── Aggregates all trial files
├── Calculates kernel metrics
├── Adds GPU hardware specs
└── Outputs: data/runs_2080ti_final.csv
The final dataset contains 40 columns organized into 9 categories:
kernel: Kernel nameregs: Registers per threadshmem: Shared memory per block (bytes)
block: Threads per blockgrid_blocks: Total blocks in grid
mean_ms: Mean execution time (milliseconds)std_ms: Standard deviation of execution time
N: Number of elements (1D kernels)rows: Matrix rows (2D kernels)cols: Matrix columns (2D kernels)H: Height (convolution kernels)W: Width (convolution kernels)matN: Matrix dimension (matmul kernels)size_kind: Size representation type ("N" or "rows_cols")
FLOPs: Floating-point operations executedBYTES: Total bytes accessed (read + write)shared_bytes: Shared memory allocatedworking_set_bytes: Active memory footprintarithmetic_intensity: FLOPs/BYTES ratiomem_pattern: Memory access pattern (coalesced, strided, random, etc.)has_branch_divergence: Branch divergence flag (0 or 1)atomic_ops_count: Number of atomic operations
gpu_device_name: Full device namegpu_architecture: Architecture (Turing, Ampere, Ada Lovelace, etc.)gpu_compute_capability: Compute capability (e.g., "7.5")gpu_sms: Number of streaming multiprocessorsgpu_max_threads_per_sm: Maximum threads per SMgpu_max_blocks_per_sm: Maximum blocks per SMgpu_regs_per_sm: Registers per SMgpu_shared_mem_per_sm: Shared memory per SM (bytes)gpu_l2_bytes: L2 cache size (bytes)gpu_warp_size: Warp size (typically 32)
peak_theoretical_gflops: Theoretical peak compute (GFLOPS)peak_theoretical_bandwidth_gbps: Theoretical peak bandwidth (GB/s)calibrated_mem_bandwidth_gbps: Measured sustained bandwidth (GB/s)calibrated_compute_gflops: Measured sustained compute (GFLOPS)
achieved_bandwidth_gbps: Actual bandwidth achieved (GB/s)achieved_compute_gflops: Actual compute throughput (GFLOPS)
T1_model_ms: Single-thread execution time estimatespeedup_model: Parallel speedup (T1 / actual_time)
Ratio of computation to memory traffic: FLOPs / BYTES
- Low AI (< 0.5): Memory-bound kernels
- High AI (> 5): Compute-bound kernels
coalesced: Sequential aligned memory accessstride_N: Strided access with stride Ntranspose_naive: Uncoalesced transposetranspose_tiled: Shared memory optimized transposerandom_gather: Random memory accessshared_reduction: Reduction using shared memorymatmul_naive: Basic matrix multiply access patternmatmul_tiled: Tiled shared memory matrix multiplystencil_NxN: Stencil operations (convolution)atomics_global_256: Global atomic operations (256 bins)atomics_hotspot: Single-location atomic contentionsmem_bank_conflict: Shared memory bank conflicts
- Theoretical: Maximum hardware capability from specifications
- Calibrated: Sustained performance measured on actual hardware
- Achieved: Performance of specific kernel workload
Calibrated values are typically 70-90% of theoretical peak due to real-world memory controller limits, instruction scheduling overhead, and power/thermal constraints.
Estimated parallel speedup using roofline model:
- T1 = max(compute_time_single_thread, memory_time_single_thread)
- Speedup = T1 / measured_parallel_time
- Typical values: 2-30× depending on kernel characteristics
Generate data for RTX 2080 Ti:
cd gpu-perf
# Calibrate GPU (one-time setup)
cd calibration
nvcc -O3 -o ../bin/props props.cu && ../bin/props > ../data/props_2080ti.out
nvcc -O3 -o ../bin/stream_like stream_like.cu && ../bin/stream_like > ../data/stream_like_2080ti.out
nvcc -O3 -lcublas -o ../bin/gemm_cublas gemm_cublas.cu && ../bin/gemm_cublas > ../data/gemm_cublas_2080ti.out
cd ..
# Run full pipeline
./scripts/generate_dataset_full.sh
# Output: data/runs_2080ti_final.csv (65+ rows × 40 columns)The resulting CSV file can be used for performance analysis, modeling, or machine learning applications.
All output CSV files use standard format:
- Header row with column names
- One row per kernel configuration
- Numeric values with appropriate precision (6 decimals for times, 2 for throughput)
- Empty strings for inapplicable fields (e.g., N for 2D kernels)
Example row (abbreviated):
kernel,block,mean_ms,N,FLOPs,BYTES,arithmetic_intensity,achieved_bandwidth_gbps
vector_add,256,0.123456,1048576,1048576,12582912,0.083333,97.45
- NVIDIA GPU with CUDA support
- CUDA Toolkit (nvcc compiler)
- cuBLAS library
- Python 3.6+
- Bash shell
- Each trial includes warmup iterations to ensure GPU is at stable clock speeds
- Multiple trials (10) per configuration provide statistical reliability
- Problem sizes chosen to span different cache/memory hierarchies
- Register and shared memory usage extracted from compiler output