CUDA C++ implementations of core acceleration patterns used in machine learning and image-processing workloads: matrix multiplication, shared-memory tiling, cuBLAS benchmarking, and Python-callable GPU convolution.
This repository is positioned as an AI/MLE systems portfolio project. The goal is to show that I can reason about GPU kernels, benchmark boundaries, and Python/CUDA integration rather than only call high-level ML libraries.
- Implemented CPU, naive CUDA, shared-memory tiled CUDA, and cuBLAS matrix multiplication benchmarks.
- Built a Python-callable CUDA shared library with
ctypesfor 2D convolution and Sobel edge detection. - Compared custom kernels against CPU baselines and NVIDIA library primitives on an NVIDIA Tesla T4 environment.
- Added reproducible scripts for collecting benchmark CSVs and regenerating plots.
Modern ML systems often bottleneck on tensor operations, preprocessing, feature extraction, and data movement. This project demonstrates lower-level skills that are useful when profiling inference pipelines, writing custom CUDA extensions, evaluating GPU utilization, or deciding when to use vendor libraries such as cuBLAS instead of maintaining a custom kernel.
.
|-- collect_results.py # Compiles/runs matrix benchmarks and writes results/final_comparison.csv
|-- plot_matrix_performance.py # Generates benchmark plots from results/final_comparison.csv
|-- CUDA_Acceleration_Primitives.ipynb # Colab verification notebook
|-- matrix_gpu.cu # Naive CUDA matrix multiplication
|-- matrix_gpu_optimized.cu # Shared-memory tiled CUDA matrix multiplication
|-- matrix_cublas.cu # cuBLAS SGEMM benchmark
|-- matrix_lib.cu # Python-callable CUDA matrix multiplication library
|-- test_lib.py # ctypes smoke test for libmatrix.so
|-- cpu/
| |-- matrix_cpu.c # CPU O(N^3) baseline
| `-- result.csv # CPU-only timing reference
|-- convolution/
| |-- convolution_lib.cu # CUDA convolution kernel + C ABI wrapper
| |-- convolution_standalone.cu # Standalone CUDA convolution benchmark
| |-- test_convolution.py # Python ctypes benchmark and visualization
| `-- results/ # Convolution output images
`-- results/ # Matrix benchmark CSV and plots
Benchmarked on Google Colab with an NVIDIA Tesla T4 GPU. GPU timings are measured with CUDA events around the kernel or cuBLAS call only; host-device transfers and allocation are intentionally excluded for this compute-kernel comparison. CPU timings measure the baseline CPU compute loop.
| Implementation | N=256 | N=512 | N=1024 | N=2048 | N=4096 |
|---|---|---|---|---|---|
| CPU | 0.019891s | 0.224031s | 3.280722s | 88.443179s | Timeout |
| Naive GPU | 0.008071s | 0.007691s | 0.007581s | 0.007705s | 0.010922s |
| Optimized Tiled CUDA | 0.011163s | 0.011361s | 0.010567s | 0.010849s | 0.007325s |
| cuBLAS | 0.014966s | 0.005545s | 0.006221s | 0.011343s | 0.052835s |
The convolution module exposes C ABI functions from CUDA C++ and calls them from Python with ctypes. The benchmark uses a Sobel 3x3 edge filter while varying image size, then uses 5x5 and 7x7 box filters to test larger stencil workloads.
- NVIDIA GPU with CUDA toolkit and
nvcc - Python 3.9+
- Python packages:
numpy,pandas,matplotlib - Optional for the original environment: Google Colab T4 GPU
For a notebook-based run, open CUDA_Acceleration_Primitives.ipynb in Colab with a GPU runtime. The notebook follows the same commands below and is meant as a reproducible project demo rather than a course lab handout.
The scripts default to sm_75 for Tesla T4. To target another GPU, set CUDA_ARCH:
CUDA_ARCH=sm_86 python collect_results.pypython collect_results.py
python plot_matrix_performance.pyOutputs:
results/final_comparison.csvresults/plot_log_scale.pngresults/plot_gpu_only.png
nvcc -arch=sm_75 -Xcompiler -fPIC -shared convolution/convolution_lib.cu -o convolution/libconvolution.so
python convolution/test_convolution.pyOutputs:
convolution/results/edge_detection_result.pngconvolution/results/convolution_output_demo.pngconvolution/results/comparison.png
Optional standalone benchmark:
nvcc -arch=sm_75 convolution/convolution_standalone.cu -o convolution/convolution_standalone
./convolution/convolution_standalone- The optimized matrix kernel uses 16x16 shared-memory tiles to reduce repeated global memory reads.
- The cuBLAS benchmark is included as a vendor-library reference point, which is usually the right production choice for dense linear algebra.
- The convolution wrapper currently allocates/copies/frees GPU memory inside each Python call, so its timing is closer to an end-to-end Python integration benchmark than the matrix kernel-only timing.
- For interview discussion, the next natural optimizations would be CUDA error-checking macros, pinned host memory, persistent device buffers for repeated convolution calls, correctness checks against NumPy, and Tensor Core or WMMA experiments for matrix multiplication.


