Fast, reproducible, and portable software development environments
-
Updated
Dec 8, 2021 - Dockerfile
Fast, reproducible, and portable software development environments
Remote development on HPC clusters with VSCode
Accelerate and optimize existing C/C++ CPU-only applications using the most essential CUDA tools and techniques.
Matrix multiplication example performed with OpenMP, OpenACC, BLAS, cuBLABS, and CUDA
High-performance Sobel edge detection using CUDA with CPU vs GPU benchmarking, roofline analysis, and Nsight profiling.
CUDA Samples and Nsight Guided Profiling Samples
A simple and understandable CUDA kernel for batch-matmul operation
Repository for Architecture of computers and parallel systems course on VŠB
Custom PyTorch CUDA kernel implementing optimized ReLU activation with vectorization, performance profiling, and memory analysis on Tesla T4 GPU achieving 75% bandwidth efficiency.
The MNIST classification problem is a fundamental machine learning task that involves recognizing handwritten digits (0- 9) from a dataset of 70,000 grayscale images (28x28 pixels each). It serves as a benchmark for evaluating machine learning models, particularly neural networks.
High-Performance Computing (HPC) & Optimization studies using CUDA C++. Includes Grid-Stride Loops, Shared Memory tiling, and Nsight Compute profiling analysis.
CUDA-accelerated kNN regression for rent estimation with CPU baseline, shared-memory optimization, and profiling
University Project for "Computer Architecture" course (MSc Computer Engineering @ University of Pisa). Implementation of a Parallelized Nearest Neighbor Upscaler using CUDA.
🎬 Explore GPU training efficiency with FP32 vs FP16 in this modular lab, utilizing Tensor Core acceleration for deep learning insights.
16-step CUDA optimization of FlashAttention-2 achieving 99.2% of official performance on A100 — Ampere architecture
A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.
Quantum workload planning and profiler-backed architecture analysis for exact tensor-network execution.
Add a description, image, and links to the nsight topic page so that developers can more easily learn about it.
To associate your repository with the nsight topic, visit your repo's landing page and select "manage topics."