🤖FFPA: Extend FlashAttention-2 w/ Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA.
-
Updated
Apr 18, 2026 - Cuda
🤖FFPA: Extend FlashAttention-2 w/ Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA.
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
General Matrix Multiplication using NVIDIA Tensor Cores
CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
Vulkan & GLSL implementation of FlashAttention-2
CUDA 12-first backend inference for Unsloth on Kaggle — Optimized for small GGUF models (1B-5B) on dual Tesla T4 GPUs (15GB each, SM 7.5)
A benchmarking framework for correlators of FX telescope arrays
Neural Network C is an advanced neural network implementation in pure C, optimized for high performance on CPUs and NVIDIA GPUs.
INT8 Sparse Tensor Core GEMM for PyTorch — built for Windows
High-performance CUDA kernels with step-by-step optimization, profiling, and analysis. A growing collection of GPU solutions demonstrating warp-level tuning, memory optimization, and Tensor Core acceleration.
The MNIST classification problem is a fundamental machine learning task that involves recognizing handwritten digits (0- 9) from a dataset of 70,000 grayscale images (28x28 pixels each). It serves as a benchmark for evaluating machine learning models, particularly neural networks.
🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.
GNN inference acceleration with TVM compiler
CUDA matrix library for GEMM, GEMV, TRSM with naive, tiled, register-blocked, and tensor-core kernels. Includes FP16/BF16 mixed precision, sparse ops, cuSOLVER wrappers, and Python bindings.
TsuruTune is a comprehensive deep learning model optimization tool designed specifically for NVIDIA Jetson platforms and edge devices.. It leverages Tensor Core acceleration and memory bandwidth alignment to achieve optimal performance for deep learning inference on edge devices.
10,000-image LeNet-5 forward pass in ~28 ms on a single A40 via fused convolution and Tensor Cores (TF32).
Accelerate INT8 sparse inference in PyTorch on Windows with minimal setup. Achieve high performance using Sparse Tensor Cores without Linux dependencies.
CUDA Kernel Optimization Academy: SGEMM Tutorial, TensorCraft Ops, HPC Advanced & Inference Engine | CUDA Kernel 优化学院:SGEMM 教程、TensorCraft 算子库、HPC 进阶、推理引擎,从入门到 Tensor Core
Mini Deep Learning Inference Engine (CUDA + C++17): 7-Level GEMM Optimization, FP16/INT8 & Auto-Tuner | 迷你深度学习推理引擎(CUDA + C++17):7 级 GEMM 优化、FP16/INT8 量化、自动调优器
Add a description, image, and links to the tensor-cores topic page so that developers can more easily learn about it.
To associate your repository with the tensor-cores topic, visit your repo's landing page and select "manage topics."