tensor-cores

CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.

Updated Apr 2, 2026
Python

etasnadi / VulkanCooperativeMatrixAttention

Star

Vulkan & GLSL implementation of FlashAttention-2

vulkan glsl artificial-intelligence gpu-acceleration attention gpu-computing deel-learning tensor-cores large-language-models llm flash-attention flash-attention-2

Updated Jan 19, 2025
C++

llcuda / llcuda

Star

CUDA 12-first backend inference for Unsloth on Kaggle — Optimized for small GGUF models (1B-5B) on dual Tesla T4 GPUs (15GB each, SM 7.5)

python machine-learning ai deep-learning jupyter gpu cuda inference pytorch nvidia cuda-kernels google-colab tensor-cores tesla-t4 llm gguf unsloth flashattention

Updated Feb 1, 2026
Jupyter Notebook

LDRyan0 / Correlator-Bench

Star

A benchmarking framework for correlators of FX telescope arrays

cpp cuda radio-astronomy astronomy-instrumentation tensor-cores

Updated Oct 20, 2023
Cuda

LessUp / sgemm-optimization

Star

Progressive CUDA SGEMM tutorial and reference code: five kernels from naive GEMM to Tensor Core WMMA, with cuBLAS verification and benchmarks.

tutorial cuda matrix-multiplication high-performance-computing cuda-kernels shared-memory gemm sgemm gpu-optimization bank-conflict tensor-cores wmma

Updated Apr 22, 2026
Cuda

NeuralAditya / Neural_Network_C

Star

Neural Network C is an advanced neural network implementation in pure C, optimized for high performance on CPUs and NVIDIA GPUs.

Updated Mar 29, 2025
C

LessUp / hpc-ai-optimization-lab

Star

🎓 CUDA HPC Kernel Optimization Lab: Progressive GEMM, FlashAttention, Tensor Core & CUDA 13 Features | 从朴素到 Tensor Core 的 CUDA 高性能算子优化实验室

Updated Apr 22, 2026
Cuda

High-performance CUDA kernels with step-by-step optimization, profiling, and analysis. A growing collection of GPU solutions demonstrating warp-level tuning, memory optimization, and Tensor Core acceleration.

gpu-acceleration cuda-programming tensor-cores leetgpu warp-reduction

Updated Nov 12, 2025
Cuda

WizardsForgeIo / sparsemma

Star

INT8 Sparse Tensor Core GEMM for PyTorch — built for Windows

windows gpu cuda inference pytorch nvidia sparse quantization gemm int8 ptx structured-sparsity tensor-cores vram-optimization

Updated Feb 16, 2026
Cuda

Umer-Farooq-CS / MNIST-Classification

Star

The MNIST classification problem is a fundamental machine learning task that involves recognizing handwritten digits (0- 9) from a dataset of 70,000 grayscale images (28x28 pixels each). It serves as a benchmark for evaluating machine learning models, particularly neural networks.

benchmarking deep-learning parallel-computing cuda mnist neural-networks high-performance-computing gpu-acceleration profiling shared-memory openacc performance-optimization c-cpp nsight tensor-cores cuda-streams pinned-memory

Updated Sep 12, 2025
Cuda

ZrobMiloudaa / jetson-orin-matmul-analysis

Star

🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.

machine-learning robotics cuda cublas matrix-multiplication high-performance-computing gpu-computing performance-optimization autonomous-systems edge-computing nvidia-jetson embeded-systems tensor-cores ml-deployment jetson-orin-nano gpu-benchmarking power-efficiency-benchmark cuda-optimization

Updated Apr 23, 2026
Python

fsudjatmiko / tsurutune-app

Star

TsuruTune is a comprehensive deep learning model optimization tool designed specifically for NVIDIA Jetson platforms and edge devices.. It leverages Tensor Core acceleration and memory bandwidth alignment to achieve optimal performance for deep learning inference on edge devices.

deep-learning optimization-methods tensor-cores

Updated Jan 8, 2026
Python

berlin0308 / NeedleGX-TVM

Star

GNN inference acceleration with TVM compiler

cuda inference avx-instructions tvm gnns tensor-cores

Updated Dec 17, 2025
Python

Yoonkyu-Lee / batched-lenet-cuda

Star

10,000-image LeNet-5 forward pass in ~28 ms on a single A40 via fused convolution and Tensor Cores (TF32).

parallel-computing cuda inference cnn matrix-multiplication lenet convolution gpu-computing ampere gpu-programming lenet-5 im2col cuda-programming tensor-cores kernel-optimization tf32 wmma

Updated Apr 16, 2026
Cuda

aye-shadow / neural-network-acceleration

Star

cuda gpu-acceleration tensor-cores

Updated Apr 20, 2025
Cuda

Olajide-Badejo / CUDA-Matrix-Library

Star

CUDA matrix library for GEMM, GEMV, TRSM with naive, tiled, register-blocked, and tensor-core kernels. Includes FP16/BF16 mixed precision, sparse ops, cuSOLVER wrappers, and Python bindings.

cpp gpu cuda blas gemm mixed-precision tensor-cores

Updated Apr 15, 2026
C++

LessUp / mini-inference-engine

Star

CUDA GEMM Optimization Learning Project: 7-Level Progressive Optimization from Naive to ~89% cuBLAS Performance | CUDA GEMM 渐进式优化学习项目：7级优化从基础到~89% cuBLAS性能

deep-learning cpp hpc cuda inference nvidia matrix-multiplication high-performance-computing cuda-kernels gpu-computing gemm performance-optimization nvidia-cuda inference-engine gpu-programming cuda-programming tensor-cores

Updated Apr 22, 2026
C++

Improve this page

Add a description, image, and links to the tensor-cores topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the tensor-cores topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensor-cores

Here are 24 public repositories matching this topic...

xlite-dev / ffpa-attn

xlite-dev / HGEMM

tgautam03 / tGeMM

Cre4T3Tiv3 / jetson-orin-matmul-analysis

etasnadi / VulkanCooperativeMatrixAttention

llcuda / llcuda

LDRyan0 / Correlator-Bench

LessUp / sgemm-optimization

NeuralAditya / Neural_Network_C

LessUp / hpc-ai-optimization-lab

keneoneth / leet_gpu_solution

WizardsForgeIo / sparsemma

Umer-Farooq-CS / MNIST-Classification

ZrobMiloudaa / jetson-orin-matmul-analysis

fsudjatmiko / tsurutune-app

berlin0308 / NeedleGX-TVM

Yoonkyu-Lee / batched-lenet-cuda

aye-shadow / neural-network-acceleration

Olajide-Badejo / CUDA-Matrix-Library

LessUp / mini-inference-engine

Improve this page

Add this topic to your repo