Pinned Loading
-
gpt2-inference
gpt2-inference PublicA GPT-2 inference engine written from scratch in CUDA and C++. Implements custom CUDA kernels for tiled matrix multiplication, LayerNorm, fused attention, transformer blocks, KV cache management, a…
-
Memory-Allocator
Memory-Allocator PublicCustom memory allocator in C++ built from scratch using mmap. Allocates a 1MB memory pool upfront and carves blocks from it to keep all allocations contiguous. Implements malloc, free, block reuse …
-
Adaptive-ViT
Adaptive-ViT PublicAn adaptive Vision Transformer inference system that avoids unnecessary high-resolution computation, achieving ~3× faster inference than static high-res ViT by selectively escalating only when needed.
Python 5
-
KV-Compression
KV-Compression PublicImplementing and benchmarking KV cache compression methods for LLM inference in Triton. Featuring optimized kernels for KIVI and TurboQuant
-
research-papers
research-papers PublicResearch implementations focused on inference efficiency and model optimization. Includes custom Triton kernels, LoRA, knowledge distillation pipelines, and more.
If the problem persists, check the GitHub status page or contact support.

