Skip to content
View Mog9's full-sized avatar

Block or report Mog9

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Pinned Loading

  1. gpt2-inference gpt2-inference Public

    A GPT-2 inference engine written from scratch in CUDA and C++. Implements custom CUDA kernels for tiled matrix multiplication, LayerNorm, fused attention, transformer blocks, KV cache management, a…

    Cuda 39 1

  2. Memory-Allocator Memory-Allocator Public

    Custom memory allocator in C++ built from scratch using mmap. Allocates a 1MB memory pool upfront and carves blocks from it to keep all allocations contiguous. Implements malloc, free, block reuse …

    C++ 36 2

  3. tri-sds tri-sds Public

    Triton-based EAGLE speculative decoding engine for Qwen3-4B to Qwen3-32B on AMD MI300X. Matches SGLang's acceptance speedup ratios (1.56–2.49×) with fully custom Triton kernels (prefill, GQA decode…

    Python 4

  4. Adaptive-ViT Adaptive-ViT Public

    An adaptive Vision Transformer inference system that avoids unnecessary high-resolution computation, achieving ~3× faster inference than static high-res ViT by selectively escalating only when needed.

    Python 5

  5. KV-Compression KV-Compression Public

    Implementing and benchmarking KV cache compression methods for LLM inference in Triton. Featuring optimized kernels for KIVI and TurboQuant

    Python 5 1

  6. research-papers research-papers Public

    Research implementations focused on inference efficiency and model optimization. Includes custom Triton kernels, LoRA, knowledge distillation pipelines, and more.

    Python 12 2