MILI: Mojo Inference Language Engine

A comprehensive, hands-on guide to building a high-performance LLM inference engine in Mojo and Python.

Quick Start

Prerequisites

GPU: NVIDIA GPU with CUDA 12.0+ (A100, H100, or RTX 4090 recommended)
Mojo SDK: Latest version from Modular
Python: 3.10 or higher
CUDA Toolkit: 12.0+

Installation

# Clone and setup
git clone https://github.com/Ammar-Alnagar/MILI.git
cd mili_qwen3

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Build Mojo kernels
cd mojo_kernels
bash build.sh
cd ..

Run Inference Server

# Start server
python server.py

# In another terminal, test it:
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is machine learning?",
    "max_tokens": 128,
    "temperature": 0.7
  }'

Documentation Structure

This project is organized as a progressive learning guide:

1. Project Overview - START HERE

- High-level architecture
- System design principles
- Project organization
- Prerequisites and setup

2. Mojo Kernel Guide - GPU Kernels (Legacy)

- Foundation & setup
- RoPE implementation
- RMSNorm kernels
- SwiGLU activation
- FlashAttention prefill
- Decode-phase attention
- Memory management

3. Python Integration Guide - Python Layer

- Model architecture & config
- Weight loading
- Tokenization (tiktoken)
- Request scheduler (continuous batching)
- Sampling strategies
- Model class integration

4. Attention Mechanisms - Deep Dive

- Scaled dot-product attention
- Grouped Query Attention (GQA)
- FlashAttention optimization
- Decode-phase optimization
- Multi-request attention
- Performance benchmarks

5. KV Cache Management - Memory Efficiency

- Paged KV cache
- RadixAttention for prefix sharing
- Reference counting
- Allocation strategies
- Eviction policies
- Integration with inference loop

6. Deployment Guide - Production Ready

- FastAPI server setup
- Docker containerization
- Kubernetes deployment
- GPU optimization
- Monitoring & metrics
- Load testing

7. Advanced Optimization - Performance Tuning

- Kernel optimization techniques
- Memory bandwidth optimization
- Parallel processing strategies

8. Troubleshooting and Debugging - Common Issues

- Environment setup problems
- Kernel compilation issues
- Python integration bugs
- Server deployment issues
- Performance bottlenecks

9. API Reference - Complete API Docs

- Server endpoints
- Request/response formats
- Configuration options
- Error handling

10. Best Practices and Patterns - Production Patterns

- Code organization
- Testing strategies
- Deployment best practices
- Performance monitoring

Project Structure

mili_qwen3/
├── config/                         # Configuration files
│   ├── inference_config.json       # Inference settings
│   └── model_config.json           # Model configuration
│
├── docs/                           # Documentation
│   ├── 01_PROJECT_OVERVIEW.md
│   ├── 02_MOJO_KERNEL_GUIDE.md
│   ├── 03_PYTHON_INTEGRATION.md
│   ├── 04_ATTENTION_MECHANISMS.md
│   ├── 05_KV_CACHE_MANAGEMENT.md
│   ├── 06_DEPLOYMENT.md
│   ├── 07_ADVANCED_OPTIMIZATION.md
│   ├── 08_TROUBLESHOOTING_AND_DEBUGGING.md
│   ├── 09_API_REFERENCE.md
│   └── 10_BEST_PRACTICES_AND_PATTERNS.md
│
├── examples/                       # Example scripts
│   └── basic_inference.py          # Basic inference example
│
├── mojo_kernels/                   # Mojo kernel implementations (legacy)
│   ├── core/
│   │   ├── activations.🔥
│   │   ├── attention.🔥
│   │   ├── normalization.🔥
│   │   └── rope.🔥
│   ├── memory/
│   │   └── kv_cache.🔥
│   ├── utils/
│   │   └── types.🔥
│   ├── build.sh
│   └── test_simple.mojo
│
├── python_layer/                   # Python components
│   ├── inference/
│   │   ├── __init__.py
│   │   └── inference_engine.py
│   ├── memory/
│   │   ├── __init__.py
│   │   └── kv_cache_manager.py
│   ├── model/
│   │   ├── __init__.py
│   │   ├── qwen_model.py
│   │   └── weight_loader.py
│   ├── server/
│   │   ├── __init__.py
│   │   └── api.py
│   ├── tokenizer/
│   │   ├── __init__.py
│   │   └── qwen_tokenizer.py
│   └── utils/
│       └── __init__.py
│
├── tests/                          # Test suites
│   ├── integration/
│   │   ├── __init__.py
│   │   └── test_inference.py
│   ├── performance/
│   │   └── __init__.py
│   ├── unit/
│   │   ├── __init__.py
│   │   └── test_tokenizer.py
│   └── __init__.py
│
├── DELIVERABLES.txt                # Project deliverables
├── IMPLEMENTATION_GUIDE.md         # Implementation guide
├── INDEX.md                        # Project index
├── pyproject.toml                  # Python project config
├── requirements.txt                # Python dependencies
├── server.py                       # Main inference server
├── STRUCTURE.md                    # Project structure
├── test_project.py                 # Project test script
├── test_qwen3_local.py             # Local Qwen3 test
├── test_real_weights.py            # Weight loading test
├── verify_implementation.py        # Implementation verification
└── verify_simple.py                # Simple verification

Learning Path

Week 1-2: Foundation

Read Project Overview
Set up development environment
Review Mojo SDK basics
Understand transformer architecture

Week 3-4: Kernels

Week 5-6: Attention

Study Attention Mechanisms guide
Understand FlashAttention algorithm
Implement prefill attention
Implement decode attention
Benchmark implementations

Week 7-8: Python Layer

Week 9-10: Memory & Cache

Week 11-12: Deployment

Development Commands

Building

# Build all Mojo kernels
cd mojo_kernels && bash build.sh && cd ..

# Build specific kernel with optimization
mojo build -O3 -o lib/core/attention.so core/attention.🔥

Testing

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/unit/test_kernels.py -v

# Run with coverage
pytest tests/ --cov-report=html

Running

# Development server with auto-reload
python server.py

# Production server (single process for now)
python server.py

# Simple inference script
python examples/simple_generation.py

Benchmarking

# Benchmark kernels
python tests/performance/benchmark_kernels.py

# End-to-end benchmark
python tests/performance/benchmark_e2e.py

# Load testing
python tests/performance/load_test.py --num-requests 1000 --concurrent 50

Performance Targets

Metric	Target	Status
Prefill Throughput	> 100K tokens/sec
Decode Throughput	> 50 tokens/sec (1 req)
Batch Decode	> 5K tokens/sec (batch 64)
E2E Latency	< 1s (512 + 128 tokens)
Memory Efficiency	< 90% VRAM (batch 64)

Docker Quick Start

# Build image
docker build -t mili:latest -f deployment/docker/Dockerfile .

# Run container
docker run --gpus all -p 8000:8000 mili:latest

# Run with docker-compose
docker-compose -f deployment/docker/docker-compose.yml up

Kubernetes Deployment

# Deploy
kubectl apply -f deployment/kubernetes/

# Check status
kubectl get pods
kubectl logs -f deployment/mili-inference

# Port forward
kubectl port-forward svc/mili-inference 8000:80

Example Usage

Simple Generation

# Using the API server
import requests

response = requests.post("http://localhost:8000/generate", json={
    "prompt": "What is the meaning of life?",
    "max_tokens": 100,
    "temperature": 0.7
})

result = response.json()
print(result["generated_text"])

# Or direct Python usage
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")

inputs = tokenizer("What is the meaning of life?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
    temperature=0.7
)
print(response)

API Request

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing in simple terms.",
    "max_tokens": 256,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 50,
    "stream": false
  }'

Key Concepts

Continuous Batching

Dynamically add/remove requests as they complete, maximizing GPU utilization.

Paged KV Cache

Divide KV cache into fixed-size pages (16 tokens) for efficient memory management.

RadixAttention

Share KV cache across multiple requests with common prefixes using a radix tree.

FlashAttention

Optimize attention with tiled computation and online softmax for 3-4x speedup.

Grouped Query Attention

Reduce KV cache size by using fewer KV heads than query heads.

Contributing

Contributions welcome! Areas to help:

Implement additional kernels (quantization, fused ops)
Optimize existing kernels
Add more sampling strategies
Improve documentation
Add more tests
Performance optimizations

References

Papers

Attention Is All You Need - Transformer architecture
FlashAttention - Efficient attention
Grouped Query Attention - KV cache reduction
PagedAttention - Paged KV cache

Resources

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Issues: Report bugs or feature requests
Discussions: Ask questions and share ideas
Documentation: Check docs/ for detailed guides

Acknowledgments

This project builds upon research from:

Modular AI team (Mojo language)
DeepSpeed team (optimization techniques)
vLLM team (inference system design)
HuggingFace community (models and tools)

**Happy inferencing! **

For a detailed walkthrough, start with 01_PROJECT_OVERVIEW.md.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
mili_qwen3		mili_qwen3
mojo_llm_guide		mojo_llm_guide
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
kgen.trace.json.time-events.txt		kgen.trace.json.time-events.txt
kgen.trace.json.time-trace		kgen.trace.json.time-trace
run_python_model.py		run_python_model.py
test_project.py		test_project.py
test_qwen3_local.py		test_qwen3_local.py

Folders and files

Latest commit

History

Repository files navigation

MILI: Mojo Inference Language Engine

Quick Start

Prerequisites

Installation

Run Inference Server

Documentation Structure

1. Project Overview - START HERE

2. Mojo Kernel Guide - GPU Kernels (Legacy)

3. Python Integration Guide - Python Layer

4. Attention Mechanisms - Deep Dive

5. KV Cache Management - Memory Efficiency

6. Deployment Guide - Production Ready

7. Advanced Optimization - Performance Tuning

8. Troubleshooting and Debugging - Common Issues

9. API Reference - Complete API Docs

10. Best Practices and Patterns - Production Patterns

Project Structure

Learning Path

Week 1-2: Foundation

Week 3-4: Kernels

Week 5-6: Attention

Week 7-8: Python Layer

Week 9-10: Memory & Cache

Week 11-12: Deployment

Development Commands

Building

Testing

Running

Benchmarking

Performance Targets

Docker Quick Start

Kubernetes Deployment

Example Usage

Simple Generation

API Request

Key Concepts

Continuous Batching

Paged KV Cache

RadixAttention

FlashAttention

Grouped Query Attention

Contributing

References

Papers

Resources

License

Support

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages