Reliable Inference Runtime for Autonomous Systems
Quick Start • Usage • Contributing
Zipy is a constraint-aware LLM inference runtime built in Rust and wgpu for low-latency decision-making in autonomous systems. It is designed to run directly on-device under strict memory, compute, and power constraints such as rovers, drones, and edge-based agents by enforcing bounded execution, fallback modes, and structured outputs.
- Native
safetensorsloading with efficient GPU buffer management viawgpu - Custom WGSL kernels for Transformer operations (MatMul, RoPE, RMSNorm)
- Quantized inference for reduced memory footprint on edge devices
- PagedAttention for efficient KV-cache management under fragmented VRAM
- Low-latency inference optimized for real-time decision loops
- Multi-modal input support (camera frames, IMU data, sensor streams)
- Persistent KV-cache offloading to NVMe to reduce recomputation overhead
- Continuous batching for fine-grained request scheduling
- Model distillation support for deployment in constrained environments
- FP16 / BF16 precision support
TBD.
TBD.
TBD.
Built by @ashworks1706 for real-time autonomous systems operating under constrained environments.
