Agent.cpp is a high-performance CPU-only C++ inference engine specifically designed for the Tiny-MoA project.
While it is based on (cloned from) the famous llama.cpp, it is not a simple copy. We have redesigned the core architecture to maximize performance for Mixture of Agents (MoA) environments, where iterative inference and frequent context switching are critical.
llama.cpp is an excellent general-purpose engine, but it can be inefficient in MoA environments where multiple models converse and Context Switching occurs dozens of times. Agent.cpp was born to solve this bottleneck.
| Feature | Standard llama.cpp | Agent.cpp (Tiny-MoA Engine) |
|---|---|---|
| Target Use | General LLM Inference | Tiny-MoA Multi-Agent Orchestration |
| Cache Management | Linear Cache (Optimized for single chat) | RadixCache (Tree-based, Instant restore for multi-branch chats) |
| Memory Tech | Standard KV Cache | PagedAttention & RadixAttention (0% Memory Fragmentation) |
| Orchestration | External Python script required | Native C++ Built-in (Brain/Specialist Auto-Routing) |
| Performance (TTFT) | Prompt re-computation every turn | Instant Generation on Cache Hit (~1.8x to 10x Faster) |
This engine is not just a wrapper. We have implemented state-of-the-art LLM serving technologies directly in C++.
- RadixAttention (Tree-based Caching): When System Prompts and Few-shot examples are shared across multiple agents, they are managed in a tree structure, completely eliminating redundant computations.
- PagedAttention: Borrowing paging techniques from OS, KV Cache memory is managed in pages, achieving nearly 100% memory efficiency without defragmentation.
- CPU Optimization: actively utilizes AVX2/AVX-512 instructions to ensure Tiny-MoA (1B~3B models) runs near real-time even without a GPU.
- Windows (Visual Studio 2022), CMake 3.20+
- Reference:
Tiny-MoAmodel file (.gguf)
mkdir build
cd build
cmake .. -G "Visual Studio 17 2022" -A x64
cmake --build . --config Release --target agentUnlike llama.cpp, you don't need complex configurations. Just provide -m (model) and -p (prompt), and the optimized MoA pipeline runs internally.
./build/bin/Release/agent.exe -m path/to/LFM2.5-1.2B.gguf -p "Human: Explain quantum mechanics in 50 words. Assistant:"Currently, Agent.cpp serves as the core kernel for Tiny-MoA. The following features are planned:
- Python Bindings (pybind11): Support for importing the C++ core as a library in Python.
- Tools & Function Calling: Native C++ implementation of external tools (Search, Calculator) to minimize overhead.
- Multimodal (Vision) Support: Integration of 'Vision Specialist' for image recognition in the MoA pipeline.
- Runtime Quantization: Adaptive quantization levels for better support on low-end hardware.
Agent.cpp Project If you wish to contribute or have questions, please open an Issue.