This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
RCLI is a high-performance, low-latency on-device voice AI system built in C++17 for Apple Silicon. It combines STT (Zipformer/Whisper), LLM (Qwen3 via llama.cpp), TTS (Piper/Kokoro via sherpa-onnx), VAD (Silero), tool calling (43 macOS actions), and RAG into a single unified pipeline with an interactive TUI.
Platform: macOS (Apple Silicon with Metal GPU). Language: C++17, Apple Clang.
# First-time setup (clones llama.cpp + sherpa-onnx into deps/)
bash scripts/setup.sh
# Download AI models (~1 GB)
bash scripts/download_models.sh
# Build
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(sysctl -n hw.ncpu)./build/rcli # Interactive TUI mode
./build/rcli listen # Continuous voice mode
./build/rcli ask "open Safari" # One-shot text command
./build/rcli ask "create a note called Ideas" # Triggers macOS action
./build/rcli actions # List all actions
./build/rcli bench # Run benchmarks (STT, LLM, TTS, E2E)
./build/rcli models # Manage LLM/STT/TTS models
./build/rcli rag ingest ~/Documents # Index docs for RAG
./build/rcli info # Show engine info
./build/rcli cleanup # Remove unused modelscd build && ./rcli_testTesting is primarily done via CLI modes, the interactive TUI, and the built-in benchmark suite.
Mic → VAD → STT (Zipformer) → [RAG Retrieval] → LLM (Qwen3) → TTS (Piper) → Speaker
↓
Tool Calling → macOS Actions
src/
├── engines/ ML engine wrappers (stt, llm, tts, vad, embedding, model_profile)
├── pipeline/ Orchestrator, sentence detector, text sanitizer
├── rag/ Vector index (USearch HNSW), BM25, hybrid retriever, doc processor
├── core/ types.h, ring_buffer.h, memory_pool.h, hardware_profile.h, log.h
├── audio/ CoreAudio mic/speaker I/O, WAV file I/O
├── tools/ Tool calling engine with JSON schema definitions
├── bench/ Benchmark harness (STT, LLM, TTS, E2E, memory)
├── actions/ 43 macOS action implementations (AppleScript + shell)
├── api/ C API (rcli_api.h/.cpp) — the engine's public interface
├── cli/ main.cpp, TUI dashboard (FTXUI), model pickers, help, setup commands
├── models/ Model registries (LLM, TTS, STT) with on-demand download
└── test/ Pipeline test harness
- Orchestrator pattern: Central class owns all engines and coordinates data flow.
- Lock-free ring buffers: Zero-copy audio passing between STT/LLM/TTS threads.
- Pre-allocated memory pool: 64 MB arena allocated at init; no runtime malloc.
- Hardware profiling at startup:
hardware_profile.hdetects P-cores, E-cores, Metal GPU, RAM. - Atomic pipeline state:
std::atomic<PipelineState>(IDLE → LISTENING → PROCESSING → SPEAKING → INTERRUPTED). - Sentence-level TTS scheduling:
SentenceDetectoraccumulates LLM tokens and flushes complete sentences. - Double-buffered TTS: TTS synthesizes the next sentence while current one plays.
- System prompt KV caching: Reuses llama.cpp KV cache state across queries.
- Hybrid tool calling: Tier 1 keyword match + Tier 2 LLM-based extraction.
- STT thread: feeds mic audio, detects speech endpoints
- LLM thread: waits for STT output, generates tokens
- TTS thread: queues sentences from LLM, double-buffered playback
- Synchronization via
std::condition_variable+ mutex on pending text
Hybrid retrieval combining vector search (USearch HNSW) + BM25 full-text search, fused via Reciprocal Rank Fusion (RRF). Embedding cache (LRU) achieves 99.9% hit rate. Total retrieval latency is ~4 ms over 5K+ chunks.
All vendored or CMake-fetched (no external package manager):
- llama.cpp — LLM + embedding inference with Metal GPU (
deps/llama.cpp,add_subdirectory) - sherpa-onnx — STT/TTS/VAD via ONNX Runtime (
deps/sherpa-onnx,add_subdirectory) - USearch v2.16.5 — HNSW vector index (header-only,
FetchContent) - FTXUI v5.0.0 — Terminal UI library (
FetchContent) - CoreAudio/Metal/Accelerate/IOKit — macOS system frameworks
| Model | Size | Purpose |
|---|---|---|
| Qwen3 0.6B Q4_K_M | 456 MB | Default LLM |
| Qwen3.5 0.8B / 2B / 4B | 600MB–2.7GB | Upgrade LLMs |
| Liquid LFM2 1.2B | 731 MB | Tool-calling LLM |
| Snowflake Arctic Embed S (Q8_0) | 34 MB | Text embeddings (RAG) |
| Zipformer (streaming) | ~50 MB | Live STT |
| Whisper base.en | ~140 MB | Offline STT |
| Piper en_US-amy-medium | ~60 MB | Default TTS |
| KittenTTS / Kokoro | 90MB–500MB | Upgrade TTS voices |
| Silero VAD | 0.6 MB | Voice Activity Detection |
Models are gitignored. Download via scripts/download_models.sh or rcli setup.