RapidSpeech.cpp 🎙️

RapidSpeech.cpp is a high-performance, edge-native speech intelligence framework built on top of ggml.
It aims to provide pure C++, zero-dependency, and on-device inference for large-scale ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) models.

🌟 Key Differentiators

While the open-source ecosystem already offers powerful cloud-side frameworks such as vLLM-omni, as well as mature on-device solutions like sherpa-onnx, RapidSpeech.cpp introduces a new generation of design choices focused on edge deployment.

1. vs. vLLM: Edge-first, not cloud-throughput-first

vLLM
- Designed for data centers and cloud environments
- Strongly coupled with Python and CUDA
- Maximizes GPU throughput via techniques such as PageAttention
RapidSpeech.cpp
- Designed specifically for edge and on-device inference
- Optimized for low latency, low memory footprint, and lightweight deployment
- Runs on embedded devices, mobile platforms, laptops, and even NPU-only systems
- No Python runtime required

2. vs. sherpa-onnx: Deeper control over the inference stack

Aspect	sherpa-onnx (ONNX Runtime)	RapidSpeech.cpp (ggml)
Memory Management	Managed internally by ORT, relatively opaque	Zero runtime allocation — memory is fully planned during graph construction to avoid edge-side OOM
Quantization	Primarily INT8, limited support for ultra-low bit-width	Full K-Quants family (Q4_K / Q5_K / Q6_K), significantly reducing bandwidth and memory usage while preserving accuracy
GPU Performance	Relies on execution providers with operator mapping overhead	Native backends (`ggml-cuda`, `ggml-metal`) with speech-specific optimizations, outperforming generic `onnxruntime-gpu`
Deployment	Requires shared libraries and external config files	Single binary deployment — model weights and configs are fully encapsulated in GGUF

📦 Model Support

Automatic Speech Recognition (ASR)

SenseVoice-small
FunASR-nano
Qwen3-ASR

Text-to-Speech (TTS)

CosyVoice3
Qwen3-TTS

🏗️ Architecture Overview

RapidSpeech.cpp is not just an inference wrapper — it is a full-featured speech application framework:

Core Engine
A ggml-based computation backend supporting mixed-precision inference from INT4 to FP32.
Architecture Layer
A plugin-style model construction and loading system, with planned support for FunASR-nano, CosyVoice, Qwen3-TTS, and more.
Business Logic Layer
Built-in ring buffers, VAD (voice activity detection), text frontend processing (e.g., phonemization), and multi-session management.

🚀 Core Features

Extreme Quantization: Native support for 4-bit, 5-bit, and 6-bit quantization schemes to match diverse hardware constraints.
Zero Dependencies: Implemented entirely in C/C++, producing a single lightweight binary.
GPU / NPU Acceleration: Customized CUDA and Metal backends optimized for speech models.
Unified Model Format: Both ASR and TTS models use an extended GGUF format.

🛠️ Quick Start (WIP)

Download Models

Models are available on:

🤗 Hugging Face: https://huggingface.co/lovemefan/RapidSpeech
ModelScope: https://www.modelscope.cn/models/lovemefan/RapidSpeech

Build & Run

git clone https://github.com/RapidAI/RapidSpeech.cpp
cd RapidSpeech.cpp
cmake -B build
cmake --build build --config Release

./build/rs-asr-offline \
  -m /path/to/SenseVoice/sense-voice-small-fp32.gguf \
  -w /path/to/test_sample_rate_16k.wav

🤝 Contributing

If you are interested in the following areas, we welcome your PRs or participation in discussions:

Adapting more models to the framework.
Refining and optimizing the project architecture.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
cmake		cmake
examples		examples
ggml @ b6d1f0f		ggml @ b6d1f0f
include		include
python-api-examples/asr		python-api-examples/asr
rapidspeech		rapidspeech
scripts		scripts
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README-CN.md		README-CN.md
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RapidSpeech.cpp 🎙️

🌟 Key Differentiators

1. vs. vLLM: Edge-first, not cloud-throughput-first

2. vs. sherpa-onnx: Deeper control over the inference stack

📦 Model Support

🏗️ Architecture Overview

🚀 Core Features

🛠️ Quick Start (WIP)

Download Models

Build & Run

🤝 Contributing

About

Uh oh!

Releases

Languages

RapidAI/RapidSpeech.cpp

Folders and files

Latest commit

History

Repository files navigation

RapidSpeech.cpp 🎙️

🌟 Key Differentiators

1. vs. vLLM: Edge-first, not cloud-throughput-first

2. vs. sherpa-onnx: Deeper control over the inference stack

📦 Model Support

🏗️ Architecture Overview

🚀 Core Features

🛠️ Quick Start (WIP)

Download Models

Build & Run

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages