On-device LLM inference for Android using llama.cpp. Runs GGUF models locally with CPU, Vulkan, or OpenCL backends. Built for benchmarking inference performance across backends and quantizations on Snapdragon hardware.
While we have put a lot of work into it, this is still a demo and proof of concept project. Please be cautious using it in production scenarios. The authors do not take any responsibilities for any problems or issues caused by the app, however you are free to report issues or submit PRs on github.
- Language: Kotlin (app + JNI wrapper), C++ (native inference)
- Inference: llama.cpp (vendored in
lib/src/main/cpp/llama-source/) - Build: Gradle + CMake, Android NDK 29
- Min SDK: 35
- GPU backends: OpenCL, Vulkan (see Backends)
- Model format: GGUF
app/ Android app (UI, MainActivity)
lib/ JNI wrapper library
src/main/cpp/
ai_chat.cpp C++ JNI bridge into llama.cpp
CMakeLists.txt Native build config (backend flags)
src/main/java/com/arm/aichat/
InferenceEngine.kt Public Kotlin interface
internal/
InferenceEngineImpl.kt Singleton JNI wrapper
gguf/
GgufMetadataReader.kt Pure-Kotlin GGUF metadata parser
Open in Android Studio (Hedgehog or later) or build from CLI:
./gradlew assembleDebugRequires NDK 29 installed. Set the NDK path in local.properties if not auto-detected:
ndk.dir=/path/to/ndk/29.x.x
Configured via gradle.properties:
ENABLE_VULKAN=false
ENABLE_OPENCL=trueOnly one should be enabled at a time. To use CPU-only, disable both.
Status on Snapdragon 8 Elite (Adreno 830):
| Backend | Status |
|---|---|
| CPU | Working |
| OpenCL | Working — slower than CPU for token generation (bandwidth-limited at batch_size=1), faster for prefill |
| Vulkan | Compiles but produces incorrect output — known driver issue with GL_KHR_cooperative_matrix on Adreno 830 |
For OpenCL, libOpenCL.so must be present at app/src/main/jniLibs/arm64-v8a/libOpenCL.so. Pull it from your device:
adb pull /vendor/lib64/libOpenCL.so app/src/main/jniLibs/arm64-v8a/- Build and install the APK on your device
- Download or copy a
.ggufmodel onto your device, make sure that the size and quantization is suitable for your system specs, otherwise the app may crash due to running out of memory - Launch PocketLlama, yo ucan attach a debugger or listen with adb logcat if required for debugging (see below)
- Tap Select GGUF File and pick a local
.ggufmodel (or download one from HuggingFace) - Configure inference parameters such as offloaded GPU layers (0 = CPU only, max = all layers on GPU).
- Tap Load Model, then start chatting
For speed when changing configuration in the same run, models are copied to app-internal storage on first load and reused on subsequent loads.
Qwen3-4B quants from bartowski/Qwen_Qwen3-4B-GGUF:
Q8_0— highest quality, ~4 GBQ4_K_M— good balance, ~2.5 GBQ2_K— smallest, ~1.4 GB
| Parameter | Default | Description |
|---|---|---|
| GPU layers | 0 | Number of transformer layers offloaded to GPU |
| Temperature | 0.3 | Sampling temperature |
| Max reply tokens | 1024 | Maximum generated tokens per response |
| Batch size | 128 | Prompt processing batch size (n_ubatch) |
Changing GPU layers while a model is loaded requires a reload (you will be prompted to do so).
After each response, tap the details link on any assistant message to see:
- Prefill time + tok/s (prompt processing speed)
- Token count + generation tok/s
- Total time + overall tok/s
All llama.cpp log output is tagged ai-chat:
adb logcat -s "ai-chat"To confirm GPU offload is active:
adb logcat -s "ai-chat" | grep -E "n_gpu_layers|offload"To check GPU memory allocation:
adb shell dumpsys gpu | grep Proc