Skip to content

Latest commit

 

History

History
142 lines (98 loc) · 4.26 KB

File metadata and controls

142 lines (98 loc) · 4.26 KB

PocketLlama

On-device LLM inference for Android using llama.cpp. Runs GGUF models locally with CPU, Vulkan, or OpenCL backends. Built for benchmarking inference performance across backends and quantizations on Snapdragon hardware.

Group project for Mobile Application Software Development, Tsinghua University, Spring 2026.

While we have put a lot of work into it, this is still a demo and proof of concept project. Please be cautious using it in production scenarios. The authors do not take any responsibilities for any problems or issues caused by the app, however you are free to report issues or submit PRs on github.


Stack

  • Language: Kotlin (app + JNI wrapper), C++ (native inference)
  • Inference: llama.cpp (vendored in lib/src/main/cpp/llama-source/)
  • Build: Gradle + CMake, Android NDK 29
  • Min SDK: 35
  • GPU backends: OpenCL, Vulkan (see Backends)
  • Model format: GGUF

Project Structure

app/                        Android app (UI, MainActivity)
lib/                        JNI wrapper library
  src/main/cpp/
    ai_chat.cpp             C++ JNI bridge into llama.cpp
    CMakeLists.txt          Native build config (backend flags)
  src/main/java/com/arm/aichat/
    InferenceEngine.kt      Public Kotlin interface
    internal/
      InferenceEngineImpl.kt  Singleton JNI wrapper
    gguf/
      GgufMetadataReader.kt   Pure-Kotlin GGUF metadata parser

Building

Open in Android Studio (Hedgehog or later) or build from CLI:

./gradlew assembleDebug

Requires NDK 29 installed. Set the NDK path in local.properties if not auto-detected:

ndk.dir=/path/to/ndk/29.x.x

Backends

Configured via gradle.properties:

ENABLE_VULKAN=false
ENABLE_OPENCL=true

Only one should be enabled at a time. To use CPU-only, disable both.

Status on Snapdragon 8 Elite (Adreno 830):

Backend Status
CPU Working
OpenCL Working — slower than CPU for token generation (bandwidth-limited at batch_size=1), faster for prefill
Vulkan Compiles but produces incorrect output — known driver issue with GL_KHR_cooperative_matrix on Adreno 830

For OpenCL, libOpenCL.so must be present at app/src/main/jniLibs/arm64-v8a/libOpenCL.so. Pull it from your device:

adb pull /vendor/lib64/libOpenCL.so app/src/main/jniLibs/arm64-v8a/

Running

  1. Build and install the APK on your device
  2. Download or copy a .gguf model onto your device, make sure that the size and quantization is suitable for your system specs, otherwise the app may crash due to running out of memory
  3. Launch PocketLlama, yo ucan attach a debugger or listen with adb logcat if required for debugging (see below)
  4. Tap Select GGUF File and pick a local .gguf model (or download one from HuggingFace)
  5. Configure inference parameters such as offloaded GPU layers (0 = CPU only, max = all layers on GPU).
  6. Tap Load Model, then start chatting

For speed when changing configuration in the same run, models are copied to app-internal storage on first load and reused on subsequent loads.

Recommended models

Qwen3-4B quants from bartowski/Qwen_Qwen3-4B-GGUF:

  • Q8_0 — highest quality, ~4 GB
  • Q4_K_M — good balance, ~2.5 GB
  • Q2_K — smallest, ~1.4 GB

Inference Parameters

Parameter Default Description
GPU layers 0 Number of transformer layers offloaded to GPU
Temperature 0.3 Sampling temperature
Max reply tokens 1024 Maximum generated tokens per response
Batch size 128 Prompt processing batch size (n_ubatch)

Changing GPU layers while a model is loaded requires a reload (you will be prompted to do so).


Generation Stats

After each response, tap the details link on any assistant message to see:

  • Prefill time + tok/s (prompt processing speed)
  • Token count + generation tok/s
  • Total time + overall tok/s

Debugging

All llama.cpp log output is tagged ai-chat:

adb logcat -s "ai-chat"

To confirm GPU offload is active:

adb logcat -s "ai-chat" | grep -E "n_gpu_layers|offload"

To check GPU memory allocation:

adb shell dumpsys gpu | grep Proc