Skip to content
This repository was archived by the owner on Sep 28, 2025. It is now read-only.
This repository was archived by the owner on Sep 28, 2025. It is now read-only.

2x speed-up via 1/2 hardware concurrency #1

@jpohhhh

Description

@jpohhhh

TL;DR: use navigator.hardwareConcurrency / 2 in main-worker.js

I maintain two open source Flutter libraries for cross-platform ML. (all platforms, macOS, iOS, Android, Windows, Linux)
FONNX wraps the ONNX runtime.
FLLAMA wraps llama.cpp - except on web.

I saw your post in /r/localllama a few days ago (I'm refulgentis).
Today, I looked at the code: it is the first to run llama.cpp on WASM in many months, excellent work.

Also this week, I updated FLLAMA's llama.cpp version, and it had a really interesting issue on Android. It took 3 minutes to load a 3B model. Used to take 15 seconds. Turned out the issue was setting # of threads equal to # of CPU cores. Simply changing it from 4 to 2 fixed everything and made it much faster during inference too.

After playing around with this project for an hour trying to speed it up, I realized the same trick worked.

It may seem hacky, but I recommend changing use of navigatior.hardwareConcurrency() to navigatior.hardwareConcurrency() / 2:

  • I don't 100% understand why it helps so much, other than the general reasons (threads can get starved for data, etc.)
  • In my experience, it is also best practice for ML on web generally. Approximately all the ONNX web implementations I've seen do the same thing.
  • Part of me thinks it has something to do with a change in llama.cpp, because my slow Android load happened sometime between llama.cpp commit ceebbb5b21b971941b2533210b74bf359981006c and 7930a8a6e89a04c77c51e3ae5dc1cd8e845b6b8f. But, that is unlikely. The Android problem was an extremely slow model load with inference speed that stayed the same.

Benchmarks (M2 Max/Ultra/whatever MacBook Pro). number of threads to tokens per second
Phi2
1: 1.2
6: 6.5 (1/2 hardware concurrency)
8: 7.4
12: 3.6 (hardware concurrency)

Mistral:
1: 13
6: 62 (1/2 hardware concurrency)
8: 65
12: 26 (hardware concurrency)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions