Discussion: do we still need the LZ4 fork? Assembly-store compression + fast deployment — variables, benchmarks, open questions

## Background

dotnet/android carries a fork of LZ4, [dotnet/lz4](https://github.com/dotnet/lz4) (a fork of [lz4/lz4](https://github.com/lz4/lz4)), and also uses the `K4os.Compression.LZ4` NuGet package on the build (host) side. LZ4 is used in **two independent** places:

1. **Assembly store compression** — assemblies packaged inside the app are LZ4-compressed and decompressed by the runtime at app startup.
2. **Fast deployment ("fastdev")** — during `dotnet build -t:Install` (Debug), changed assemblies are LZ4-compressed on the host, streamed to the device, and decompressed there.

This issue collects what we measured about both, lays out the variables that matter, and lists possible directions. **It is intentionally open-ended — there is no proposed decision here.** The two areas have different constraints and different possible paths, so they are split into two blocks below.

Where LZ4 lives today (for reference):

- Assembly store, runtime decompress: `src/native/clr/host/assembly-store.cc` (CoreCLR) and `src/native/mono/monodroid/embedded-assemblies.cc` (Mono), via `LZ4_decompress_safe` from `external/lz4`, guarded by `#if defined(HAVE_LZ4) && defined(RELEASE)`.
- Assembly store, build-side compress: `src/Xamarin.Android.Build.Tasks/Utilities/AssemblyCompression.cs` (`K4os.Compression.LZ4`), gated by the `AndroidEnableAssemblyCompression` MSBuild property (default **true**).
- Fastdev, device-side decompress: `tools/fastdev/xamarin.sync/main.cc` (`external/lz4`).
- Fastdev, host-side compress: `src/Xamarin.Android.Build.Debugging.Tasks/Tasks/FastDeploy.cs` (`K4os.Compression.LZ4`, level `L03_HC`).

So both `external/lz4` and `K4os.Compression.LZ4` have two consumers each. Removing the dependency would require addressing **both** areas.

> **Measurement environment for everything below:** a stock `dotnet new maui` app, .NET 11 preview 5, `net11.0-android`, CoreCLR runtime, `android-arm64`, no AOT/R2R. On-device numbers are from a low-end **Samsung Galaxy A16 (SM-A165F)**. Single device, single app — directional, not definitive.

---

## Block 1 — Assembly store compression

### What it is

When `AndroidEnableAssemblyCompression` is **true** (the default), each managed assembly is LZ4-compressed (the `XALZ` block format) and packed into `lib/<abi>/libassembly-store.so`. At app startup the runtime decompresses each assembly into memory (`LZ4_decompress_safe`). With compression **false**, the assemblies are stored raw and can be used/mapped without a decompression step.

### Variables that matter

- **`android:extractNativeLibs`** — the single most important variable, and the source of a real annoyance (below).
- **Where the size is paid:** inside the APK/AAB (download), and on-disk after install.
- **Startup CPU** spent decompressing assemblies on every cold start.
- **RAM** — LZ4'd assemblies are decompressed into memory; uncompressed assemblies can be demand-paged/mapped.
- **Decompression speed of the algorithm** (directly affects startup).

### The `extractNativeLibs` / Play Store annoyance

`libassembly-store.so` is a native library, so it is subject to `android:extractNativeLibs`:

- With **`extractNativeLibs=true`** (dotnet/android's current default), the APK zip itself **DEFLATE-compresses** the store. So the APK's own compression already shrinks the assemblies — and pre-compressing them with LZ4 first is largely redundant (you are compressing already-compressed data).
- With **`extractNativeLibs=false`**, native libs (including the assembly store) are stored **uncompressed and page-aligned** in the APK and mapped directly — no extraction, no second on-disk copy. **Google Play sets this automatically for app bundles on API 26+.** So in what users actually download/install, the APK's zip compression does **not** help the assembly store at all, and you also can't trust a locally-built `.apk`/`.aab` size because Play re-packages it.

This is the annoying part: in the real distribution config we **cannot** lean on the APK's own compression for the assembly store, so if we don't compress it ourselves, it ships uncompressed.

### Benchmarks

**Size — `extractNativeLibs=true` (current default; APK zip DEFLATEs the store), android-arm64:**

| store | raw `.so` | in the APK (DEFLATE) |
|---|---|---|
| LZ4 on | 7.64 MB | 6.10 MB |
| LZ4 off | 18.43 MB | 5.99 MB |

With the zip re-compressing, LZ4 is ~neutral (actually ~100 KB/ABI *worse*, because DEFLATE compresses raw assemblies better than already-LZ4'd bytes).

**Size — `extractNativeLibs=false` (what Play delivers; store stored uncompressed), android-arm64:**

| store | in the APK (Stored) | final signed APK |
|---|---|---|
| LZ4 on | 7.64 MB | 28.4 MB |
| LZ4 off | 18.43 MB | 38.9 MB |

Here LZ4 saves **~10.5 MB/ABI** of download + on-disk size. Opposite conclusion from the `true` config — which is exactly why the variable matters.

**Startup — A16, CoreCLR, 20x interleaved cold starts (`am start -W` TotalTime, force-stop between launches):**

| | mean | median | stdev |
|---|---|---|---|
| compression on | 1109 ms | 1090 | 44 |
| compression off | 1044 ms | 1031 | 32 |

Compression makes cold startup **~65 ms / ~6% slower** (Mann-Whitney z = 4.33, p < 0.001). The per-launch decompression CPU isn't offset by smaller I/O when the page cache is warm.

**Decompression speed of candidate algorithms (A16, 139 MB of real assemblies, decompress + write):**

| codec | time | throughput |
|---|---|---|
| LZ4 | 234 ms | ~593 MB/s |
| zlib/DEFLATE | 654 ms | ~212 MB/s |

LZ4 decompresses ~2.8x faster than zlib. This matters because the assembly store is decompressed at **startup**.

### Possible directions (open)

- **Keep LZ4 store compression.** Real-config size win (~10.5 MB/ABI) at a ~6% startup cost and the `external/lz4` + `K4os` dependency.
- **Drop store compression.** Faster startup, less RAM, but ~10.5 MB/ABI larger downloads in the real (`extractNativeLibs=false`) config.
- **Switch to a different algorithm that doesn't need our LZ4 fork — but only if it's similarly fast to decompress, so startup doesn't regress.** System `libz` (zlib) is dependency-free on Android but ~2.8x slower to decompress (would worsen startup). Is there a fast-decompress option that's either system-provided or small enough to vendor without the fork (e.g. upstream lz4 directly, a minimal vendored decompressor, zstd, ...)?
- **Make the default depend on `extractNativeLibs`/packaging** so we don't pay startup cost in configs where the size win doesn't materialize.

Open questions: what decompression throughput is "fast enough" to keep startup flat? Do we actually need the *fork* (`dotnet/lz4`), or just the algorithm? How much of the ~10.5 MB/ABI matters once split per-ABI in an AAB?

---

## Block 2 — Fast deployment (fastdev)

### What it is

On an incremental `dotnet build -t:Install` (Debug), only the assemblies the build determined changed are redeployed. For each such file, fastdev currently: LZ4-compresses it on the host (`K4os`, level `L03_HC`), then runs `run-as <pkg> xamarin.sync <args>` and streams the compressed bytes over the shell's stdin; the device-side `xamarin.sync` tool decompresses (`external/lz4`) and writes the file into the app's private `files/.__override__/<abi>/` directory. Confirmed on net11 CoreCLR via live diagnostics (`deploy.tool: xamarin.sync`, `deploy.supports.fastdev: True`).

Importantly, this is **one process spawn per file**: `FastDeploy.cs` loops over the changed files and invokes `run-as ... xamarin.sync ...` once per file (`FastDeploy.cs` ~lines 730 -> 763 -> 792).

### Variables that matter

- **File count and file size** of the changed set (per-file spawn overhead vs per-byte transfer).
- **Per-file `run-as` spawn overhead** (measured ~40 ms each on the A16).
- **Transfer channel throughput** (the shell-stdin stream fastdev uses vs `adb push`).
- **Whether adb's own transfer compression (`adb push -z`) is used** (algorithms: any/none/brotli/lz4/zstd; assume available on modern adb).
- **Host compression algorithm and level.**
- **Per-file vs batched** invocation.

### Benchmarks

**Per-file spawn overhead (A16, on-device, differential - excludes adb roundtrip):**

| | per spawn |
|---|---|
| shell builtin (`:`) | ~0 ms (no fork) |
| `/system/bin/true` (fork+exec floor) | ~24 ms |
| `run-as <pkg> true` | ~43 ms |
| `run-as <pkg> <tool>` | ~38 ms |

fastdev pays ~40 ms of `run-as` overhead **per file**, regardless of codec or payload.

**Transfer throughput (A16, 139 MB):**

| channel | throughput |
|---|---|
| shell-stdin stream (what fastdev uses) | ~22 MB/s |
| `adb push -Z` (compression off) | ~20 MB/s |
| `adb push -z` (compression on) | ~60 MB/s |

The shell-stdin channel is uncompressed (~20 MB/s). `adb push -z` gives ~3x throughput on real assemblies, for free, with no host CPU.

**Host-side compress cost (40 MB of real assemblies):**

| | time |
|---|---|
| LZ4 fast (`L00`) | 80 ms |
| LZ4 `L03_HC` (current fastdev) | ~400 ms |
| DEFLATE level 1 | 365 ms |
| DEFLATE level 6 | 1538 ms |

Compression level matters a lot to end-to-end time; the currently-used `L03_HC` is on the slow side, and DEFLATE-6 would be slower than not compressing at all on a fast link.

**Device tool sizes (stripped):**

| tool | size | dependency |
|---|---|---|
| C, LZ4 (bundles `lz4.c`) | ~65 KB | `external/lz4` |
| C, zlib (links system `libz.so`) | ~6.6 KB | system libz (no bundle) |
| NativeAOT (C#, `System.IO.Compression`) | ~1.19 MB | none (self-contained) |

Binary size is negligible in absolute terms; we don't think it's a deciding factor. (For completeness: a NativeAOT reimplementation is ~180x the C/zlib binary and the managed-runtime floor alone (~786 KB) exceeds the entire stripped C tool, so NativeAOT doesn't look attractive on size; it also adds process-startup cost.)

**End-to-end deploy benchmark (A16, best of 3, md5-verified).** This is the one that captures everything we care about: *host compress + upload + decompress + move the file into the app's filesystem.* Four strategies, framed by external dependency:

- **S1** - host LZ4 + `run-as` stdin stream + `xamarin.sync` decompress->app fs (dep: `external/lz4`) - i.e. ~today's design.
- **S2** - host DEFLATE + `run-as` stdin stream + tool inflate via system `libz`->app fs (dep: system libz).
- **S3** - no host compression + `adb push -z` to a tmp location + one batched `run-as cp` into the app fs (dep: none).
- **S4** - no host compression + `adb push -Z` (raw) to tmp + batched `run-as cp` (dep: none).

| changed file set | S1 lz4-stream | S2 zlib-stream | S3 `push -z` | S4 `push -Z` |
|---|---|---|---|---|
| 1 x 256 KB | 171 ms | 182 ms | 370 ms | 253 ms |
| 20 x 128 KB (2.5 MB) | 2487 ms | 2118 ms | 832 ms | 875 ms |
| 5 x 4 MB (20 MB) | 829 ms | 1276 ms | 669 ms | 1386 ms |
| 20 x 2 MB (40 MB) | 2107 ms | 2832 ms | 1452 ms | 2839 ms |
| 50 x 512 KB (25 MB) | 4525 ms | 4616 ms | 2028 ms | 2848 ms |

Observations from the matrix:

- The streaming strategies (S1/S2) scale poorly with **file count** because of the per-file `run-as` spawn (~40 ms each); `adb push` transfers the whole set in one invocation.
- `adb push -z` (S3) vs raw `adb push -Z` (S4) is ~2x on larger payloads - adb's built-in transfer compression roughly halves transfer time, with no host CPU and no tool.
- For a single tiny file, the batch `adb push` setup overhead makes S3/S4 slower than streaming - but that's a small absolute difference.
- LZ4 vs zlib in the streaming model (S1 vs S2) is close; when file count is high both are dominated by spawn overhead, not the codec.

### Possible directions (open)

- **Rebase fastdev transport on `adb push` (+ a batched `run-as cp/mv`)** and let `adb -z` provide compression. This would remove the custom tool and the `external/lz4`/`K4os` fastdev dependency. The matrix suggests it's also faster for multi-file deploys (batching beats per-file spawn), though a single-tiny-file case is slightly slower.
- **Keep a streaming tool but batch it** (one invocation handling all files) to remove the per-file spawn cost, and/or **use system `libz`** instead of bundling lz4.
- **Tune the current path cheaply:** switch the host LZ4 level from `L03_HC` to a fast level, independent of any larger change.
- **NativeAOT reimplementation** of the tool: measured as a size/startup regression here; doesn't look worth it.

Open questions: is `adb push -z` reliably available across the device/adb versions we must support? How do these numbers look on faster hardware and over USB-3 vs the slower channel here? What does a realistic *mixed* incremental change (one large dll + a few small ones) look like end-to-end?

---

## Cross-cutting

Both areas pull in `external/lz4` (the fork) and `K4os.Compression.LZ4`. A recurring idea worth exploring for both: **is there a compression choice that removes the fork/external-lz4 dependency while staying fast enough not to regress startup (store) or deploy time (fastdev)?** System `libz`/zlib removes the dependency but is ~2.8x slower to decompress; that's likely fine for fastdev (transfer-bound) but a concern for the startup-sensitive assembly store. Other angles: using upstream lz4 directly rather than the fork, vendoring a minimal decompressor, or a different fast algorithm.

### Caveats

Low-end single device (A16), CoreCLR only, a trivial app, single-ABI measurements, force-stop (warm-cache) cold starts rather than post-reboot cold-disk, and the fastdev numbers are from a synthetic harness that reproduces the real pipeline (host compress + `run-as` stream / `adb push` + decompress + move into the app fs) rather than from instrumenting `FastDeploy` itself. Numbers are directional. Happy to share the harness and raw data.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: do we still need the LZ4 fork? Assembly-store compression + fast deployment — variables, benchmarks, open questions #11668

Background

Block 1 — Assembly store compression

What it is

Variables that matter

The `extractNativeLibs` / Play Store annoyance

Benchmarks

Possible directions (open)

Block 2 — Fast deployment (fastdev)

What it is

Variables that matter

Benchmarks

Possible directions (open)

Cross-cutting

Caveats

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	per spawn
shell builtin (`:`)	~0 ms (no fork)
`/system/bin/true` (fork+exec floor)	~24 ms
`run-as <pkg> true`	~43 ms
`run-as <pkg> <tool>`	~38 ms

channel	throughput
shell-stdin stream (what fastdev uses)	~22 MB/s
`adb push -Z` (compression off)	~20 MB/s
`adb push -z` (compression on)	~60 MB/s

	time
LZ4 fast (`L00`)	80 ms
LZ4 `L03_HC` (current fastdev)	~400 ms
DEFLATE level 1	365 ms
DEFLATE level 6	1538 ms

tool	size	dependency
C, LZ4 (bundles `lz4.c`)	~65 KB	`external/lz4`
C, zlib (links system `libz.so`)	~6.6 KB	system libz (no bundle)
NativeAOT (C#, `System.IO.Compression`)	~1.19 MB	none (self-contained)

changed file set	S1 lz4-stream	S2 zlib-stream	S3 `push -z`	S4 `push -Z`
1 x 256 KB	171 ms	182 ms	370 ms	253 ms
20 x 128 KB (2.5 MB)	2487 ms	2118 ms	832 ms	875 ms
5 x 4 MB (20 MB)	829 ms	1276 ms	669 ms	1386 ms
20 x 2 MB (40 MB)	2107 ms	2832 ms	1452 ms	2839 ms
50 x 512 KB (25 MB)	4525 ms	4616 ms	2028 ms	2848 ms

Discussion: do we still need the LZ4 fork? Assembly-store compression + fast deployment — variables, benchmarks, open questions #11668

Description

Background

Block 1 — Assembly store compression

What it is

Variables that matter

The extractNativeLibs / Play Store annoyance

Benchmarks

Possible directions (open)

Block 2 — Fast deployment (fastdev)

What it is

Variables that matter

Benchmarks

Possible directions (open)

Cross-cutting

Caveats

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

The `extractNativeLibs` / Play Store annoyance