Skip to content

Discussion: do we still need the LZ4 fork? Assembly-store compression + fast deployment — variables, benchmarks, open questions #11668

@simonrozsival

Description

@simonrozsival

Background

dotnet/android carries a fork of LZ4, dotnet/lz4 (a fork of lz4/lz4), and also uses the K4os.Compression.LZ4 NuGet package on the build (host) side. LZ4 is used in two independent places:

  1. Assembly store compression — assemblies packaged inside the app are LZ4-compressed and decompressed by the runtime at app startup.
  2. Fast deployment ("fastdev") — during dotnet build -t:Install (Debug), changed assemblies are LZ4-compressed on the host, streamed to the device, and decompressed there.

This issue collects what we measured about both, lays out the variables that matter, and lists possible directions. It is intentionally open-ended — there is no proposed decision here. The two areas have different constraints and different possible paths, so they are split into two blocks below.

Where LZ4 lives today (for reference):

  • Assembly store, runtime decompress: src/native/clr/host/assembly-store.cc (CoreCLR) and src/native/mono/monodroid/embedded-assemblies.cc (Mono), via LZ4_decompress_safe from external/lz4, guarded by #if defined(HAVE_LZ4) && defined(RELEASE).
  • Assembly store, build-side compress: src/Xamarin.Android.Build.Tasks/Utilities/AssemblyCompression.cs (K4os.Compression.LZ4), gated by the AndroidEnableAssemblyCompression MSBuild property (default true).
  • Fastdev, device-side decompress: tools/fastdev/xamarin.sync/main.cc (external/lz4).
  • Fastdev, host-side compress: src/Xamarin.Android.Build.Debugging.Tasks/Tasks/FastDeploy.cs (K4os.Compression.LZ4, level L03_HC).

So both external/lz4 and K4os.Compression.LZ4 have two consumers each. Removing the dependency would require addressing both areas.

Measurement environment for everything below: a stock dotnet new maui app, .NET 11 preview 5, net11.0-android, CoreCLR runtime, android-arm64, no AOT/R2R. On-device numbers are from a low-end Samsung Galaxy A16 (SM-A165F). Single device, single app — directional, not definitive.


Block 1 — Assembly store compression

What it is

When AndroidEnableAssemblyCompression is true (the default), each managed assembly is LZ4-compressed (the XALZ block format) and packed into lib/<abi>/libassembly-store.so. At app startup the runtime decompresses each assembly into memory (LZ4_decompress_safe). With compression false, the assemblies are stored raw and can be used/mapped without a decompression step.

Variables that matter

  • android:extractNativeLibs — the single most important variable, and the source of a real annoyance (below).
  • Where the size is paid: inside the APK/AAB (download), and on-disk after install.
  • Startup CPU spent decompressing assemblies on every cold start.
  • RAM — LZ4'd assemblies are decompressed into memory; uncompressed assemblies can be demand-paged/mapped.
  • Decompression speed of the algorithm (directly affects startup).

The extractNativeLibs / Play Store annoyance

libassembly-store.so is a native library, so it is subject to android:extractNativeLibs:

  • With extractNativeLibs=true (dotnet/android's current default), the APK zip itself DEFLATE-compresses the store. So the APK's own compression already shrinks the assemblies — and pre-compressing them with LZ4 first is largely redundant (you are compressing already-compressed data).
  • With extractNativeLibs=false, native libs (including the assembly store) are stored uncompressed and page-aligned in the APK and mapped directly — no extraction, no second on-disk copy. Google Play sets this automatically for app bundles on API 26+. So in what users actually download/install, the APK's zip compression does not help the assembly store at all, and you also can't trust a locally-built .apk/.aab size because Play re-packages it.

This is the annoying part: in the real distribution config we cannot lean on the APK's own compression for the assembly store, so if we don't compress it ourselves, it ships uncompressed.

Benchmarks

Size — extractNativeLibs=true (current default; APK zip DEFLATEs the store), android-arm64:

store raw .so in the APK (DEFLATE)
LZ4 on 7.64 MB 6.10 MB
LZ4 off 18.43 MB 5.99 MB

With the zip re-compressing, LZ4 is ~neutral (actually ~100 KB/ABI worse, because DEFLATE compresses raw assemblies better than already-LZ4'd bytes).

Size — extractNativeLibs=false (what Play delivers; store stored uncompressed), android-arm64:

store in the APK (Stored) final signed APK
LZ4 on 7.64 MB 28.4 MB
LZ4 off 18.43 MB 38.9 MB

Here LZ4 saves ~10.5 MB/ABI of download + on-disk size. Opposite conclusion from the true config — which is exactly why the variable matters.

Startup — A16, CoreCLR, 20x interleaved cold starts (am start -W TotalTime, force-stop between launches):

mean median stdev
compression on 1109 ms 1090 44
compression off 1044 ms 1031 32

Compression makes cold startup ~65 ms / ~6% slower (Mann-Whitney z = 4.33, p < 0.001). The per-launch decompression CPU isn't offset by smaller I/O when the page cache is warm.

Decompression speed of candidate algorithms (A16, 139 MB of real assemblies, decompress + write):

codec time throughput
LZ4 234 ms ~593 MB/s
zlib/DEFLATE 654 ms ~212 MB/s

LZ4 decompresses ~2.8x faster than zlib. This matters because the assembly store is decompressed at startup.

Possible directions (open)

  • Keep LZ4 store compression. Real-config size win (~10.5 MB/ABI) at a ~6% startup cost and the external/lz4 + K4os dependency.
  • Drop store compression. Faster startup, less RAM, but ~10.5 MB/ABI larger downloads in the real (extractNativeLibs=false) config.
  • Switch to a different algorithm that doesn't need our LZ4 fork — but only if it's similarly fast to decompress, so startup doesn't regress. System libz (zlib) is dependency-free on Android but ~2.8x slower to decompress (would worsen startup). Is there a fast-decompress option that's either system-provided or small enough to vendor without the fork (e.g. upstream lz4 directly, a minimal vendored decompressor, zstd, ...)?
  • Make the default depend on extractNativeLibs/packaging so we don't pay startup cost in configs where the size win doesn't materialize.

Open questions: what decompression throughput is "fast enough" to keep startup flat? Do we actually need the fork (dotnet/lz4), or just the algorithm? How much of the ~10.5 MB/ABI matters once split per-ABI in an AAB?


Block 2 — Fast deployment (fastdev)

What it is

On an incremental dotnet build -t:Install (Debug), only the assemblies the build determined changed are redeployed. For each such file, fastdev currently: LZ4-compresses it on the host (K4os, level L03_HC), then runs run-as <pkg> xamarin.sync <args> and streams the compressed bytes over the shell's stdin; the device-side xamarin.sync tool decompresses (external/lz4) and writes the file into the app's private files/.__override__/<abi>/ directory. Confirmed on net11 CoreCLR via live diagnostics (deploy.tool: xamarin.sync, deploy.supports.fastdev: True).

Importantly, this is one process spawn per file: FastDeploy.cs loops over the changed files and invokes run-as ... xamarin.sync ... once per file (FastDeploy.cs ~lines 730 -> 763 -> 792).

Variables that matter

  • File count and file size of the changed set (per-file spawn overhead vs per-byte transfer).
  • Per-file run-as spawn overhead (measured ~40 ms each on the A16).
  • Transfer channel throughput (the shell-stdin stream fastdev uses vs adb push).
  • Whether adb's own transfer compression (adb push -z) is used (algorithms: any/none/brotli/lz4/zstd; assume available on modern adb).
  • Host compression algorithm and level.
  • Per-file vs batched invocation.

Benchmarks

Per-file spawn overhead (A16, on-device, differential - excludes adb roundtrip):

per spawn
shell builtin (:) ~0 ms (no fork)
/system/bin/true (fork+exec floor) ~24 ms
run-as <pkg> true ~43 ms
run-as <pkg> <tool> ~38 ms

fastdev pays ~40 ms of run-as overhead per file, regardless of codec or payload.

Transfer throughput (A16, 139 MB):

channel throughput
shell-stdin stream (what fastdev uses) ~22 MB/s
adb push -Z (compression off) ~20 MB/s
adb push -z (compression on) ~60 MB/s

The shell-stdin channel is uncompressed (~20 MB/s). adb push -z gives ~3x throughput on real assemblies, for free, with no host CPU.

Host-side compress cost (40 MB of real assemblies):

time
LZ4 fast (L00) 80 ms
LZ4 L03_HC (current fastdev) ~400 ms
DEFLATE level 1 365 ms
DEFLATE level 6 1538 ms

Compression level matters a lot to end-to-end time; the currently-used L03_HC is on the slow side, and DEFLATE-6 would be slower than not compressing at all on a fast link.

Device tool sizes (stripped):

tool size dependency
C, LZ4 (bundles lz4.c) ~65 KB external/lz4
C, zlib (links system libz.so) ~6.6 KB system libz (no bundle)
NativeAOT (C#, System.IO.Compression) ~1.19 MB none (self-contained)

Binary size is negligible in absolute terms; we don't think it's a deciding factor. (For completeness: a NativeAOT reimplementation is ~180x the C/zlib binary and the managed-runtime floor alone (~786 KB) exceeds the entire stripped C tool, so NativeAOT doesn't look attractive on size; it also adds process-startup cost.)

End-to-end deploy benchmark (A16, best of 3, md5-verified). This is the one that captures everything we care about: host compress + upload + decompress + move the file into the app's filesystem. Four strategies, framed by external dependency:

  • S1 - host LZ4 + run-as stdin stream + xamarin.sync decompress->app fs (dep: external/lz4) - i.e. ~today's design.
  • S2 - host DEFLATE + run-as stdin stream + tool inflate via system libz->app fs (dep: system libz).
  • S3 - no host compression + adb push -z to a tmp location + one batched run-as cp into the app fs (dep: none).
  • S4 - no host compression + adb push -Z (raw) to tmp + batched run-as cp (dep: none).
changed file set S1 lz4-stream S2 zlib-stream S3 push -z S4 push -Z
1 x 256 KB 171 ms 182 ms 370 ms 253 ms
20 x 128 KB (2.5 MB) 2487 ms 2118 ms 832 ms 875 ms
5 x 4 MB (20 MB) 829 ms 1276 ms 669 ms 1386 ms
20 x 2 MB (40 MB) 2107 ms 2832 ms 1452 ms 2839 ms
50 x 512 KB (25 MB) 4525 ms 4616 ms 2028 ms 2848 ms

Observations from the matrix:

  • The streaming strategies (S1/S2) scale poorly with file count because of the per-file run-as spawn (~40 ms each); adb push transfers the whole set in one invocation.
  • adb push -z (S3) vs raw adb push -Z (S4) is ~2x on larger payloads - adb's built-in transfer compression roughly halves transfer time, with no host CPU and no tool.
  • For a single tiny file, the batch adb push setup overhead makes S3/S4 slower than streaming - but that's a small absolute difference.
  • LZ4 vs zlib in the streaming model (S1 vs S2) is close; when file count is high both are dominated by spawn overhead, not the codec.

Possible directions (open)

  • Rebase fastdev transport on adb push (+ a batched run-as cp/mv) and let adb -z provide compression. This would remove the custom tool and the external/lz4/K4os fastdev dependency. The matrix suggests it's also faster for multi-file deploys (batching beats per-file spawn), though a single-tiny-file case is slightly slower.
  • Keep a streaming tool but batch it (one invocation handling all files) to remove the per-file spawn cost, and/or use system libz instead of bundling lz4.
  • Tune the current path cheaply: switch the host LZ4 level from L03_HC to a fast level, independent of any larger change.
  • NativeAOT reimplementation of the tool: measured as a size/startup regression here; doesn't look worth it.

Open questions: is adb push -z reliably available across the device/adb versions we must support? How do these numbers look on faster hardware and over USB-3 vs the slower channel here? What does a realistic mixed incremental change (one large dll + a few small ones) look like end-to-end?


Cross-cutting

Both areas pull in external/lz4 (the fork) and K4os.Compression.LZ4. A recurring idea worth exploring for both: is there a compression choice that removes the fork/external-lz4 dependency while staying fast enough not to regress startup (store) or deploy time (fastdev)? System libz/zlib removes the dependency but is ~2.8x slower to decompress; that's likely fine for fastdev (transfer-bound) but a concern for the startup-sensitive assembly store. Other angles: using upstream lz4 directly rather than the fork, vendoring a minimal decompressor, or a different fast algorithm.

Caveats

Low-end single device (A16), CoreCLR only, a trivial app, single-ABI measurements, force-stop (warm-cache) cold starts rather than post-reboot cold-disk, and the fastdev numbers are from a synthetic harness that reproduces the real pipeline (host compress + run-as stream / adb push + decompress + move into the app fs) rather than from instrumenting FastDeploy itself. Numbers are directional. Happy to share the harness and raw data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIssues that need to be assigned.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions