Reduce peak memory of large half-precision random.uniform/normal by dogukanveziroglu · Pull Request #3439 · ml-explore/mlx

dogukanveziroglu · 2026-04-22T14:54:21Z

Hi guys, so I've been messing around with mlx.random for the last few days initially just trying to figure out why mx.random.normal((46341, 46341), dtype=mx.bfloat16) was crashing on my M4 16GB. I realised it uses too much memory then it supposed to do. I tried to make some changes about the calculation. Opening as a draft because I want your gut check before I polish anything further. I might try to climb a vertical flat wall but I am not sure :D. If what I made is dumb pls tell and guide me I would love to get some critisim.

Heads up: I didn't tried on nvidia GPU yet I will try it soon...

What

mx.random.normal((46341, 46341), dtype=mx.bfloat16) aborts with
"Insufficient Memory" on a 16 GB Apple-Silicon device. The standard
bits → divide → cast → minimum → mul → add chain holds three
fp32-sized buffers at the same time — about 12.88 GB of peak for a
4.3 GB output.

This PR adds two narrow GPU paths to fix it. The headline canary
goes from abort (12.88 GB) to success at 4.69 GB peak on
M4-16GB.

How

Two new dispatch paths in mlx/random.cpp:

1. Fused RandomUniform Metal kernel for half-precision uniform
when bounds are scalar, total size is even, and a single key is in
flight. Computes bits → divide → clip → cast → affine per thread
in registers, so no fp32 intermediate ever lands in global memory.
The output is bit-identical to vanilla, peak drops from ~3x to
1x, and it's 5–10× faster on large shapes.

2. Chunked path that splits the existing fp32 pipeline along
axis 0 into K independent sub-keys (K = ⌈bits_bytes / 256MB⌉,
clamped to [4, 256]). Triggers when the fp32-equivalent size is
≥ 512 MB. Peak drops to about (1 + 2/K) × output. Bytes differ
from vanilla because sub-keys differ, each chunk still does
fp32-then-cast.

Both paths are skipped when detail::in_tracing() is true, so
anything inside vmap / compile / vjp / jvp falls back to
the vanilla pipeline. fp32 is excluded from the chunked path on
purpose — vanilla fp32 already runs at ~1x output peak (the
intermediate IS the target dtype), so chunking only adds overhead.

Dispatch summary

When	What runs
GPU + bf16/fp16 + scalar bounds + even N + single key + outside transforms	fused kernel
(bit-equal vanilla)
GPU + bf16/fp16 + ≥ 512 MB + scalar bounds + single key + outside transforms	chunked path
Everything else	vanilla (unchanged)

Numbers (M4-16GB, 30 reps, fresh subprocess)

Half-precision uniform (fused kernel):

Shape	Vanilla ms / GB	This PR ms / GB
4096² bf16	7.83 / 0.10	1.67 / 0.034
16384² bf16	113 / 1.61	12.2 / 0.54
32768² bf16	448 / 6.44	44.7 / 2.15

Half-precision normal (chunked):

Shape	Vanilla ms / GB	This PR ms / GB	Δ ms	Δ peak
16384² bf16	164 / 1.61	182 / 0.94	+11%	−42%
32768² bf16	649 / 6.44	718 / 2.55	+11%	−60%
46341² bf16	abort 12.88 GB	1502 ms / 4.69 GB	n/a	fits

fp32 (deliberately unchanged):

Shape	Vanilla ms / GB	This PR ms / GB
16384² fp32 normal	164.61 / 1.07	164.57 / 1.07

Quality

bf16 uniform fused output is bit-equal to vanilla
(mean/var/min/max identical for 10 tested seeds at 16384²).
bf16 normal chunked output: KS one-sample distribution matches
vanilla; inter-chunk Pearson correlation < 0.011 at all chunk
boundaries; unique bf16 value count 2315 at 100K samples
mlx-lm Qwen2.5-0.5B inference: identical sampled tokens at
identical seeds, same peak memory, ~2% better steady-state tok/s.

Trade-offs

Real but small:

Half-precision uniform at very small shapes (≤256²) is +35–43%
slower because the fused kernel's launch overhead dominates.
Crossover above ~1K² gives the 5–90× speedups. Easy to add a
min-size guard if you'd prefer.
Half-precision normal at chunked-trigger shapes (≥11K²) is
+9–11% slower in exchange for the 42–60% peak cut.
Shapes with odd total element count fall back to vanilla on the
uniform path (the kernel emits two outputs per thread).

Tests

22/22 in python/tests/test_random.py (8 new in TestRandomChunked
with mathematically derived 5σ/√N tolerances)
708/708 full pytest

Files

mlx/random.cpp — dispatch + chunking helper
mlx/primitives.{h,cpp} — RandomUniform class
mlx/backend/metal/kernels/random.metal — runiformc<T> kernel
mlx/backend/metal/primitives.cpp — RandomUniform::eval_gpu
mlx/backend/cpu/primitives.cpp — RandomUniform::eval_cpu stub
mlx/backend/cuda/random.cu — CUDA mirror
python/tests/test_random.py — TestRandomChunked

Repro the canary

python -c "                                          
import mlx.core as mx                        
mx.clear_cache(); mx.reset_peak_memory()
mx.random.seed(0)                                                                                  
a = mx.random.normal(shape=(46341, 46341), dtype=mx.bfloat16); mx.eval(a)
print(f'CANARY peak={mx.get_peak_memory()/1e9:.2f} GB')                                            
"                                                    
# Vanilla: abort. This PR: 4.69 GB.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

A new primitive that runs the entire uniform RNG pipeline (threefry hash → fp32 normalize → clip → cast → affine) per-thread in registers for half-precision GPU outputs. Avoids materializing the fp32 intermediate buffer that the standard bits()/divide()/astype() chain requires; peak memory drops 3x → 1x of target. Activation conditions (all required): half-precision dtype (bf16 or fp16), even total output size, scalar low/high, single key (shape {2}), GPU stream. Bit-exact with vanilla on the same seed; matches the rbitsc kernel's interleaved counter layout. Performance: 15.4x faster on (16384, 16384) bf16 (108 ms → 7 ms) because the fp32 intermediate no longer transits L2/HBM. Small shapes (<1 MB) pay a slight kernel-launch overhead — chunked path threshold ensures the fast path only activates when the win dominates. CUDA mirror added in a follow-up commit (untested; algorithmic transcription of the validated Metal kernel).

Splits large GPU random calls (output ≥ ~512 MB fp32-equivalent) along axis 0 into K independent sub-key chunks, computes each via the existing fp32-then-cast pipeline, and writes into a pre- allocated output via slice_update with eval per chunk. Per-chunk fp32 transients are freed between iterations; peak drops from 3x to ~1+2/K of target (1.09x at K=33 on the canary shape). Heuristic: K = ceil(fp32_bytes / 256 MB), clamped to [4, 256]. Profiled in path-c/19-K-isolation.md: theory matches measurement within 5% at K ≥ 32; allocator overhead at small K (2-16) adds 17-30% but amortizes away. Sub-key derivation via random::split is cryptographically independent and seed-deterministic. Same seed produces same chunked output across runs, but the bit pattern differs from vanilla (which uses one key for the whole shape). Same trade-off class as PR ml-explore#904; statistical quality preserved per-chunk (chunked unique-value count ≥ vanilla baseline). Activation rule (all required): GPU stream, scalar lo/hi, single key, fp32-equiv output size ≥ 512 MB, axis-0 dim ≥ 4. Falls back to vanilla path for everything else (small shapes, multi-key, broadcast bounds, CPU). normal() uses the same chunked pipeline when target dtype is bf16/fp16/fp32. Resolves OOM on (46341, 46341) bf16 normal: vanilla aborts at 12.88 GB peak, chunked completes at 4.69 GB. Tolerates up to ~11 GB of concurrent allocations on M4 16 GB before swap kicks in (path-c/21-active-ballast.md).

CUDA mirror of the Metal RandomUniform kernel (same threefry counter mapping, same per-thread fp32-then-cast in registers, same output dtype templating). Marked untested in code: no NVIDIA hardware on this branch's CI; algorithmic equivalence to the validated Metal kernel verified by inspection. TestRandomChunked: 8 tests targeting the chunked path (shapes ≥ 1 GB so chunking activates). Each test uses 5σ/√N statistical tolerance for distribution stats (not hand-tuned); seed reproducibility test confirms deterministic output; odd-first-dim test exercises chunk-remainder handling; unique-bit test asserts ≥ 2000 distinct bf16 values per million samples (PR ml-explore#2361 quality floor). Brings test_random.py coverage from 14 to 22 tests; full pytest remains 696 passed / 4 skipped / 9283 subtests on M4.

The chunked dispatch in mlx/random.cpp had two correctness gaps discovered by an adversarial drawback sweep against vanilla: 1. fp32 chunking is strictly worse than vanilla. Vanilla fp32 uniform/normal already operate at ~1x output peak (the intermediate IS the target dtype), so chunking adds K-fold sub-key derivation + slice_update overhead with zero memory benefit. Measured ~25% latency regression and ~25% higher peak memory at 12K^2+ shapes. Restrict chunkable_dtype to {bfloat16, float16}. 2. Both the fused RandomUniform primitive and the chunked path are illegal inside mx.compile / mx.vmap / mx.grad: the fused primitive throws on RandomUniform::vmap, and the chunked path's per-chunk eval() is rejected by the tracer. Gate both dispatches on !detail::in_tracing() so any transform falls back to the vanilla pipeline (which uses RandomBits, DEFINE_VMAP()-supported). Headline canary unchanged: (46341, 46341) bf16 normal still peaks at 4.7 GB on M4-16GB.

Pre-PR cleanup pass: remove internal investigation references ("Variant D1/D4", "Phase X", *.md filenames, drawback Phase references) from comments, compress the chunked-path docstrings, and tighten throw messages to drop implementation detail leakage. Also: - mlx/random.cpp: replace if-cascade clamps in pick_chunk_count with std::clamp; mark chunked_fp32_then_cast and pick_chunk_count static; drop the redundant inner key-shape check (single_key already guarantees Shape{2}); inline single-use bool 'even'. - mlx/backend/metal/kernels/random.metal: collapse the two-output per-thread block to one expression each; drop the "Step 4" reference. - mlx/backend/metal/primitives.cpp: drop the 7-line debugging postmortem about constant-buffer packing. - .gitignore: drop the path-c-only .venv / python/mlx/lib entries. No behavior change. 22/22 random tests + 708 full pytest pass; canary (46341, 46341) bf16 normal still peaks at 4.69 GB. Net diff: -69 lines.

dogukanveziroglu · 2026-04-23T21:29:30Z

@angeloskath could you give some thoughts please, thank you.

zcbenz

Wouldn't mx.compile fuse the ops?

dogukanveziroglu · 2026-04-27T11:39:47Z

Wouldn't mx.compile fuse the ops?

That was actually the first thing that I tried but it gave me the same memory peak with 3x more latency. The problem is the two compile fusion regression for random.uniform and random.normal. First I tried to fix the donation guard in mlx/backend/common/compiled.cpp but as far as I understand @awni intentionally made bits fp32 sized at #2361 for quality so I decided to try this way.

zcbenz · 2026-04-28T08:29:43Z

I think we should try to make mx.compile fuse the ops, and only add a new primitive as the last resort when there is a very good reason that we couldn't do it. @angeloskath

dogukanveziroglu · 2026-04-28T09:01:13Z

Ok, I will try to make mx.compile fuse this without dropping the randomization quality.

dogukanveziroglu added 5 commits April 21, 2026 20:12

zcbenz reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce peak memory of large half-precision random.uniform/normal#3439

Reduce peak memory of large half-precision random.uniform/normal#3439
dogukanveziroglu wants to merge 5 commits intoml-explore:mainfrom
dogukanveziroglu:reduce-random-peak-memory

dogukanveziroglu commented Apr 22, 2026

Uh oh!

dogukanveziroglu commented Apr 23, 2026

Uh oh!

zcbenz left a comment

Uh oh!

dogukanveziroglu commented Apr 27, 2026 •

edited

Loading

Uh oh!

zcbenz commented Apr 28, 2026

Uh oh!

dogukanveziroglu commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dogukanveziroglu commented Apr 22, 2026

What

How

Dispatch summary

Numbers (M4-16GB, 30 reps, fresh subprocess)

Quality

Trade-offs

Tests

Files

Repro the canary

Checklist

Uh oh!

dogukanveziroglu commented Apr 23, 2026

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

dogukanveziroglu commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcbenz commented Apr 28, 2026

Uh oh!

dogukanveziroglu commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dogukanveziroglu commented Apr 27, 2026 •

edited

Loading