Performance & Memory

projectMM's per-step performance contracts live in the scenario JSONs — each test/scenarios/*.json step carries a per-target contract block (tick_us ceiling + free_heap floor) and an observed block (the latest reading per target). The scenarios are the source of truth and the assertion surface: every PR runs against them. See testing.md § Performance contracts for the contract semantics and renegotiation workflow. The headline numbers users care about are in README.md § Performance.

This document holds what scenarios can't carry: structural sizes (sizeof), build-variant deltas, and the WiFi/Ethernet physics that explain why a contract comes out where it does.

Desktop (64-bit)

Desktop ArtNet sends to a non-existent IP so packets complete instantly; freeHeap returns 0 (unlimited). Per-step tick budgets live in per-host contract.pc-<os> blocks across the scenarios — pc-macos for macOS arm64, pc-windows for Windows x64, pc-linux for Linux. The sizeof and dynamic-memory numbers below apply to all 64-bit desktop targets; tick numbers differ by host CPU and live in the scenario contracts.

sizeof (desktop, 64-bit)

Class	sizeof (bytes)
MoonModule	120
Layer	208
Drivers	408
GridLayout	128
SystemModule	368
NetworkModule	336
HttpServerModule	168

Drivers grew from 120 → 408 with the per-driver Correction stage (256-entry brightness LUT + channel-order table). Other classes grew ~16-32 bytes each as MoonModule itself grew (rolling-range observed slot + wired-by-code flag + per-child tickChildren accounting fields).

Binary sizes:

Target	Size	Build
macOS arm64	358 KB	debug-arm64 (release-strip is smaller)
Windows x64	432 KB	MSVC Release, static CRT

Memory at 128×128 with mirror

Module	dynamicBytes	Breakdown
Layer	92 KB	12 KB buffer + 80 KB LUT (uint32_t indices on 64-bit)
Drivers	48 KB	output buffer (128×128×3)

ESP32 — Olimex Gateway Rev G (no PSRAM, 320 KB internal)

Per-step tick/heap live in contract.esp32-eth-wifi and contract.esp32-eth across the scenarios; see the README perf table for the headline grid×board matrix. The notes below cover what those rows don't.

Run-to-run variance

Individual measurements vary ~5–10% on the Olimex board with no configuration change — inherent ESP32/Ethernet timing jitter (lwIP tcpip_thread scheduling, EMAC DMA, Ethernet ACK pacing). Scenarios use 10% default ESP32 tolerance to absorb this; when a step trips, re-run before treating it as a real regression. The collect_kpi.py --commit gate parses a single tick: line from esp32/monitor.log and can flag an unlucky sample — same rule applies.

All-effects sweep (every effect, no modifier, Ethernet + ArtNet)

This Olimex sweep ran each effect alone over a Layer (no modifier) at four square grids, through the real ArtNet + Preview drivers — the per-effect cost of the same pipeline the README's single-effect headline row measures. Numbers are from a live run; apply the ~5–10% variance above. (The per-effect sweep merged into the light/heavy bracket of scenario_perf_full; this table is the archived Olimex run.)

FPS (= 1,000,000 / tick µs):

Effect	16²	32²	64²	128²
Lines	12658	7633	2304	23
Rainbow	3831	968	143	22
Noise	1117	324	71	17
Plasma	3194	829	135	18
PlasmaPalette	6024	1733	267	21
Metaballs	2016	521	102	18
Fire	2762	784	159	21
Particles	4716	1848	424	30
GlowParticles	1706	586	128	14
Checkerboard	8474	2617	397	21
Spiral	2403	571	87	15
Rings	1118	284	45	12
LavaLamp	3030	756	113	18
GameOfLife	6802	1519	226	13

At 128² nearly every effect converges to ~12–23 FPS: the board is ArtNet-output-bound there (the ~38 ms synchronous send dominates the tick), so effect-compute differences wash out — the same physics the README narrates for the S3 over WiFi. Effect cost is visible at 64² and below, where Rings / Noise / Spiral are the heaviest and Lines / Checkerboard the lightest.

Free internal heap (KB) — the scarce resource on a no-PSRAM board; drops as the grid grows because the Layer buffer + LUT and the driver output buffer live in internal RAM:

Effect	16²	32²	64²	128²
Lines	220	214	195	126
Rainbow	173	167	158	126
Noise	171	168	159	126
Plasma	173	171	162	126
PlasmaPalette	170	168	160	126
Metaballs	173	171	162	126
Fire	173	170	157	110
Particles	172	167	149	77
GlowParticles	173	171	162	126
Checkerboard	173	168	159	123
Spiral	170	169	160	123
Rings	170	168	160	124
LavaLamp	170	169	160	124
GameOfLife	171	165	150	90

Largest free internal block (KB) — the memory-pressure signal that matters: free heap can be ample while fragmentation leaves no single block big enough for the next allocation:

Effect	16²	32²	64²	128²
Lines	108	108	108	62
Rainbow	92	88	76	62
Noise	92	88	76	62
Plasma	92	92	84	62
PlasmaPalette	92	88	80	62
Metaballs	92	92	84	62
Fire	96	92	76	62
Particles	80	80	68	34
GlowParticles	84	84	80	62
Checkerboard	96	88	72	62
Spiral	88	88	76	62
Rings	92	88	80	62
LavaLamp	92	88	72	62
GameOfLife	88	84	68	46

Most effects hold the same ~126 KB free / 62 KB block at 128² — their per-cell state is negligible next to the buffers. The exceptions carry real per-cell state: Particles (77 KB / 34 KB) and GameOfLife (90 KB / 46 KB) allocate a parallel grid-sized array, and Fire (110 KB) a heat map. Those three are the ones to watch for fragmentation headroom on a no-PSRAM board at large grids.

ArtNet over WiFi vs Ethernet

	Ethernet	WiFi STA
ArtNet (97 UDP packets)	~27,000 µs	~110,000 µs
Total tick	~50,000 µs / 20 FPS	~130,000 µs / 7 FPS

WiFi sendto() is ~1,140 µs/packet vs Ethernet's ~280 µs — CSMA/CA backoff, rate adaptation, link-layer retries. Not a code regression; WiFi physics. For ArtNet at 16K lights, use Ethernet.

Build-variant note: WiFi-only `esp32` is slow on Olimex

Same source tree, same MCU (ESP32 classic, 160 MHz):

Board / firmware	128×128 tick	ArtNet send
Olimex Gateway, old WiFi-only `esp32` (pre-collapse)	220 ms (4 FPS)	155 ms
Olimex Gateway, default `esp32` (WiFi + Ethernet)	85–95 ms (10–12 FPS)	38 ms

The default esp32 build carries both the WiFi and Ethernet stacks, and sdkconfig.defaults.eth enlarges the shared lwIP/WiFi buffer pool via CONFIG_ETH_DMA_* — those buffers roughly quadruple ArtNet throughput versus a WiFi-only buffer pool. Generic ESP32 boards (no PCB-trace antenna, less stable 3V3) vary wildly in WiFi TX quality vs the Olimex.

Memory at 128×128 with mirror

Module	dynamicBytes	Breakdown
Layer	52 KB	12 KB buffer + 40 KB LUT (uint16_t indices on ESP32)
Drivers	48 KB	output buffer (128×128×3)
System + Network	0	char buffers in class, no heap

LUT is half desktop size (uint16_t vs uint32_t per entry). The 1:1 (no-modifier) path skips the LUT entirely; see scenario_Layer_memory_1to1 vs scenario_MultiplyModifier_memory_lut.

Heap breakdown (128×128, mirror, RainbowEffect, Ethernet + mDNS)

Component	Bytes	Notes
Boot heap	290,240	Before any init
After Ethernet + mDNS init	~240,000	lwIP + Ethernet + mDNS driver
Layer buffer	12,288	64×64×3 (logical, halved by mirror)
Mapping LUT	40,962	offsets + destinations (uint16_t)
Driver output buffer	49,152	128×128×3 (physical)
Preview frame	0	Zero-copy: pointer to output buffer
HTTP + WebSocket	~8,000	server + kernel buffers
MoonModule instances	~3,000	all modules combined
Free heap (running)	~104,000	stable, no leaks

ESP32-S3 — ESP32-S3 N16R8 Dev (16 MB flash, 8 MB octal PSRAM)

esp32s3-n16r8 firmware on the ESP32-S3 N16R8 Dev at Network.txPowerSetting=8 dBm (the brown-out cap injected by deviceModels.json). 128×128 grid, Mirror XY, ArtNet over WiFi STA — the grid sweep (now part of scenario_perf_full) against the live device. Per-step tick/heap live in observed.esp32s3-n16r8 across the scenarios. Numbers below are the 128×128 step.

Metric	Value	Notes
Total tick	~164 ms / 6 FPS	Dominated by ArtNet at the 8 dBm cap
ArtNetSend	~93 ms (97 UDP packets)	~960 µs/packet — slower than full-power WiFi (cf. Olimex `esp32-eth-wifi` at 38 ms) because the cap cuts radio TX margin, association-rate adaptation lands at a lower MCS rate, and packets retry more
Free internal RAM	~240 KB	The comparable, scarce resource. Stays flat (~238–240 KB) across all grid sizes — the Layer buffer + LUT live in PSRAM, so growing the grid doesn't touch internal RAM. This is the number the README perf table shows for the S3, so devices compare on the same axis.
Free heap (incl. PSRAM)	~8,163 KB	The PSRAM-merged total (`totalHeap` reports 8 MB combined). Looks huge but isn't the constraint — assume PSRAM is ample for now.
maxBlock (internal)	~164 KB	Internal-RAM largest contiguous block — the scarce-resource KPI. `maxAllocBlock` (any-memory) reports ~8 MB on PSRAM boards and is meaningless as a pressure signal; SystemModule + scenario_runner use `maxInternalAllocBlock` instead.
Layer buffer	92 KB	In PSRAM (auto by heap_caps preference)
Image	1,307 KB	~30% larger than `esp32-eth-wifi` due to USB-Serial-JTAG driver + Improv-dual-transport listener

Per-grid-size FPS from the same sweep: 16×16 → 1672, 32×32 → 287, 64×64 → 25, 128×128 → 6. The steep drop is ArtNet-bound: packet count scales with the pixel count, and at 8 dBm each packet is ~3× slower on-air than the Olimex Ethernet path.

Why ArtNet is slower at 8 dBm

The brown-out cap drops TX power 12 dB below default (8 dBm vs ~20 dBm). At lower TX power, the WiFi PHY rate-adaptation algorithm picks a slower MCS rate to maintain link margin — for a frame burst this means more time on-air per packet. ~960 µs/packet × 97 packets = the ~93 ms ArtNet budget. The cap is the price of a stable association on this hardware; without it the radio brown-outs and ArtNet doesn't get sent at all.

Use Ethernet-capable boards for high-FPS ArtNet workloads. The ESP32-S3 N16R8 Dev fits the "lots of PSRAM, accept WiFi compromise" niche — large pixel buffers or feature-heavy effects that wouldn't fit in 320 KB internal RAM.

Memory at 128×128 with mirror

Module	dynamicBytes	Notes
Layer	92 KB	Buffer lives in PSRAM (vs 12 KB on Olimex internal) — full uint32_t LUT instead of halved uint16_t
Drivers	48 KB	Output buffer (128×128×3)
Free internal	~240 KB free, ~164 KB largest block	Plenty of headroom for WiFi + lwIP + Improv-on-both-transports

The PSRAM-merged heap (totalHeap() > totalInternalHeap()) is auto-detected — SystemModule binds the psram progress control only when this comparison is true. See docs/moonmodules/core/SystemModule.md.

All-effects sweep — render-only (no output driver, audio + discovery disabled)

A render-only per-effect sweep on the S3 (observed.esp32s3-n16r8, build Jun 17 2026; this curve is what scenario_perf_full's light/heavy bracket now measures on-device). Unlike the Olimex sweep above (which runs through the ArtNet driver and is output-bound at 128²), this one measures raw render cost: audio (I2S sampling) and the Devices module (the blocking HTTP discovery sweep) are disabled and no output driver is attached, so the tick is Layout→Layer→effect only. On the S3 the Layer buffer lives in PSRAM, so effect-compute is visible all the way to 16K pixels (it never converges to an output-bound floor the way the no-PSRAM Olimex does).

Tick (µs) — render only, ~5–10% run-to-run variance applies:

Effect	16² (256)	32² (1K)	64² (4K)	128² (16K)
Lines	88	96	179	6,425
Rainbow	285	849	3,228	16,207
Noise	913	2,951	11,661	51,230
Plasma	352	1,020	3,744	20,020
PlasmaPalette	146	423	1,765	10,085
Metaballs	462	1,757	6,108	28,576
Fire	382	1,138	4,505	22,745
Particles	229	535	1,945	15,792
GlowParticles	580	1,874	6,959	31,479
Checkerboard	121	345	1,098	8,500
Spiral	465	1,379	6,712	24,666
Rings	852	2,455	9,383	41,403
LavaLamp	309	974	3,612	21,243
GameOfLife	138	413	1,870	16,127

The cheapest (Lines, Checkerboard, PlasmaPalette) clear ~100 FPS even at 16K; the heaviest is Noise (51 ms = ~19 FPS at 16K — simplex noise per pixel), then Rings and GlowParticles. Effect-compute differences stay visible across the whole range because nothing is output-bound here.

Free internal heap holds ~8.54 MB at small grids and ~8.46–8.49 MB at 16K — the ~50–100 KB delta is just the grid-sized render buffer (the model array), and it returns to ~8.54 MB whenever the grid shrinks: no leak, no fragmentation creep across the sweep. Largest free internal block stays ~90–110 KB throughout. (Internal RAM is not the constraint on this PSRAM board; the Layer buffer is in PSRAM.)

Incremental cost analysis (`scenario_perf_light` / `scenario_perf_full`)

These two scenarios start from a clean canvas and add one subsystem at a time, measuring the tick/heap delta per step, so each module's cost is isolated. Measured live (2026-06-17, render-only, audio + discovery disabled) on all three boards:

classic — Olimex Gateway, ESP32 @240MHz, no PSRAM (320KB internal), nrOfLightsType=uint16
S3 — ESP32-S3 N16R8 @240MHz, 8MB PSRAM, uint32
P4 — Waveshare P4-NANO, ESP32-P4 @400MHz dual-core, 32MB PSRAM, uint32

All figures tick µs at 16² unless a grid is named; ~5–10% run-to-run variance, so small (<~30µs) deltas are near the noise floor.

Per-subsystem cost (added one at a time, 16² grid)

Absolute tick at each step (the diff vs the prior row is that subsystem's cost):

Step	classic	S3	P4	Reading
Render floor (Grid+Layer+Checkerboard)	129	133	67	the baseline; P4 ~2× faster
− AudioModule disabled	116	111	n/a	audio ≈ +13–22µs/tick (fixed I2S block-read; no mic on the P4)
− Devices discovery disabled	116	112	56	idle discovery is free (boot sweep is one-shot)
+ MultiplyModifier (2×2)	315	292	96	+180–200µs — the per-frame blend+map over the LUT
+ PreviewDriver	115	118	56	apparatus; free
+ NetworkSendDriver	139	141	67	ArtNet/DDP build+send; cheap at this size
+ RmtLedDriver (64 LEDs)	152	120	56	per-frame encode+transmit at a fixed 64-LED output
+ LcdLedDriver (64 LEDs)	n/a¹	142	57²	S3 LCD_CAM i80
+ ParlioLedDriver (64 LEDs)	n/a¹	n/a	58	P4 Parlio

¹ classic has only RMT; the LCD/Parlio rows there are not real measurements (the driver isn't compiled/registered on classic, the optional add is skipped, so the row just re-measures the prior pipeline). Gating these drivers out per chip is tracked in the backlog. ² P4 has LCD_CAM too, but Parlio is its scale path. "n/a" = driver absent on that chip.

Expected, and confirmed everywhere: audio is a small fixed per-tick cost; idle discovery is free; output drivers are cheap at a capped 64-LED output (none dominates the render path). The modifier's +~190µs at 16² is the one notable per-frame add — explained below (it's the blend+map, and it pays for itself at large grids).

Effect compute — light vs heavy bracket, across grid sizes (render-only)

Tick µs; FPS in parens for the 16K row:

Grid (pixels)	classic	S3	P4
Checkerboard (light)
16² (256)	149	119	61
32² (1K)	357	328	133
64² (4K)	1,147	1,090	452
128² (16K)	4,360	7,949	1,940
Noise (heavy)
16² (256)	1,010	799	313
32² (1K)	3,203	2,831	1,120
64² (4K)	13,547	11,235	4,358
128² (16K)	62,316 (16 FPS)	50,555 (20 FPS)	17,433 (57 FPS)

All curves scale ~linear in pixel count (no superlinear blowup → no realloc/fragmentation pathology). The heavy effect is the 16K bottleneck on every board, and the board ranking is P4 ≫ S3 > classic on heavy compute (the P4's 400MHz dual-core is ~3× the S3). Surprise worth noting: at light-16K the classic (4,360µs) beats the S3 (7,949µs) — the S3's PSRAM-resident buffer has higher access latency than the classic's internal RAM for the cheap Checkerboard inner loop, and classic's uint16 LUT is half the size; on the heavy effect the compute dominates and the S3 pulls ahead again. See the NoiseEffect cost backlog item for the fixed-point / strided-sampling ideas.

MultiplyModifier — compute down, memory up (Noise effect)

Often misread (I misread it first): with the default 2×2 kaleidoscope the modifier makes the logical grid ¼-size, so the effect computes on fewer pixels — the modifier reduces tick at large grids, and its real cost is the 1:N mapping-LUT memory.

Tick µs, Noise alone vs Noise+Multiply:

Grid (physical)	classic alone	classic +Mult	S3 alone	S3 +Mult	P4 alone	P4 +Mult
16²	1,010	456	799	385	313	163
32²	3,203	1,808	2,831	1,573	1,120	533
64²	13,547	6,958	11,235	6,552	4,358	2,058
128² (16K)	62,316	28,466 (35 FPS)	50,555	29,647	17,433	9,964 (100 FPS)

So the modifier roughly halves the heavy tick at every grid (¼ logical area, but the 1:N map adds back some cost). The memory price is the LUT-destinations array — on the S3 it cost +1.7KB(16²)→+93KB(16K); on the classic at 16K it ran with ~36KB free heap / ~26KB largest block — tight but working, no crash, no degrade. This confirms the no-PSRAM viability: 16K Noise+Multiply runs on the classic at 35 FPS render-only (and has historically run at 10–20 FPS when also sending out over ArtNet — that send, not the render, was the limiter). Not a no-PSRAM blocker.

ESP32 firmware size

Board: the default esp32 (WiFi + Ethernet — the largest classic variant, measured pre-collapse as esp32-eth-wifi). Partition layout: app0/app1 = 1.75 MB each, LittleFS = 384 KB, coredump = 64 KB.

	Size
Firmware image	~1.27 MB
App partition	1.75 MB (~72% used, ~28% headroom)
DRAM used	38 KB
DRAM free	142 KB
`sizeof(MoonModule)` ESP32	56 bytes

Component breakdown (default `esp32`)

Run from project root after a clean build:

uv run scripts/build/build_esp32.py --firmware esp32
idf.py -B build/esp32-esp32 \
       -DSDKCONFIG=build/esp32-esp32/sdkconfig \
       size-components | head -40

These numbers shift with IDF version + sdkconfig — treat as rough proportions.

Category	Approx	What
WiFi stack	~400 KB	`esp_wifi` + `wpa_supplicant` + `esp_phy`. ~1/3 of the binary; `esp32-eth` drops it entirely (image → ~602 KB).
lwIP networking	~180 KB	TCP/IP stack, DHCP, DNS, ARP, mDNS, SNTP.
TLS + cert bundle	~170 KB	`mbedtls` + Mozilla root bundle (~50 KB). Used by `esp_https_ota`; reused by any future HTTPS client.
FreeRTOS + IDF core	~150 KB	Kernel, esp_event, esp_timer, heap, logging, partition ops. Always present.
projectMM code	~120 KB	`src/core/` + `src/light/` + `src/platform/esp32/` + `src/main.cpp`. ~10% of the binary.
HTTP server + WS	~60 KB	`esp_http_server` + `HttpServerModule` routing.
Embedded UI assets	~50 KB	`index.html`, `app.js`, `style.css`, `preview3d.js`, `install-picker.js`, logo PNG — packed as `constexpr uint8_t[]`.
`esp_https_ota` + HTTP client	~40 KB	OTA-from-URL machinery.
LittleFS	~30 KB	`joltwallet/esp_littlefs` component.
Ethernet stack	~30 KB	`esp_eth` + LAN8720 PHY. Present in every classic variant since the collapse (RMII driver is always compiled in).
Misc (alignment, .rodata)	~40 KB	Format strings, error tables, version metadata.

Variant size deltas

Variant	Image	Delta	Difference
`esp32` (default, WiFi + RMII Eth)	1.27 MB	—	Everything compiled in
`esp32-eth`	0.60 MB	−670 KB	WiFi stack excluded (`EXCLUDE_COMPONENTS`)
`esp32s3-n16r8`	~1.27 MB	similar	Xtensa LX7, 16 MB flash, different partition table; W5500 SPI Eth instead of RMII

The default esp32 carries both the WiFi and Ethernet stacks (1.27 MB); esp32-eth is the Ethernet-only build that drops the WiFi stack for ~670 KB less image.

Size budget for upcoming features

Feature	Est.	Rationale
Mozilla cert bundle trimmed	−40 KB	`CONFIG_MBEDTLS_CERTIFICATE_BUNDLE_DEFAULT_CMN` keeps common roots only. `_NONE` saves ~50 KB but breaks TLS.
Static IPv6	+20 KB	lwIP IPv6 component (off by default). Only if a deployment needs it.
WebSocket TLS (`wss://`)	~0 KB	Reuses linked mbedTLS; certificate handling adds <5 KB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance & Memory

Desktop (64-bit)

sizeof (desktop, 64-bit)

Memory at 128×128 with mirror

ESP32 — Olimex Gateway Rev G (no PSRAM, 320 KB internal)

Run-to-run variance

All-effects sweep (every effect, no modifier, Ethernet + ArtNet)

ArtNet over WiFi vs Ethernet

Build-variant note: WiFi-only `esp32` is slow on Olimex

Memory at 128×128 with mirror

Heap breakdown (128×128, mirror, RainbowEffect, Ethernet + mDNS)

ESP32-S3 — ESP32-S3 N16R8 Dev (16 MB flash, 8 MB octal PSRAM)

Why ArtNet is slower at 8 dBm

Memory at 128×128 with mirror

All-effects sweep — render-only (no output driver, audio + discovery disabled)

Incremental cost analysis (`scenario_perf_light` / `scenario_perf_full`)

Per-subsystem cost (added one at a time, 16² grid)

Effect compute — light vs heavy bracket, across grid sizes (render-only)

MultiplyModifier — compute down, memory up (Noise effect)

ESP32 firmware size

Component breakdown (default `esp32`)

Variant size deltas

Size budget for upcoming features

FilesExpand file tree

performance.md

Latest commit

History

performance.md

File metadata and controls

Performance & Memory

Desktop (64-bit)

sizeof (desktop, 64-bit)

Memory at 128×128 with mirror

ESP32 — Olimex Gateway Rev G (no PSRAM, 320 KB internal)

Run-to-run variance

All-effects sweep (every effect, no modifier, Ethernet + ArtNet)

ArtNet over WiFi vs Ethernet

Build-variant note: WiFi-only esp32 is slow on Olimex

Memory at 128×128 with mirror

Heap breakdown (128×128, mirror, RainbowEffect, Ethernet + mDNS)

ESP32-S3 — ESP32-S3 N16R8 Dev (16 MB flash, 8 MB octal PSRAM)

Why ArtNet is slower at 8 dBm

Memory at 128×128 with mirror

All-effects sweep — render-only (no output driver, audio + discovery disabled)

Incremental cost analysis (scenario_perf_light / scenario_perf_full)

Per-subsystem cost (added one at a time, 16² grid)

Effect compute — light vs heavy bracket, across grid sizes (render-only)

MultiplyModifier — compute down, memory up (Noise effect)

ESP32 firmware size

Component breakdown (default esp32)

Variant size deltas

Size budget for upcoming features

Build-variant note: WiFi-only `esp32` is slow on Olimex

Incremental cost analysis (`scenario_perf_light` / `scenario_perf_full`)

Component breakdown (default `esp32`)