AudioModule

Acquires an audio source and publishes an AudioFrame — an overall sound level, a 16-band frequency spectrum, and the dominant peak. The frame is available to consumers every render tick, but its analysed values are recomputed only when a full sample block has accumulated (a 512-sample block at 22 kHz takes ~23 ms, longer than one tick), so a tick that doesn't complete a block re-publishes the previous AudioFrame unchanged rather than re-analysing. It is the producer half of the audio-reactive pipeline; AudioVolumeEffect and AudioSpectrumEffect are the consumers.

It is named for what it does, audio acquisition plus analysis, not for one source: today the source is a digital I²S MEMS microphone (the only one wired), and the same analysis pipeline is built to serve other sources (line-in, USB audio) behind the platform read seam as they are added. The candidate source types — I²S with an MCLK for line-in, PDM mics, analog line-in, and I²C-configured codecs — are surveyed in the Troy (troyhacks) prior-art notes below. Most of the module is the analysis (DC-blocker, RMS level, windowed FFT, band mapping), which is source-independent.

A SystemModule Peripheral, added by the user (not auto-wired on a default flash). It is a microphone peripheral, useful only on a board that actually has an I²S mic, so it follows the same model as the effects: registered in the factory, added through the UI when wanted, not boot-wired. (Auto-wiring it forced an I²S init on every board, which on the classic ESP32 hung setup() and boot-looped a mic-less device.) When added, its pins default to unset (−1) (the standard Pin control sentinel, so GPIO 0 stays a usable mic pin) and it stays idle, with a status note, until the user enters the real GPIOs, so adding it never grabs arbitrary pins. The audio effects reach the live frame through the static AudioModule::latestFrame(), which returns a permanently-silent frame when no mic exists, so they simply stay dark.

Hardware: INMP441-class digital mic

Built and tested against an INMP441 (a self-clocked I²S MEMS microphone): standard/Philips framing, 24-bit data left-justified in a 32-bit slot, mono. Three wires plus power:

Each pin defaults to −1 (unset); the values below are the bench INMP441 wiring, not code defaults.

Control	Bench pin	Role
`wsPin`	4	word-select / LRCLK
`sdPin`	5	serial data out of the mic
`sckPin`	6	bit clock

The part is self-clocked from the bit clock; there is no master-clock (MCLK) pin. It drives the one slot its L/R select pin chooses (tie L/R to GND for the left slot, VDD for the right); if level stays at the floor with sound present, the mic is filling the other slot; the fix is one wire, not firmware.

How the AudioFrame is produced

Each loop(): read a block of samples → DC-blocker high-pass → compute the level → window + FFT → map to bands. The high-pass conditions the raw block once, up front, so both the level and the spectrum see the same cleaned signal.

DC-blocker high-pass (AudioLevel.h::DcBlocker, host-tested): a one-pole/one-zero IIR high-pass (y[n] = x[n] − x[n−1] + R·y[n−1], R = 0.99, ≈ 40 Hz corner at 22 kHz) applied to the whole block before any analysis. It removes the MEMS mic's large constant DC bias and sub-bass rumble below ~40 Hz (handling/wind/structural) that would otherwise leak into the lowest band. Its state carries across blocks (it's a continuous filter, not per-block), and it resets when the channel re-inits. This is distinct from, and runs before, the level path's own block-mean subtraction below.
Level (AudioLevel.h, host-tested): subtract the block's DC mean (belt-and-braces after the high-pass), take the RMS, and map it through the log/dB window (floor / gain). It is the overall loudness, independent of the FFT: the VU value. (It uses a gentler floor than the bands so the meter keeps moving with volume rather than gating hard.)
Spectrum (AudioBands.h, host-tested): apply a Hann window (the standard general-purpose FFT window, tapers the block edges so a tone doesn't smear across bins), run the FFT (platform::audioFft), then group the magnitude bins into 16 log-spaced bands (a plain geometric / equal-ratio bin split) and pick the loudest bin as the dominant peak (argmax). The peak is held when no real signal is present so it doesn't wander in silence.

Only the I²S read and the FFT kernel are platform code (platform_esp32_i2s.cpp: IDF's i2s_std driver + esp-dsp's float dsps_fft2r_fc32); everything else is plain domain math that runs in CI on the desktop's reference DFT.

The DSP choices are the textbook defaults on purpose: a Hann window, RMS for level, a geometric band split, argmax for the peak. There is deliberately no per-frequency correction table; the INMP441 is flat ±3 dB across the range that matters (datasheet), so there is no mic-response error to compensate, and a hand-tuned correction curve would add complexity for nothing. The level is overall RMS loudness computed independently of the FFT, not derived from the bands; deriving it from the bands would stop it tracking volume.

Controls

wsPin / sdPin / sckPin: the three I²S GPIOs (see table above). Changing any re-creates the I²S channel live — no reboot (§ Live reconfiguration); notable for an audio peripheral, where most firmware (WLED's audioreactive usermod included) bakes the mic pins in and needs a restart to change them.
sampleRate: a dropdown over the standard rates (8000 / 16000 / 22050 / 44100 Hz), default 22050 (~11 kHz Nyquist covers the range that matters for light). Changing it re-creates the channel live.
floor, the noise floor: bands and level below this read as silence, so an ambient room stays dark. Raise it for a noisy room, lower it for a quiet one. Default 100.
gain, sensitivity: higher = more (a narrower dB window, so a given sound fills more of the bar). Default 222.
level RMS: read-only RMS sound level. The display shows the PEAK level over each 1-second window (the live value the LEDs use is recomputed ~43×/s; sampling it once a second would read 0 between beats even while the meter LEDs move).
peakHz: read-only dominant frequency (updates each second).

Cross-domain wiring

AudioModule produces an AudioFrame (src/core/AudioFrame.h); the consuming effects reach the live frame through the static AudioModule::latestFrame(), not a boot-time setter, so an effect added through the UI at any time still finds the one mic, and with no mic it gets a static silent frame. The active module registers itself in setup() and clears that pointer in teardown(), so adding or removing the mic (or an effect) in any order always leaves a coherent answer.

The first ~250 ms after the I²S clock starts are power-on settling garbage; the read is non-blocking (the hot-path rule), so those samples flow through the first few loop() reads and the level/bands self-correct within that quarter-second; the frame stays valid (zeroed) until then.

Prior art

Audio-reactive lighting is a long-standing idea in the LED-controller world (WLED-MM and MoonLight are the closest lineage). This is projectMM's own implementation, designed from the INMP441 datasheet and standard DSP rather than traced from any one project's code or band tables; the rationale for the specific DSP choices is in How the AudioFrame is produced above. The history of what was tried and removed (notably a self-calibrating auto-gain / noise-floor conditioner, deferred as its own increment) lives in decisions.md.

Frank (softhack007). Frank is the main author of the WLED-MM audioreactive usermod, the most-used open-source audio-reactive LED implementation, and a direct ancestor of the ideas this module learns from. projectMM's product owner worked alongside Frank for years on WLED-SR / WLED-MM before starting MoonLight and then projectMM, so the collaboration goes back a long way. We don't trace his code (per the Industry standards, our own code principle), but we study his thinking with real respect and credit it by name; the Adaptive noise gate section below is the first worked example: his concept, our analysis, written fresh against our own architecture.

Troy (troyhacks). Troy is, like Frank, part of the MoonModules team. He keeps his own fork of WLED-MM at troyhacks/WLED, where (on the P4_experimental / Pure_IDFv5_Port branches) he reworked the audioreactive usermod's DSP to run on Espressif's esp-dsp library — a combination he reports as "stupid fast compared to ArduinoFFT … stupid fast even without on-chip acceleration," with very low latency on the S3 and P4. The relevant file is usermods/audioreactive/audio_reactive.h (guarded by UM_AUDIOREACTIVE_USE_ESPDSP_FFT). His contribution has two parts, and our assessment of each follows.

The esp-dsp FFT. Troy uses esp-dsp's radix-4 real FFT (dsps_fft4r_fc32 → dsps_bit_rev4r_fc32 → dsps_cplx2real_fc32) with a Blackman-Harris window. This is the right family, and it validates the path projectMM is already on: we use esp-dsp too — dsps_fft2r_fc32, the radix-2 float real FFT — in platform_esp32_i2s.cpp (see How the AudioFrame is produced). So Troy's "winning combination of speed and acceleration" and ours are the same library; the one open optimisation is radix-4 vs radix-2. For a power-of-two real FFT, radix-4 does fewer butterfly stages (log₄ N vs log₂ N) and is the textbook faster choice on a float FPU — a measured, low-risk follow-up for this module if FFT time ever shows up in the tick budget (today it doesn't: the float FFT on the S3/P4 FPU is well inside one tick). Worth noting two adjacent, not yet adopted options so the trade-offs are on record: (a) esp-dsp also exposes an int16 / fixed-point path that uses the built-in FFT instructions on the S3 and P4 — that is the "even faster if the hardware has the functions baked in" Troy refers to; we deliberately run float today because our targets have an FPU and float keeps the band math simple (the Industry standards, our own code call), but the hardware-accelerated int16 path is the lever for low-power FPU-less chips (C3 / S2); and (b) Espressif's standalone dl_fft component does only FFT (float or hardware-accelerated int16) without esp-dsp's shared-global twiddle tables — the "new FFT lib that doesn't drag in all of ESP-DSP" — which we do not use (we take the whole esp-dsp dependency because we also want its DSP primitives), but it is the right pick if a future build wants the FFT without the rest of esp-dsp.

The biquad pre-filters. Before the FFT, Troy runs the time-domain samples through biquad high-pass, low-pass, and a peaking ("notch to boost the mids") filter using esp-dsp's optimised dsps_biquad_f32 (not hand-rolled), with 5-coefficient direct-form sections designed in EarLevel Engineering's Biquad Calculator v2 (the "web-based visual biquad tool that spits out the 5 values"); he also bundled an offline copy into the WLED web UI as biquad.htm. This is squarely industry-standard — a biquad / second-order section is the canonical building block for audio EQ and pre-emphasis, and the Audio EQ Cookbook coefficients EarLevel emits are the recognised reference. Our current pipeline does one fixed DC-blocker high-pass (~40 Hz) to strip the offset before analysis; Troy's contribution shows the natural next step — making that filter stage a configurable biquad chain (HP to kill rumble, LP to tame aliasing, optional peaking to lift the mids the FFT under-reports). Priority/assessment: the FFT is already shared ground (radix-4 is a measure-then-maybe tune-up, not a gap); the biquad pre-filter chain is the higher-value idea to adopt, because it improves spectral accuracy (Troy: "the HP and LP filters improved the FFT output accuracy") with off-the-shelf primitives and a known design tool, and it composes cleanly with the forward-looking Adaptive noise gate below — a learned gate keyed to a cleanly-filtered signal is better than one keyed to a raw one. Both remain analysis here, written fresh against our architecture, not traced from Troy's code.

Fixed-point, and adjacent WLED work. Frank's note that "esp-dsp is the way to go for AR 2.0" while "Damian is tinkering with fixed-point … the low-power C3 and S2 boards" maps onto a real WLED feature: Damian Schneider — DedeHai — is the same person as "Dedehai" (one contributor, not two), a WLED core developer, and WLED's audioreactive usermod already carries an integer / fixed-point FFT path (UM_AUDIOREACTIVE_USE_INTEGER_FFT, ~1.5 ms on a C3, "over 10× faster than ArduinoFFT" on FPU-less chips). Troy's and Frank's read is that with esp-dsp's FFT + biquads, fixed-point is not necessary on FPU chips (S3 / P4) — which is exactly projectMM's position: float on FPU targets, and the int16 / dl_fft hardware path noted above is the lever reserved for the low-power chips if we ever target them. Will's note that "Dedehai has already built something similar" in WLED is also accurate — DedeHai's current audio experiment is a PoC MSGEQ7-based AudioReactive (offloading the spectrum to a dedicated hardware analyser chip rather than running FFT on the MCU at all) — a different point in the same design space, recorded here for completeness.

Source types beyond the I²S MEMS mic. Troy's read on the current single-source spec: it's a good base, and the natural extension is to broaden the source seam rather than the analysis. Four source types are worth supporting, in roughly increasing hardware complexity:

I²S with an MCLK, for line-in. The INMP441 is self-clocked and has no MCLK; a line-in codec generally needs a master clock the ESP32 drives. Adding an optional mclkPin to the I²S read seam covers the line-in case without changing the analysis.
PDM mics, for boards that ship with one. A different I²S sub-mode (the IDF i2s_pdm driver) — another variant behind the same platform read, not a new pipeline.
Analog line-in. Long held to "only the original ESP32," but the field has moved: DedeHai got analog input working on the S3, and Troy got it working in his ParrotRadio project. Troy flags a testing-confidence nuance worth recording: he considers his own ParrotRadio analog path better exercised — he was actually recording and playing audio back through it and chasing down real issues — whereas an unlistened-to analog path elsewhere may not be as accurate as it looks, "if nobody's ever listened to it." So if projectMM adopts analog line-in, validate by listening, not just by watching the level meter move.
I²C-configured codecs (e.g. the ES8311): the right move is explicitly not to hand-roll each codec's register config (which is what Troy did in WLED). Espressif ships an esp_codec_dev "codecs" component for IDF that already carries the option tables for many codecs; pulling it in would support "a bunch more codecs for free" and let users configure them for their own hardware. If something Troy hand-rolled turns out to be missing from the component, the codec class is extensible — but he doubts anything is. This is the Industry standards, our own code call applied to codec bring-up: take Espressif's component rather than a bespoke per-codec config.

Troy also has DSP boards on his desk — essentially I²S front-ends "waaaaay beyond the regular codecs" — a class of source recorded here so the line-in / codec work leaves room for it rather than only the simple cases. All of the above is source-seam work: it widens what feeds the pipeline, leaving the DC-blocker / RMS / FFT / band analysis untouched. Tracked under backlog § sensor input.

Adaptive noise gate: forward-looking

Present-tense exception (justified). Module specs are otherwise present-tense (CLAUDE.md); this section is forward-looking by deliberate choice, so the design analysis stays with the module it extends. It describes a concept and our judgement of it, not shipped behaviour. The shipped audio path is everything above.

This concept comes from softhack007 (see Prior art), who granted permission to analyse it here. The proposal: replace the borrowed squelch/noiseFloor knob, described as "a WLED-SR workaround, not a real gate," with a proper adaptive noise gate. The rest of this section is our own assessment.

The concept

A standard noise gate: below a threshold the signal is silenced (gate closed), above it the signal passes (gate open).
Asymmetric, bang-bang timing: open fast, close slow. A bang-bang (hysteresis) controller avoids chatter at the threshold.
A new "detect silence" function drives the gate. This is the explicitly unfinished part of the idea.
Leave the GEQ / FFT channels untouched. The gate acts on the time-domain signal, not the bands. (A per-band noise threshold is noted as possibly also worth having.)
The closing pre-condition should be relative, not an absolute sample count: a "percentage of average signal," not a fixed number.
Optionally feed the gate compressed samples (sqrt or log) so the threshold behaves perceptually rather than linearly.

Five design constraints come with it, and they are the load-bearing part: (1) samples are signed, of arbitrary magnitude, and scaling to an effect range is AGC's job, not the gate's; (2) every abs() must be justified (a rectify discards sign/phase); (3) prefer relative factors to absolute thresholds, the one allowed absolute being that changes < 2 counts are sampling noise; (4) smooth before thresholding; (5) every filter adds delay, and total audio delay must stay < 30 ms.

Is this a good idea? Our verdict

Yes, directionally, and it is squarely industry-standard. A hysteresis noise gate with a fast-attack/slow-release envelope is the textbook design for exactly this problem (it is how studio gates, two-way-radio squelch, and voice-activity detectors all work), so adopting it moves us toward the recognisable solution and away from the borrowed squelch constant, which is the right direction under the Industry standards, our own code principle. The relative-threshold insight (constraint 3) is the genuinely valuable core: a gate keyed to a learned floor self-calibrates to whatever mic or line source is connected, where an absolute squelch only ever suits one setup. So the idea is sound and worth doing.

Two cautions keep it from being a drop-in. First, timing is tight and must be proven, not assumed. A 512-sample block at 22050 Hz is already ~23 ms of buffering before analysis begins; that leaves under ~7 ms of the 30 ms budget for everything the gate adds. The block size, not the gate, is the dominant cost, so any smoothing the gate introduces must be cheap (one-pole) and the open path especially must not lengthen it. This is measurable on hardware and a hard gate on the design. Second, it overlaps work we have already scoped (the per-band floor, below), so the risk is building a parallel mechanism instead of one coherent one. Both push the same way: decompose and adopt in steps, do not overhaul.

Does our per-band floor already cover part of this?

Partly, and that overlap is the key to sequencing. The backlogged per-band noise-floor learns each band's idle baseline and subtracts it, so a steady single-frequency tone (our bench's ~258 Hz mains hum) gates to dark while the other bands stay live. The proposed time-domain gate answers a different question, "is there any sound at all," across the whole signal. They are complementary halves, not competitors: the per-band floor is the frequency-domain noise floor, the gate is the time-domain one. The per-band floor is also the smaller, already-planned step, so it is the natural first increment, and it is genuinely "part of this idea," not a thing the gate replaces.

How to decompose it: cherry-pick, step by step

The whole proposal is more than one increment. Taken apart, most of its value lands early and cheaply, and the riskier parts can wait or be dropped:

Per-band noise floor (already backlogged). Ship this first. It is the frequency-domain half, the smallest change, and it kills the concrete hum we actually see. Independent of everything below.
Relative thresholds, reusing the RMS we already compute. The single most valuable idea here is "threshold against a learned floor, not an absolute number." computeLevel already produces a per-block RMS, which is an envelope estimate (and RMS is the one justified abs() under constraint 2: it is the energy measure, not a naive rectify). So a learned-floor follower over that RMS, with open/close as factors of it, is a small, host-testable addition that needs no new DSP stage and no extra delay (the RMS is on the critical path already). This is the cherry to pick.
Hysteresis + asymmetric timing. The fast-open/slow-close behaviour falls out of two time-constants on that follower plus a close-hold, not a separate state machine. Cheap to add once step 2 exists; this is where the < 30 ms budget gets measured for real.
Optional, defer until proven needed: log/dB-domain thresholds (our magToByte already does perceptual compression downstream, so the detector can stay linear at first and move to dB only if the linear factors prove twitchy), and a true soft gate (0..1 gain vs a hard 0/1).

Each step is its own commit, host-tested red-first, and leaves the system working; none requires touching AudioBands.h or the effect consumers. Steps 1–2 deliver most of the benefit (a self-calibrating floor in both domains) with almost no timing cost; 3–4 are polish to layer on only if the bench says they earn their place.

What it eventually retires: the floor knob's role as a hard squelch. floor would become the display noise-floor only (the dB-window bottom in magToByte), while the learned gate decides "is there sound." That is a clean subtraction, but it is the end of the path, not the first step. Tracked under backlog § audio follow-ups.

Tests

Full case lists are in the generated inventories — unit tests § AudioModule and scenario tests § AudioModule (both regenerated from the test files, so they never drift). What each layer covers:

Level + Spectrum (CI, host): the signal math runs end-to-end on synthesized blocks through the desktop reference DFT — silence/DC read 0, a louder sine reads higher, the floor/gain knobs gate and scale, a tone lands in the right band and peakHz tracks it, energy concentrates rather than smears, and degenerate input never crashes.
Module lifecycle (CI, host): the part the classic-ESP32 boot-loop showed was risky — a fresh module is idle with pins unset (never inits a mic by merely existing), setup/teardown is repeatable with no residue, teardown() clears the active mic so latestFrame() falls back to silence (no dangling pointer), and last-setup-wins under any add/remove order (the robustness rule).
Mutation scenario (CI, host): add / configure / remove the mic and a consumer effect while the pipeline renders — the hard case is removing the producer while a consumer is still live, which must keep rendering on silent audio. The boot-loop robustness, proven end-to-end through the Scheduler.
Hardware: on the S3 with an INMP441, level fluctuates with how loud the room is, the spectrum bars track played tones, peakHz follows the dominant frequency, and raising floor keeps an ambient room dark.

Source

AudioModule.h · AudioFrame.h · AudioLevel.h · AudioBands.h · platform_esp32_i2s.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AudioModule

Hardware: INMP441-class digital mic

How the AudioFrame is produced

Controls

Cross-domain wiring

Prior art

Adaptive noise gate: forward-looking

The concept

Is this a good idea? Our verdict

Does our per-band floor already cover part of this?

How to decompose it: cherry-pick, step by step

Tests

Source

FilesExpand file tree

AudioModule.md

Latest commit

History

AudioModule.md

File metadata and controls

AudioModule

Hardware: INMP441-class digital mic

How the AudioFrame is produced

Controls

Cross-domain wiring

Prior art

Adaptive noise gate: forward-looking

The concept

Is this a good idea? Our verdict

Does our per-band floor already cover part of this?

How to decompose it: cherry-pick, step by step

Tests

Source