Skip to content

[CMSIS-NN] Fix stateful execution and batch-major striding for CMSIS-NN LSTM#3564

Open
veblush wants to merge 1 commit into
tensorflow:mainfrom
veblush:cm-lstm
Open

[CMSIS-NN] Fix stateful execution and batch-major striding for CMSIS-NN LSTM#3564
veblush wants to merge 1 commit into
tensorflow:mainfrom
veblush:cm-lstm

Conversation

@veblush
Copy link
Copy Markdown
Collaborator

@veblush veblush commented May 21, 2026

Problem

The current CMSIS-NN LSTM wrapper uses arm_lstm_unidirectional_s8 and arm_lstm_unidirectional_s16. These CMSIS-NN functions are designed for stateless sequence evaluation: they explicitly wipe the cell state at t=0 and ignore any initial hidden state, returning only the sequence outputs.

This breaks TFLM's streaming/embedded ML workloads which rely on stateful LSTMs where the CellStateTensor and HiddenStateTensor persist as variable tensors across Invoke() calls.

Furthermore, CMSIS-NN's internal implementation for batch-major tensors (time_major=false with batch_size > 1) incorrectly jumps memory by time_steps, causing an out-of-bounds read on the contiguous hidden_state buffer.

Solution

  1. Fallback to explicit looping: Implemented a manual time/batch loop within CMSIS_NN_EvalInteger8x8_16Lstm and CMSIS_NN_EvalInteger16x8_16Lstm that bypasses the stateless sequence evaluator and instead iteratively calls the single-step CMSIS-NN kernels (arm_nn_lstm_step_s8 and arm_nn_lstm_step_s16).
  2. State Persistence: The fallback loop properly preserves the CellStateTensor and HiddenStateTensor across timesteps and invocations.
  3. Stride Bug Bypass: For time_major=false, the loop evaluates one batch at a time (batch_size=1 passed to the kernel), which guarantees cache-friendly contiguous memory reads and avoids CMSIS-NN's batch striding bug entirely.
  4. Future-proofing: Introduced #ifdef CMSIS_NN_STATEFUL_LSTM. Once ARM merges a fix upstream to support the optional hidden_state context pointer, this flag will seamlessly switch back to using the native CMSIS-NN sequence evaluator.

This completely solves the mismatch between the reference TFLM output and the CMSIS-NN implementation!

BUG=N/A

@veblush veblush requested a review from a team as a code owner May 21, 2026 18:24
@veblush veblush added the ci:full Triggers the comprehensive cross-platform test suite. label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:full Triggers the comprehensive cross-platform test suite.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant