Merge pull request #2661 from ArmDeveloperEcosystem/main

pareenaverma · web-flow · commit 46612e153a85 · 2025-12-15T09:01:20.000-05:00
Prod update with minor fixes for SME2 LiteRT LP
diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/1-litert-kleidiai-sme2.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/1-litert-kleidiai-sme2.md
@@ -8,7 +8,7 @@ layout: learningpathall
 
 ## Inside the LiteRT software stack
 
-LiteRT (Lightweight Runtime, formerly TensorFlow Lite) is a runtime for on-device AI on Arm platforms. The default CPU acceleration library used by LiteRT is XNNPACK.
+LiteRT (Lite Runtime, formerly TensorFlow Lite) is a runtime for on-device AI. The default CPU acceleration library used by LiteRT is XNNPACK.
 
 XNNPACK is an open-source library that provides highly optimized implementations of neural-network operators. It continuously integrates the KleidiAI library to use new CPU features such as Scalable Matrix Extension 2 (SME2).
 
@@ -25,13 +25,13 @@ To understand how KleidiAI SME2 micro-kernels work in LiteRT, think about a Lite
 ### LiteRT → XNNPACK workflow
 
 ![Diagram showing the workflow for a fully connected operator in LiteRT using XNNPACK. The diagram depicts the flow from LiteRT to XNNPACK, highlighting the use of NEON instructions for matrix multiplication and weight packing on Arm platforms. The technical environment emphasizes operator traversal, hardware detection, and parallel computation. alt-text #center](./litert-xnnpack-workflow.png "LiteRT, XNNPACK workflow")
-A fully connected operator multiplies two matrices: the input activations (LHS) and the weights (RHS).
+For batch sizes greater than 1, a fully connected operator performs a matrix multiplication between the input activations (LHS) and the weights (RHS).
 
 When LiteRT loads a model, it reads the operators and builds a computation graph. If you select the CPU as the accelerator, LiteRT uses XNNPACK by default.
 
-XNNPACK scans the computation graph and looks for operators it can optimize. It packs the weight matrix to prepare for efficient computation. On Arm platforms, XNNPACK uses NEON instructions to speed up this packing and the matrix multiplication.
+XNNPACK scans the computation graph and looks for operators it can optimize. XNNPACK also checks the hardware compatibility and chooses the best available micro-kernel. Then, it packs the weight matrix to prepare for efficient computation. On Arm platforms, XNNPACK uses NEON instructions to speed up this packing.
 
-At runtime, XNNPACK checks the hardware and chooses the best available micro-kernel. During inference, it splits the matrices into smaller tiles and runs the multiplications in parallel across multiple threads, using NEON instructions for faster processing.
+During model inference, it splits the matrices into smaller tiles and runs the multiplications in parallel across multiple threads, using NEON instructions for faster processing.
 
 ### LiteRT → XNNPACK → KleidiAI workflow
 
@@ -41,7 +41,7 @@ When KleidiAI and SME2 are enabled at build time, the KleidiAI SME2 micro-kernel
 
 During the model loading stage, when XNNPACK optimizes the subgraph, it checks the operator’s data type to determine whether a KleidiAI implementation is available. If KleidiAI supports it, XNNPACK bypasses its own default implementation. As a result, RHS packing is performed using the KleidiAI SME packing micro-kernel. Because KleidiAI typically requires packing of the LHS, a flag is also set during this stage.
 
-During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix product.
+During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix.
 
 ## What you've accomplished and what's next
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/2-build-model.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/2-build-model.md
@@ -8,7 +8,7 @@ layout: learningpathall
 
 ## KleidiAI SME2 support in LiteRT
 
-LiteRT uses XNNPACK as its default CPU backend. KleidiAI micro-kernels are integrated through XNNPACK in LiteRT. Only a subset of KleidiAI Scalable Matrix Extension (SME and SME2) micro-kernels has been integrated into XNNPACK. These micro-kernels support operators using the following data types and quantization configurations in the LiteRT model. Other operators use XNNPACK's default implementation during inference.
+LiteRT uses XNNPACK as its default CPU backend. KleidiAI micro-kernels are integrated through XNNPACK in LiteRT. Only a subset of KleidiAI SME2 micro-kernels has been integrated into XNNPACK. These micro-kernels support operators using the following data types and quantization configurations in the LiteRT model. Other operators use XNNPACK's default implementation during inference.
 
 ### Supported operator configurations
 
@@ -34,8 +34,8 @@ LiteRT uses XNNPACK as its default CPU backend. KleidiAI micro-kernels are integ
 
 | Activations                  | Weights                                               | Output                       |
 | ---------------------------- | ----------------------------------------------------- | ---------------------------- |
-| FP32                         | FP32, pointwise (kernel size is 1)                    | FP32                         |
-| FP32                         | FP16, pointwise (kernel size is 1)                    | FP32                         |
+| FP32                         | FP32                                                  | FP32                         |
+| FP32                         | FP16                                                  | FP32                         |
 | FP32                         | Per-channel or per-tensor symmetric INT8 quantization | FP32                         |
 | Asymmetric INT8 quantization | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization |
 
@@ -101,8 +101,6 @@ adb shell chmod +x /data/local/tmp/fc_fp32.tflite
 
 You can also optimize this Keras model using post-training quantization to create a LiteRT model that suits your requirements.
 
----
-
 ## Post-training quantization options
 
 **Post-training FP16 quantization**
diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/3-build-tool.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/3-build-tool.md
@@ -12,7 +12,7 @@ LiteRT provides a standalone performance measurement utility called `benchmark_m
 
 In this section, you will build two versions of the benchmark tool:
 - With KleidiAI and Scalable Matrix Extension version 2 (SME2) enabled, which uses Arm-optimized micro-kernels
-- Without KleidiAI and SME2, which provides baseline performance using NEON or SVE2 fallback
+- Without KleidiAI and SME2, which provides baseline performance using NEON micro-kernels
 
 This comparison demonstrates the performance gains provided by SME2 acceleration.
 
@@ -23,9 +23,9 @@ cd $WORKSPACE
 git clone https://github.com/google-ai-edge/LiteRT.git
 ```
 
-Because LiteRT integrates KleidiAI through XNNPACK (an open-source library providing highly optimized neural-network operators), you must build LiteRT from source to enable SME2 micro-kernels.
+Because LiteRT integrates KleidiAI through XNNPACK, you must build LiteRT from source to enable SME2 micro-kernels.
 
-Next, set up your Android build environment using Docker on your Linux development machine. Google provides a Dockerfile that installs the toolchain needed for TensorFlow Lite (TFLite)/LiteRT Android builds.
+Next, set up your Android build environment using Docker on your Linux development machine. Google provides a Dockerfile that installs the toolchain needed for LiteRT Android builds.
 
 Download the Dockerfile:
 
@@ -129,7 +129,7 @@ ${XNNPACK_OPTIONS} "${BENCHMARK_TOOL_PATH}" \
 --repo_env=HERMETIC_PYTHON_VERSION=3.12
 ```
 
-This build of the `benchmark_model` disables all SME2 micro-kernels and forces fallback to XNNPACK's NEON or SVE2 kernels.
+This build of the `benchmark_model` disables all SME2 micro-kernels and forces fallback to XNNPACK's NEON micro-kernels.
 
 You can then use Android Debug Bridge (ADB) to push the benchmark tool to your Android device:
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/4-benchmark.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/4-benchmark.md
@@ -1,3 +1,4 @@
+---
 title: Benchmark the LiteRT model
 weight: 5
 ### FIXED, DO NOT MODIFY
@@ -210,11 +211,13 @@ For other operators supported by KleidiAI, the per-operator profiling node types
 | Fully Connected                        | Fully Connected (NC, QP8, F32, QC4W)                  | Fully Connected (NC, QD8, F32, QC4W)                   |
 | Fully Connected / Conv2D (Pointwise)   | Fully Connected (NC, QP8, F32, QC8W)                  | Fully Connected (NC, QD8, F32, QC8W)                   |
 | Fully Connected / Conv2D (Pointwise)   | Fully Connected (NC, PQS8, QC8W)                      | Fully Connected (NC, QS8, QC8W)                        |
+| Conv2D                                 | Convolution (NHWC, PF32)                              | Convolution (NHWC, F32)                                |
+| Conv2D                                 | Convolution (NHWC, PF16)                              | Convolution (NHWC, F16)                                |
+| Conv2D                                 | Convolution (NHWC, PQS8, QS8, QC8W)                   | Convolution (NHWC, QC8)                                |
+| TransposeConv                          | Deconvolution (NHWC, PQS8, QS8, QC8W)                 | Deconvolution (NC, QS8, QC8W)                          |
 | Batch Matrix Multiply                  | Batch Matrix Multiply (NC, PF32)                      | Batch Matrix Multiply (NC, F32)                        |
 | Batch Matrix Multiply                  | Batch Matrix Multiply (NC, PF16)                      | Batch Matrix Multiply (NC, F16)                        |
 | Batch Matrix Multiply                  | Batch Matrix Multiply (NC, QP8, F32, QC8W)            | Batch Matrix Multiply (NC, QD8, F32, QC8W)             |
-| Conv2D                                 | Convolution (NHWC, PQS8, QS8, QC8W)                   | Convolution (NHWC, QC8)                                |
-| TransposeConv                          | Deconvolution (NHWC, PQS8, QS8, QC8W)                 | Deconvolution (NC, QS8, QC8W)                          |
 
 The letter “P” in the node type indicates a KleidiAI implementation.
 
diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/_index.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/_index.md
@@ -1,7 +1,7 @@
 ---
 title: Accelerate LiteRT Models on Android with KleidiAI and SME2
 
-minutes_to_complete: 30
+minutes_to_complete: 45
 
 who_is_this_for: This is an advanced topic for developers looking to leverage Arm's Scalable Matrix Extension 2 (SME2) instructions to accelerate LiteRT model inference on Android.