update copyright, add README

bashbaug · bashbaug · commit af28503d6ea8 · 2026-02-22T21:11:32.000-08:00
diff --git a/samples/20_matrixexperiments-bf16/CMakeLists.txt b/samples/20_matrixexperiments-bf16/CMakeLists.txt
@@ -1,4 +1,4 @@
-# Copyright (c) 2024-2025 Ben Ashbaugh
+# Copyright (c) 2024-2026 Ben Ashbaugh
 #
 # SPDX-License-Identifier: MIT
 
diff --git a/samples/20_matrixexperiments-bf16/README.md b/samples/20_matrixexperiments-bf16/README.md
@@ -0,0 +1,60 @@
+# matrixexperiments-bf16
+
+## Sample Purpose
+
+This sample demonstrates various techniques to perform a large matrix multiplcation where the matrix elements contain 16-bit `bfloat16` data.
+The sample includes many different implementations:
+
+1. The "naive" implementation is a very simple implementation.
+It is not very fast, but it is easy to understand, and it has no extension dependencies so it will run on many devices.
+2. The "dpas" kernels use sub-group extensions to improve performance.
+On some devices, they will also use specialized matrix multiplication extensions to further improve performance.
+Because these kernels require certain extensions or a specific sub-group size, they may not run on all devices.
+3. The "dpas blockread" kernels use additional sub-group extensions to further improve performance.
+
+Most of the optimized kernels operate on fixed size tiles of matrix data.
+For some of these kernels, parameters such as the number of matrix tiles per-sub-group or the number of sub-groups per work-group may be modified via program build options.
+Experiment with different options to see what performs the best!
+
+A good place to start for some devices is:
+
+```sh
+./matrixexperiments-bf16 -m4096 --options="-DSGS_PER_WG_X=4 -DSGS_PER_WG_Y=8 -DKK=2 -cl-intel-256-GRF-per-thread" --zero
+```
+
+## Key APIs and Concepts
+
+This sample will optionally use the following OpenCL extensions:
+
+* cl_intel_bfloat16_conversions
+* cl_intel_required_subgroup_size
+* cl_intel_split_work_group_barrier
+* cl_intel_subgroup_2d_block_io
+* cl_intel_subgroup_matrix_multiply_accumulate
+* cl_intel_subgroups
+* cl_intel_subgroups_short
+
+## Command Line Options
+
+| Option | Default Value | Description |
+|:--|:-:|:--|
+| `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on.
+| `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on.
+| `--file <string>` | `matrix_kernels_bf16.cl` | Specify the name of the file with the OpenCL kernel source.
+| `--options <string>` | None | Specify optional program build options.
+| `--matrixsize <int>` | 512 | Specify the dimensions of the matrix.
+| `--iterations <int>` | 16 | Specify the number of iterations for performance testing.
+| `--validate` | n/a | Validate results for correctness.
+| `--zero` | n/a | Initialize all matrices to zero.
+| `--identity` | n/a | Initialize all matrices to to one.
+| `--fixed` | n/a | Initialize all matrices to values computed from the matrix row and column.
+| `--emulate` | n/a | Do not use specialized matrix multiplication extensions.
+| `--wallclock` | n/a | Measure performance using wallclock time instead of event profiling.
+| `--skipinit` | n/a | Skip initialization of source matrices.
+| `--roundrobin` | n/a | Use round robin thread scheduling.
+| `--threshold <float>` | 0.01 | Set the threshold used when validating results.
+| `--mask <int>` | ~0 | Set a mask to only run a subset of tests.
+
+By default, the source matrices are populated with random data.
+When validating results, it is recommended to use either "fixed" or "identity" data.
+For best performance, use "zero" data".
diff --git a/samples/20_matrixexperiments-bf16/main.cpp b/samples/20_matrixexperiments-bf16/main.cpp
@@ -1,5 +1,5 @@
 /*
-// Copyright (c) 2024-2025 Ben Ashbaugh
+// Copyright (c) 2024-2026 Ben Ashbaugh
 //
 // SPDX-License-Identifier: MIT
 */
diff --git a/samples/20_matrixexperiments-bf16/matrix_helpers_bf16.cl b/samples/20_matrixexperiments-bf16/matrix_helpers_bf16.cl
@@ -1,5 +1,5 @@
 /*
-// Copyright (c) 2024-2025 Ben Ashbaugh
+// Copyright (c) 2024-2026 Ben Ashbaugh
 //
 // SPDX-License-Identifier: MIT
 */
diff --git a/samples/20_matrixexperiments-bf16/matrix_kernel_tiled_bf16.cl b/samples/20_matrixexperiments-bf16/matrix_kernel_tiled_bf16.cl
@@ -1,5 +1,5 @@
 /*
-// Copyright (c) 2024-2025 Ben Ashbaugh
+// Copyright (c) 2024-2026 Ben Ashbaugh
 //
 // SPDX-License-Identifier: MIT
 */
diff --git a/samples/20_matrixexperiments-bf16/matrix_kernels_bf16.cl b/samples/20_matrixexperiments-bf16/matrix_kernels_bf16.cl
@@ -1,5 +1,5 @@
 /*
-// Copyright (c) 2024-2025 Ben Ashbaugh
+// Copyright (c) 2024-2026 Ben Ashbaugh
 //
 // SPDX-License-Identifier: MIT
 */
diff --git a/samples/20_matrixexperiments-i8/CMakeLists.txt b/samples/20_matrixexperiments-i8/CMakeLists.txt
@@ -1,4 +1,4 @@
-# Copyright (c) 2024-2025 Ben Ashbaugh
+# Copyright (c) 2024-2026 Ben Ashbaugh
 #
 # SPDX-License-Identifier: MIT
 
diff --git a/samples/20_matrixexperiments-i8/README.md b/samples/20_matrixexperiments-i8/README.md
@@ -0,0 +1,60 @@
+# matrixexperiments-i8
+
+## Sample Purpose
+
+This sample demonstrates various techniques to perform a large matrix multiplcation where the matrix elements contain 8-bit integer data.
+The sample includes many different implementations:
+
+1. The "naive" implementation is a very simple implementation.
+It is not very fast, but it is easy to understand, and it has no extension dependencies so it will run on many devices.
+2. The "dpas" kernels use sub-group extensions to improve performance.
+On some devices, they will also use specialized matrix multiplication extensions to further improve performance.
+Because these kernels require certain extensions or a specific sub-group size, they may not run on all devices.
+3. The "dpas blockread" kernels use additional sub-group extensions to further improve performance.
+
+Most of the optimized kernels operate on fixed size tiles of matrix data.
+For some of these kernels, parameters such as the number of matrix tiles per-sub-group or the number of sub-groups per work-group may be modified via program build options.
+Experiment with different options to see what performs the best!
+
+Note, these kernels are not as highly tuned as the kernels for `bfloat16` and `tf32`!
+A good place to start for some devices is:
+
+```sh
+./matrixexperiments-i8 -m4096 --zero
+```
+
+## Key APIs and Concepts
+
+This sample will optionally use the following OpenCL extensions:
+
+* cl_intel_required_subgroup_size
+* cl_intel_split_work_group_barrier
+* cl_intel_subgroup_2d_block_io
+* cl_intel_subgroup_matrix_multiply_accumulate
+* cl_intel_subgroups
+* cl_intel_subgroups_char
+
+## Command Line Options
+
+| Option | Default Value | Description |
+|:--|:-:|:--|
+| `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on.
+| `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on.
+| `--file <string>` | `matrix_kernels_bf16.cl` | Specify the name of the file with the OpenCL kernel source.
+| `--options <string>` | None | Specify optional program build options.
+| `--matrixsize <int>` | 512 | Specify the dimensions of the matrix.
+| `--iterations <int>` | 16 | Specify the number of iterations for performance testing.
+| `--validate` | n/a | Validate results for correctness.
+| `--zero` | n/a | Initialize all matrices to zero.
+| `--identity` | n/a | Initialize all matrices to to one.
+| `--fixed` | n/a | Initialize all matrices to values computed from the matrix row and column.
+| `--emulate` | n/a | Do not use specialized matrix multiplication extensions.
+| `--wallclock` | n/a | Measure performance using wallclock time instead of event profiling.
+| `--skipinit` | n/a | Skip initialization of source matrices.
+| `--roundrobin` | n/a | Use round robin thread scheduling.
+| `--threshold <float>` | 0.01 | Set the threshold used when validating results.
+| `--mask <int>` | ~0 | Set a mask to only run a subset of tests.
+
+By default, the source matrices are populated with random data.
+When validating results, it is recommended to use either "fixed" or "identity" data.
+For best performance, use "zero" data".
diff --git a/samples/20_matrixexperiments-i8/main.cpp b/samples/20_matrixexperiments-i8/main.cpp
@@ -1,5 +1,5 @@
 /*
-// Copyright (c) 2024-2025 Ben Ashbaugh
+// Copyright (c) 2024-2026 Ben Ashbaugh
 //
 // SPDX-License-Identifier: MIT
 */
diff --git a/samples/20_matrixexperiments-i8/matrix_helpers_i8.cl b/samples/20_matrixexperiments-i8/matrix_helpers_i8.cl
@@ -1,5 +1,5 @@
 /*
-// Copyright (c) 2024-2025 Ben Ashbaugh
+// Copyright (c) 2024-2026 Ben Ashbaugh
 //
 // SPDX-License-Identifier: MIT
 */
@@ -634,4 +634,4 @@ void store_c_rowmajor_int32_m8_nx(global int* C, int8 v, int rowStart, int colSt
     intel_sub_group_block_write(C_ui + offset, v_ui.s7); offset += stride;
 }
 
-#endif // defined(cl_intel_subgroups) && defined(cl_intel_subgroups_short)
+#endif // defined(cl_intel_subgroups) && defined(cl_intel_subgroups_char)
diff --git a/samples/20_matrixexperiments-i8/matrix_kernels_i8.cl b/samples/20_matrixexperiments-i8/matrix_kernels_i8.cl
@@ -1,5 +1,5 @@
 /*
-// Copyright (c) 2024-2025 Ben Ashbaugh
+// Copyright (c) 2024-2026 Ben Ashbaugh
 //
 // SPDX-License-Identifier: MIT
 */
diff --git a/samples/20_matrixexperiments-tf32/CMakeLists.txt b/samples/20_matrixexperiments-tf32/CMakeLists.txt
@@ -1,4 +1,4 @@
-# Copyright (c) 2024-2025 Ben Ashbaugh
+# Copyright (c) 2024-2026 Ben Ashbaugh
 #
 # SPDX-License-Identifier: MIT
 
diff --git a/samples/20_matrixexperiments-tf32/README.md b/samples/20_matrixexperiments-tf32/README.md
@@ -0,0 +1,58 @@
+# matrixexperiments-tf32
+
+## Sample Purpose
+
+This sample demonstrates various techniques to perform a large matrix multiplcation where the matrix elements contain 32-bit `tf32` data.
+The sample includes many different implementations:
+
+1. The "naive" implementation is a very simple implementation.
+It is not very fast, but it is easy to understand, and it has no extension dependencies so it will run on many devices.
+2. The "dpas" kernels use sub-group extensions to improve performance.
+On some devices, they will also use specialized matrix multiplication extensions to further improve performance.
+Because these kernels require certain extensions or a specific sub-group size, they may not run on all devices.
+3. The "dpas blockread" kernels use additional sub-group extensions to further improve performance.
+
+Most of the optimized kernels operate on fixed size tiles of matrix data.
+For some of these kernels, parameters such as the number of matrix tiles per-sub-group or the number of sub-groups per work-group may be modified via program build options.
+Experiment with different options to see what performs the best!
+
+A good place to start for some devices is:
+
+```sh
+./matrixexperiments-tf32 -m4096 --options="-DSGS_PER_WG_X=4 -DSGS_PER_WG_Y=8 -DKK=2 -cl-intel-256-GRF-per-thread" --zero
+```
+
+## Key APIs and Concepts
+
+This sample will optionally use the following OpenCL extensions:
+
+* cl_intel_required_subgroup_size
+* cl_intel_split_work_group_barrier
+* cl_intel_subgroup_2d_block_io
+* cl_intel_subgroup_matrix_multiply_accumulate_tf32
+* cl_intel_subgroups
+
+## Command Line Options
+
+| Option | Default Value | Description |
+|:--|:-:|:--|
+| `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on.
+| `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on.
+| `--file <string>` | `matrix_kernels_tf32.cl` | Specify the name of the file with the OpenCL kernel source.
+| `--options <string>` | None | Specify optional program build options.
+| `--matrixsize <int>` | 512 | Specify the dimensions of the matrix.
+| `--iterations <int>` | 16 | Specify the number of iterations for performance testing.
+| `--validate` | n/a | Validate results for correctness.
+| `--zero` | n/a | Initialize all matrices to zero.
+| `--identity` | n/a | Initialize all matrices to to one.
+| `--fixed` | n/a | Initialize all matrices to values computed from the matrix row and column.
+| `--emulate` | n/a | Do not use specialized matrix multiplication extensions.
+| `--wallclock` | n/a | Measure performance using wallclock time instead of event profiling.
+| `--skipinit` | n/a | Skip initialization of source matrices.
+| `--roundrobin` | n/a | Use round robin thread scheduling.
+| `--threshold <float>` | 0.01 | Set the threshold used when validating results.
+| `--mask <int>` | ~0 | Set a mask to only run a subset of tests.
+
+By default, the source matrices are populated with random data.
+When validating results, it is recommended to use either "fixed" or "identity" data.
+For best performance, use "zero" data".
diff --git a/samples/20_matrixexperiments-tf32/main.cpp b/samples/20_matrixexperiments-tf32/main.cpp
@@ -1,5 +1,5 @@
 /*
-// Copyright (c) 2019-2024 Ben Ashbaugh
+// Copyright (c) 2024-2026 Ben Ashbaugh
 //
 // SPDX-License-Identifier: MIT
 */
diff --git a/samples/20_matrixexperiments-tf32/matrix_helpers_tf32.cl b/samples/20_matrixexperiments-tf32/matrix_helpers_tf32.cl
@@ -1,5 +1,5 @@
 /*
-// Copyright (c) 2024-2025 Ben Ashbaugh
+// Copyright (c) 2024-2026 Ben Ashbaugh
 //
 // SPDX-License-Identifier: MIT
 */
diff --git a/samples/20_matrixexperiments-tf32/matrix_kernel_tiled_tf32.cl b/samples/20_matrixexperiments-tf32/matrix_kernel_tiled_tf32.cl
@@ -1,5 +1,5 @@
 /*
-// Copyright (c) 2024-2025 Ben Ashbaugh
+// Copyright (c) 2024-2026 Ben Ashbaugh
 //
 // SPDX-License-Identifier: MIT
 */
diff --git a/samples/20_matrixexperiments-tf32/matrix_kernels_tf32.cl b/samples/20_matrixexperiments-tf32/matrix_kernels_tf32.cl
@@ -1,5 +1,5 @@
 /*
-// Copyright (c) 2024-2025 Ben Ashbaugh
+// Copyright (c) 2024-2026 Ben Ashbaugh
 //
 // SPDX-License-Identifier: MIT
 */

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Copyright (c) 2024-2025 Ben Ashbaugh`
	`1`	`+# Copyright (c) 2024-2026 Ben Ashbaugh`
`2`	`2`	`#`
`3`	`3`	`# SPDX-License-Identifier: MIT`
`4`	`4`