Skip to content

Commit af28503

Browse files
committed
update copyright, add README
1 parent a348dea commit af28503

File tree

17 files changed

+193
-15
lines changed

17 files changed

+193
-15
lines changed

samples/20_matrixexperiments-bf16/CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2024-2025 Ben Ashbaugh
1+
# Copyright (c) 2024-2026 Ben Ashbaugh
22
#
33
# SPDX-License-Identifier: MIT
44

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# matrixexperiments-bf16
2+
3+
## Sample Purpose
4+
5+
This sample demonstrates various techniques to perform a large matrix multiplcation where the matrix elements contain 16-bit `bfloat16` data.
6+
The sample includes many different implementations:
7+
8+
1. The "naive" implementation is a very simple implementation.
9+
It is not very fast, but it is easy to understand, and it has no extension dependencies so it will run on many devices.
10+
2. The "dpas" kernels use sub-group extensions to improve performance.
11+
On some devices, they will also use specialized matrix multiplication extensions to further improve performance.
12+
Because these kernels require certain extensions or a specific sub-group size, they may not run on all devices.
13+
3. The "dpas blockread" kernels use additional sub-group extensions to further improve performance.
14+
15+
Most of the optimized kernels operate on fixed size tiles of matrix data.
16+
For some of these kernels, parameters such as the number of matrix tiles per-sub-group or the number of sub-groups per work-group may be modified via program build options.
17+
Experiment with different options to see what performs the best!
18+
19+
A good place to start for some devices is:
20+
21+
```sh
22+
./matrixexperiments-bf16 -m4096 --options="-DSGS_PER_WG_X=4 -DSGS_PER_WG_Y=8 -DKK=2 -cl-intel-256-GRF-per-thread" --zero
23+
```
24+
25+
## Key APIs and Concepts
26+
27+
This sample will optionally use the following OpenCL extensions:
28+
29+
* cl_intel_bfloat16_conversions
30+
* cl_intel_required_subgroup_size
31+
* cl_intel_split_work_group_barrier
32+
* cl_intel_subgroup_2d_block_io
33+
* cl_intel_subgroup_matrix_multiply_accumulate
34+
* cl_intel_subgroups
35+
* cl_intel_subgroups_short
36+
37+
## Command Line Options
38+
39+
| Option | Default Value | Description |
40+
|:--|:-:|:--|
41+
| `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on.
42+
| `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on.
43+
| `--file <string>` | `matrix_kernels_bf16.cl` | Specify the name of the file with the OpenCL kernel source.
44+
| `--options <string>` | None | Specify optional program build options.
45+
| `--matrixsize <int>` | 512 | Specify the dimensions of the matrix.
46+
| `--iterations <int>` | 16 | Specify the number of iterations for performance testing.
47+
| `--validate` | n/a | Validate results for correctness.
48+
| `--zero` | n/a | Initialize all matrices to zero.
49+
| `--identity` | n/a | Initialize all matrices to to one.
50+
| `--fixed` | n/a | Initialize all matrices to values computed from the matrix row and column.
51+
| `--emulate` | n/a | Do not use specialized matrix multiplication extensions.
52+
| `--wallclock` | n/a | Measure performance using wallclock time instead of event profiling.
53+
| `--skipinit` | n/a | Skip initialization of source matrices.
54+
| `--roundrobin` | n/a | Use round robin thread scheduling.
55+
| `--threshold <float>` | 0.01 | Set the threshold used when validating results.
56+
| `--mask <int>` | ~0 | Set a mask to only run a subset of tests.
57+
58+
By default, the source matrices are populated with random data.
59+
When validating results, it is recommended to use either "fixed" or "identity" data.
60+
For best performance, use "zero" data".

samples/20_matrixexperiments-bf16/main.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
// Copyright (c) 2024-2025 Ben Ashbaugh
2+
// Copyright (c) 2024-2026 Ben Ashbaugh
33
//
44
// SPDX-License-Identifier: MIT
55
*/

samples/20_matrixexperiments-bf16/matrix_helpers_bf16.cl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
// Copyright (c) 2024-2025 Ben Ashbaugh
2+
// Copyright (c) 2024-2026 Ben Ashbaugh
33
//
44
// SPDX-License-Identifier: MIT
55
*/

samples/20_matrixexperiments-bf16/matrix_kernel_tiled_bf16.cl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
// Copyright (c) 2024-2025 Ben Ashbaugh
2+
// Copyright (c) 2024-2026 Ben Ashbaugh
33
//
44
// SPDX-License-Identifier: MIT
55
*/

samples/20_matrixexperiments-bf16/matrix_kernels_bf16.cl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
// Copyright (c) 2024-2025 Ben Ashbaugh
2+
// Copyright (c) 2024-2026 Ben Ashbaugh
33
//
44
// SPDX-License-Identifier: MIT
55
*/

samples/20_matrixexperiments-i8/CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2024-2025 Ben Ashbaugh
1+
# Copyright (c) 2024-2026 Ben Ashbaugh
22
#
33
# SPDX-License-Identifier: MIT
44

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# matrixexperiments-i8
2+
3+
## Sample Purpose
4+
5+
This sample demonstrates various techniques to perform a large matrix multiplcation where the matrix elements contain 8-bit integer data.
6+
The sample includes many different implementations:
7+
8+
1. The "naive" implementation is a very simple implementation.
9+
It is not very fast, but it is easy to understand, and it has no extension dependencies so it will run on many devices.
10+
2. The "dpas" kernels use sub-group extensions to improve performance.
11+
On some devices, they will also use specialized matrix multiplication extensions to further improve performance.
12+
Because these kernels require certain extensions or a specific sub-group size, they may not run on all devices.
13+
3. The "dpas blockread" kernels use additional sub-group extensions to further improve performance.
14+
15+
Most of the optimized kernels operate on fixed size tiles of matrix data.
16+
For some of these kernels, parameters such as the number of matrix tiles per-sub-group or the number of sub-groups per work-group may be modified via program build options.
17+
Experiment with different options to see what performs the best!
18+
19+
Note, these kernels are not as highly tuned as the kernels for `bfloat16` and `tf32`!
20+
A good place to start for some devices is:
21+
22+
```sh
23+
./matrixexperiments-i8 -m4096 --zero
24+
```
25+
26+
## Key APIs and Concepts
27+
28+
This sample will optionally use the following OpenCL extensions:
29+
30+
* cl_intel_required_subgroup_size
31+
* cl_intel_split_work_group_barrier
32+
* cl_intel_subgroup_2d_block_io
33+
* cl_intel_subgroup_matrix_multiply_accumulate
34+
* cl_intel_subgroups
35+
* cl_intel_subgroups_char
36+
37+
## Command Line Options
38+
39+
| Option | Default Value | Description |
40+
|:--|:-:|:--|
41+
| `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on.
42+
| `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on.
43+
| `--file <string>` | `matrix_kernels_bf16.cl` | Specify the name of the file with the OpenCL kernel source.
44+
| `--options <string>` | None | Specify optional program build options.
45+
| `--matrixsize <int>` | 512 | Specify the dimensions of the matrix.
46+
| `--iterations <int>` | 16 | Specify the number of iterations for performance testing.
47+
| `--validate` | n/a | Validate results for correctness.
48+
| `--zero` | n/a | Initialize all matrices to zero.
49+
| `--identity` | n/a | Initialize all matrices to to one.
50+
| `--fixed` | n/a | Initialize all matrices to values computed from the matrix row and column.
51+
| `--emulate` | n/a | Do not use specialized matrix multiplication extensions.
52+
| `--wallclock` | n/a | Measure performance using wallclock time instead of event profiling.
53+
| `--skipinit` | n/a | Skip initialization of source matrices.
54+
| `--roundrobin` | n/a | Use round robin thread scheduling.
55+
| `--threshold <float>` | 0.01 | Set the threshold used when validating results.
56+
| `--mask <int>` | ~0 | Set a mask to only run a subset of tests.
57+
58+
By default, the source matrices are populated with random data.
59+
When validating results, it is recommended to use either "fixed" or "identity" data.
60+
For best performance, use "zero" data".

samples/20_matrixexperiments-i8/main.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
// Copyright (c) 2024-2025 Ben Ashbaugh
2+
// Copyright (c) 2024-2026 Ben Ashbaugh
33
//
44
// SPDX-License-Identifier: MIT
55
*/

samples/20_matrixexperiments-i8/matrix_helpers_i8.cl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
// Copyright (c) 2024-2025 Ben Ashbaugh
2+
// Copyright (c) 2024-2026 Ben Ashbaugh
33
//
44
// SPDX-License-Identifier: MIT
55
*/
@@ -634,4 +634,4 @@ void store_c_rowmajor_int32_m8_nx(global int* C, int8 v, int rowStart, int colSt
634634
intel_sub_group_block_write(C_ui + offset, v_ui.s7); offset += stride;
635635
}
636636

637-
#endif // defined(cl_intel_subgroups) && defined(cl_intel_subgroups_short)
637+
#endif // defined(cl_intel_subgroups) && defined(cl_intel_subgroups_char)

0 commit comments

Comments
 (0)