|
| 1 | +# matrixexperiments-bf16 |
| 2 | + |
| 3 | +## Sample Purpose |
| 4 | + |
| 5 | +This sample demonstrates various techniques to perform a large matrix multiplcation where the matrix elements contain 16-bit `bfloat16` data. |
| 6 | +The sample includes many different implementations: |
| 7 | + |
| 8 | +1. The "naive" implementation is a very simple implementation. |
| 9 | +It is not very fast, but it is easy to understand, and it has no extension dependencies so it will run on many devices. |
| 10 | +2. The "dpas" kernels use sub-group extensions to improve performance. |
| 11 | +On some devices, they will also use specialized matrix multiplication extensions to further improve performance. |
| 12 | +Because these kernels require certain extensions or a specific sub-group size, they may not run on all devices. |
| 13 | +3. The "dpas blockread" kernels use additional sub-group extensions to further improve performance. |
| 14 | + |
| 15 | +Most of the optimized kernels operate on fixed size tiles of matrix data. |
| 16 | +For some of these kernels, parameters such as the number of matrix tiles per-sub-group or the number of sub-groups per work-group may be modified via program build options. |
| 17 | +Experiment with different options to see what performs the best! |
| 18 | + |
| 19 | +A good place to start for some devices is: |
| 20 | + |
| 21 | +```sh |
| 22 | +./matrixexperiments-bf16 -m4096 --options="-DSGS_PER_WG_X=4 -DSGS_PER_WG_Y=8 -DKK=2 -cl-intel-256-GRF-per-thread" --zero |
| 23 | +``` |
| 24 | + |
| 25 | +## Key APIs and Concepts |
| 26 | + |
| 27 | +This sample will optionally use the following OpenCL extensions: |
| 28 | + |
| 29 | +* cl_intel_bfloat16_conversions |
| 30 | +* cl_intel_required_subgroup_size |
| 31 | +* cl_intel_split_work_group_barrier |
| 32 | +* cl_intel_subgroup_2d_block_io |
| 33 | +* cl_intel_subgroup_matrix_multiply_accumulate |
| 34 | +* cl_intel_subgroups |
| 35 | +* cl_intel_subgroups_short |
| 36 | + |
| 37 | +## Command Line Options |
| 38 | + |
| 39 | +| Option | Default Value | Description | |
| 40 | +|:--|:-:|:--| |
| 41 | +| `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on. |
| 42 | +| `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on. |
| 43 | +| `--file <string>` | `matrix_kernels_bf16.cl` | Specify the name of the file with the OpenCL kernel source. |
| 44 | +| `--options <string>` | None | Specify optional program build options. |
| 45 | +| `--matrixsize <int>` | 512 | Specify the dimensions of the matrix. |
| 46 | +| `--iterations <int>` | 16 | Specify the number of iterations for performance testing. |
| 47 | +| `--validate` | n/a | Validate results for correctness. |
| 48 | +| `--zero` | n/a | Initialize all matrices to zero. |
| 49 | +| `--identity` | n/a | Initialize all matrices to to one. |
| 50 | +| `--fixed` | n/a | Initialize all matrices to values computed from the matrix row and column. |
| 51 | +| `--emulate` | n/a | Do not use specialized matrix multiplication extensions. |
| 52 | +| `--wallclock` | n/a | Measure performance using wallclock time instead of event profiling. |
| 53 | +| `--skipinit` | n/a | Skip initialization of source matrices. |
| 54 | +| `--roundrobin` | n/a | Use round robin thread scheduling. |
| 55 | +| `--threshold <float>` | 0.01 | Set the threshold used when validating results. |
| 56 | +| `--mask <int>` | ~0 | Set a mask to only run a subset of tests. |
| 57 | + |
| 58 | +By default, the source matrices are populated with random data. |
| 59 | +When validating results, it is recommended to use either "fixed" or "identity" data. |
| 60 | +For best performance, use "zero" data". |
0 commit comments