The aim to realise the performance gain of a matrix multiplication code when we go from a CPU to TPU. running the code
python3 plot.py
The GPU under test is RTX 3050 mobile.
The CPU under test is AMD Ryzen 7.
===== CPU Info =====
CPU Model: AMD Ryzen 7 5800H with Radeon Graphics
Vendor: AuthenticAMD
Physical Cores: 8
Logical Cores: 16
===== Cache Sizes =====
L1d cache: 256 KiB
L1i cache: 256 KiB
L2 cache: 4 MiB
L3 cache: 16 MiB
===== Prefetchers =====
Prefetcher check not supported for vendor: AuthenticAMD (likely AMD)
Hint: Check BIOS/UEFI settings for prefetcher controls on AMD CPUs.
Metrics to record:
- Execution time
- Number of instruction
- CPU cache accesses
- L1D :
Cache hit,Cache miss,MPKI(cal) - L2 :
Cache hit,Cache miss,MPKI(cal) - LLC :
Cache hit,Cache miss,MPKI(cal)
- L1D :
Code version:
CPU: Optimisation strategies:
vanilla
Data layout and locality:
loop_reordering
loop_unrolling
Cache optimisation:
blocking/tiling
prefetching
SIMD:
simd
Compiler and instruction-level optimisation:
Multi-threading and parallelism
Algorithmic Improvements
The code is very simple. It consists of three loops:
- The first loop iterates over the number of rows in matrix A.
- The second loop iterates over the number of columns in matrix B.
- The third loop performs the operation to generate each element in matrix C using matrices A and B. ![[figures/mat_mul.svg]]
for(int i = 0; i<size; i++){ // select a row in A
for(int j = 0; j<size; j++){ // select a col in B
for(int k = 0; k<size; k++){ // no. of operation for ele in C
C[i*size+j] += A[i*size+k]*B[k*size+j];
}
}
}The matrix is stored in a row-major format, meaning the elements are organized in the order A0, A1, A2, A3, A4, and so on. This organization provides locality for these indexes. To take advantage of this, we will access arrays B and C in a similar manner. While the access pattern for array A will remain unchanged, instead of accessing each column to compute an element of C, we will iterate through matrix B row by row and calculate the partial sums of C during each row iteration.
To implement this, simply switch the order of the for loops for j and k.
for(int i = 0; i<size; i++){ // select a row in A
for(int j = 0; j<size; j++){ // select a col in B
for(int k = 0; k<size; k++){ // no. of operation for ele in C
C[i*size+j] += A[i*size+k]*B[k*size+j];
}
}
}![[figures/execution_time_plot.png]]
The idea is rather than working on the whole matrix, work on a smaller sub-matrix. ![[figures/blocking.png]]