perfCtoGPU

The aim to realise the performance gain of a matrix multiplication code when we go from a CPU to TPU. running the code

python3 plot.py

The GPU under test is RTX 3050 mobile.

The CPU under test is AMD Ryzen 7.

===== CPU Info =====
CPU Model: AMD Ryzen 7 5800H with Radeon Graphics
Vendor: AuthenticAMD
Physical Cores: 8
Logical Cores: 16

===== Cache Sizes =====
L1d cache:                            256 KiB
L1i cache:                            256 KiB
L2 cache:                             4 MiB
L3 cache:                             16 MiB

===== Prefetchers =====
Prefetcher check not supported for vendor: AuthenticAMD (likely AMD)
Hint: Check BIOS/UEFI settings for prefetcher controls on AMD CPUs.

Metrics to record:

Execution time
Number of instruction
CPU cache accesses
- L1D : Cache hit, Cache miss, MPKI(cal)
- L2 : Cache hit, Cache miss, MPKI(cal)
- LLC : Cache hit, Cache miss, MPKI(cal)

Code version:

CPU: Optimisation strategies: vanilla Data layout and locality: loop_reordering loop_unrolling Cache optimisation: blocking/tiling prefetching SIMD: simd Compiler and instruction-level optimisation: Multi-threading and parallelism Algorithmic Improvements

Vanilla Matrix multiplication code: `vanilla`

The code is very simple. It consists of three loops:

The first loop iterates over the number of rows in matrix A.
The second loop iterates over the number of columns in matrix B.
The third loop performs the operation to generate each element in matrix C using matrices A and B. ![[figures/mat_mul.svg]]

	for(int i = 0; i<size; i++){						// select a row in A
		for(int j = 0; j<size; j++){					// select a col in B
			for(int k = 0; k<size; k++){				// no. of operation for ele in C
				C[i*size+j] += A[i*size+k]*B[k*size+j];
			}
		}
	}

Loop reordering: `loop_reordering`

The matrix is stored in a row-major format, meaning the elements are organized in the order A0, A1, A2, A3, A4, and so on. This organization provides locality for these indexes. To take advantage of this, we will access arrays B and C in a similar manner. While the access pattern for array A will remain unchanged, instead of accessing each column to compute an element of C, we will iterate through matrix B row by row and calculate the partial sums of C during each row iteration.

To implement this, simply switch the order of the for loops for j and k.

for(int i = 0; i<size; i++){						// select a row in A
		for(int j = 0; j<size; j++){					// select a col in B
			for(int k = 0; k<size; k++){				// no. of operation for ele in C
				C[i*size+j] += A[i*size+k]*B[k*size+j];
			}
		}
	}

![[figures/execution_time_plot.png]]

Blocking/Tiling: `blocking`

The idea is rather than working on the whole matrix, work on a smaller sub-matrix. ![[figures/blocking.png]]

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
CPU		CPU
GPU		GPU
TPU		TPU
all_out		all_out
figures		figures
README.md		README.md
get_cpu_spec.sh		get_cpu_spec.sh
plot.py		plot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

perfCtoGPU

Vanilla Matrix multiplication code: `vanilla`

Loop reordering: `loop_reordering`

Blocking/Tiling: `blocking`

About

Uh oh!

Releases

Packages

Languages

sroy729/perfCtoGPU

Folders and files

Latest commit

History

Repository files navigation

perfCtoGPU

Vanilla Matrix multiplication code: vanilla

Loop reordering: loop_reordering

Blocking/Tiling: blocking

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Vanilla Matrix multiplication code: `vanilla`

Loop reordering: `loop_reordering`

Blocking/Tiling: `blocking`

Packages