Skip to content

sroy729/perfCtoGPU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

perfCtoGPU

The aim to realise the performance gain of a matrix multiplication code when we go from a CPU to TPU. running the code

python3 plot.py 

The GPU under test is RTX 3050 mobile.

The CPU under test is AMD Ryzen 7.

===== CPU Info =====
CPU Model: AMD Ryzen 7 5800H with Radeon Graphics
Vendor: AuthenticAMD
Physical Cores: 8
Logical Cores: 16

===== Cache Sizes =====
L1d cache:                            256 KiB
L1i cache:                            256 KiB
L2 cache:                             4 MiB
L3 cache:                             16 MiB

===== Prefetchers =====
Prefetcher check not supported for vendor: AuthenticAMD (likely AMD)
Hint: Check BIOS/UEFI settings for prefetcher controls on AMD CPUs.

Metrics to record:

  • Execution time
  • Number of instruction
  • CPU cache accesses
    • L1D : Cache hit, Cache miss, MPKI(cal)
    • L2 : Cache hit, Cache miss, MPKI(cal)
    • LLC : Cache hit, Cache miss, MPKI(cal)

Code version:

CPU: Optimisation strategies: vanilla Data layout and locality: loop_reordering loop_unrolling Cache optimisation: blocking/tiling prefetching SIMD: simd Compiler and instruction-level optimisation: Multi-threading and parallelism Algorithmic Improvements

Vanilla Matrix multiplication code: vanilla

The code is very simple. It consists of three loops:

  • The first loop iterates over the number of rows in matrix A.
  • The second loop iterates over the number of columns in matrix B.
  • The third loop performs the operation to generate each element in matrix C using matrices A and B. ![[figures/mat_mul.svg]]
	for(int i = 0; i<size; i++){						// select a row in A
		for(int j = 0; j<size; j++){					// select a col in B
			for(int k = 0; k<size; k++){				// no. of operation for ele in C
				C[i*size+j] += A[i*size+k]*B[k*size+j];
			}
		}
	}

Loop reordering: loop_reordering

The matrix is stored in a row-major format, meaning the elements are organized in the order A0, A1, A2, A3, A4, and so on. This organization provides locality for these indexes. To take advantage of this, we will access arrays B and C in a similar manner. While the access pattern for array A will remain unchanged, instead of accessing each column to compute an element of C, we will iterate through matrix B row by row and calculate the partial sums of C during each row iteration.

To implement this, simply switch the order of the for loops for j and k.

for(int i = 0; i<size; i++){						// select a row in A
		for(int j = 0; j<size; j++){					// select a col in B
			for(int k = 0; k<size; k++){				// no. of operation for ele in C
				C[i*size+j] += A[i*size+k]*B[k*size+j];
			}
		}
	}

![[figures/execution_time_plot.png]]

Blocking/Tiling: blocking

The idea is rather than working on the whole matrix, work on a smaller sub-matrix. ![[figures/blocking.png]]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published