This repository is the starter code for the code optimization assignment. The main job for students is to replace the placeholder stu_* implementations in src/kernel/*.cpp with faster versions, keep the results correct, and use the benchmark harness to measure the speedup.
CSC3060_Project5_Code_Optimization/
├── CMakeLists.txt # Build rules for the kernels library and executables
├── README.md # This guide
├── perfUsage.md # How to use Linux perf for this project
├── include/ # Shared headers and benchmark interfaces
│ ├── bench.h # Common benchmark harness and timing/check helpers
│ ├── blackscholes.h # Benchmark-specific args, baseline, wrappers, checker
│ ├── bitwise.h
│ ├── filter_gradient.h
│ ├── graph.h
│ ├── grff.h
│ ├── image_proc.h
│ ├── matmul.h
│ ├── relu.h
│ ├── sparse_spmm.h
│ └── trace_replay.h
├── src/
│ ├── kernel/ # Naive reference code and student code live here
│ │ ├── blackscholes.cpp
│ │ ├── bitwise.cpp
│ │ ├── filter_gradient.cpp
│ │ ├── graph.cpp
│ │ ├── grff.cpp
│ │ ├── image_proc.cpp
│ │ ├── matmul.cpp
│ │ ├── relu.cpp
│ │ ├── sparse_spmm.cpp
│ │ └── trace_replay.cpp
│ ├── main/
│ │ ├── single_bench.cpp # Run a small benchmark set while tuning/debugging
│ │ └── run_all.cpp # Run the full suite and report speedups
│ └── res/ # Images used by perfUsage.md
├── .clang-format # Optional formatting config
└── .clang-tidy # Optional lint config
What each directory is for:
include/: One header per benchmark. Each header usually defines the argument struct, baseline time, initialization helper, naive implementation declaration, student implementation declaration, wrapper functions, and correctness checker.src/kernel/: The actual benchmark implementations. This is the directory students will edit most often.src/main/: The programs that run the benchmarks.src/res/: Screenshot assets used byperfUsage.md. These are not benchmark inputs.
Notes:
single_bench.cppis the easier place to debug one kernel at a time.run_all.cppregisters the full benchmark suite and is the better entry point forperf.
Most benchmarks follow the same pattern:
include/<name>.h: declares the benchmark argument struct,BASELINE_*,initialize_*,naive_*,stu_*, wrapper functions, and the checker.src/kernel/<name>.cpp: defines the naive implementation, the placeholder student implementation, wrappers, initialization logic, and result checking.
In other words, students normally edit src/kernel/<name>.cpp, and sometimes add fields to the matching include/<name>.h argument struct if they need extra precomputed data.
The benchmark headers now carry two kinds of reference data:
BASELINE_*: a reference runtime used when we report speedup over the baseline.NAIVE_SPEEDUP_LOWER_BOUND_*: a reference lower bound for the rationaive / stu_*.
For the overall summary printed by run_all.cpp, we now use these fixed
BASELINE_* values from the header files to compute the geometric mean.
In the formulas below:
-
$t_{\mathrm{naive}}$ means the runtime of the naive implementation. -
$t_{\mathrm{stu}}$ means the runtime of the correspondingstu_*implementation. -
$LB$ means the lower-bound value for the current kernel.
Because the server workload can be unstable, the main per-kernel speedup that students should pay attention to is now the speedup over the naive version:
The lower-bound values are stored in the corresponding header files, and they mean that ideally:
Important note:
- These lower bounds are references for performance evaluation and tuning.
- They do not mean “below this value = automatic zero”.
- We will run benchmarks multiple times and look at the best, most stable result instead of judging from one noisy run on a busy server.
The table below is sorted by header file name in ascending order. The actual constant names in the headers are still NAIVE_SPEEDUP_LOWER_BOUND_*; the table only lists their numeric values.
| Header file | Baseline constant | Baseline time (ns) | Lower bound value | Default input size in run_all.cpp |
|---|---|---|---|---|
bitwise.h |
BASELINE_BITWISE |
250,000 |
8.00 |
vector length 1,024,000 |
blackscholes.h |
BASELINE_BLACKSCHOLES |
4,800,000 |
1.35 |
81,920 options |
filter_gradient.h |
BASELINE_FILTER_GRADIENT |
25,000,000 |
1.45 |
height x width 1024 x 1024 |
graph.h |
BASELINE_GRAPH |
5,000,000 |
2.50 |
1,024,000 nodes, average degree 8 |
grff.h |
BASELINE_GRFF |
8,500,000 |
3.15 |
feature size 1,024,000 |
image_proc.h |
BASELINE_IMAGE_PROC |
43,000,000 |
1.73 |
image size 1024 x 1000 |
matmul.h |
BASELINE_MATMUL |
88,000,000 |
2.45 |
matrix size 512 x 512 |
relu.h |
BASELINE_RELU |
550,000 |
2.50 |
vector length 1,024,000 |
sparse_spmm.h |
BASELINE_SPARSE_SPMM |
116,000,000 |
1.40 |
LHS (left hand side) CSR matrix of 2048 x 2048, the default dense RHS (right hand side) is also 2048 x 2048 |
trace_replay.h |
BASELINE_TRACE_REPLAY |
3,400,000 |
1.75 |
65,536 records and trace length 1,048,576 |
include/bench.h is the shared benchmark harness used by both single_bench and run_all.
It provides:
bench_t: one record describing a benchmark run.tfunc: the function wrapper that should call the student implementation.naiveFunc: the wrapper for the reference implementation.checkFunc: the validator used after timing.argsandref_args: the benchmark contexts passed to the student and reference wrappers.baseline_time: the baseline runtime instd::chrono::nanoseconds.naive_speedup_lower_bound: the reference lower bound used when comparingstu_*against the naive implementation.measure_time(...): helper for timing one benchmark run.flush_cache(): touches a large buffer before each measurement to reduce cache-warm effects between runs.debug_log(...): prints debug messages only whenDEBUGis enabled.calculate_speedup(...)andcalculate_geometric_mean_speedup(...): helpers used byrun_all.cpp.
The most important speedup numbers now are:
The final geometric mean reported by run_all.cpp is also based on these
fixed baseline speedups, not on naive / stu_*.
Important detail:
argsandref_argsdo not have to point to the same object.- If a kernel modifies its input in place, keep separate student and reference contexts.
reluis the clearest example because the input vector is also the output.
For example, when you want to measure a student version, the benchmark entry should look like this idea:
{"ReLU",
stu_relu_wrapper,
naive_relu_wrapper,
relu_check,
&relu_args_student,
&relu_args_reference,
BASELINE_RELU,
NAIVE_SPEEDUP_LOWER_BOUND_RELU}Right now, some student benchmark entries in run_all.cpp are intentionally left as TODO comments because not every stu_* path may be available at the same time. Students can uncomment the corresponding block when that kernel is ready to benchmark.
Run all commands from the repository root.
Build the project:
cmake -S . -B build
cmake --build build -jRun the executables:
./build/single_bench
./build/run_allBuild only one target if you want:
cmake --build build --target single_bench -j
cmake --build build --target run_all -jIf you only changed source or header files such as:
src/kernel/*.cppsrc/main/*.cppinclude/*.h
then a normal rebuild is enough:
cmake --build build -jIf you changed build settings such as:
CMakeLists.txt- the compiler
- compile flags
- the build directory you want to use
then re-run CMake configure first:
cmake -S . -B build
cmake --build build -jIf you want a separate debug build, use a different build directory such as build-debug instead of overwriting build.
There are several useful ways to debug this project.
bench.h defines debug_log(...). It only prints when the DEBUG macro is enabled.
Use a separate debug build:
cmake -S . -B build-debug -DCMAKE_CXX_FLAGS="-DDEBUG"
cmake --build build-debug -j
./build-debug/single_benchWith DEBUG enabled, many checkers print useful mismatch information such as:
- first failing index
- reference value vs student value
- absolute or relative error
- per-run timing inside the benchmark loop
single_bench.cpp is the fastest place to test one benchmark at a time. You can:
- register only the kernel you are currently working on
- reduce problem sizes temporarily
- confirm correctness before profiling the whole suite
CMakeLists.txt already adds:
-g-fno-omit-frame-pointer
This is useful for:
- source-level debugging
- better call stacks in
perf record -g - more useful output from
perf reportandperf annotate
src/main/run_all.cpp defines:
#define GEOMETRIC_MEAN 1This macro controls whether the program prints one overall performance summary at the end.
- When
GEOMETRIC_MEANis1,run_allcollects the speedup over the fixed baseline for every benchmark and prints the geometric mean after all benchmarks finish. - When
GEOMETRIC_MEANis0, that overall summary is disabled. - If any benchmark fails correctness checking, the geometric mean is reported as
N/A.
Why this matters:
- The geometric mean is a better overall score than averaging raw runtimes or averaging speedup numbers directly.
- It gives one single summary number for the whole benchmark suite.
- For the assignment,
run_allis the place to look when you want the overall performance result across all kernels, not just one kernel.
For one kernel, the individual multiplier used by the geometric mean is:
If we write the baseline speedup of kernel
and the final geometric mean over
So the overall summary now uses the fixed BASELINE_* values defined in the
header files. The lower-bound target naive / stu_* for one kernel, but it is not part of the geometric-mean
calculation.
Please read perfUsage.md. It already contains a practical workflow and example screenshots for:
perf statperf recordperf reportperf annotate
A good workflow for this assignment is:
- Run
run_allunderperfto find the hottest benchmark or function. - Optimize one kernel.
- Rebuild.
- Re-run
run_all. - Use
single_benchwhen you want focused measurements for one kernel.
Typical commands:
perf stat ./build/run_all
perf record -g ./build/run_all
perf report
perf annotate --stdio --source -l -s <function_name>- Pick one kernel in
src/kernel/. - Implement or improve the matching
stu_*function. - Make sure the benchmark entry uses
stu_*_wrapperastfuncandnaive_*_wrapperasnaiveFunc. - Keep separate
argsandref_argsif the kernel changes its inputs in place. - Rebuild with
cmake --build build -j. - Check correctness with
single_bench. - Measure end-to-end impact with
run_allandperf.