Summary
Break down benchmark timings to separately report PCIe transfer time (host↔device) versus kernel execution time, giving clear visibility into where time is actually spent.
Motivation
The current benchmark reports total wall-clock time per iteration, but this conflates two very different costs: data transfer over the PCIe bus and actual GPU kernel execution. For bandwidth-bound kernels (e.g. VectorAdd), transfer time may dominate. For compute-bound kernels (e.g. SHA-256), kernel time dominates. Without this breakdown, it's impossible to identify the real bottleneck or measure the benefit of optimisations like double-buffering (#6).
Acceptance Criteria
Technical Notes
- ILGPU provides stream synchronisation primitives that can be used to isolate transfer and compute phases
BenchmarkResult will need new fields for transfer and kernel timings
BenchmarkRunner will need to wrap each phase with timing instrumentation
- Consider adding a
--verbose or --breakdown flag to show the detailed timing (keeping default output clean)
Summary
Break down benchmark timings to separately report PCIe transfer time (host↔device) versus kernel execution time, giving clear visibility into where time is actually spent.
Motivation
The current benchmark reports total wall-clock time per iteration, but this conflates two very different costs: data transfer over the PCIe bus and actual GPU kernel execution. For bandwidth-bound kernels (e.g. VectorAdd), transfer time may dominate. For compute-bound kernels (e.g. SHA-256), kernel time dominates. Without this breakdown, it's impossible to identify the real bottleneck or measure the benefit of optimisations like double-buffering (#6).
Acceptance Criteria
Technical Notes
BenchmarkResultwill need new fields for transfer and kernel timingsBenchmarkRunnerwill need to wrap each phase with timing instrumentation--verboseor--breakdownflag to show the detailed timing (keeping default output clean)