This work has been accepted to WACCPD 2025 which is the the Twelfth Workshop on Accelerator Programming and Directives.
The paper: https://dl.acm.org/doi/10.1145/3731599.3767570
This work is tested on AMD MI210,AMD MI300X, Intel Max 1550, Nvidia A100, and Nvidia GH200 GPUS.
-
We have 3 different load balancing approaches that works best in different scenarios:
-
Local Load Balancing(LLB) distributes work efficiently within each work-group, ensuring that individual work-items share the load evenly.
-
Global Load Balancing(GLB) extends load balancing across the entire device by redistributing work between work-groups.
-
Strided Local Load Balancing(SLB) similar to LLB but assigns work-items using a strided mapping based on the number of work-groups.
-
Authors: Kaan Olgu & Tobias Kenter
Build instructions vary depending on your system environment (Ubuntu/Debian with prebuilt oneAPI, or RHEL/HPC systems requiring a source build).
See BUILD.md for full instructions covering:
- Intel oneAPI prebuilt (AMD, NVIDIA, Intel GPUs)
- Building Intel LLVM from source (HPC / RHEL / module-based systems)
- Runtime environment setup
- Known issues
source setvars.sh --force --include-intel-llvm
cmake -Bbuild_local -H. -DENABLE_NVIDIA_BACKEND=ON -DCUDA_ARCH=80 \
-DGPU_TARGETS=all -DUSE_GLOBAL_LOAD_BALANCE=OFF -DSM_FACTOR=48
cmake --build build_local
./build_local/bfs_1.gpu --dataset=$dataset --root=$root \
--num_runs=20 --output=output.jsonFor AMD, Intel GPU, HPC clusters, or multi-GPU setups — see BUILD.md.
The dataset rmat-19-16 provided for up to 4 GPU files. Best way is to generate your own RMAT dataset via scripts in the scripts folder or converting your already available dataset to binary format. The python might require missing packages that could be installed via pip install xxx
$python --version
Python 3.12.5
python genGraph.py rmat ${scale} ${factor}
python generator.py rmat-${scale}-${factor} nnz
# Example :
python generator.py rmat-19-16 nnz $((2**19))
Here is a table that we captured the throughput values in GTEPS

The authors gratefully acknowledge the computing time provided to them on the high-performance computers Noctua2 at the NHR Center PC2. These are funded by the Federal Ministry of Education and Research and the state governments participating on the basis of the resolutions of the GWK for the national highperformance computing at universities (www.nhr-verein.de/unsere-partner).
(Intel Tiber AI Cloud)[https://www.intel.com/content/www/us/en/developer/tools/tiber/ai-cloud.html]
This work used the DiRAC@Durham facility managed by the Institute for Computational Cosmology on behalf of the STFC DiRAC HPC Facility (www.dirac.ac.uk). The equipment was funded by BEIS capital funding via STFC capital grants ST/P002293/1, ST/R002371/1 and ST/S002502/1, Durham University and STFC operations grant ST/R000832/1. DiRAC is part of the National e-Infrastructure.