AdePT can be profiled with the NVIDIA tools Nsight Systems (nsys) and
Nsight Compute (ncu). Note that high level counters are available in the
output when running with /adept/verbose 4.
nsys is the right tool to inspect whole-application timing, CUDA stream
overlap, kernel launch rates, and which particle type is dominating.
As profiling full production workflows can be difficult,
AdePT provides optional CUDA profiler API hooks so nsys can skip GPU geometry
initialization and capture only a transport-loop window. Enable them at build
time with:
cmake -S . -B ./adept-build \
-DADEPT_COMPUTE_BACKEND=CUDA \
-DADEPT_ENABLE_NSYS_PROFILING=ON \
<otherargs>ADEPT_ENABLE_NSYS_PROFILING requires the CUDA backend. CMake fails during
configuration if it is requested with a non-CUDA backend.
At runtime, enable the transport capture hook with:
export ADEPT_NSYS_CAPTURE_TRANSPORT=1Optional runtime controls:
| Variable | Default | Meaning |
|---|---|---|
ADEPT_NSYS_CAPTURE_START_AFTER_ITERATIONS |
0 |
Start capture before this transport-loop iteration. |
ADEPT_NSYS_CAPTURE_STOP_AFTER_ITERATIONS |
0 |
Stop capture before this transport-loop iteration. 0 means stop during transport teardown. |
The iteration range is half-open: [start, stop). For example,
START_AFTER_ITERATIONS=50 and STOP_AFTER_ITERATIONS=150 captures transport
iterations 50 through 149. Invalid or negative values are ignored and treated as
the default 0.
Run nsys with the CUDA profiler API capture range:
nsys profile --capture-range=cudaProfilerApi --capture-range-end=stop \
--trace=cuda,nvtx --sample=none --cpuctxsw=none \
--stats=true --export=sqlite --force-overwrite=true \
--output adept_transport_profile \
<application command>Open the report with:
nsys-ui adept_transport_profile.nsys-repThe exported SQLite file can be summarized with the AdePT profile plotting script:
python3 scripts/plot_adept_nsys_profile.py \
--sqlite adept_transport_profile.sqlite \
--output-prefix adept_transport_profile \
--title "AdePT"This writes:
adept_transport_profile_kernel_profile.png: total CUDA kernel times, transport shares by particle type, the waited kernel category that most often reaches the end of an AdePT transport iteration, and the corresponding critical-path margin in milliseconds.adept_transport_profile_species_pies.png: per-species pie charts showing which split kernels dominate electron, positron, and gamma transport.adept_transport_profile.txtplus CSV files with the numeric summaries.
The percentages labelled as "all kernels" are normalized to all CUDA kernels in the captured range, including injection, population statistics, and bookkeeping. The percentages labelled as "transport" are normalized only to electron, positron, and gamma transport kernels.
The limiter plots use the latest-ending waited non-FinishIteration CUDA
kernel before FinishIteration as the limiting category. InitTracks kernels
are excluded from this limiter view because they run on a separate injection
stream and FinishIteration does not directly wait for them; they remain visible
in the total CUDA kernel-time view. The frequency view shows what percentage of transport iterations each
waited category limits. The critical-margin view credits only the time by which
that latest kernel extends the iteration beyond the runner-up latest kernel, capped
by the latest kernel's own duration.
:name: fig-nsys-kernel-profile-example
:alt: Example AdePT Nsight Systems kernel profile summary plot.
:align: center
:width: 95%
Example kernel summary from an AdePT split-kernel `nsys` profile.
The example above was produced from a CMSSW ttbar simulation. It illustrates why
both the total kernel-time view and the limiter view are useful: gamma transport
accounts for about 25% of all CUDA kernel time and about 27% of transport kernel
time, but gamma kernels almost never determine the end of the transport
iteration in this profile. Electrons and positrons dominate the waited
last-kernel counts, and electrons alone account for most of the critical-path
margin. This means that reducing the gamma workload, for example with Russian
roulette, is not expected to improve the iteration wall time unless the gamma
kernels become the latest waited path. Conversely, InitTracks is visible as a
non-negligible total kernel-time bucket, but it is intentionally excluded from
the limiter view because it runs asynchronously on the injection stream.
:name: fig-nsys-species-pies-example
:alt: Example AdePT Nsight Systems per-species transport kernel pie charts.
:align: center
:width: 95%
Example per-species transport-kernel summary from the same kind of AdePT
split-kernel `nsys` profile.
ncu is the tool for detailed kernel analysis: register count, achieved
occupancy, memory coalescing, cache behavior, stall reasons, and source-level
metrics. Use kernel-name filters plus launch skipping/counting to select
representative transport kernels:
ncu --target-processes all \
--kernel-name-base demangled \
--kernel-name 'regex:.*Electron(HowFar|Propagation|MSC|Relocation).*' \
--launch-skip <n> \
--launch-count <m> \
--set detailed \
--force-overwrite \
--export ncu_electron_geometry \
<application command>Choose --launch-skip from an nsys profile so the selected launches are in
the dominant transport region rather than startup or tail iterations. For
source-level attribution, use a build configuration that emits CUDA line
information, such as RelWithDebInfo.