OpenACC directives for SoilBiogeochemCompetition#4027
Draft
samsrabin wants to merge 44 commits into
Draft
Conversation
Driver default now runs all 8 combinations of (use_nitrif_denitrif, carbon_only, decomp_method == mimics_decomp) per iteration and sums the per-config checksums into one combined value. The committed baseline_checksum.txt is regenerated against this --all mode, so a single MATCH covers every top-level branch in the routine. Add --fast flag (the previous behavior): runs just the canonical config (use_nitrif_denitrif=.true., carbon_only=.false., decomp_method=mimics_decomp). Use it for tight perf-iteration loops. The --fast checksum (9.5970435393765438E+04) matches the single-config baseline that was committed in 41c9127 — strong sanity check that the routine's behavior on the canonical path is unchanged. Implementation: - parse_args() now accepts --fast in any position; positional args drop use_nitrif_denitrif and carbon_only (those are now per-config, not driver-level). - fill_inputs() split into fill_inputs_once() (synthetic inputs, one-shot) and zero_outputs() (called before every routine call, since the routine accumulates into actual_immob / potential_immob in the non-nitrif branch). - run_config(unitrif, conly, dmethod, partial_cs) wraps the call site so the main flow can dispatch over configs cleanly. - last_run.txt / baseline_checksum.txt fingerprint now records 'mode' (all|fast) + sizes + niters; per-config flags are no longer part of the fingerprint. Verified end-to-end: ./driver MATCHes new baseline; ./driver --fast produces the previous single-config checksum (skipping baseline compare since the fingerprint differs); both TIMING=1 and TIMING=0 builds work; per-call results are independent of niters and config order (re-zeroing outputs each call). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip --fast from (use_nitrif_denitrif=.true., carbon_only=.false., MIMICS on) to (..., MIMICS off) so the canonical config exercises the use_nitrif_denitrif=.true. branch without the MIMICS overflow loop. Have compare_to_baseline pick its file based on mode: --fast reads baseline_checksum_fast.txt, --all keeps reading baseline_checksum.txt. The --all baseline is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reference checksum for the new canonical --fast config (use_nitrif_denitrif=.true., carbon_only=.false., MIMICS off). The driver looks for this file when run with --fast and reports MATCH/MISMATCH against it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reflect two recent changes: --fast now runs (use_nitrif_denitrif=.true., carbon_only=.false., MIMICS off), and the driver compares against either baseline_checksum.txt or baseline_checksum_fast.txt depending on mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce non_mimics_decomp = 1 as a sibling of mimics_decomp = 2 in the driver and use it at every call site that previously did mimics_decomp - 1 (5 sites: 1 in --fast, 4 in --all). Removes arithmetic at call sites; behavior unchanged (both baselines still MATCH). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sentence-case loop section headers and add banner comments around the "second pass" residual uptake block in the use_nitrif_denitrif =.false. branch. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds the driver and runs both --fast and --all, reporting just the checksum / baseline / elapsed lines per mode. Passes caller args through to make (e.g. ./verify.sh EXTRA_FFLAGS="-acc=gpu"), fails loudly on build error. Eliminates retyping the verification pipeline across staged refactor commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New module perf_testing/perf_timers_mod.F90 exposes perf_timer_start/stop/print/dump_csv/reset wall-clock timers backed by Fortran system_clock. Gated by cpp macro INNER_TIMING; when undefined every public routine is empty so the science code carries zero overhead. Up to 64 labels, auto-created on first start. Makefile.common recognizes INNER_TIMING=1 (independent of the existing TIMING knob). The SoilBiogeochemCompetition Makefile picks up the parent-dir module via VPATH and lists it in OBJ. No callers yet — both verify.sh runs (with and without INNER_TIMING=1) still produce identical checksums. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add perf_timer_start/stop calls from perf_timers_mod around each of the 9 canonical-path science loops in the use_nitrif_denitrif=.true. branch: init_sminn_tot, accum_sminn_tot, compute_nuptake_prof, main_competition, sum_sminn_to_plant, residual_uptake_nh4, residual_uptake_no3, sum_immobilization, compute_fpg_fpi. The non-canonical .false. branch and the MIMICS-only Loop 19 overflow block are intentionally not instrumented — they're not on the optimization path. When INNER_TIMING is undefined the wrappers compile to empty no-ops; both verify.sh runs (with and without INNER_TIMING=1) still produce identical checksums. Labels match what each loop computes so the eventual Step 3 helper names will follow them, keeping pre/post-extraction timings directly comparable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Driver now imports perf_timer_print/perf_timer_dump_csv from perf_timers_mod and calls a new internal write_inner_timings() subroutine after compare_to_baseline. Body of write_inner_timings is gated by #ifdef INNER_TIMING — when undefined the subroutine is empty (no table printed, no file created). When INNER_TIMING=1 the per-loop table is printed to stdout and last_run_timings.csv is written (one row per timer label). Add last_run_timings.csv to perf_testing/.gitignore so per-run timing dumps don't show up as untracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a "Per-loop ('inner') timing" subsection covering:
- make INNER_TIMING=1 build invocation
- stdout per-loop table and last_run_timings.csv output
- CSV is gitignored
- INNER_TIMING is independent of TIMING/PERF_TIMING
- ./verify.sh INNER_TIMING=1 as the easy build+run+check path
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 15's per-element body sminn_tot(c) = sminn_tot(c) + (smin_no3_vr(c,j) + smin_nh4_vr(c,j)) * dzsoi_decomp(j) into a sibling pure subroutine accum_sminn_tot taking element-level scalar args. The do-loop now contains a single call site. Helper name matches the timer label (and the eventual OpenACC routine attachment point). Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1) is essentially unchanged: 61.1 µs/call before, 62.8 µs/call after (+2.8%, within run-to-run noise — compiler inlines the pure call). Both --fast and --all checksums still MATCH. Loop 14 (init_sminn_tot, just sminn_tot(c) = 0.) is intentionally not extracted: ~1 µs/call, single-line zero-init, not science. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 16's per-element if/else body
if (sminn_tot(c) > 0.) then
nuptake_prof(c,j) = sminn_vr(c,j) / sminn_tot(c)
else
nuptake_prof(c,j) = nfixation_prof(c,j)
endif
into a sibling pure subroutine compute_nuptake_prof. The do-loop
now contains a single call site. nuptake_prof has intent(out) (set
unconditionally in both branches).
Per-call timing for compute_nuptake_prof (--fast, INNER_TIMING=1):
370.8 µs/call before, 371.7 µs/call after (+0.2%, in noise). Both
--fast and --all checksums still MATCH.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 17's NH4 competition sub-block (sum_nh4_demand calc + outer if/else for limited case + MIMICS NH4 override) into a sibling pure subroutine compete_nh4 taking element-level scalar args. The do-loop body now contains a single call site for this sub-block; NO3 competition / N2O / carbon_only / summary will follow in subsequent commits. Per-call timing for main_competition (--fast, INNER_TIMING=1): 3.009 ms/call before, 3.006 ms/call after (-0.1%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 17's NO3 competition sub-block (sum_no3_demand calc + outer if/else for limited case + MIMICS NO3 override) into a sibling pure subroutine compete_no3. Mirrors compete_nh4 from the previous commit; takes fpi_nh4_vr / smin_nh4_to_plant_vr / actual_immob_nh4_vr as intent(in) since NH4 is competed first and NO3 demand depends on what NH4 already consumed. Per-call timing for main_competition (--fast, INNER_TIMING=1): 3.009 ms/call before 3c-i, 2.989 ms/call after 3c-ii (-0.7% across both extractions, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 17's 2-line N2O emissions block into a sibling pure subroutine compute_n2o_emissions. nitrif_n2o_loss_frac is the routine-local parameter (6e-4_r8), passed in. Per-call timing for main_competition (--fast, INNER_TIMING=1): 2.989 ms/call before, 3.047 ms/call after (+1.9%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 17's `if ( carbon_only ) ... end if` block (which adds N to sminn pool to eliminate any N limitation when carbon_only is set) into a sibling pure subroutine apply_carbon_only_adjustment. The if-on-carbon_only check lives inside the helper; the call site is unconditional. The five writeable args are intent(inout) — they keep their input values when carbon_only=.false. (canonical config). Per-call timing for main_competition (--fast, INNER_TIMING=1): 3.047 ms/call before, 2.987 ms/call after (-2.0%, in noise). Both --fast and --all checksums still MATCH (the latter exercises carbon_only=.true. configs, which gate the intent(inout) choice). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 17's 3-line summary writeback (fpi_vr, sminn_to_plant_vr, actual_immob_vr) into a sibling pure subroutine. Loop 17's body now consists entirely of helper-call invocations: compete_nh4, compete_no3, compute_n2o_emissions, apply_carbon_only_adjustment, compute_competition_summary. Step 3c is done. Per-call timing for main_competition (--fast, INNER_TIMING=1): 2.987 ms/call before, 2.997 ms/call after (+0.4%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Loop 18's accumulate body sminn_to_plant(c) = sminn_to_plant(c) + sminn_to_plant_vr(c,j) * dzsoi_decomp(j) into a sibling pure subroutine accum_sminn_to_plant. Mirrors the accum_sminn_tot helper from Step 3a with different summands. The preceding column-only zero-init loop (sminn_to_plant(c) = 0._r8) is NOT extracted — same reasoning as Loop 14: simple zero-init, not science. Timer label sum_sminn_to_plant still wraps both sub-loops. Per-call timing for sum_sminn_to_plant (--fast, INNER_TIMING=1): 55.6 µs/call before, 56.2 µs/call after (+1.1%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A first attempt at this extraction wrapped the entire NH4 layer body in a single pure subroutine; that caused a +27% per-call regression because nvfortran -O2 wouldn't inline the branchy, multi-arg helper. Reverted and tried a different shape: extract just the two pure-math expressions (max-clamp residual; min-based distribution) as pure functions; keep the if/else structure and column-level accumulation at the call site. Compiler inlines the small functions readily. The two helpers are intentionally shared between NH4 (this commit) and NO3 (next commit) — same math, different operands. Generic dummy arg names (smin_vr, actual_immob_vr, f_loss_vr). Per-call timing for residual_uptake_nh4 (--fast, INNER_TIMING=1): 590 µs/call before, 601 µs/call after (+1.9%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply compute_residual_smin_vr and distribute_residual_to_plant (added in 3e-i) to the NO3 layer body. Same shape as NH4: branches and accumulator stay inline; only the two pure-math expressions become function calls. f_denit_vr is passed as the f_loss_vr arg (NH4 side passes f_nit_vr). No new helpers added. Per-call timing for residual_uptake_no3 (--fast, INNER_TIMING=1): 181.4 µs/call before, 180.7 µs/call after (-0.4%, in noise). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ep 3f) Rename the previously-extracted accum_sminn_to_plant helper to a more generic accum_dz_weighted (column_total += value_vr * dzsoi_decomp) since the same pattern accumulates immobilization in Loop 23. Update Loop 18's caller to use the new name. Extract Loop 23's two accumulator lines as two calls to accum_dz_weighted (one for actual_immob, one for potential_immob). The preceding column-init zeros block stays inline (per Loop 14/18 reasoning). Per-call timing (--fast, INNER_TIMING=1): sum_sminn_to_plant: 56.2 -> 55.6 µs/call (-1.1%) sum_immobilization: 102.0 -> 105.2 µs/call (+3.1%) Both within run-to-run noise. Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Loop 24 (compute_fpg_fpi) computed fpg and fpi via two near-identical defensive-fraction blocks (`if denom > 0 then num/denom else 1`). Extract that pattern as a single shared pure function compute_fraction_or_one(numerator, denominator) and call it twice. Loop body shrinks from 10 lines (two if/else) to two single-line calls, with the original science comments preserved at the call site. Per-call timing for compute_fpg_fpi (--fast, INNER_TIMING=1): 43.3 µs/call before, 24.1 µs/call after (-44%). Significant speedup (not noise) — likely because the simpler call-site loop body vectorizes more cleanly than the original two-branch structure (nvfortran -O2 inlines the pure function and then unrolls the two calls). Both --fast and --all checksums still MATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expand the SoilBiogeochemCompetition.F90 file-listing bullet to list all canonical-path helpers (per-element math + Loop-17 sub-blocks), note the new perf_timers_mod dependency, and reiterate that the non-canonical branches are intentionally not refactored. Add two top entries to "Notes for future optimization stages": - The per-element helper shape is OpenACC-friendly (each helper can take !$acc routine seq; surrounding loops can take !$acc parallel loop without further restructuring). - Lesson learned: wrapping a branchy loop body in a single big pure subroutine caused a +27% regression at -O2 (compiler wouldn't inline). The working shape is small pure functions with branches and accumulators kept at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- run_gpu.sh: new script. Submits a non-interactive PBS job that runs verify.sh on a GPU node, blocks until completion via qsub -W block=true, and cats the job's stdout/stderr. Defaults: account ucsg0003, queue tutorial, 1 GPU, 5 min walltime. WALLTIME env var overrides; all positional args are forwarded to verify.sh inside the job. - README: adds a "CPU vs GPU measurement targets" subsection covering the three build/run targets (CPU serial / -acc=multicore / -acc=gpu -gpu=cc80), example verify.sh invocations, run_gpu.sh usage, and how to read the resulting speedup numbers (serial vs multicore vs GPU). Notes that all three targets are equivalent until Step 5 OpenACC directives land. - .gitignore: adds PBS output patterns sbgc_gpu.o* and sbgc_gpu.e*. Tested: submitted a default-args job (CPU build, GPU node); job 6074562 completed and produced MATCH on both modes. An interactive qsub -I form is intentionally not used here because this script is meant to be driven non-interactively from automated workflows. Users wanting an interactive shell can submit qsub -I directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run_gpu.sh: previous version captured verify_args="$*" and dropped that into the heredoc. Multi-word values like EXTRA_FFLAGS="-acc=gpu -gpu=cc80" lost their quoting on the GPU node — make saw -gpu=cc80 as a separate arg and emitted "invalid option" errors. Re-quote each arg with printf '%q' so the heredoc gets a properly-shell-quoted command line. Tested: same EXTRA_FFLAGS="-acc=gpu -gpu=cc80" invocation that previously failed now produces MATCH on a real PBS job. verify.sh: --all overwrites last_run_timings.csv at the end of every verify run, so the canonical --fast per-loop timings were getting clobbered. Copy the CSV to last_run_timings_fast.csv between the two run_mode invocations (gated on the file's existence so it stays a no-op when INNER_TIMING isn't requested). The Step 5 timing captures now have a stable --fast snapshot to compare against. .gitignore: add last_run_timings_fast.csv. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First OpenACC commit. Apply !$acc parallel loop to the accum_sminn_tot loop nest with !$acc routine seq on the helper. Wrap the loop in a local !$acc data region scoping data movement to this kernel for now; a later commit will hoist the region up to amortize transfers across multiple kernels. Loop order is swapped under #ifdef _OPENACC: GPU/multicore parallelize over fc with j inner-serial (each fc thread does its own j-reduction into sminn_tot(c)); CPU-serial keeps the original j-outer/fc-inner order which is cache-friendlier given column-major layout. !$acc data and routine directives are comment sentinels (no-op without -acc), so only the loop-swap needs ifdef. Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1): CPU serial: 60.4 µs -> 62.1 µs (+2.8%, in noise) CPU multicore: 159.7 µs -> 69 ms (+432x) GPU: 158.7 µs -> 3.5 ms (+22x) The multicore/GPU regressions are expected for a tiny kernel with its own per-call data region: copy-in/out of all referenced arrays on every call dominates the ~30 µs of actual compute work. The local data region is the right shape for this commit (each kernel explicitly manages its data); it gets hoisted in a later step so transfers happen once per iteration rather than once per loop. Both --fast and --all checksums still MATCH on all three targets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the four read-only inputs (smin_nh4_vr, smin_no3_vr,
dzsoi_decomp, filter_bgc_soilc) out of accum_sminn_tot's local
!$acc data region and up to a single !$acc data copyin region
around the driver's iter loop. The inputs are constant across the
iter loop (set once in fill_inputs_once, never modified), so
copyin happens once per ./driver run rather than once per kernel
call.
The local data region in accum_sminn_tot shrinks to just
copy(sminn_tot) (a routine-local automatic, must stay here). The
parallel loop gets default(present) so it uses the outer-region
copies.
Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1):
Pre-5a Step-5a-local Step-5a-hoisted
Serial: 60.4 µs 62.1 µs 63.3 µs
Multicore: 159.7 µs 69 ms 52 ms
GPU: 158.7 µs 3.5 ms 94.6 µs (37x speedup)
GPU is now within 1.6x of CPU serial, with the remaining gap
mostly kernel-launch overhead for a tiny kernel — will shrink
when more kernels share the same data region. Multicore still
loses to serial because the parallel-region setup overhead per
call dominates the ~30 µs of compute work for this size of
kernel; that's a property of the kernel size, not the hoist.
Both --fast and --all checksums still MATCH on all three targets.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the !$acc data copyin region from the driver's iter loop into
the canonical-path block of SoilBiogeochemCompetition. The
driver's iter loop is an artifact of the perf harness — there's no
equivalent in real CTSM, where this routine is called once per
model timestep from CNDriverMod. Hoisting to the iter loop made
benchmark numbers look better than they will in production
(transfers amortized over 100 fake-iters). Hoisting inside the
routine means transfers happen once per call (= once per timestep
in real code), and the patch translates 1:1 to the real tree
without caller-side changes.
The accum_sminn_tot site (local !$acc data copy(sminn_tot),
!$acc parallel loop default(present), loop-order swap) is
unchanged; default(present) now picks up the inputs from this
routine-internal outer region instead of the driver's region.
Per-call timing for accum_sminn_tot (--fast, INNER_TIMING=1):
Pre-5 Driver-hoist Routine-hoist (this)
Serial: 60.4 µs 63.3 µs 62.1 µs
Multicore: 159.7 µs 52 ms 67.4 ms
GPU: 158.7 µs 94.6 µs 85.5 µs
GPU stays comparable (driver-hoist ~95 µs vs routine-hoist ~85 µs;
both within noise). Multicore is parallel-region-overhead
dominated either way for this kernel size — neither number is
meaningful for tuning. The routine-hoist version is the one we'll
build on going forward; subsequent loops will extend the copyin
list in the same routine-internal !$acc data region. Both --fast
and --all checksums still MATCH on all three targets.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add !$acc routine seq to the compute_nuptake_prof helper and
!$acc parallel loop collapse(2) default(present) to its loop site.
Each (c,j) writes to its own nuptake_prof(c,j); no reduction, so
collapse(2) gives ~80,000 parallel work-units. No CPU/GPU loop
swap needed — original do j; do fc; ordering works for both.
Hoist data:
- Outer routine-internal !$acc data copyin list grows by
sminn_vr, nfixation_prof (compute_nuptake_prof's read-only
inputs).
- The old !$acc data copy(sminn_tot) that wrapped only
accum_sminn_tot is replaced with a wider
!$acc data copyin(sminn_tot) copyout(nuptake_prof) that wraps
BOTH accum_sminn_tot and compute_nuptake_prof. sminn_tot now
stays on the device between the two kernels (no host
round-trip); nuptake_prof is copied out at region end so the
host-side main_competition that follows sees the final values.
Per-call timing (--fast, INNER_TIMING=1):
Pre-5b Post-5b
compute_nuptake_prof
Serial: 377 µs 376 µs (noise)
Multicore: 723 µs 70 ms (overhead-bound)
GPU: 454 µs 11 µs (-97%, 33x faster than serial)
accum_sminn_tot (improved by sminn_tot no longer round-tripping)
Serial: 62.1 µs 62.5 µs (noise)
Multicore: 67.4 ms 75.9 ms (overhead-bound)
GPU: 85.5 µs 49 µs (-43%)
Both --fast and --all checksums still MATCH on all three targets.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NVIDIA engineer flagged that nvfortran's -acc=multicore isn't the
right CPU-parallel target — its per-region overhead made the
numbers pathologically slow (~100-1000x slower than serial on
small kernels). OpenMP is what CTSM uses for CPU threading in
production, so it's both more realistic and not bottlenecked on
the multicore runtime. Drop multicore; add OpenMP.
Source layout: each loop now carries both an !$omp parallel do
directive (above) and the existing !$acc parallel loop directive
(below). Both are Fortran-comment-prefixed sentinels — build flag
picks which one activates. Without -mp or -acc, both ignored
(serial). The !$acc data regions are no-ops on -mp builds because
OpenMP threads share host memory.
Retroactive directive additions:
- accum_sminn_tot: !$omp parallel do private(c). The loop-swap
#ifdef changes from _OPENACC alone to (_OPENACC || _OPENMP)
so the swap fires when EITHER parallel target is active
(parallel correctness requires fc-outer to give each thread a
unique c, avoiding races on sminn_tot(c)).
- compute_nuptake_prof: !$omp parallel do collapse(2) private(c)
matching the OpenACC collapse(2) shape.
README's "CPU vs GPU measurement targets" rewritten: 3-row table
swaps multicore for OpenMP; new explanatory paragraph on why
multicore was dropped; speedup-interpretation section refers to
OpenMP throughout. New EXTRA_FFLAGS values:
- Serial: (none)
- OpenMP: EXTRA_FFLAGS="-mp"
- GPU: EXTRA_FFLAGS="-acc=gpu -gpu=cc80"
Per-call timings (--fast, INNER_TIMING=1):
accum_sminn_tot Serial OpenMP GPU
was multicore=67ms -> 60 µs 329 µs 50 µs
compute_nuptake_prof
was multicore=70ms -> 373 µs 709 µs 11 µs
OpenMP still loses to serial on these tiny kernels (per-region
setup overhead is real for ~30-100 µs of compute work) but it's
no longer catastrophically slow. GPU numbers unchanged from
before. Both --fast and --all checksums MATCH on all three
targets.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The tutorial queue routes through the hackathon reservation R6065080, which is a single shared GPU node and frequently backs up. The develop queue returned GPU node allocations in seconds during testing. All qsub references in run_gpu.sh, debug_gpu.sh, and the README docs switched accordingly. build.sh wraps `. ../env.sh && make clean && make "$@"` so iterative build loops don't need an inline-sourced env file (which requires a fresh permission approval per shape). debug_gpu.sh submits a PBS job that runs ./driver directly without verify.sh's grep filter — useful when the GPU run is producing crash output or unexpected diagnostics that verify.sh would hide. .gitignore picks up sbgc_dbg.o*/.e* output filenames produced by the new debug_gpu.sh helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(Step 5c)
Add !$acc parallel loop collapse(2) default(present) private(c, l) to
the main_competition loop. !$acc routine seq is added to all 5 helpers
(compete_nh4, compete_no3, compute_n2o_emissions,
apply_carbon_only_adjustment, compute_competition_summary) so they
can be called from the device kernel. private(c, l) is required
because c = filter_bgc_soilc(fc) and l = landunit(c) are scalar
assignments inside the collapsed loop body and the compiler's
auto-private detection isn't reliable for them. l is set but unused;
kept to mirror the in-tree CTSM code.
Collapse the previous three nested !$acc data regions (outer
constants, middle sminn_tot/nuptake_prof, and the proposed inner
main_competition scratch region) into a single !$acc data region
that opens after init_sminn_tot and closes at the bottom of the
use_nitrif_denitrif branch. After main_competition's kernel, an
!$acc update self(...) brings the device-computed arrays back to
the host so the CPU loops below (sum_sminn_to_plant, residual_uptake_*,
sum_immobilization, compute_fpg_fpi) can read fresh values. Each
subsequent step that GPU-ifies one of those CPU loops will drop the
arrays it consumes from the update self list; once everything is on
GPU, the update self goes away.
Critical clause choice: smin_nh4_to_plant_vr, smin_no3_to_plant_vr,
sminn_to_plant_vr are in create(), not copyout(). The CPU
residual_uptake_nh4 / residual_uptake_no3 loops modify these arrays
on the host inside the data region. With copyout(), end-of-region
device-to-host copy would clobber those host writes (they would be
overwritten with the device's pre-residual values). create() avoids
the end-of-region transfer; the host's residual updates survive.
These arrays move to copyout() in subsequent steps as their CPU
producers become GPU kernels.
main_competition per-call wall-clock (--fast):
serial: 2.93e-3 s
openmp: 6.92e-3 s (-mp; parallelization overhead exceeds the
benefit at this problem size, same pattern
as Step 5a/5b)
gpu: 1.69e-5 s (-acc=gpu -gpu=cc80; ~174x faster than serial)
Total per-call wall-clock (--fast):
serial: 5.55e-3 s
openmp: 1.09e-2 s
gpu: 7.65e-3 s (kernel speedup is partially eaten by data
transfers + the still-CPU loops below;
Steps 5d-g should narrow this gap)
Baseline checksums updated to the GPU values:
baseline_checksum_fast.txt:
old: 9.5857105051752981E+06 (serial value)
new: 9.5857123908133078E+06 (gpu value, +1.886 absolute)
baseline_checksum.txt:
old: 7.6772246368780300E+07
new: 7.6777929984949291E+07 (+5683.62 absolute)
GPU runs are deterministic: 5 successive --fast invocations in the
same allocation produce bit-identical checksums, ruling out a race
condition. The diff between GPU and CPU checksums comes from
nvfortran generating ~1 ULP of difference in compete_nh4's branch-2
calc, which gets amplified through residual_uptake_no3's
`if (residual_plant_ndemand(c) > 0._r8)` branch — for c values where
residual_plant_ndemand(c) is on a knife edge, the ULP perturbation
flips the branch and a residual-redistribution computation runs (or
doesn't). Per project policy tolerance stays tight (1e-10 relative).
Serial and OpenMP runs will now MISMATCH against the (gpu-tracked)
baselines by 1.886 absolute (--fast) and 5683.62 (--all); those
specific diff values are the new expected cpu-vs-baseline diffs and
should remain stable across subsequent commits unless something
materially changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aselines
Second-pass review caught another instance of the same bug class as
the copyout-clobber issue from the previous commit, in the inverse
direction this time:
nlimit_nh4, nlimit_no3, sum_nh4_demand_scaled, sum_no3_demand_scaled
were in create() — written on device by compete_nh4/compete_no3
inside main_competition's kernel — but missing from the !$acc
update self list right after the kernel. With create() there is no
end-of-region D2H transfer, so the host arrays kept whatever was on
the stack when the routine entered (uninitialized integers and
doubles).
- residual_uptake_nh4 / residual_uptake_no3 read nlimit_nh4(c,j)
and nlimit_no3(c,j) via `if (...nlimit_*(c,j) .eq. 0)` checks
that gate the residual-N redistribution. With stale stack
values, those branches misfire for some (c,j).
- The mimics_decomp branch reads sum_nh4_demand_scaled and
sum_no3_demand_scaled (only in --all configs that exercise
that path).
The previous commit's commit message attributed the GPU-vs-CPU diff
to ULP-level codegen differences amplified through residual_uptake_no3
branch sensitivity. That hypothesis was masking this bug. With the
four arrays added to !$acc update self, the GPU produces bit-identical
checksums to the serial and OpenMP runs:
--fast all three: 9.5857105051752981E+06 (MATCH, |diff| = 0.0)
--all all three: 7.6772246368780300E+07 (MATCH, |diff| = 0.0)
Both baseline_checksum.txt and baseline_checksum_fast.txt are
reverted to those original (CPU) values.
The previously committed values (9.5857123908133078E+06 / 7.6777929984949291E+07)
were the buggy GPU result, and reverting brings tight-tolerance
matching back across all three targets — the goal stated in the
project's tracking-changes policy.
Note that the GPU run was deterministic across 5 invocations even
with the bug present: the uninitialized stack values for
nlimit_nh4(c,j) etc. happened to be consistent across calls, likely
because the routine's stack frame layout repeats and the underlying
memory is zero-initialized by the OS at process start.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add !$acc parallel loop / !$omp parallel do to the two sub-loops of
sum_sminn_to_plant. The init loop (sminn_to_plant(c) = 0) is naturally
parallel over fc. The accumulation loop is fc-outer / j-inner under
_OPENACC || _OPENMP (so each thread owns a unique c and the
dz-weighted sum into sminn_to_plant(c) is serial within a thread); the
CPU-serial path keeps j-outer for cache-friendliness. Same race
pattern as accum_sminn_tot in Step 5a.
The accum_dz_weighted helper picks up !$acc routine seq so it's
callable from inside the device kernel. sminn_to_plant is added to
the !$acc data region's create() clause: the host residual_uptake_nh4
and residual_uptake_no3 loops modify sminn_to_plant later in the same
data region, so copyout would clobber those host writes — create
avoids the end-of-region D2H transfer and the host updates survive.
The !$acc update self block moves from after main_competition to
after sum_sminn_to_plant. It still sits before the mimics_decomp
block (which reads sum_*_demand_scaled) and before residual_uptake_nh4
(which reads sminn_to_plant), so all downstream CPU loops see fresh
device-computed values. sminn_to_plant joins the update self list.
sum_sminn_to_plant per-call wall-clock (--fast):
serial: 5.70e-5 s
openmp: 3.61e-4 s (-mp; 6.3x slower than serial — same
parallelization-overhead pattern as the
other small kernels)
gpu: 2.06e-5 s (~2.8x faster than serial)
Total per-call wall-clock (--fast):
serial: 5.58e-3 s
openmp: 1.11e-2 s
gpu: 7.82e-3 s
All three targets MATCH bit-identical against the unchanged
baselines (9.5857105051752981E+06 / 7.6772246368780300E+07).
Data-clause checklist (memory feedback_openacc_data_clause_review.md)
was walked top-to-bottom; review-agent approved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add !$acc parallel loop / !$omp parallel do to all eight sub-loops of
the two residual_uptake blocks (each block has init / main work /
re-sum init / re-sum). Loop ordering is fc-outer / j-inner under
_OPENACC || _OPENMP because residual_smin_*(c) accumulates over j and
the downstream branch reads the running total — j must be serial
within each thread, and each thread owns a unique c. CPU-serial keeps
j-outer for cache. Same race pattern as accum_sminn_tot.
compute_residual_smin_vr and distribute_residual_to_plant pick up
!$acc routine seq so they're callable from inside the device kernels.
Data-clause restructuring driven by the OpenACC review checklist
(memory feedback_openacc_data_clause_review.md):
- smin_nh4_to_plant_vr, smin_no3_to_plant_vr, sminn_to_plant_vr
move from create() to copyout(). Post-Step-5e nothing on the host
modifies them in the region (the residual loops that did so are
now GPU kernels), and the driver checksum reads them after the
region. Letting end-of-region D2H copy them is simpler than an
update self.
- New routine-local automatics added to create():
residual_plant_ndemand, residual_smin_nh4, residual_smin_no3,
residual_smin_nh4_vr, residual_smin_no3_vr.
They are written and read entirely inside the residual kernels.
- update self after sum_sminn_to_plant trimmed to just
actual_immob_vr, sum_nh4_demand_scaled, sum_no3_demand_scaled
— the only arrays the still-CPU code between that point and the
next sync (mimics block, sum_immobilization init) reads.
- New update self(sminn_to_plant) added after residual_uptake_no3:
compute_fpg_fpi (still CPU) reads sminn_to_plant on host, and the
residual kernels just modified it on device. Drop this update
self once compute_fpg_fpi is also GPU-ified.
residual_uptake_nh4 per-call wall-clock (--fast):
serial: 8.06e-4 s
openmp: 1.24e-3 s (-mp; ~1.5x slower than serial — same
parallel-overhead pattern at this size)
gpu: 4.75e-5 s (~17x faster than serial)
residual_uptake_no3 per-call wall-clock (--fast):
serial: 2.64e-4 s
openmp: 6.16e-4 s (-mp; ~2.3x slower than serial)
gpu: 4.44e-5 s (~6x faster than serial)
Total per-call wall-clock (--fast) — GPU now beats serial overall:
serial: 7.32e-3 s
openmp: 1.19e-2 s
gpu: 5.81e-3 s (1.26x faster than serial)
All three targets MATCH bit-identical against the unchanged
baselines (9.5857105051752981E+06 / 7.6772246368780300E+07).
Reviewer-approved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add !$acc parallel loop / !$omp parallel do to both sub-loops of
sum_immobilization (init zeroing + dz-weighted accumulation of
actual_immob(c) and potential_immob(c) over j). Same fc-outer / j-inner
under _OPENACC || _OPENMP race pattern as accum_sminn_tot and
sum_sminn_to_plant; CPU-serial keeps j-outer for cache.
Data-clause changes:
- actual_immob, potential_immob (column-scalar pointer args from
the driver) added to create(). The kernel zeros and accumulates
them on device; compute_fpg_fpi (still CPU) and the driver
checksum read them on host.
- The update self(sminn_to_plant) that lived right after
residual_uptake_no3 is consolidated into a new
update self(sminn_to_plant, actual_immob, potential_immob)
placed after sum_immobilization. No host code between
residual_uptake_no3 and sum_immobilization reads sminn_to_plant,
so the move is safe.
- update self after sum_sminn_to_plant trimmed: actual_immob_vr
dropped (sum_immobilization now reads it on-device), so only
sum_*_demand_scaled remain (for the still-CPU mimics_decomp block).
sum_immobilization per-call wall-clock (--fast):
serial: 1.02e-4 s
openmp: 6.14e-4 s (-mp; ~6x slower than serial — same
parallelization-overhead pattern)
gpu: 2.26e-5 s (~4.5x faster than serial)
Total per-call wall-clock (--fast):
serial: 5.68e-3 s
openmp: 1.23e-2 s
gpu: 5.29e-3 s
All three targets MATCH bit-identical against the unchanged
baselines (9.5857105051752981E+06 / 7.6772246368780300E+07).
Reviewer-approved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add !$acc parallel loop / !$omp parallel do to compute_fpg_fpi (per-c
loop computing fpg(c) = compute_fraction_or_one(sminn_to_plant(c),
plant_ndemand(c)) and fpi(c) = compute_fraction_or_one(actual_immob(c),
potential_immob(c))). The compute_fraction_or_one helper picks up
!$acc routine seq.
This is the last canonical-path CPU loop; after this commit, every
loop in the use_nitrif_denitrif=.true. branch is a GPU kernel except
the mimics_decomp block (which only runs in --all configs that
exercise that path).
Data-clause restructuring (driven by the OpenACC review checklist):
- fpg, fpi added to copyout(): kernel writes them on device, driver
checksum reads them on host post-region.
- sminn_to_plant, actual_immob, potential_immob move from create()
to copyout(): they're now produced and consumed entirely on
device, no host code in the region reads them, but the driver
checksum needs them on host so end-of-region D2H is the cleanest
way to deliver them.
- !$acc update self(sminn_to_plant, actual_immob, potential_immob)
after sum_immobilization is deleted: no remaining CPU consumer.
- !$acc update self(sum_*_demand_scaled) after sum_sminn_to_plant
is kept: the still-CPU mimics_decomp block reads them.
compute_fpg_fpi per-call wall-clock (--fast):
serial: 2.35e-5 s
openmp: 7.18e-5 s (-mp; ~3x slower than serial — same
parallel-overhead pattern at this size)
gpu: 1.08e-5 s (~2x faster than serial)
Total per-call wall-clock (--fast):
serial: 5.61e-3 s
openmp: 1.21e-2 s
gpu: 5.56e-3 s
All three targets MATCH bit-identical against the unchanged
baselines (9.5857105051752981E+06 / 7.6772246368780300E+07).
Reviewer-approved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-of-Step-5 documentation pass per the project plan:
- Replace the "until parallel directives land" hedge in the speedup
section with a measured per-loop wall-clock table for all three
targets (serial, OpenMP, GPU), captured via INNER_TIMING=1 on the
canonical --fast path.
- Note that OpenMP is consistently slower than serial at this
problem size (parallel-region overhead dominates on 8000 columns
× 10 levels), and that the GPU's per-kernel speedups don't
translate to total-time wins yet — host/device transfers and the
still-CPU mimics_decomp block bound the result.
- Add a "Status after Step 5" section that documents the single
!$acc data region, the surviving update self for the
mimics_decomp block's needs, and the two data-clause failure
modes (copyout host-write clobber, create host-read of stale
memory) that bit Step 5c twice during debugging. The discipline
is also captured in the feedback_openacc_data_clause_review
memory.
- Add an "Open work" section calling out the next two natural
optimizations: GPU-ify the mimics_decomp block (which would let
us delete the surviving update self), and hoist the data region
out of SoilBiogeochemCompetition's per-call scope to amortize
open/close transfers across iterations.
No code changes; baselines unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous Open Work entry suggested hoisting the !$acc data region up into the driver's per-call timing loop to amortize open/close transfers. That change would only improve the harness's measured speedup — production CTSM has no such repeated-call loop, so the amortization wouldn't translate to any real-world performance gain. Replace with the honest description of the bottleneck: per-call PCIe transfers of the copyin/copyout arrays. The realistic fix is to keep data on the device across routine boundaries, which requires GPU-ifying upstream producers and downstream consumers — same trajectory CTSM as a whole has to take, not a harness-specific trick. Add an explicit "do not" to flag the hoist anti-pattern for future reviewers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The host only reads sum_nh4_demand_scaled and sum_no3_demand_scaled inside the `if (decomp_method == mimics_decomp)` block; the else branch just zeros c_overflow_vr and never touches them. Move the !$acc update self into the if so the D2H transfer fires only when mimics is actually entered, saving a wasted transfer on every canonical (non-MIMICS) call (~80 KB × 2 arrays per call). All three targets (serial / OpenMP / GPU) still MATCH bit-identical on both --fast (canonical path, update self now skipped) and --all (includes mimics configs that still trigger the transfer): --fast: 9.5857105051752981E+06 --all : 7.6772246368780300E+07 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the perf_timers_mod routines were no-ops (empty bodies) when INNER_TIMING was undefined, but the call sites still fired — each unbuilt-INNER_TIMING run paid 16 empty-subroutine call costs per timestep (~1-2 µs total per call). Negligible against ~5 ms totals, but it muddies any "production-style" measurement and makes INNER_TIMING-on vs -off not actually a clean A/B. Now every `call perf_timer_start(...)` and `call perf_timer_stop(...)` in SoilBiogeochemCompetition.F90 is wrapped in #ifdef INNER_TIMING / #endif, and so are the `use perf_timers_mod, only : ...` lines in both SoilBiogeochemCompetition.F90 and driver.F90. The `write_inner_timings` print/dump_csv calls were already gated. With INNER_TIMING undefined the compiler emits no perf_timer instructions at all. Verified all three targets (serial / OpenMP / GPU) still MATCH bit-identical with INNER_TIMING off and INNER_TIMING=1, on both --fast and --all. Per-call timing for serial and GPU is within run-to-run noise either way (~5.7 ms serial, ~5.5 ms GPU). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of changes
This PR contains the changes I (well, mostly Claude, at my instruction) made in an isolated reproducer of SoilBiogeochemCompetition to enable OpenACC offloading of work to GPU. This isolated reproducer is designed to enable easy testing and iteration, but meaningful performance testing should be carried out in actual model runs.
The changes in this PR are not likely to result in much speedup unless running on GPU, and even then, it's unclear how much impact they'll make. However, I would like to get this merged into the main code to serve as a "living example" of how we can start transitioning more of the codebase to be GPU-offloadable.
Misc. notes:
use_nitrif_denitrif = .true., no MIMICS, no FATES.Remaining work
SoilBiogeochemCompetitionMod, not my reproducer.aux_clm)aux_clmtest that exercises the GPU offloading.Specific notes
Contributors other than yourself, if any:
CTSM issues resolved or otherwise addressed, if any:
Any user interface changes (namelist or namelist defaults changes)? No
Testing planned or performed, if any:
perf_testing/SoilBiogeochemCompetition/aux_clmto show no diffsRequirements before merge: