Skip to content

Improve GPU performance of TLSPH#1084

Open
efaulhaber wants to merge 8 commits intotrixi-framework:mainfrom
efaulhaber:drift-source-terms
Open

Improve GPU performance of TLSPH#1084
efaulhaber wants to merge 8 commits intotrixi-framework:mainfrom
efaulhaber:drift-source-terms

Conversation

@efaulhaber
Copy link
Member

This PR

  • pulls integrate_tlsph checks out of the hot loops in drift! and source terms (relevant for small problems on GPUs),
  • adds an optimized version of add_velocity! and update_tlsph_positions! for GPU arrays,
  • removes set_zero!(du) and renames add_velocity! to set_velocity! (relevant for small problems on GPUs),
  • skips add_source_terms! when no source terms are used and system.acceleration is zero.

Especially for split integration on GPUs, where a small TLSPH-only simulation is run for many sub-steps, this provides a very significant speedup.

Here are the relevant changes in the timer output of my large fin simulation (#844) on an H100. Note that this is before I moved the "source terms" timer behind the dispatch. Now, the "source terms" timer should actually completely disappear when neither source terms nor gravity is used.
main:

Total                                          146s
split integration                    3.67k    96.6s   66.8%  26.3ms 
    update TLSPH positions            122k    4.73s    3.3%  38.9μs
  drift!                              120k    10.8s    7.4%  89.8μs
  source terms                        120k    9.78s    6.8%  81.5μs

fluid RHS
  source terms                       1.67k    989ms    0.7%   592μs 
drift!                               1.67k    2.30s    1.6%  1.38ms

this PR:

Total                                          114s                   (1.28x speedup)
split integration                    3.67k    68.0s   59.9%  18.5ms   (1.4x speedup)
    update TLSPH positions            122k    1.24s    1.1%  10.2μs   (3.8x speedup)
  drift!                              120k    1.54s    1.4%  12.8μs   (7.0x speedup)
  source terms                        120k    130ms    0.1%  1.08μs   (now negligible)

fluid RHS
  source terms                       1.67k   12.3ms    0.0%  7.36μs   (now negligible)
drift!                               1.67k    394ms    0.3%   236μs   (5.8x speedup)

Here are the full timer outputs for reference.
main:

──────────────────────────────────────────────────────────────────────────────────────────────
             TrixiParticles.jl                       Time                    Allocations      
                                            ───────────────────────   ────────────────────────
             Tot / % measured:                    146s /  99.3%           69.0GiB / 100.0%    

Section                             ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────────
split integration                    3.67k    96.6s   66.8%  26.3ms   28.3GiB   41.0%  7.88MiB
  update systems and nhs              122k    38.2s   26.4%   314μs   10.8GiB   15.6%  92.8KiB
    stress tensor                     122k    20.4s   14.1%   168μs   5.05GiB    7.3%  43.5KiB
    apply prescribed motion           122k    7.84s    5.4%  64.5μs   3.05GiB    4.4%  26.3KiB
    update TLSPH positions            122k    4.73s    3.3%  38.9μs   1.58GiB    2.3%  13.6KiB
    update nhs                       1.67k    2.60s    1.8%  1.56ms    199MiB    0.3%   122KiB
    ~update systems and nhs~          122k    1.63s    1.1%  13.4μs    276MiB    0.4%  2.32KiB
    compute boundary pressure        3.34k    767ms    0.5%   230μs    599MiB    0.8%   184KiB
    inverse state equation           3.34k    201ms    0.1%  60.1μs   48.9MiB    0.1%  15.0KiB
    update density diffusion         1.67k    251μs    0.0%   150ns     0.00B    0.0%    0.00B
  ~split integration~                3.67k    20.4s   14.1%  5.56ms   2.39GiB    3.5%   683KiB
  system interaction                  122k    13.9s    9.6%   115μs   3.33GiB    4.8%  28.7KiB
    structure4-structure4             120k    12.9s    8.9%   108μs   3.13GiB    4.5%  27.3KiB
    ~system interaction~              122k    758ms    0.5%  6.23μs   91.2MiB    0.1%     786B
    structure4-fluid1                1.67k    243ms    0.2%   145μs    111MiB    0.2%  67.9KiB
    structure4-open_boundary3        1.67k    165μs    0.0%  98.7ns     0.00B    0.0%    0.00B
    structure4-boundary2             1.67k    147μs    0.0%  88.0ns     0.00B    0.0%    0.00B
  drift!                              120k    10.8s    7.4%  89.8μs   5.60GiB    8.1%  48.9KiB
  source terms                        120k    9.78s    6.8%  81.5μs   5.38GiB    7.8%  47.0KiB
  reset ∂v/∂t                         120k    1.93s    1.3%  16.1μs    205MiB    0.3%  1.75KiB
  compute averaged velocity          24.0k    1.49s    1.0%  62.3μs    564MiB    0.8%  24.0KiB
  copy back                          3.67k    124ms    0.1%  33.8μs   39.7MiB    0.1%  11.1KiB
  init                                   1   1.84ms    0.0%  1.84ms    310KiB    0.0%   310KiB
kick!                                1.67k    28.4s   19.6%  17.0ms   3.86GiB    5.6%  2.37MiB
  system interaction                 1.67k    22.7s   15.7%  13.6ms   1.64GiB    2.4%  1.01MiB
    fluid1-fluid1                    1.67k    16.6s   11.5%  9.95ms    109MiB    0.2%  67.0KiB
    fluid1-open_boundary3            1.67k    1.83s    1.3%  1.09ms    225MiB    0.3%   138KiB
    fluid1-structure4                1.67k    854ms    0.6%   511μs    160MiB    0.2%  97.8KiB
    open_boundary3-open_boundary3    1.67k    822ms    0.6%   492μs    286MiB    0.4%   175KiB
    open_boundary3-structure4        1.67k    736ms    0.5%   440μs    241MiB    0.3%   148KiB
    fluid1-boundary2                 1.67k    699ms    0.5%   419μs    106MiB    0.2%  65.0KiB
    open_boundary3-fluid1            1.67k    465ms    0.3%   278μs    204MiB    0.3%   125KiB
    open_boundary3-boundary2         1.67k    405ms    0.3%   242μs    202MiB    0.3%   124KiB
    ~system interaction~             1.67k    218ms    0.2%   131μs    147MiB    0.2%  90.3KiB
    structure4-fluid1                1.67k    372μs    0.0%   223ns     0.00B    0.0%    0.00B
    boundary2-fluid1                 1.67k    228μs    0.0%   136ns     0.00B    0.0%    0.00B
    boundary2-structure4             1.67k    171μs    0.0%   103ns     0.00B    0.0%    0.00B
    boundary2-boundary2              1.67k    148μs    0.0%  88.7ns     0.00B    0.0%    0.00B
    structure4-structure4            1.67k    140μs    0.0%  83.9ns     0.00B    0.0%    0.00B
    boundary2-open_boundary3         1.67k    135μs    0.0%  80.8ns     0.00B    0.0%    0.00B
    structure4-boundary2             1.67k    129μs    0.0%  77.1ns     0.00B    0.0%    0.00B
    structure4-open_boundary3        1.67k    101μs    0.0%  60.7ns     0.00B    0.0%    0.00B
  update systems and nhs             1.67k    4.69s    3.2%  2.81ms   1.16GiB    1.7%   727KiB
    update nhs                       1.67k    2.62s    1.8%  1.57ms    199MiB    0.3%   122KiB
    compute boundary pressure        3.34k    787ms    0.5%   235μs    599MiB    0.8%   184KiB
    ~update systems and nhs~         1.67k    635ms    0.4%   380μs    203MiB    0.3%   124KiB
    stress tensor                    1.67k    282ms    0.2%   169μs   71.0MiB    0.1%  43.5KiB
    inverse state equation           3.34k    205ms    0.1%  61.4μs   48.9MiB    0.1%  15.0KiB
    apply prescribed motion          1.67k   96.4ms    0.1%  57.7μs   42.9MiB    0.1%  26.3KiB
    update TLSPH positions           1.67k   66.0ms    0.0%  39.5μs   22.2MiB    0.0%  13.6KiB
    update density diffusion         1.67k    227μs    0.0%   136ns     0.00B    0.0%    0.00B
  source terms                       1.67k    989ms    0.7%   592μs   1.06GiB    1.5%   664KiB
  reset ∂v/∂t                        1.67k   74.6ms    0.1%  44.6μs   2.88MiB    0.0%  1.77KiB
  ~kick!~                            1.67k   15.9ms    0.0%  9.50μs   1.55KiB    0.0%    0.95B
save solution                            1    10.7s    7.4%   10.7s   18.4GiB   26.6%  18.4GiB
  write to vtk                           4    9.23s    6.4%   2.31s   2.32GiB    3.4%   595MiB
  ~save solution~                        1    1.44s    1.0%   1.44s   16.0GiB   23.3%  16.0GiB
  update dvdu                            1   21.2ms    0.0%  21.2ms   3.09MiB    0.0%  3.09MiB
  update systems                         1   3.62ms    0.0%  3.62ms    727KiB    0.0%   727KiB
update callback                        334    5.01s    3.5%  15.0ms    673MiB    1.0%  2.01MiB
  ~update callback~                    334    2.06s    1.4%  6.17ms    272MiB    0.4%   835KiB
  update open boundary                 334    2.01s    1.4%  6.01ms    155MiB    0.2%   476KiB
    check domain                       334    1.97s    1.4%  5.90ms    137MiB    0.2%   420KiB
    update boundary quantities         334   36.0ms    0.0%   108μs   18.3MiB    0.0%  56.1KiB
    ~update open boundary~             334   3.47ms    0.0%  10.4μs    105KiB    0.0%     323B
  update systems and nhs               334    925ms    0.6%  2.77ms    237MiB    0.3%   727KiB
  compute averaged velocity            334   16.7ms    0.0%  49.9μs   7.85MiB    0.0%  24.1KiB
  apply particle shifting              668    139μs    0.0%   208ns     0.00B    0.0%    0.00B
drift!                               1.67k    2.30s    1.6%  1.38ms   1.06GiB    1.5%   666KiB
  velocity                           1.67k    2.22s    1.5%  1.33ms   1.06GiB    1.5%   664KiB
  reset ∂u/∂t                        1.67k   66.0ms    0.0%  39.5μs   2.88MiB    0.0%  1.77KiB
  ~drift!~                           1.67k   10.2ms    0.0%  6.08μs      976B    0.0%    0.58B
apply postprocess cb                     4    1.59s    1.1%   397ms   16.8GiB   24.3%  4.19GiB
  ~apply postprocess cb~                 4    1.51s    1.0%   377ms   16.7GiB   24.3%  4.18GiB
  update dvdu                            4   71.3ms    0.0%  17.8ms   12.1MiB    0.0%  3.03MiB
  update systems and nhs                 4   11.5ms    0.0%  2.87ms   2.84MiB    0.0%   727KiB
calculate dt                             1   2.88μs    0.0%  2.88μs     0.00B    0.0%    0.00B
──────────────────────────────────────────────────────────────────────────────────────────────

this PR:

──────────────────────────────────────────────────────────────────────────────────────────────
             TrixiParticles.jl                       Time                    Allocations      
                                            ───────────────────────   ────────────────────────
             Tot / % measured:                    114s /  99.6%           53.9GiB / 100.0%    

Section                             ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────────
split integration                    3.67k    68.0s   59.9%  18.5ms   15.4GiB   28.6%  4.30MiB
  update systems and nhs              122k    36.5s   32.2%   300μs   9.44GiB   17.5%  81.3KiB
    stress tensor                     122k    21.0s   18.5%   172μs   5.05GiB    9.4%  43.5KiB
    apply prescribed motion           122k    8.51s    7.5%  69.9μs   3.08GiB    5.7%  26.6KiB
    update nhs                       1.67k    2.58s    2.3%  1.54ms    199MiB    0.4%   122KiB
    ~update systems and nhs~          122k    1.82s    1.6%  14.9μs    275MiB    0.5%  2.31KiB
    update TLSPH positions            122k    1.24s    1.1%  10.2μs    225MiB    0.4%  1.89KiB
    compute boundary pressure        3.34k    991ms    0.9%   297μs    594MiB    1.1%   182KiB
    inverse state equation           3.34k    414ms    0.4%   124μs   48.0MiB    0.1%  14.7KiB
    update density diffusion         1.67k    212μs    0.0%   127ns     0.00B    0.0%    0.00B
  system interaction                  122k    13.4s   11.8%   110μs   3.29GiB    6.1%  28.3KiB
    structure4-structure4             120k    12.3s   10.8%   102μs   3.09GiB    5.7%  27.0KiB
    ~system interaction~              122k    714ms    0.6%  5.87μs   90.3MiB    0.2%     778B
    structure4-fluid1                1.67k    458ms    0.4%   274μs    109MiB    0.2%  66.9KiB
    structure4-boundary2             1.67k    140μs    0.0%  83.7ns     0.00B    0.0%    0.00B
    structure4-open_boundary3        1.67k    124μs    0.0%  74.2ns     0.00B    0.0%    0.00B
  ~split integration~                3.67k    13.2s   11.6%  3.59ms   1.63GiB    3.0%   467KiB
  drift!                              120k    1.54s    1.4%  12.8μs    282MiB    0.5%  2.41KiB
  compute averaged velocity          24.0k    1.32s    1.2%  55.1μs    564MiB    1.0%  24.0KiB
  reset ∂v/∂t                         120k    1.12s    1.0%  9.33μs    143MiB    0.3%  1.22KiB
  copy back                          3.67k    721ms    0.6%   196μs   37.3MiB    0.1%  10.4KiB
  source terms                        120k    130ms    0.1%  1.08μs   54.9MiB    0.1%     480B
  init                                   1   1.51ms    0.0%  1.51ms    172KiB    0.0%   172KiB
kick!                                1.67k    27.8s   24.5%  16.7ms   2.75GiB    5.1%  1.68MiB
  system interaction                 1.67k    22.5s   19.9%  13.5ms   1.61GiB    3.0%  0.99MiB
    fluid1-fluid1                    1.67k    16.6s   14.7%  10.0ms    105MiB    0.2%  64.4KiB
    fluid1-open_boundary3            1.67k    1.65s    1.5%   986μs    222MiB    0.4%   136KiB
    fluid1-structure4                1.67k    1.07s    0.9%   639μs    155MiB    0.3%  94.9KiB
    open_boundary3-open_boundary3    1.67k    877ms    0.8%   525μs    283MiB    0.5%   173KiB
    fluid1-boundary2                 1.67k    689ms    0.6%   412μs    104MiB    0.2%  63.7KiB
    open_boundary3-structure4        1.67k    577ms    0.5%   345μs    236MiB    0.4%   145KiB
    open_boundary3-fluid1            1.67k    461ms    0.4%   276μs    201MiB    0.4%   123KiB
    open_boundary3-boundary2         1.67k    398ms    0.4%   238μs    199MiB    0.4%   122KiB
    ~system interaction~             1.67k    198ms    0.2%   118μs    144MiB    0.3%  88.3KiB
    boundary2-boundary2              1.67k    269μs    0.0%   161ns     0.00B    0.0%    0.00B
    structure4-fluid1                1.67k    267μs    0.0%   160ns     0.00B    0.0%    0.00B
    boundary2-fluid1                 1.67k    259μs    0.0%   155ns     0.00B    0.0%    0.00B
    boundary2-open_boundary3         1.67k    244μs    0.0%   146ns     0.00B    0.0%    0.00B
    structure4-boundary2             1.67k    211μs    0.0%   126ns     0.00B    0.0%    0.00B
    boundary2-structure4             1.67k    162μs    0.0%  97.1ns     0.00B    0.0%    0.00B
    structure4-structure4            1.67k    148μs    0.0%  88.4ns     0.00B    0.0%    0.00B
    structure4-open_boundary3        1.67k   88.3μs    0.0%  52.8ns     0.00B    0.0%    0.00B
  update systems and nhs             1.67k    5.25s    4.6%  3.14ms   1.13GiB    2.1%   711KiB
    update nhs                       1.67k    2.59s    2.3%  1.55ms    199MiB    0.4%   122KiB
    compute boundary pressure        3.34k    1.23s    1.1%   367μs    595MiB    1.1%   182KiB
    ~update systems and nhs~         1.67k    835ms    0.7%   500μs    202MiB    0.4%   124KiB
    stress tensor                    1.67k    282ms    0.2%   169μs   71.0MiB    0.1%  43.5KiB
    inverse state equation           3.34k    199ms    0.2%  59.5μs   48.1MiB    0.1%  14.7KiB
    apply prescribed motion          1.67k    103ms    0.1%  61.5μs   43.3MiB    0.1%  26.6KiB
    update TLSPH positions           1.67k   20.0ms    0.0%  12.0μs   3.09MiB    0.0%  1.89KiB
    update density diffusion         1.67k    245μs    0.0%   147ns     0.00B    0.0%    0.00B
  reset ∂v/∂t                        1.67k   17.5ms    0.0%  10.5μs   2.01MiB    0.0%  1.23KiB
  source terms                       1.67k   12.3ms    0.0%  7.36μs   3.06MiB    0.0%  1.88KiB
  ~kick!~                            1.67k   11.7ms    0.0%  6.98μs   1.55KiB    0.0%    0.95B
save solution                            1    10.3s    9.1%   10.3s   18.2GiB   33.8%  18.2GiB
  write to vtk                           4    9.05s    8.0%   2.26s   2.19GiB    4.1%   561MiB
  ~save solution~                        1    1.27s    1.1%   1.27s   16.0GiB   29.8%  16.0GiB
  update dvdu                            1   19.3ms    0.0%  19.3ms   1.82MiB    0.0%  1.82MiB
  update systems                         1   3.20ms    0.0%  3.20ms    712KiB    0.0%   712KiB
update callback                        334    5.00s    4.4%  15.0ms    664MiB    1.2%  1.99MiB
  update open boundary                 334    1.98s    1.7%  5.92ms    155MiB    0.3%   475KiB
    check domain                       334    1.94s    1.7%  5.80ms    137MiB    0.2%   419KiB
    update boundary quantities         334   34.9ms    0.0%   104μs   18.1MiB    0.0%  55.6KiB
    ~update open boundary~             334   3.19ms    0.0%  9.54μs    105KiB    0.0%     323B
  ~update callback~                    334    1.88s    1.7%  5.63ms    269MiB    0.5%   826KiB
  update systems and nhs               334    1.13s    1.0%  3.37ms    232MiB    0.4%   712KiB
  compute averaged velocity            334   16.6ms    0.0%  49.8μs   7.85MiB    0.0%  24.1KiB
  apply particle shifting              668   76.0μs    0.0%   114ns     0.00B    0.0%    0.00B
apply postprocess cb                     4    1.94s    1.7%   484ms   16.7GiB   31.1%  4.19GiB
  ~apply postprocess cb~                 4    1.86s    1.6%   465ms   16.7GiB   31.1%  4.18GiB
  update dvdu                            4   65.7ms    0.1%  16.4ms   7.06MiB    0.0%  1.76MiB
  update systems and nhs                 4   10.9ms    0.0%  2.73ms   2.78MiB    0.0%   712KiB
drift!                               1.67k    394ms    0.3%   236μs    105MiB    0.2%  64.6KiB
calculate dt                             1   2.72μs    0.0%  2.72μs     0.00B    0.0%    0.00B
──────────────────────────────────────────────────────────────────────────────────────────────

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the TLSPH time-integration path for GPU workloads by reducing per-particle dispatch/branching in hot loops, adding GPU-optimized bulk copies for velocity/positions, and short-circuiting work when source terms/gravity are inactive.

Changes:

  • Refactors drift! velocity update into per-system set_velocity! and adds a GPU fast-path (copyto!) for velocity assignment.
  • Adds a GPU fast-path for TLSPH position updates (update_tlsph_positions!).
  • Refactors add_source_terms! to dispatch per system and skip work when acceleration and source terms are inactive; updates special-case handling for ParticlePackingSystem.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/general/semidiscretization.jl Refactors drift! velocity-setting and reworks source-term application/dispatch + skipping logic.
src/schemes/structure/total_lagrangian_sph/system.jl Adds GPU-optimized TLSPH position copy.
src/schemes/boundary/open_boundary/system.jl Updates open-boundary velocity-setting to new set_velocity! API and threaded loop.
src/preprocessing/particle_packing/system.jl Adds ParticlePackingSystem source-term no-op specialization and adapts to set_velocity!.
test/schemes/structure/total_lagrangian_sph/rhs.jl Removes now-unneeded mock add_acceleration! definition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 82.60870% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.27%. Comparing base (71da8bc) to head (2435602).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/general/semidiscretization.jl 88.88% 6 Missing ⚠️
...c/schemes/structure/total_lagrangian_sph/system.jl 0.00% 5 Missing ⚠️
test/general/semidiscretization.jl 69.23% 4 Missing ⚠️
src/preprocessing/particle_packing/system.jl 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1084      +/-   ##
==========================================
- Coverage   89.38%   89.27%   -0.12%     
==========================================
  Files         122      122              
  Lines        8958     8987      +29     
==========================================
+ Hits         8007     8023      +16     
- Misses        951      964      +13     
Flag Coverage Δ
total 89.27% <82.60%> (-0.12%) ⬇️
unit 65.32% <47.82%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@efaulhaber
Copy link
Member Author

/run-gpu-tests

@efaulhaber efaulhaber marked this pull request as ready for review March 9, 2026 12:03
@efaulhaber efaulhaber requested review from LasNikas and svchb March 9, 2026 12:03
add_velocity!(du, v, u, particle, system, semi, t)
end
end
# Set velocity and add acceleration for each system
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't acceleration set in kick!()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I just copied the wrong comment without thinking about it. But especially with #1055, the comment is redundant anyway because it's clear from the code what is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants