From 393262c9bf5be3b3bf5f2545bb5002691ed8dbb8 Mon Sep 17 00:00:00 2001 From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com> Date: Wed, 26 Nov 2025 21:10:39 +0200 Subject: [PATCH 1/7] Update 13-examples.rst --- content/13-examples.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/13-examples.rst b/content/13-examples.rst index 88dbb8d6..60670a48 100644 --- a/content/13-examples.rst +++ b/content/13-examples.rst @@ -1,6 +1,6 @@ .. _example-heat: -GPU programming example: stencil computation +Example: putting it all together ============================================ .. questions:: From d8eadb71e1da17e8a613cc9f2ea666d35d2b829d Mon Sep 17 00:00:00 2001 From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com> Date: Wed, 26 Nov 2025 21:19:21 +0200 Subject: [PATCH 2/7] Update 13-examples.rst: remove framework poll --- content/13-examples.rst | 8 -------- 1 file changed, 8 deletions(-) diff --git a/content/13-examples.rst b/content/13-examples.rst index 60670a48..61dc1b6c 100644 --- a/content/13-examples.rst +++ b/content/13-examples.rst @@ -95,14 +95,6 @@ In `an earlier episode Date: Wed, 26 Nov 2025 21:29:23 +0200 Subject: [PATCH 3/7] Update 13-examples.rst: thread-parallel tabs --- content/13-examples.rst | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/content/13-examples.rst b/content/13-examples.rst index 61dc1b6c..b38f4ac5 100644 --- a/content/13-examples.rst +++ b/content/13-examples.rst @@ -126,21 +126,23 @@ Sequential and thread-parallel program in C++ If we assume the grid point values to be truly independent *for a single time step*, stencil application procedure may be straightforwardly written as a loop over the grid points, as shown below in tab "Stencil update". (General structure of the program and the default parameter values for the problem model are also provided for reference.) CPU-thread parallelism can then be enabled by a single OpenMP ``#pragma``: +`stencil/base/ `_ + .. tabs:: - .. tab:: Stencil update + .. tab:: Stencil update
(core.cpp) .. literalinclude:: examples/stencil/base/core.cpp :language: cpp :emphasize-lines: 25 - .. tab:: Main function + .. tab:: Main function
(main.cpp) .. literalinclude:: examples/stencil/base/main.cpp :language: cpp :emphasize-lines: 37 - .. tab:: Default params + .. tab:: Default params
(heat.h) .. literalinclude:: examples/stencil/base/heat.h :language: cpp From aeaa16dce4429251becdbe535a4270398f4a089b Mon Sep 17 00:00:00 2001 From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com> Date: Wed, 26 Nov 2025 21:38:08 +0200 Subject: [PATCH 4/7] Update 13-examples.rst: trying out multiline tab titles --- content/13-examples.rst | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/content/13-examples.rst b/content/13-examples.rst index b38f4ac5..fcfefbb1 100644 --- a/content/13-examples.rst +++ b/content/13-examples.rst @@ -126,23 +126,26 @@ Sequential and thread-parallel program in C++ If we assume the grid point values to be truly independent *for a single time step*, stencil application procedure may be straightforwardly written as a loop over the grid points, as shown below in tab "Stencil update". (General structure of the program and the default parameter values for the problem model are also provided for reference.) CPU-thread parallelism can then be enabled by a single OpenMP ``#pragma``: -`stencil/base/ `_ +**`stencil/base/ `_** .. tabs:: - .. tab:: Stencil update
(core.cpp) + .. tab:: | Stencil update + | (core.cpp) .. literalinclude:: examples/stencil/base/core.cpp :language: cpp :emphasize-lines: 25 - .. tab:: Main function
(main.cpp) + .. tab:: | Main function + | (main.cpp) .. literalinclude:: examples/stencil/base/main.cpp :language: cpp :emphasize-lines: 37 - .. tab:: Default params
(heat.h) + .. tab:: | Default params + | (heat.h) .. literalinclude:: examples/stencil/base/heat.h :language: cpp From c97d264eaa9395f4834f0c8079a88f09553f1b78 Mon Sep 17 00:00:00 2001 From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com> Date: Wed, 26 Nov 2025 21:46:10 +0200 Subject: [PATCH 5/7] Update 13-examples.rst: revert formatting --- content/13-examples.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/content/13-examples.rst b/content/13-examples.rst index fcfefbb1..d6f31909 100644 --- a/content/13-examples.rst +++ b/content/13-examples.rst @@ -126,26 +126,26 @@ Sequential and thread-parallel program in C++ If we assume the grid point values to be truly independent *for a single time step*, stencil application procedure may be straightforwardly written as a loop over the grid points, as shown below in tab "Stencil update". (General structure of the program and the default parameter values for the problem model are also provided for reference.) CPU-thread parallelism can then be enabled by a single OpenMP ``#pragma``: -**`stencil/base/ `_** +`stencil/base/ `_ .. tabs:: - .. tab:: | Stencil update - | (core.cpp) + .. tab:: Stencil update + **core.cpp** .. literalinclude:: examples/stencil/base/core.cpp :language: cpp :emphasize-lines: 25 - .. tab:: | Main function - | (main.cpp) + .. tab:: Main function + **main.cpp** .. literalinclude:: examples/stencil/base/main.cpp :language: cpp :emphasize-lines: 37 - .. tab:: | Default params - | (heat.h) + .. tab:: Default params + **heat.h** .. literalinclude:: examples/stencil/base/heat.h :language: cpp From 7566d53a3393522eb99e48aba32eeec567c6fcb2 Mon Sep 17 00:00:00 2001 From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com> Date: Wed, 26 Nov 2025 22:11:41 +0200 Subject: [PATCH 6/7] Update 13-examples.rst: SYCL part formatting --- content/13-examples.rst | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/content/13-examples.rst b/content/13-examples.rst index d6f31909..16da9adf 100644 --- a/content/13-examples.rst +++ b/content/13-examples.rst @@ -275,15 +275,19 @@ Similarly, SYCL programming model offers convenient ways to define execution ker Changes of stencil update code for OpenMP and SYCL are shown in the tabs below: +`stencil/ `_ + .. tabs:: .. tab:: OpenMP (naive) + **base/core-off.cpp** .. literalinclude:: examples/stencil/base/core-off.cpp :language: cpp :emphasize-lines: 25-26 .. tab:: SYCL (naive) + **sycl/core-naive.cpp** .. literalinclude:: examples/stencil/sycl/core-naive.cpp :language: cpp @@ -313,10 +317,10 @@ Changes of stencil update code for OpenMP and SYCL are shown in the tabs below: $ cd ../sycl/ (give the following lines some time, probably a couple of min) $ acpp -O2 -o stencil_naive core-naive.cpp io.cpp main-naive.cpp pngwriter.c setup.cpp utilities.cpp - $ acpp -O2 -o stencil core.cpp io.cpp main.cpp pngwriter.c setup.cpp utilities.cpp + $ acpp -O2 -o stencil_data core.cpp io.cpp main.cpp pngwriter.c setup.cpp utilities.cpp $ srun stencil_naive - $ srun stencil + $ srun stencil_data If everything works well, the output should look similar to this: @@ -327,7 +331,7 @@ Changes of stencil update code for OpenMP and SYCL are shown in the tabs below: Average temperature at end: 59.281239 Control temperature at end: 59.281239 Iterations took 2.086 seconds. - $ srun stencil + $ srun stencil_data Average temperature, start: 59.763305 Average temperature at end: 59.281239 Control temperature at end: 59.281239 @@ -386,39 +390,38 @@ But overhead can be reduced by taking care to minimize data transfers between *h - only copy the data from GPU to CPU when we need it, - swap the GPU buffers between timesteps, like we do with CPU buffers. (OpenMP does this automatically.) -Changes of stencil update code as well as the main program are shown in tabs below. +Changes of stencil update code as well as the main program are shown in tabs below: + +`stencil/ `__ .. tabs:: .. tab:: OpenMP + **base/core-data.cpp** .. literalinclude:: examples/stencil/base/core-data.cpp :language: cpp :emphasize-lines: 25,40-75 .. tab:: SYCL + **sycl/core.cpp** .. literalinclude:: examples/stencil/sycl/core.cpp :language: cpp :emphasize-lines: 13-14,25,40-50 .. tab:: Python + **python-numba/core_cuda.py** .. literalinclude:: examples/stencil/python-numba/core_cuda.py :language: py :lines: 6-34 :emphasize-lines: 14-16,18 - .. tab:: main() (SYCL) - - .. literalinclude:: examples/stencil/sycl/main.cpp - :language: cpp - :emphasize-lines: 38-39,44-45,51,56,59,75-77 - .. challenge:: Exercise: updated GPU ports - Test your compiled executables ``base/stencil_data`` and ``sycl/stencil``. Try changing problem size parameters: + Test your compiled executables ``base/stencil_data`` and ``sycl/stencil_data``. Try changing problem size parameters: - ``srun stencil 2000 2000 5000`` From 78bf0432f9f02666500ea8d2179f0b93a0abe625 Mon Sep 17 00:00:00 2001 From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com> Date: Wed, 26 Nov 2025 22:31:17 +0200 Subject: [PATCH 7/7] Update 13-examples.rst: Python section formatting --- content/13-examples.rst | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/content/13-examples.rst b/content/13-examples.rst index 16da9adf..b2f7c82c 100644 --- a/content/13-examples.rst +++ b/content/13-examples.rst @@ -390,7 +390,7 @@ But overhead can be reduced by taking care to minimize data transfers between *h - only copy the data from GPU to CPU when we need it, - swap the GPU buffers between timesteps, like we do with CPU buffers. (OpenMP does this automatically.) -Changes of stencil update code as well as the main program are shown in tabs below: +Changes of stencil update code are shown in tabs below (also check out the respective main() functions for calls to persistent GPU buffer creation, access, and deletion): `stencil/ `__ @@ -410,14 +410,6 @@ Changes of stencil update code as well as the main program are shown in tabs bel :language: cpp :emphasize-lines: 13-14,25,40-50 - .. tab:: Python - **python-numba/core_cuda.py** - - .. literalinclude:: examples/stencil/python-numba/core_cuda.py - :language: py - :lines: 6-34 - :emphasize-lines: 14-16,18 - .. challenge:: Exercise: updated GPU ports @@ -458,9 +450,12 @@ Python: JIT and GPU acceleration As mentioned `previously `_, Numba package allows developers to just-in-time (JIT) compile Python code to run fast on CPUs, but can also be used for JIT compiling for (NVIDIA) GPUs. JIT seems to work well on loop-based, computationally heavy functions, so trying it out is a nice choice for initial source version: +`stencil/python-numba `_ + .. tabs:: .. tab:: Stencil update + **core.py** .. literalinclude:: examples/stencil/python-numba/core.py :language: py @@ -468,12 +463,21 @@ As mentioned `previously `_, tab "Python". In this case, data transfer functions ``devdata = cuda.to_device(data)`` and ``devdata.copy_to_host(data)`` (see ``main_cuda.py``) are already provided by Numba package. +Numba also offers direct CUDA-based kernel programming, which can be the best choice for those already familiar with CUDA. Example for stencil update written in Numba CUDA is shown in the above section, tab "Stencil update in GPU". In this case, data transfer functions ``devdata = cuda.to_device(data)`` and ``devdata.copy_to_host(data)`` (see ``main_cuda.py``) are already provided by Numba package. .. challenge:: Exercise: CUDA acceleration in Python