From 393262c9bf5be3b3bf5f2545bb5002691ed8dbb8 Mon Sep 17 00:00:00 2001
From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com>
Date: Wed, 26 Nov 2025 21:10:39 +0200
Subject: [PATCH 1/7] Update 13-examples.rst

---
 content/13-examples.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/content/13-examples.rst b/content/13-examples.rst
index 88dbb8d6..60670a48 100644
--- a/content/13-examples.rst
+++ b/content/13-examples.rst
@@ -1,6 +1,6 @@
 .. _example-heat:
 
-GPU programming example: stencil computation
+Example: putting it all together
 ============================================
 
 .. questions::

From d8eadb71e1da17e8a613cc9f2ea666d35d2b829d Mon Sep 17 00:00:00 2001
From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com>
Date: Wed, 26 Nov 2025 21:19:21 +0200
Subject: [PATCH 2/7] Update 13-examples.rst: remove framework poll

---
 content/13-examples.rst | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/content/13-examples.rst b/content/13-examples.rst
index 60670a48..61dc1b6c 100644
--- a/content/13-examples.rst
+++ b/content/13-examples.rst
@@ -95,14 +95,6 @@ In `an earlier episode <https://enccs.github.io/gpu-programming/7-non-portable-k
 Another point to note is that even if the solution is propagated in small time steps, not every step might actually be needed for output. Once some *local* region of the field is updated, mathematically nothing prevents it from being updated for the second time step -- even if the rest of the field is still being recalculated -- as long as :math:`t = m-1` values for the region boundary are there when needed. (Of course, this is more complicated to implement and would only give benefits in certain cases.)
 
 
-.. challenge:: Poll: which programming model/ framework are you most interested in today?
-
-   - OpenMP offloading (C++)
-   - SYCL (C++)
-   - *Python* (``numba``/CUDA)
-   - Julia
-
-
 The following table will aid you in navigating the rest of this section:
 
 .. admonition:: Episode guide

From b49d4d7c2bde47f2cae1758bdc1493530415cac4 Mon Sep 17 00:00:00 2001
From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com>
Date: Wed, 26 Nov 2025 21:29:23 +0200
Subject: [PATCH 3/7] Update 13-examples.rst: thread-parallel tabs

---
 content/13-examples.rst | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/content/13-examples.rst b/content/13-examples.rst
index 61dc1b6c..b38f4ac5 100644
--- a/content/13-examples.rst
+++ b/content/13-examples.rst
@@ -126,21 +126,23 @@ Sequential and thread-parallel program in C++
 
 If we assume the grid point values to be truly independent *for a single time step*, stencil application procedure may be straightforwardly written as a loop over the grid points, as shown below in tab "Stencil update". (General structure of the program and the default parameter values for the problem model are also provided for reference.) CPU-thread parallelism can then be enabled by a single OpenMP ``#pragma``:
 
+`stencil/base/ <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/base/>`_
+
 .. tabs::
 
-   .. tab:: Stencil update
+   .. tab:: Stencil update <br>(core.cpp)
 
          .. literalinclude:: examples/stencil/base/core.cpp 
                         :language: cpp
                         :emphasize-lines: 25
 
-   .. tab:: Main function
+   .. tab:: Main function <br>(main.cpp)
 
          .. literalinclude:: examples/stencil/base/main.cpp 
                         :language: cpp
                         :emphasize-lines: 37
  
-   .. tab:: Default params
+   .. tab:: Default params <br>(heat.h)
 
          .. literalinclude:: examples/stencil/base/heat.h 
                         :language: cpp

From aeaa16dce4429251becdbe535a4270398f4a089b Mon Sep 17 00:00:00 2001
From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com>
Date: Wed, 26 Nov 2025 21:38:08 +0200
Subject: [PATCH 4/7] Update 13-examples.rst: trying out multiline tab titles

---
 content/13-examples.rst | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/content/13-examples.rst b/content/13-examples.rst
index b38f4ac5..fcfefbb1 100644
--- a/content/13-examples.rst
+++ b/content/13-examples.rst
@@ -126,23 +126,26 @@ Sequential and thread-parallel program in C++
 
 If we assume the grid point values to be truly independent *for a single time step*, stencil application procedure may be straightforwardly written as a loop over the grid points, as shown below in tab "Stencil update". (General structure of the program and the default parameter values for the problem model are also provided for reference.) CPU-thread parallelism can then be enabled by a single OpenMP ``#pragma``:
 
-`stencil/base/ <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/base/>`_
+**`stencil/base/ <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/base/>`_**
 
 .. tabs::
 
-   .. tab:: Stencil update <br>(core.cpp)
+   .. tab:: | Stencil update 
+            | (core.cpp)
 
          .. literalinclude:: examples/stencil/base/core.cpp 
                         :language: cpp
                         :emphasize-lines: 25
 
-   .. tab:: Main function <br>(main.cpp)
+   .. tab:: | Main function 
+            | (main.cpp)
 
          .. literalinclude:: examples/stencil/base/main.cpp 
                         :language: cpp
                         :emphasize-lines: 37
  
-   .. tab:: Default params <br>(heat.h)
+   .. tab:: | Default params 
+            | (heat.h)
 
          .. literalinclude:: examples/stencil/base/heat.h 
                         :language: cpp

From c97d264eaa9395f4834f0c8079a88f09553f1b78 Mon Sep 17 00:00:00 2001
From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com>
Date: Wed, 26 Nov 2025 21:46:10 +0200
Subject: [PATCH 5/7] Update 13-examples.rst: revert formatting

---
 content/13-examples.rst | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/content/13-examples.rst b/content/13-examples.rst
index fcfefbb1..d6f31909 100644
--- a/content/13-examples.rst
+++ b/content/13-examples.rst
@@ -126,26 +126,26 @@ Sequential and thread-parallel program in C++
 
 If we assume the grid point values to be truly independent *for a single time step*, stencil application procedure may be straightforwardly written as a loop over the grid points, as shown below in tab "Stencil update". (General structure of the program and the default parameter values for the problem model are also provided for reference.) CPU-thread parallelism can then be enabled by a single OpenMP ``#pragma``:
 
-**`stencil/base/ <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/base/>`_**
+`stencil/base/ <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/base/>`_
 
 .. tabs::
 
-   .. tab:: | Stencil update 
-            | (core.cpp)
+   .. tab:: Stencil update 
+            **core.cpp**
 
          .. literalinclude:: examples/stencil/base/core.cpp 
                         :language: cpp
                         :emphasize-lines: 25
 
-   .. tab:: | Main function 
-            | (main.cpp)
+   .. tab:: Main function 
+            **main.cpp**
 
          .. literalinclude:: examples/stencil/base/main.cpp 
                         :language: cpp
                         :emphasize-lines: 37
  
-   .. tab:: | Default params 
-            | (heat.h)
+   .. tab:: Default params 
+            **heat.h**
 
          .. literalinclude:: examples/stencil/base/heat.h 
                         :language: cpp

From 7566d53a3393522eb99e48aba32eeec567c6fcb2 Mon Sep 17 00:00:00 2001
From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com>
Date: Wed, 26 Nov 2025 22:11:41 +0200
Subject: [PATCH 6/7] Update 13-examples.rst: SYCL part formatting

---
 content/13-examples.rst | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/content/13-examples.rst b/content/13-examples.rst
index d6f31909..16da9adf 100644
--- a/content/13-examples.rst
+++ b/content/13-examples.rst
@@ -275,15 +275,19 @@ Similarly, SYCL programming model offers convenient ways to define execution ker
 
 Changes of stencil update code for OpenMP and SYCL are shown in the tabs below:
 
+`stencil/ <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/base/>`_
+
 .. tabs::
 
    .. tab:: OpenMP (naive)
+            **base/core-off.cpp**
 
          .. literalinclude:: examples/stencil/base/core-off.cpp 
                         :language: cpp
                         :emphasize-lines: 25-26
          
    .. tab:: SYCL (naive)
+            **sycl/core-naive.cpp**
 
          .. literalinclude:: examples/stencil/sycl/core-naive.cpp 
                         :language: cpp
@@ -313,10 +317,10 @@ Changes of stencil update code for OpenMP and SYCL are shown in the tabs below:
       $ cd ../sycl/
       (give the following lines some time, probably a couple of min)
       $ acpp -O2 -o stencil_naive core-naive.cpp io.cpp main-naive.cpp pngwriter.c setup.cpp utilities.cpp
-      $ acpp -O2 -o stencil core.cpp io.cpp main.cpp pngwriter.c setup.cpp utilities.cpp
+      $ acpp -O2 -o stencil_data core.cpp io.cpp main.cpp pngwriter.c setup.cpp utilities.cpp
       
       $ srun stencil_naive
-      $ srun stencil
+      $ srun stencil_data
 
    If everything works well, the output should look similar to this:
    
@@ -327,7 +331,7 @@ Changes of stencil update code for OpenMP and SYCL are shown in the tabs below:
       Average temperature at end: 59.281239
       Control temperature at end: 59.281239
       Iterations took 2.086 seconds.
-      $ srun stencil
+      $ srun stencil_data
       Average temperature, start: 59.763305
       Average temperature at end: 59.281239
       Control temperature at end: 59.281239
@@ -386,39 +390,38 @@ But overhead can be reduced by taking care to minimize data transfers between *h
 - only copy the data from GPU to CPU when we need it,
 - swap the GPU buffers between timesteps, like we do with CPU buffers. (OpenMP does this automatically.)
 
-Changes of stencil update code as well as the main program are shown in tabs below. 
+Changes of stencil update code as well as the main program are shown in tabs below: 
+
+`stencil/ <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/base/>`__
 
 .. tabs::
 
    .. tab:: OpenMP
+            **base/core-data.cpp**
 
          .. literalinclude:: examples/stencil/base/core-data.cpp
                         :language: cpp
                         :emphasize-lines: 25,40-75
    
    .. tab:: SYCL
+            **sycl/core.cpp**
 
          .. literalinclude:: examples/stencil/sycl/core.cpp
                         :language: cpp
                         :emphasize-lines: 13-14,25,40-50
 
    .. tab:: Python
+            **python-numba/core_cuda.py**
 
          .. literalinclude:: examples/stencil/python-numba/core_cuda.py
                         :language: py
                         :lines: 6-34
                         :emphasize-lines: 14-16,18
 
-   .. tab:: main() (SYCL)
-
-         .. literalinclude:: examples/stencil/sycl/main.cpp 
-                        :language: cpp
-                        :emphasize-lines: 38-39,44-45,51,56,59,75-77
-
 
 .. challenge:: Exercise: updated GPU ports
 
-   Test your compiled executables ``base/stencil_data`` and ``sycl/stencil``. Try changing problem size parameters:
+   Test your compiled executables ``base/stencil_data`` and ``sycl/stencil_data``. Try changing problem size parameters:
    
    - ``srun stencil 2000 2000 5000``
    

From 78bf0432f9f02666500ea8d2179f0b93a0abe625 Mon Sep 17 00:00:00 2001
From: Stepas Toliautas <59330245+stepas-toliautas@users.noreply.github.com>
Date: Wed, 26 Nov 2025 22:31:17 +0200
Subject: [PATCH 7/7] Update 13-examples.rst: Python section formatting

---
 content/13-examples.rst | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/content/13-examples.rst b/content/13-examples.rst
index 16da9adf..b2f7c82c 100644
--- a/content/13-examples.rst
+++ b/content/13-examples.rst
@@ -390,7 +390,7 @@ But overhead can be reduced by taking care to minimize data transfers between *h
 - only copy the data from GPU to CPU when we need it,
 - swap the GPU buffers between timesteps, like we do with CPU buffers. (OpenMP does this automatically.)
 
-Changes of stencil update code as well as the main program are shown in tabs below: 
+Changes of stencil update code are shown in tabs below (also check out the respective main() functions for calls to persistent GPU buffer creation, access, and deletion): 
 
 `stencil/ <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/base/>`__
 
@@ -410,14 +410,6 @@ Changes of stencil update code as well as the main program are shown in tabs bel
                         :language: cpp
                         :emphasize-lines: 13-14,25,40-50
 
-   .. tab:: Python
-            **python-numba/core_cuda.py**
-
-         .. literalinclude:: examples/stencil/python-numba/core_cuda.py
-                        :language: py
-                        :lines: 6-34
-                        :emphasize-lines: 14-16,18
-
 
 .. challenge:: Exercise: updated GPU ports
 
@@ -458,9 +450,12 @@ Python: JIT and GPU acceleration
 
 As mentioned `previously <https://enccs.github.io/gpu-programming/9-language-support/#numba>`_, Numba package allows developers to just-in-time (JIT) compile Python code to run fast on CPUs, but can also be used for JIT compiling for (NVIDIA) GPUs. JIT seems to work well on loop-based, computationally heavy functions, so trying it out is a nice choice for initial source version:
 
+`stencil/python-numba <https://github.com/ENCCS/gpu-programming/tree/main/content/examples/stencil/python-numba/>`_
+
 .. tabs::
 
    .. tab:: Stencil update
+            **core.py**
 
          .. literalinclude:: examples/stencil/python-numba/core.py
                         :language: py
@@ -468,12 +463,21 @@ As mentioned `previously <https://enccs.github.io/gpu-programming/9-language-sup
                         :emphasize-lines: 17
    
    .. tab:: Data generation
+            **heat.py**
 
          .. literalinclude:: examples/stencil/python-numba/heat.py
                         :language: py
                         :lines: 57-78
                         :emphasize-lines: 1
 
+   .. tab:: Stencil update in GPU
+            **core_cuda.py**
+
+         .. literalinclude:: examples/stencil/python-numba/core_cuda.py
+                        :language: py
+                        :lines: 6-34
+                        :emphasize-lines: 14-16,18
+
 
 The alternative approach would be to rewrite stencil update code in NumPy style, exploiting loop vectorization.
 
@@ -536,7 +540,7 @@ Short summary of a typical Colab run is provided below:
 
 Numba's ``@vectorize`` and ``@guvectorize`` decorators offer an interface to create CPU- (or GPU-) accelerated *Python* functions without explicit implementation details. However, such functions become increasingly complicated to write (and optimize by the compiler) with increasing complexity of the computations within.
 
-Numba also offers direct CUDA-based kernel programming, which can be the best choice for those already familiar with CUDA. Example for stencil update written in Numba CUDA is shown in the `data movement section <https://enccs.github.io/gpu-programming/13-examples/#gpu-parallelization-data-movement>`_, tab "Python". In this case, data transfer functions ``devdata = cuda.to_device(data)`` and ``devdata.copy_to_host(data)`` (see ``main_cuda.py``) are already provided by Numba package.
+Numba also offers direct CUDA-based kernel programming, which can be the best choice for those already familiar with CUDA. Example for stencil update written in Numba CUDA is shown in the above section, tab "Stencil update in GPU". In this case, data transfer functions ``devdata = cuda.to_device(data)`` and ``devdata.copy_to_host(data)`` (see ``main_cuda.py``) are already provided by Numba package.
 
 
 .. challenge:: Exercise: CUDA acceleration in Python