From f15b95fe1fc8710ae3b2d5426add67555078c79c Mon Sep 17 00:00:00 2001
From: jadgt <juan.degracia@icloud.com>
Date: Tue, 11 Nov 2025 14:06:46 +0100
Subject: [PATCH 1/6] Add introductory content on High-Performance Computing
 (HPC)

---
 content/intohpc.md | 443 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 443 insertions(+)
 create mode 100644 content/intohpc.md

diff --git a/content/intohpc.md b/content/intohpc.md
new file mode 100644
index 0000000..a142648
--- /dev/null
+++ b/content/intohpc.md
@@ -0,0 +1,443 @@
+# Introduction to HPC
+
+:::{questions}
+- What is High-Performance Computing (HPC)?
+- Why do we use HPC systems?
+- How does parallel computing make programs faster?
+:::
+
+:::{objectives}
+- Define what High-Performance Computing (HPC) is.
+- Identify the main components of an HPC system.
+- Describe the difference between serial and parallel computing.
+- Run a simple command on a cluster using the terminal.
+:::
+
+High-Performance Computing (HPC) refers to using many computers working
+together to solve complex problems faster than a single machine could.
+HPC is widely used in fields such as climate science, molecular
+simulation, astrophysics, and artificial intelligence.
+
+This lesson introduces what HPC is, why it matters, and how researchers
+use clusters to perform large-scale computations.
+
+---
+
+## What is HPC?
+
+HPC systems, often called *supercomputers* or *clusters*, are made up of
+many computers (called **nodes**) connected by a fast network. Each node
+can have multiple **CPUs** (and sometimes **GPUs**) that run tasks in
+parallel.
+
+### Typical HPC Components
+
+| Component | Description |
+|------------|--------------|
+| **Login node** | Where you connect and submit jobs |
+| **Compute nodes** | Machines where your program actually runs |
+| **Scheduler** | Manages job submissions and allocates resources (e.g. SLURM) |
+| **Storage** | Shared file system accessible to all nodes |
+
+---
+
+## Connecting to a Cluster
+
+You usually access an HPC system through a **login node** using SSH.
+
+```bash
+ssh username@hpc-login.example.org
+```
+Once connected, you can see what resources are available using the
+scheduler’s commands (for example, SLURM):
+
+```bash
+sinfo
+```
+
+:::{exercise} Exploring the Cluster
+	1.	Log in to your training cluster using ssh.
+	2.	Run sinfo to see available partitions and nodes.
+	3.	Try squeue to view currently running jobs.
+:::
+
+---
+
+## Parallel Computing
+
+High-Performance Computing relies on **parallel computing**, splitting a problem into smaller parts that can be executed *simultaneously* on multiple processors.
+
+Instead of running one instruction at a time on one CPU core, parallel computing allows you to run many instructions on many cores or even multiple machines at once.
+
+Parallelism can occur at different levels:
+- **Within a single CPU** (multiple cores)
+- **Across multiple CPUs** (distributed nodes)
+- **On specialized accelerators** (GPUs)
+
+---
+
+### Shared-Memory Parallelism
+
+In **shared-memory** systems, multiple processor cores share the same memory space.  
+Each core can directly read and write to the same variables in memory.
+
+This is the model used in:
+- Multicore laptops and workstations  
+- *Single compute nodes* on a cluster  
+
+Programs use **threads** to execute in parallel (e.g., with OpenMP in C/C++/Fortran or multiprocessing in Python).
+
+Example (OpenMP-style pseudocode):
+
+```c
+#pragma omp parallel for
+for (int i = 0; i < N; i++) {
+    a[i] = b[i] + c[i];
+}
+```
+Advantages:
+	•	Easy communication between threads (shared variables)
+	•	Low latency data access
+
+Limitations:
+	•	Limited by the number of cores on one machine
+	•	Risk of race conditions if data access is not synchronized
+
+:::{exercise} Shared-Memory Example
+Try using Python’s multiprocessing module to parallelize a simple loop:
+from multiprocessing import Pool
+
+def square(x):
+    return x * x
+
+if __name__ == "__main__":
+    with Pool(4) as p:
+        result = p.map(square, range(8))
+    print(result)
+
+How many CPU cores does your machine have? Try changing the number of workers.
+:::
+
+### Distributed-Memory Parallelism
+
+In distributed-memory systems, each processor (or node) has its own local memory.
+Processors communicate by passing messages over a network.
+
+This is the model used when a computation spans multiple nodes in an HPC cluster.
+
+Programs written with MPI (Message Passing Interface) use explicit communication.
+Below is an example using the Python library `mpi4py` that implements MPI functions
+in Python 
+
+```python
+# hello_mpi.py
+from mpi4py import MPI
+
+# Initialize the MPI communicator
+comm = MPI.COMM_WORLD
+
+# Get the total number of processes
+size = comm.Get_size()
+
+# Get the rank (ID) of this process
+rank = comm.Get_rank()
+
+print(f"Hello from process {rank} of {size}")
+
+# MPI is automatically finalized when the program exits,
+# but you can call MPI.Finalize() explicitly if you prefer
+```
+For now, do not worry about understanding this code, we will see 
+`mpi4py` in detail later.
+
+Advantages:
+	•	Scales to thousands of nodes
+	•	Each process works independently, avoiding memory contention
+
+Limitations:
+	•	Requires explicit communication (send/receive)
+	•	More complex programming model
+  •	More latency, requires minimizing movement of data.
+
+
+
+### Hybrid Architectures: CPU, GPU, and TPU
+
+Modern High-Performance Computing (HPC) systems rarely rely on CPUs alone.  
+They are **hybrid architectures**, combining different types of processors, typically **CPUs**, **GPUs**, and increasingly **TPUs**, to achieve both flexibility and high performance.
+
+---
+
+#### CPU: The General-Purpose Processor
+
+**Central Processing Units (CPUs)** are versatile processors capable of handling a wide range of tasks.  
+They consist of a small number of powerful cores optimized for complex, sequential operations and control flow.
+
+CPUs are responsible for:
+- Managing input/output operations  
+- Coordinating data movement and workflow  
+- Executing serial portions of applications  
+
+They excel in **task parallelism**, where different cores perform distinct tasks concurrently.
+
+---
+
+#### GPU: The Parallel Workhorse
+
+**Graphics Processing Units (GPUs)** contain thousands of lightweight cores that can execute the same instruction on many data elements simultaneously.  
+This makes them ideal for **data-parallel** workloads, such as numerical simulations, molecular dynamics, and deep learning.
+
+GPUs are optimized for:
+- Large-scale mathematical computations  
+- Highly parallel tasks such as matrix and vector operations  
+
+Common GPU computing frameworks include CUDA, HIP, OpenACC, and SYCL.  
+
+GPUs provide massive computational throughput but require explicit management of data transfers between CPU and GPU memory.  
+They are now a standard component of most modern supercomputers.
+
+---
+
+#### TPU: Specialized Processor for Tensor Operations
+
+**Tensor Processing Units (TPUs)** are specialized hardware accelerators designed for tensor and matrix operations, the building blocks of deep learning and AI.  
+Originally developed by Google, TPUs are now used in both cloud and research HPC environments.
+
+TPUs focus on **tensor computations** and achieve very high performance and energy efficiency for machine learning workloads.  
+They are less flexible than CPUs or GPUs but excel in neural network training and inference.
+
+---
+
+#### Putting It All Together
+
+Modern HPC clusters combine these processor types to create **heterogeneous nodes**.  
+A single node may contain multiple CPUs and several GPUs or TPUs, interconnected through high-speed buses.
+
+| Processor Type | Main Role | Best Suited For |
+|-----------------|------------|-----------------|
+| **CPU** | General-purpose control, coordination, and I/O | Serial and logic-heavy tasks |
+| **GPU** | Massive parallel numerical computation | Data-parallel workloads and simulations |
+| **TPU** | Specialized tensor acceleration | Machine learning and AI workloads |
+
+This combination allows each processor type to handle the tasks for which it is best optimized, enabling both high performance and efficient energy use.
+
+---
+
+#### Hybrid Programming Models
+
+To take full advantage of hybrid systems, programmers often combine multiple parallel models within one application:
+
+- **MPI + OpenMP:** Distributed memory between nodes and shared memory within each node.  
+- **MPI + CUDA or OpenACC:** Distributed communication across nodes, GPU acceleration within nodes.  
+- **MPI + XLA or TPU runtime:** MPI coordination with tensor operations handled on TPUs.
+
+This layered approach leverages:
+- **Distributed memory** across cluster nodes  
+- **Shared memory** and **accelerators** within each node  
+
+Such **hybrid parallelism** is now the dominant model in large-scale scientific and industrial computing.
+
+---
+
+:::{keypoints}
+- Modern HPC systems combine CPUs, GPUs, and TPUs in hybrid architectures.  
+- CPUs handle coordination and control, while GPUs and TPUs execute highly parallel workloads.  
+- GPUs excel at general numerical parallelism; TPUs specialize in tensor operations for AI.  
+- Hybrid programming models (MPI with OpenMP, CUDA, or TPU backends) enable efficient use of all resources.
+:::
+
+---
+
+### Hybrid Programming Models
+
+Modern HPC applications are rarely limited to a single parallel programming model.  
+Instead, they often combine multiple layers of parallelism to make the best use of available hardware — an approach known as **hybrid programming**.
+
+Hybrid programming allows developers to exploit **distributed memory parallelism** across nodes and **shared memory or accelerator parallelism** within nodes.  
+This enables applications to scale efficiently from a few cores to thousands of nodes on large supercomputers.
+
+---
+
+#### MPI + OpenMP: Hybrid Parallelism on CPU Nodes
+
+When an HPC cluster consists of CPU-only nodes, a common hybrid model combines **MPI (Message Passing Interface)** and **OpenMP (Open Multi-Processing)**.
+
+- **MPI** handles communication between nodes in the cluster.  
+  Each MPI process runs on a separate node or subset of nodes and uses message passing to exchange data.  
+
+- **OpenMP** provides shared-memory parallelism within each node.  
+  Multiple threads inside a single MPI process can execute concurrently on the cores of that node.
+
+This structure creates a **two-level hierarchy** of parallelism:
+1. **Inter-node parallelism** — achieved through MPI (distributed memory).  
+2. **Intra-node parallelism** — achieved through OpenMP (shared memory).
+
+**Benefits of MPI + OpenMP hybridization:**
+- Reduces the total number of MPI processes, decreasing communication overhead.  
+- Exploits the full capacity of multi-core CPUs within each node.  
+- Balances scalability (via MPI) and efficiency (via OpenMP threads).  
+
+**Challenges:**
+- Requires careful balancing between MPI processes and OpenMP threads per node.  
+- Data locality and thread affinity must be managed to avoid performance loss.
+
+This model is widely used in simulation codes such as computational fluid dynamics, finite element methods, and electronic structure calculations.
+
+---
+
+#### MPI + GPU: Hybrid Parallelism on Heterogeneous Nodes
+
+When HPC nodes include GPUs or other accelerators, the hybrid model extends naturally to **MPI + GPU programming**.  
+In this approach, **MPI** is still responsible for communication between nodes, while **GPU programming frameworks** (such as CUDA, HIP, or OpenACC) handle massive data-parallel computations within each node.
+
+The workflow typically looks like this:
+- Each node runs one or more MPI processes.  
+- Each MPI process controls one or more GPUs.  
+- Within a node, the heavy numerical work is offloaded from the CPU to the GPU.  
+
+This model provides three complementary layers of parallelism:
+1. **MPI level:** communication and coordination between nodes.  
+2. **Node level:** management and scheduling by the CPU host.  
+3. **Device level:** massive parallel computation executed by the GPU.  
+
+**Benefits of MPI + GPU hybridization:**
+- Enables scaling across thousands of GPUs distributed among cluster nodes.  
+- Delivers enormous speedups for highly parallel workloads (e.g., molecular dynamics, deep learning, weather modeling).  
+- Frees CPU resources for orchestration and non-parallel tasks.
+
+**Challenges:**
+- Requires explicit management of data transfer between CPU and GPU memory.  
+- Increased complexity in software development and debugging.  
+- Performance depends on matching problem size and data layout to GPU architecture.
+
+Hybrid MPI + GPU programming has become the dominant model for modern supercomputers, including those in the **Top500** list, as nearly all of them are now GPU-accelerated.
+
+---
+
+#### Choosing the Right Hybrid Model
+
+The choice between MPI + OpenMP and MPI + GPU depends on:
+- **Hardware configuration:** CPU-only clusters vs. GPU-accelerated clusters.  
+- **Application characteristics:** memory usage, data dependencies, and computational intensity.  
+- **Scalability goals:** how far the problem needs to scale across nodes and cores.  
+
+| Hybrid Model | Hardware | Parallelism Within Node | Inter-node Parallelism | Typical Use |
+|---------------|-----------|--------------------------|------------------------|--------------|
+| **MPI + OpenMP** | Multi-core CPU nodes | Shared-memory (threads) | MPI processes | General-purpose scientific computing |
+| **MPI + GPU** | CPU + GPU nodes | GPU data parallelism | MPI processes | HPC with heavy numerical or ML workloads |
+
+Both models represent essential building blocks of high-performance computing, and many large-scale codes support both configurations for flexibility across systems.
+
+---
+
+:::{keypoints}
+- Hybrid programming combines multiple parallel models to fully utilize modern hardware.  
+- **MPI + OpenMP** is the standard approach for CPU-only systems.  
+- **MPI + GPU** extends hybrid parallelism to heterogeneous CPU–GPU architectures.  
+- Choosing the right model depends on hardware, scalability, and the computational nature of the application.
+:::
+
+### Python in High-Performance Computing
+
+Python is one of the most widely used languages in scientific research due to its simplicity, readability, and rich ecosystem of numerical libraries.  
+While Python itself is interpreted and not as fast as compiled languages like C or Fortran, it has developed a strong foundation for **high-performance computing (HPC)** through specialized libraries and interfaces.
+
+These tools allow scientists and engineers to write code in Python while still achieving near-native performance on CPUs, GPUs, and distributed systems.
+
+---
+
+#### JAX: High-Performance Array Computing with Accelerators
+
+**JAX** is a library developed by Google for high-performance numerical computing and automatic differentiation.  
+It combines the familiar NumPy-like interface with **just-in-time (JIT)** compilation using **XLA (Accelerated Linear Algebra)**.  
+This allows Python functions to be compiled and executed efficiently on both **CPUs and GPUs**.
+
+Key features:
+- Transparent acceleration on GPUs and TPUs  
+- Automatic differentiation for machine learning and optimization  
+- Vectorization and parallel execution across devices  
+
+JAX is widely used in scientific machine learning, physics-informed models, and numerical optimization on large-scale clusters.
+
+---
+
+#### PyTrAn: Python–Fortran Integration for Legacy HPC Codes
+
+**PyTrAn** (Python–Fortran Translator or Interface) is designed to connect Python with legacy Fortran routines, allowing developers to reuse existing HPC codebases efficiently.  
+Many scientific applications — such as weather modeling, quantum chemistry, and fluid dynamics — rely on decades of optimized Fortran code.
+
+Through interfaces like **f2py** or PyTrAn, Python can:
+- Call high-performance Fortran routines directly  
+- Combine new Python modules with legacy HPC libraries  
+- Simplify workflows while maintaining computational efficiency  
+
+This integration makes Python a flexible *glue language* in hybrid HPC applications.
+
+---
+
+#### Numba: Just-In-Time Compilation for CPUs and GPUs
+
+**Numba** is a Just-In-Time (JIT) compiler that translates numerical Python code into fast machine code at runtime using LLVM.  
+It requires minimal code changes and provides performance close to C or Fortran.
+
+Main advantages:
+- Speeds up array-oriented computations on CPUs  
+- Enables parallel loops and multi-threading  
+- Supports GPU acceleration through CUDA targets  
+
+Numba is ideal for users who want to optimize existing Python scripts for parallel CPU or GPU execution without rewriting them in another language.
+
+---
+
+#### CuPy: GPU-Accelerated Array Library
+
+**CuPy** provides a drop-in replacement for NumPy that runs on **NVIDIA GPUs**.  
+It mimics the NumPy API, allowing users to accelerate numerical code simply by changing the import statement, while taking full advantage of GPU parallelism.
+
+Highlights:
+- NumPy-compatible syntax for easy transition  
+- GPU-backed arrays and linear algebra operations  
+- Integration with deep learning and simulation workflows  
+
+CuPy is particularly useful for applications with large data arrays or intensive numerical workloads, such as molecular modeling, image processing, and machine learning.
+
+---
+
+#### mpi4py: Parallel and Distributed Computing with MPI
+
+**mpi4py** provides Python bindings to the standard **Message Passing Interface (MPI)**, allowing distributed-memory parallelism across multiple processes or nodes.  
+It enables Python programs to run efficiently on supercomputers, leveraging the same communication model used in large-scale C or Fortran HPC applications.
+
+Capabilities:
+- Point-to-point and collective communication  
+- Parallel I/O and reduction operations  
+- Compatibility with SLURM and other HPC schedulers  
+
+mpi4py bridges Python’s high-level usability with the scalability of traditional HPC systems, making it possible to prototype and deploy distributed applications quickly.
+
+---
+
+#### Summary: Python’s Role in Modern HPC
+
+Python now plays an essential role in high-performance and scientific computing.  
+Thanks to tools such as JAX, PyTrAn, Numba, CuPy, and mpi4py, Python can serve as both a **high-level interface** for productivity and a **high-performance backend** through compilation and parallelization.
+
+| Library | Focus Area | Target Hardware | Key Strength |
+|----------|-------------|-----------------|---------------|
+| **JAX** | Automatic differentiation, numerical computing | CPU, GPU, TPU | JIT compilation and accelerator support |
+| **PyTrAn** | Interfacing Python with Fortran codes | CPU | Integration with legacy HPC software |
+| **Numba** | Just-in-time compilation | CPU, GPU | Native performance with minimal code changes |
+| **CuPy** | GPU-accelerated array computing | GPU | Drop-in NumPy replacement for GPUs |
+| **mpi4py** | Distributed parallel computing | CPU clusters | MPI-based scaling across nodes |
+
+These libraries make Python a truly viable language for HPC workflows — from prototyping to full-scale production on modern supercomputers.
+
+---
+
+:::{keypoints}
+- Python’s modern ecosystem enables real HPC performance through JIT compilation and hardware acceleration.  
+- JAX, Numba, and CuPy provide acceleration on CPUs, GPUs, and TPUs.  
+- PyTrAn bridges Python with legacy Fortran codes used in scientific computing.  
+- mpi4py enables distributed-memory parallelism across clusters using MPI.  
+- Together, these tools make Python a powerful and accessible option for modern HPC.
+:::
\ No newline at end of file

From 57d011e30ccd5b989806929ccd1ab87d031b2025 Mon Sep 17 00:00:00 2001
From: jadgt <juan.degracia@icloud.com>
Date: Tue, 11 Nov 2025 15:38:25 +0100
Subject: [PATCH 2/6] Add comprehensive introduction to MPI with Python
 (mpi4py) including key concepts, communication types, and integration with
 NumPy.

---
 content/mpi4py.md | 285 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 285 insertions(+)
 create mode 100644 content/mpi4py.md

diff --git a/content/mpi4py.md b/content/mpi4py.md
new file mode 100644
index 0000000..453f41b
--- /dev/null
+++ b/content/mpi4py.md
@@ -0,0 +1,285 @@
+# Introduction to MPI with Python (mpi4py)
+
+:::{questions}
+- What is MPI, and how does it enable parallel programs to communicate?
+- How does Python implement MPI through the `mpi4py` library?
+- What are point-to-point and collective communications?
+- How does `mpi4py` integrate with NumPy for efficient data exchange?
+:::
+
+:::{objectives}
+- Understand the conceptual model of MPI: processes, ranks, and communication.
+- Distinguish between point-to-point and collective operations.
+- Recognize how NumPy arrays act as communication buffers in `mpi4py`.
+- See how `mpi4py` bridges Python and traditional HPC concepts.
+:::
+
+---
+
+## What Is MPI?
+
+**MPI (Message Passing Interface)** is a standard for communication among processes that run on distributed-memory systems, such as HPC clusters.  
+
+When you run an MPI program, the system **creates multiple independent processes**, each running its *own copy* of the same program.  
+Each process is assigned a unique identifier called a **rank**, and all ranks together form a **communicator** (most commonly `MPI.COMM_WORLD`).
+
+Key ideas:
+- Every process runs the same code, but can behave differently depending on its rank.  
+- Processes do not share memory, they communicate by **sending and receiving messages**.  
+- MPI provides a consistent way to exchange data across nodes and processors.
+
+Conceptually, this is sometimes called the **SPMD (Single Program, Multiple Data)** model.
+
+In order to see how this works let us try this simple code snippet:
+```python
+# mpi_hello.py
+from mpi4py import MPI
+
+# Initialize communicator
+comm = MPI.COMM_WORLD
+
+# Get the number of processes
+size = comm.Get_size()
+
+# Get the rank (ID) of this process
+rank = comm.Get_rank()
+
+# Print a message from each process
+print(f"Hello from process {rank} out of {size}")
+```
+
+---
+
+## Point-to-Point Communication
+
+The most basic form of communication in MPI is **point-to-point**, meaning data is sent from one process directly to another.  
+
+Each message involves:
+- A **sender** and a **receiver**
+- A **tag** identifying the message type
+- A **data buffer** that holds the information being transmitted
+
+Typical operations:
+- **Send:** one process transmits data.
+- **Receive:** another process waits for that data.
+
+In `mpi4py`, each of these operations maps directly to MPI’s underlying mechanisms but with a simple Python interface.  
+Conceptually, this allows one process to hand off a message to another in a fully parallel environment.
+
+Examples of conceptual use cases:
+- Distributing different chunks of data to multiple workers.
+- Passing boundary conditions between neighboring domains in a simulation.
+
+To illustrate this let us construct a minimal example:
+
+```python
+# mpi_send_recv.py
+from mpi4py import MPI
+
+comm = MPI.COMM_WORLD
+rank = comm.Get_rank()
+
+if rank == 0:
+    # Process 0 sends a message
+    data = "Hello from process 0"
+    comm.send(data, dest=1)  # send to process 1
+    print(f"Process {rank} sent data: {data}")
+
+elif rank == 1:
+    # Process 1 receives a message
+    data = comm.recv(source=0)  # receive from process 0
+    print(f"Process {rank} received data: {data}")
+
+else:
+    # Other ranks do nothing
+    print(f"Process {rank} is idle")
+```
+
+---
+
+## Collective Communication
+
+While point-to-point operations handle pairs of processes, **collective operations** involve all processes in a communicator.  
+They provide coordinated data exchange and synchronization patterns that are efficient and scalable.
+
+Common collectives include:
+- **Broadcast:** One process sends data to all others.  
+- **Scatter:** One process distributes distinct pieces of data to each process.  
+- **Gather:** Each process sends data back to a root process.  
+- **Reduce:** All processes combine results using an operation (e.g., sum, max).  
+
+Collectives are conceptually similar to group conversations, where every participant either contributes, receives, or both.  
+They are essential for algorithms that require sharing intermediate results or aggregating outputs.
+
+Example broadcast + gather
+```python
+# mpi_collectives.py
+from mpi4py import MPI
+
+comm = MPI.COMM_WORLD
+rank = comm.Get_rank()
+size = comm.Get_size()
+
+# --- Broadcast example ---
+if rank == 0:
+    data = "Hello from the root process"
+else:
+    data = None
+
+# Broadcast data from process 0 to all others
+data = comm.bcast(data, root=0)
+print(f"Process {rank} received: {data}")
+
+# --- Gather example ---
+# Each process creates its own message
+local_msg = f"Message from process {rank}"
+
+# Gather all messages at the root process (rank 0)
+all_msgs = comm.gather(local_msg, root=0)
+
+if rank == 0:
+    print("\nGathered messages at root:")
+    for msg in all_msgs:
+        print(msg)
+```
+
+---
+
+## Integration with NumPy: Buffer-Like Objects
+
+One of the strengths of `mpi4py` is its **tight integration with NumPy**.  
+MPI operations in `mpi4py` can directly send and receive **buffer-like objects**, such as NumPy arrays, without copying data back and forth between Python and C memory.
+
+Conceptually:
+- Each NumPy array exposes its underlying memory buffer.
+- MPI can access this memory region directly.
+- This avoids serialization overhead (no need for `pickle`) and achieves near-native C performance.
+
+This makes it possible to:
+- Efficiently distribute large numerical datasets across processes.  
+- Perform collective operations directly on arrays.  
+- Integrate `mpi4py` seamlessly into scientific Python workflows.
+
+For example, collective reductions like sums or means across large NumPy arrays are routine building blocks in parallel simulations and machine learning workloads.
+
+Example with collectives:
+```python
+# mpi_numpy_collectives.py
+from mpi4py import MPI
+import numpy as np
+
+comm = MPI.COMM_WORLD
+rank = comm.Get_rank()
+size = comm.Get_size()
+
+# Total number of elements in the big array (must be divisible by size)
+N = 10_000_000
+
+# Only rank 0 creates the full array
+if rank == 0:
+    big_array = np.ones(N, dtype="float64")  # for simplicity, all ones
+else:
+    big_array = None
+
+# Each process will receive a chunk of this size
+local_N = N // size
+
+# Allocate local buffer on each process
+local_array = np.empty(local_N, dtype="float64")
+
+# Scatter the big array from root to all processes
+comm.Scatter(
+    [big_array, MPI.DOUBLE],  # send buffer (only meaningful on root)
+    [local_array, MPI.DOUBLE],  # receive buffer on every rank
+    root=0,
+)
+
+# Each process computes a local sum
+local_sum = np.sum(local_array)
+
+# Reduce all local sums to a global sum on root
+global_sum = comm.reduce(local_sum, op=MPI.SUM, root=0)
+
+if rank == 0:
+    print(f"Global sum = {global_sum}")
+    print(f"Expected   = {float(N)}")
+```
+
+To really understand the power of collectives, we can have a look at the same code using p2p communication:
+```python
+# mpi_numpy_point_to_point.py
+from mpi4py import MPI
+import numpy as np
+
+comm = MPI.COMM_WORLD
+rank = comm.Get_rank()
+size = comm.Get_size()
+
+# Total number of elements in the big array (must be divisible by size)
+N = 10_000_000
+local_N = N // size
+
+if rank == 0:
+    # Root creates the full big array
+    big_array = np.ones(N, dtype="float64")  # all ones for simplicity
+
+    # --- Send chunks to other ranks ---
+    for dest in range(1, size):
+        start = dest * local_N
+        end = (dest + 1) * local_N
+        chunk = big_array[start:end]
+        comm.Send([chunk, MPI.DOUBLE], dest=dest, tag=0)
+
+    # Root keeps its own chunk (rank 0)
+    local_array = big_array[0:local_N]
+
+else:
+    # Other ranks allocate a buffer and receive their chunk
+    local_array = np.empty(local_N, dtype="float64")
+    comm.Recv([local_array, MPI.DOUBLE], source=0, tag=0)
+
+# Each process computes its local sum
+local_sum = np.sum(local_array)
+
+if rank == 0:
+    # Root starts with its own local sum
+    global_sum = local_sum
+
+    # --- Receive partial sums from all other ranks ---
+    for source in range(1, size):
+        recv_sum = comm.recv(source=source, tag=1)
+        global_sum += recv_sum
+
+    print(f"Global sum (point-to-point) = {global_sum}")
+    print(f"Expected                    = {float(N)}")
+
+else:
+    # Other ranks send their local sum back to root
+    comm.send(local_sum, dest=0, tag=1)
+```
+---
+
+## Summary
+
+**mpi4py** provides a simple yet powerful bridge between Python and the Message Passing Interface used in traditional HPC applications.  
+Conceptually, it introduces the same communication paradigms used in compiled MPI programs but with Python’s expressiveness and interoperability.
+
+| Concept | Description |
+|----------|-------------|
+| **Process** | Independent copy of the program with its own memory space |
+| **Rank** | Identifier for each process within a communicator |
+| **Point-to-Point** | Direct communication between pairs of processes |
+| **Collective** | Group communication involving all processes |
+| **NumPy Buffers** | Efficient memory sharing for large numerical data |
+
+mpi4py allows Python users to write distributed parallel programs that scale from laptops to supercomputers, making it an invaluable tool for modern scientific computing.
+
+---
+
+:::{keypoints}
+- MPI creates multiple independent processes running the same program.  
+- Point-to-point communication exchanges data directly between two processes.  
+- Collective communication coordinates data exchange across many processes.  
+- mpi4py integrates tightly with NumPy for efficient, zero-copy data transfers.  
+- These concepts allow Python programs to scale effectively on HPC systems.
+:::
\ No newline at end of file

From 70b868e6d70de4992e85f88b1d1b52817d7bc1b0 Mon Sep 17 00:00:00 2001
From: jadgt <juan.degracia@icloud.com>
Date: Tue, 11 Nov 2025 16:10:38 +0100
Subject: [PATCH 3/6] Trimming and update parallel computing examples in HPC
 introduction

---
 content/intohpc.md | 226 +++++----------------------------------------
 1 file changed, 23 insertions(+), 203 deletions(-)

diff --git a/content/intohpc.md b/content/intohpc.md
index a142648..f9a427b 100644
--- a/content/intohpc.md
+++ b/content/intohpc.md
@@ -41,28 +41,6 @@ parallel.
 
 ---
 
-## Connecting to a Cluster
-
-You usually access an HPC system through a **login node** using SSH.
-
-```bash
-ssh username@hpc-login.example.org
-```
-Once connected, you can see what resources are available using the
-scheduler’s commands (for example, SLURM):
-
-```bash
-sinfo
-```
-
-:::{exercise} Exploring the Cluster
-	1.	Log in to your training cluster using ssh.
-	2.	Run sinfo to see available partitions and nodes.
-	3.	Try squeue to view currently running jobs.
-:::
-
----
-
 ## Parallel Computing
 
 High-Performance Computing relies on **parallel computing**, splitting a problem into smaller parts that can be executed *simultaneously* on multiple processors.
@@ -72,7 +50,7 @@ Instead of running one instruction at a time on one CPU core, parallel computing
 Parallelism can occur at different levels:
 - **Within a single CPU** (multiple cores)
 - **Across multiple CPUs** (distributed nodes)
-- **On specialized accelerators** (GPUs)
+- **On specialized accelerators** (GPUs or TPUs)
 
 ---
 
@@ -85,16 +63,8 @@ This is the model used in:
 - Multicore laptops and workstations  
 - *Single compute nodes* on a cluster  
 
-Programs use **threads** to execute in parallel (e.g., with OpenMP in C/C++/Fortran or multiprocessing in Python).
+Programs use **threads** to execute in parallel (e.g., with OpenMP in C/C++/Fortran or **multiprocessing in Python**).
 
-Example (OpenMP-style pseudocode):
-
-```c
-#pragma omp parallel for
-for (int i = 0; i < N; i++) {
-    a[i] = b[i] + c[i];
-}
-```
 Advantages:
 	•	Easy communication between threads (shared variables)
 	•	Low latency data access
@@ -103,8 +73,8 @@ Limitations:
 	•	Limited by the number of cores on one machine
 	•	Risk of race conditions if data access is not synchronized
 
-:::{exercise} Shared-Memory Example
-Try using Python’s multiprocessing module to parallelize a simple loop:
+Example:
+```python
 from multiprocessing import Pool
 
 def square(x):
@@ -114,6 +84,7 @@ if __name__ == "__main__":
     with Pool(4) as p:
         result = p.map(square, range(8))
     print(result)
+```
 
 How many CPU cores does your machine have? Try changing the number of workers.
 :::
@@ -160,7 +131,6 @@ Limitations:
   •	More latency, requires minimizing movement of data.
 
 
-
 ### Hybrid Architectures: CPU, GPU, and TPU
 
 Modern High-Performance Computing (HPC) systems rarely rely on CPUs alone.  
@@ -206,138 +176,7 @@ Originally developed by Google, TPUs are now used in both cloud and research HPC
 TPUs focus on **tensor computations** and achieve very high performance and energy efficiency for machine learning workloads.  
 They are less flexible than CPUs or GPUs but excel in neural network training and inference.
 
----
-
-#### Putting It All Together
-
-Modern HPC clusters combine these processor types to create **heterogeneous nodes**.  
-A single node may contain multiple CPUs and several GPUs or TPUs, interconnected through high-speed buses.
-
-| Processor Type | Main Role | Best Suited For |
-|-----------------|------------|-----------------|
-| **CPU** | General-purpose control, coordination, and I/O | Serial and logic-heavy tasks |
-| **GPU** | Massive parallel numerical computation | Data-parallel workloads and simulations |
-| **TPU** | Specialized tensor acceleration | Machine learning and AI workloads |
-
-This combination allows each processor type to handle the tasks for which it is best optimized, enabling both high performance and efficient energy use.
-
----
-
-#### Hybrid Programming Models
-
-To take full advantage of hybrid systems, programmers often combine multiple parallel models within one application:
-
-- **MPI + OpenMP:** Distributed memory between nodes and shared memory within each node.  
-- **MPI + CUDA or OpenACC:** Distributed communication across nodes, GPU acceleration within nodes.  
-- **MPI + XLA or TPU runtime:** MPI coordination with tensor operations handled on TPUs.
-
-This layered approach leverages:
-- **Distributed memory** across cluster nodes  
-- **Shared memory** and **accelerators** within each node  
-
-Such **hybrid parallelism** is now the dominant model in large-scale scientific and industrial computing.
-
----
-
-:::{keypoints}
-- Modern HPC systems combine CPUs, GPUs, and TPUs in hybrid architectures.  
-- CPUs handle coordination and control, while GPUs and TPUs execute highly parallel workloads.  
-- GPUs excel at general numerical parallelism; TPUs specialize in tensor operations for AI.  
-- Hybrid programming models (MPI with OpenMP, CUDA, or TPU backends) enable efficient use of all resources.
-:::
-
----
-
-### Hybrid Programming Models
-
-Modern HPC applications are rarely limited to a single parallel programming model.  
-Instead, they often combine multiple layers of parallelism to make the best use of available hardware — an approach known as **hybrid programming**.
-
-Hybrid programming allows developers to exploit **distributed memory parallelism** across nodes and **shared memory or accelerator parallelism** within nodes.  
-This enables applications to scale efficiently from a few cores to thousands of nodes on large supercomputers.
-
----
-
-#### MPI + OpenMP: Hybrid Parallelism on CPU Nodes
-
-When an HPC cluster consists of CPU-only nodes, a common hybrid model combines **MPI (Message Passing Interface)** and **OpenMP (Open Multi-Processing)**.
-
-- **MPI** handles communication between nodes in the cluster.  
-  Each MPI process runs on a separate node or subset of nodes and uses message passing to exchange data.  
-
-- **OpenMP** provides shared-memory parallelism within each node.  
-  Multiple threads inside a single MPI process can execute concurrently on the cores of that node.
-
-This structure creates a **two-level hierarchy** of parallelism:
-1. **Inter-node parallelism** — achieved through MPI (distributed memory).  
-2. **Intra-node parallelism** — achieved through OpenMP (shared memory).
-
-**Benefits of MPI + OpenMP hybridization:**
-- Reduces the total number of MPI processes, decreasing communication overhead.  
-- Exploits the full capacity of multi-core CPUs within each node.  
-- Balances scalability (via MPI) and efficiency (via OpenMP threads).  
-
-**Challenges:**
-- Requires careful balancing between MPI processes and OpenMP threads per node.  
-- Data locality and thread affinity must be managed to avoid performance loss.
-
-This model is widely used in simulation codes such as computational fluid dynamics, finite element methods, and electronic structure calculations.
-
----
-
-#### MPI + GPU: Hybrid Parallelism on Heterogeneous Nodes
-
-When HPC nodes include GPUs or other accelerators, the hybrid model extends naturally to **MPI + GPU programming**.  
-In this approach, **MPI** is still responsible for communication between nodes, while **GPU programming frameworks** (such as CUDA, HIP, or OpenACC) handle massive data-parallel computations within each node.
-
-The workflow typically looks like this:
-- Each node runs one or more MPI processes.  
-- Each MPI process controls one or more GPUs.  
-- Within a node, the heavy numerical work is offloaded from the CPU to the GPU.  
-
-This model provides three complementary layers of parallelism:
-1. **MPI level:** communication and coordination between nodes.  
-2. **Node level:** management and scheduling by the CPU host.  
-3. **Device level:** massive parallel computation executed by the GPU.  
-
-**Benefits of MPI + GPU hybridization:**
-- Enables scaling across thousands of GPUs distributed among cluster nodes.  
-- Delivers enormous speedups for highly parallel workloads (e.g., molecular dynamics, deep learning, weather modeling).  
-- Frees CPU resources for orchestration and non-parallel tasks.
-
-**Challenges:**
-- Requires explicit management of data transfer between CPU and GPU memory.  
-- Increased complexity in software development and debugging.  
-- Performance depends on matching problem size and data layout to GPU architecture.
-
-Hybrid MPI + GPU programming has become the dominant model for modern supercomputers, including those in the **Top500** list, as nearly all of them are now GPU-accelerated.
-
----
-
-#### Choosing the Right Hybrid Model
-
-The choice between MPI + OpenMP and MPI + GPU depends on:
-- **Hardware configuration:** CPU-only clusters vs. GPU-accelerated clusters.  
-- **Application characteristics:** memory usage, data dependencies, and computational intensity.  
-- **Scalability goals:** how far the problem needs to scale across nodes and cores.  
-
-| Hybrid Model | Hardware | Parallelism Within Node | Inter-node Parallelism | Typical Use |
-|---------------|-----------|--------------------------|------------------------|--------------|
-| **MPI + OpenMP** | Multi-core CPU nodes | Shared-memory (threads) | MPI processes | General-purpose scientific computing |
-| **MPI + GPU** | CPU + GPU nodes | GPU data parallelism | MPI processes | HPC with heavy numerical or ML workloads |
-
-Both models represent essential building blocks of high-performance computing, and many large-scale codes support both configurations for flexibility across systems.
-
----
-
-:::{keypoints}
-- Hybrid programming combines multiple parallel models to fully utilize modern hardware.  
-- **MPI + OpenMP** is the standard approach for CPU-only systems.  
-- **MPI + GPU** extends hybrid parallelism to heterogeneous CPU–GPU architectures.  
-- Choosing the right model depends on hardware, scalability, and the computational nature of the application.
-:::
-
-### Python in High-Performance Computing
+## Python in High-Performance Computing
 
 Python is one of the most widely used languages in scientific research due to its simplicity, readability, and rich ecosystem of numerical libraries.  
 While Python itself is interpreted and not as fast as compiled languages like C or Fortran, it has developed a strong foundation for **high-performance computing (HPC)** through specialized libraries and interfaces.
@@ -361,19 +200,26 @@ JAX is widely used in scientific machine learning, physics-informed models, and
 
 ---
 
-#### PyTrAn: Python–Fortran Integration for Legacy HPC Codes
+#### Pythran: Static Compilation of Numerical Python Code
 
-**PyTrAn** (Python–Fortran Translator or Interface) is designed to connect Python with legacy Fortran routines, allowing developers to reuse existing HPC codebases efficiently.  
-Many scientific applications — such as weather modeling, quantum chemistry, and fluid dynamics — rely on decades of optimized Fortran code.
+**Pythran** is a Python-to-C++ compiler designed to speed up numerical Python code, especially code that uses **NumPy**.  
+It allows scientists to write high-level Python functions and then compile them into efficient native extensions that run close to C or Fortran performance.
 
-Through interfaces like **f2py** or PyTrAn, Python can:
-- Call high-performance Fortran routines directly  
-- Combine new Python modules with legacy HPC libraries  
-- Simplify workflows while maintaining computational efficiency  
+Unlike tools that interface with existing Fortran code, Pythran works by **analyzing and translating Python source code itself** into optimized C++ code under the hood.  
+This makes it especially useful for accelerating array-oriented computations without leaving Python syntax.
 
-This integration makes Python a flexible *glue language* in hybrid HPC applications.
+Key features:
+- Compiles numerical Python code (especially with NumPy) to efficient machine code  
+- Supports automatic parallelization via OpenMP  
+- Produces portable C++ extensions that can be imported directly into Python  
+- Requires only minimal code annotations to guide type inference  
 
----
+Typical use cases include:
+- Scientific simulations and numerical kernels  
+- Loops or array operations that are too slow in pure Python  
+- HPC applications that need native performance but prefer to stay within the Python ecosystem  
+
+Pythran complements other tools such as **Numba** and **Cython**, giving scientists a flexible pathway to accelerate Python code without rewriting it in C or Fortran.
 
 #### Numba: Just-In-Time Compilation for CPUs and GPUs
 
@@ -391,7 +237,7 @@ Numba is ideal for users who want to optimize existing Python scripts for parall
 
 #### CuPy: GPU-Accelerated Array Library
 
-**CuPy** provides a drop-in replacement for NumPy that runs on **NVIDIA GPUs**.  
+**CuPy** provides a drop-in replacement for NumPy that runs on **NVIDIA GPUs** and also **AMD GPUs**.  
 It mimics the NumPy API, allowing users to accelerate numerical code simply by changing the import statement, while taking full advantage of GPU parallelism.
 
 Highlights:
@@ -415,29 +261,3 @@ Capabilities:
 
 mpi4py bridges Python’s high-level usability with the scalability of traditional HPC systems, making it possible to prototype and deploy distributed applications quickly.
 
----
-
-#### Summary: Python’s Role in Modern HPC
-
-Python now plays an essential role in high-performance and scientific computing.  
-Thanks to tools such as JAX, PyTrAn, Numba, CuPy, and mpi4py, Python can serve as both a **high-level interface** for productivity and a **high-performance backend** through compilation and parallelization.
-
-| Library | Focus Area | Target Hardware | Key Strength |
-|----------|-------------|-----------------|---------------|
-| **JAX** | Automatic differentiation, numerical computing | CPU, GPU, TPU | JIT compilation and accelerator support |
-| **PyTrAn** | Interfacing Python with Fortran codes | CPU | Integration with legacy HPC software |
-| **Numba** | Just-in-time compilation | CPU, GPU | Native performance with minimal code changes |
-| **CuPy** | GPU-accelerated array computing | GPU | Drop-in NumPy replacement for GPUs |
-| **mpi4py** | Distributed parallel computing | CPU clusters | MPI-based scaling across nodes |
-
-These libraries make Python a truly viable language for HPC workflows — from prototyping to full-scale production on modern supercomputers.
-
----
-
-:::{keypoints}
-- Python’s modern ecosystem enables real HPC performance through JIT compilation and hardware acceleration.  
-- JAX, Numba, and CuPy provide acceleration on CPUs, GPUs, and TPUs.  
-- PyTrAn bridges Python with legacy Fortran codes used in scientific computing.  
-- mpi4py enables distributed-memory parallelism across clusters using MPI.  
-- Together, these tools make Python a powerful and accessible option for modern HPC.
-:::
\ No newline at end of file

From b9a9d8903394a565135bb13908e98c6849bce26b Mon Sep 17 00:00:00 2001
From: Juan de Gracia <juan.degracia@icloud.com>
Date: Tue, 11 Nov 2025 16:17:02 +0100
Subject: [PATCH 4/6] update index

---
 content/index.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/content/index.md b/content/index.md
index 2e8332c..1f0a993 100644
--- a/content/index.md
+++ b/content/index.md
@@ -20,6 +20,8 @@ Intro
 :caption: The lesson
 :maxdepth: 1
 
+intohpc.md
+mpi4py.md
 episode.md
 ```
 

From 5b918cde60d38d297bede3585005942401f608cc Mon Sep 17 00:00:00 2001
From: Juan de Gracia <juan.degracia@icloud.com>
Date: Wed, 12 Nov 2025 13:39:25 +0100
Subject: [PATCH 5/6] Enhance HPC and MPI documentation with detailed
 explanations on performance optimization, independent processes, and
 collective communication. Add practical exercises for users to practice
 single-core speed and MPI concepts.

---
 content/intohpc.md | 267 ++++++++++++++++++++++++--------
 content/mpi4py.md  | 378 +++++++++++++++++++++++++++++++++++----------
 2 files changed, 501 insertions(+), 144 deletions(-)

diff --git a/content/intohpc.md b/content/intohpc.md
index f9a427b..a27206d 100644
--- a/content/intohpc.md
+++ b/content/intohpc.md
@@ -27,8 +27,8 @@ use clusters to perform large-scale computations.
 
 HPC systems, often called *supercomputers* or *clusters*, are made up of
 many computers (called **nodes**) connected by a fast network. Each node
-can have multiple **CPUs** (and sometimes **GPUs**) that run tasks in
-parallel.
+can have multiple cores which are **CPUs** (and sometimes **GPUs**) that 
+run tasks in parallel.
 
 ### Typical HPC Components
 
@@ -40,6 +40,98 @@ parallel.
 | **Storage** | Shared file system accessible to all nodes |
 
 ---
+## Single core performance optimization
+
+Pure-Python loops are slow because each iteration runs in the Python interpreter. NumPy pushes work into optimized native code (C/C++/BLAS), drastically reducing overhead. Below we compare a Python for loop with NumPy vectorized operations and discuss tips for fair, single-core measurements.
+
+:::{exercise} Practice making Python faster on a single CPU.
+Copy and paste this code 
+```python
+import os
+# (Optional safety if you run this inside Python, must be set BEFORE importing numpy)
+os.environ.setdefault("OMP_NUM_THREADS", "1")
+os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
+os.environ.setdefault("MKL_NUM_THREADS", "1")
+os.environ.setdefault("NUMEXPR_NUM_THREADS", "1")
+
+import numpy as np
+from time import perf_counter
+
+def timeit(fn, *args, repeats=5, warmup=1, **kwargs):
+    # warmup
+    for _ in range(warmup):
+        fn(*args, **kwargs)
+    # timed runs
+    tmin = float("inf")
+    for _ in range(repeats):
+        t0 = perf_counter()
+        fn(*args, **kwargs)
+        dt = perf_counter() - t0
+        tmin = min(tmin, dt)
+    return tmin
+
+# Problem size
+N = 10_000_000  # 10 million elements
+
+# Test data (contiguous, fixed dtype)
+a = np.random.rand(N).astype(np.float64)
+b = np.random.rand(N).astype(np.float64)
+
+# --- 1) Pure-Python loop sum ---
+def py_sum(x):
+    s = 0.0
+    for v in x:         # per-element Python overhead
+        s += v
+    return s
+
+# --- 2) NumPy vectorized sum ---
+def np_sum(x):
+    return x.sum()      # dispatches to optimized C/BLAS
+
+# --- 3) Elementwise add then sum (Python loop) ---
+def py_add_sum(x, y):
+    s = 0.0
+    for i in range(len(x)):
+        s += x[i] + y[i]
+    return s
+
+# --- 4) Elementwise add then sum (NumPy, no temporaries) ---
+def np_add_sum_no_temp(x, y):
+    # np.add.reduce avoids allocating x+y temporary
+    return np.add.reduce([x, y])  # equivalent to sum stacks; see alt below
+
+# Alternative that’s typically fastest and clearer:
+def np_add_sum_fast(x, y):
+    return (x + y).sum()  # may allocate a temporary; fast on many BLAS builds
+
+# Time them
+print("Timing on single core (best of 5 runs):")
+t_py_sum   = timeit(py_sum, a)
+t_np_sum   = timeit(np_sum, a)
+t_py_add   = timeit(py_add_sum, a, b)
+t_np_add   = timeit(np_add_sum_fast, a, b)
+
+print(f"Python for-loop sum:          {t_py_sum:8.4f} s")
+print(f"NumPy vectorized sum:         {t_np_sum:8.4f} s")
+print(f"Python loop add+sum:          {t_py_add:8.4f} s")
+print(f"NumPy vectorized add+sum:     {t_np_add:8.4f} s")
+```
+Execute it and following let us verify the effect on the following modifications:
+1. Run the timing script with N = 1_000_000, 5_000_000, 20_000_000.
+2. Try float32 vs float64.
+3. Switch (a + b).sum() to np.add(a, b, out=a); a.sum() and compare.
+:::
+
+
+#### Practical tips for single-core speed
+- Prefer vectorization: Use array ops (+, *, .sum(), .dot(), np.mean, np.linalg.*) rather than per-element Python loops.
+- Control temporaries: Expressions like (a + b + c).sum() may create temporaries. When memory is tight, consider in-place ops (a += b) or reductions (np.add(a, b, out=a); np.add.reduce([...])).
+- Use the right dtype: float64 is standard for numerics; float32 halves memory traffic and can be faster on some CPUs/GPUs (but mind precision).
+- Preallocate: Avoid growing Python lists or repeatedly allocating arrays inside loops.
+- Minimize Python in hot paths: Move heavy math into NumPy calls; keep Python for orchestration only.
+- Benchmark correctly: Use large N, pin threads to 1 for fair single-core tests, and report the best of multiple runs after a warmup.
+
+--
 
 ## Parallel Computing
 
@@ -65,14 +157,17 @@ This is the model used in:
 
 Programs use **threads** to execute in parallel (e.g., with OpenMP in C/C++/Fortran or **multiprocessing in Python**).
 
+:::{keypoints} 
 Advantages:
-	•	Easy communication between threads (shared variables)
-	•	Low latency data access
+- Easy communication between threads (shared variables)
+- Low latency data access
 
 Limitations:
-	•	Limited by the number of cores on one machine
-	•	Risk of race conditions if data access is not synchronized
+- Limited by the number of cores on one machine
+- Risk of race conditions if data access is not synchronized
+::: 
 
+:::{exercise} Practice with threaded parallelism in Python
 Example:
 ```python
 from multiprocessing import Pool
@@ -85,8 +180,6 @@ if __name__ == "__main__":
         result = p.map(square, range(8))
     print(result)
 ```
-
-How many CPU cores does your machine have? Try changing the number of workers.
 :::
 
 ### Distributed-Memory Parallelism
@@ -99,7 +192,7 @@ This is the model used when a computation spans multiple nodes in an HPC cluster
 Programs written with MPI (Message Passing Interface) use explicit communication.
 Below is an example using the Python library `mpi4py` that implements MPI functions
 in Python 
-
+:::{exercise} Practice with a simple MPI program
 ```python
 # hello_mpi.py
 from mpi4py import MPI
@@ -118,18 +211,21 @@ print(f"Hello from process {rank} of {size}")
 # MPI is automatically finalized when the program exits,
 # but you can call MPI.Finalize() explicitly if you prefer
 ```
+:::
+
 For now, do not worry about understanding this code, we will see 
 `mpi4py` in detail later.
 
+:::{keypoints}
 Advantages:
-	•	Scales to thousands of nodes
-	•	Each process works independently, avoiding memory contention
+- Scales to thousands of nodes
+- Each process works independently, avoiding memory contention
 
 Limitations:
-	•	Requires explicit communication (send/receive)
-	•	More complex programming model
-  •	More latency, requires minimizing movement of data.
-
+- Requires explicit communication (send/receive)
+- More complex programming model
+- More latency, requires minimizing movement of data.
+:::
 
 ### Hybrid Architectures: CPU, GPU, and TPU
 
@@ -178,86 +274,125 @@ They are less flexible than CPUs or GPUs but excel in neural network training an
 
 ## Python in High-Performance Computing
 
-Python is one of the most widely used languages in scientific research due to its simplicity, readability, and rich ecosystem of numerical libraries.  
-While Python itself is interpreted and not as fast as compiled languages like C or Fortran, it has developed a strong foundation for **high-performance computing (HPC)** through specialized libraries and interfaces.
+Python has become one of the most widely used languages in scientific computing due to its simplicity, readability, and extensive ecosystem of numerical libraries.  
+Although Python itself is interpreted and slower than compiled languages such as C or Fortran, it now provides a mature set of tools that allow code to **run efficiently on modern HPC architectures**.
 
-These tools allow scientists and engineers to write code in Python while still achieving near-native performance on CPUs, GPUs, and distributed systems.
+These tools map directly to the three fundamental forms of parallelism introduced earlier:
 
----
+| HPC Parallelism Type | Hardware Context | Python Solutions |
+|----------------------|------------------|------------------|
+| **Shared-memory parallelism** | Multicore CPUs within a node | NumPy, Numba, Pythran |
+| **Distributed-memory parallelism** | Multiple nodes across a cluster | mpi4py |
+| **Accelerator parallelism** | GPUs and TPUs | CuPy, JAX, Numba (CUDA) |
 
-#### JAX: High-Performance Array Computing with Accelerators
+In practice, these technologies allow Python programs to scale from a single core to thousands of nodes on hybrid CPU–GPU systems.
 
-**JAX** is a library developed by Google for high-performance numerical computing and automatic differentiation.  
-It combines the familiar NumPy-like interface with **just-in-time (JIT)** compilation using **XLA (Accelerated Linear Algebra)**.  
-This allows Python functions to be compiled and executed efficiently on both **CPUs and GPUs**.
+---
 
-Key features:
-- Transparent acceleration on GPUs and TPUs  
-- Automatic differentiation for machine learning and optimization  
-- Vectorization and parallel execution across devices  
+### Shared-Memory Parallelism (Multicore CPUs)
 
-JAX is widely used in scientific machine learning, physics-informed models, and numerical optimization on large-scale clusters.
+Shared-memory parallelism occurs within a single compute node, where all CPU cores access the same physical memory.  
+Python supports this level of performance primarily through **compiled numerical libraries** and **JIT (Just-In-Time) compilation**, which transform slow Python loops into efficient native machine code.
 
----
+#### NumPy: Foundation of Scientific Computing
+
+**NumPy** provides fast array operations implemented in C and Fortran.  
+Its vectorized operations and BLAS/LAPACK backends **automatically** exploit shared-memory parallelism through optimized linear algebra kernels.  
+Although users write Python, most computations occur in compiled native code.
 
 #### Pythran: Static Compilation of Numerical Python Code
 
-**Pythran** is a Python-to-C++ compiler designed to speed up numerical Python code, especially code that uses **NumPy**.  
-It allows scientists to write high-level Python functions and then compile them into efficient native extensions that run close to C or Fortran performance.
+**Pythran** compiles numerical Python code — particularly code using NumPy — into optimized C++ extensions.  
+It can automatically parallelize loops using **OpenMP**, enabling true multicore utilization without manual thread management.
 
-Unlike tools that interface with existing Fortran code, Pythran works by **analyzing and translating Python source code itself** into optimized C++ code under the hood.  
-This makes it especially useful for accelerating array-oriented computations without leaving Python syntax.
+Key strengths:
+- Converts array-oriented Python functions into C++ for near-native speed  
+- Supports automatic OpenMP parallelization for CPU cores  
+- Integrates easily into existing Python workflows  
 
-Key features:
-- Compiles numerical Python code (especially with NumPy) to efficient machine code  
-- Supports automatic parallelization via OpenMP  
-- Produces portable C++ extensions that can be imported directly into Python  
-- Requires only minimal code annotations to guide type inference  
+Pythran is well-suited for simulations or kernels that need to exploit multiple cores on a node.
 
-Typical use cases include:
-- Scientific simulations and numerical kernels  
-- Loops or array operations that are too slow in pure Python  
-- HPC applications that need native performance but prefer to stay within the Python ecosystem  
+#### Numba: JIT Compilation for Shared and Accelerator Architectures
 
-Pythran complements other tools such as **Numba** and **Cython**, giving scientists a flexible pathway to accelerate Python code without rewriting it in C or Fortran.
+**Numba** uses LLVM to JIT-compile Python functions into efficient machine code at runtime.  
+On multicore CPUs, Numba can parallelize loops using OpenMP-like constructs; on GPUs, it can emit CUDA kernels (see below).
 
-#### Numba: Just-In-Time Compilation for CPUs and GPUs
+Main advantages:
+- Minimal syntax changes required  
+- Explicit parallel decorators for CPU threading  
+- Compatible with NumPy arrays and ufuncs  
 
-**Numba** is a Just-In-Time (JIT) compiler that translates numerical Python code into fast machine code at runtime using LLVM.  
-It requires minimal code changes and provides performance close to C or Fortran.
+Together, NumPy, Pythran, and Numba enable Python to fully exploit shared-memory parallelism.
 
-Main advantages:
-- Speeds up array-oriented computations on CPUs  
-- Enables parallel loops and multi-threading  
-- Supports GPU acceleration through CUDA targets  
+---
+
+### Distributed-Memory Parallelism (Clusters and Supercomputers)
+
+At large scale, HPC systems use **distributed memory**, where each node has its own local memory and must communicate explicitly.  
+Python provides access to this level of parallelism through **mpi4py**, a direct interface to the standard MPI library.
+
+#### mpi4py: Scalable Distributed Computing with MPI
+
+**mpi4py** enables Python programs to exchange data between processes running on different nodes using MPI.  
+It provides both point-to-point and collective communication primitives, identical in concept to those used in C or Fortran MPI applications.
+
+Key features:
+- Works seamlessly with NumPy arrays (zero-copy data transfer)  
+- Supports all MPI operations (send, receive, broadcast, scatter, gather, reduce)  
+- Compatible with job schedulers such as SLURM or PBS  
 
-Numba is ideal for users who want to optimize existing Python scripts for parallel CPU or GPU execution without rewriting them in another language.
+With `mpi4py`, Python can participate in large-scale distributed-memory simulations or data-parallel tasks across thousands of cores.
 
 ---
 
-#### CuPy: GPU-Accelerated Array Library
+### Accelerator-Specific Parallelism (GPUs and TPUs)
 
-**CuPy** provides a drop-in replacement for NumPy that runs on **NVIDIA GPUs** and also **AMD GPUs**.  
-It mimics the NumPy API, allowing users to accelerate numerical code simply by changing the import statement, while taking full advantage of GPU parallelism.
+Modern HPC nodes increasingly include **GPUs** or **TPUs** to accelerate numerical workloads.  
+Python offers several mature libraries that interface directly with these accelerators, providing high-level syntax while executing low-level parallel kernels.
+
+#### CuPy: GPU-Accelerated NumPy Replacement
+
+**CuPy** mirrors the NumPy API but executes array operations on GPUs using CUDA (NVIDIA) or ROCm (AMD).  
+Users can port existing NumPy code to GPUs with minimal changes, gaining massive speedups for large, data-parallel computations.
 
 Highlights:
-- NumPy-compatible syntax for easy transition  
-- GPU-backed arrays and linear algebra operations  
-- Integration with deep learning and simulation workflows  
+- NumPy-compatible array and linear algebra operations  
+- Native support for multi-GPU and CUDA streams  
+- Tight integration with deep learning and simulation frameworks  
+
+#### JAX: Unified Array Computing for CPUs, GPUs, and TPUs
+
+**JAX** combines automatic differentiation and XLA-based compilation to execute Python functions efficiently on CPUs, GPUs, and TPUs.  
+It is particularly well-suited for scientific machine learning and differentiable simulations.
+
+Key strengths:
+- Just-In-Time (JIT) compilation via XLA  
+- Transparent execution on accelerators (GPU, TPU)  
+- Built-in vectorization and automatic differentiation  
 
-CuPy is particularly useful for applications with large data arrays or intensive numerical workloads, such as molecular modeling, image processing, and machine learning.
+JAX provides a single high-level API for heterogeneous HPC nodes, seamlessly handling hybrid CPU–GPU–TPU workflows.
 
 ---
 
-#### mpi4py: Parallel and Distributed Computing with MPI
+### Summary: Python Across HPC Architectures
 
-**mpi4py** provides Python bindings to the standard **Message Passing Interface (MPI)**, allowing distributed-memory parallelism across multiple processes or nodes.  
-It enables Python programs to run efficiently on supercomputers, leveraging the same communication model used in large-scale C or Fortran HPC applications.
+Python can now leverage **all layers of hybrid HPC architectures** through specialized libraries:
 
-Capabilities:
-- Point-to-point and collective communication  
-- Parallel I/O and reduction operations  
-- Compatibility with SLURM and other HPC schedulers  
+| Architecture | Parallelism Type | Typical Python Tools | Example Use Cases |
+|---------------|------------------|----------------------|-------------------|
+| **Multicore CPUs** | Shared memory | NumPy, Pythran, Numba | Numerical kernels, vectorized math |
+| **Clusters** | Distributed memory | mpi4py | Large-scale simulations, domain decomposition |
+| **GPUs / TPUs** | Accelerator parallelism | CuPy, JAX, Numba (CUDA) | Machine learning, dense linear algebra |
 
-mpi4py bridges Python’s high-level usability with the scalability of traditional HPC systems, making it possible to prototype and deploy distributed applications quickly.
+Together, these tools allow Python to serve as a *high-level orchestration language* that transparently scales from a single laptop core to full supercomputing environments — integrating shared-memory, distributed-memory, and accelerator-based parallelism in one ecosystem.
+
+---
+
+:::{keypoints}
+- Python’s ecosystem maps naturally onto hybrid HPC architectures.  
+- **NumPy, Numba, and Pythran** exploit shared-memory parallelism on multicore CPUs.  
+- **mpi4py** extends Python to distributed-memory clusters.  
+- **CuPy and JAX** enable acceleration on GPUs and TPUs.  
+- These libraries allow researchers to combine high productivity with near-native performance across all layers of HPC systems.
+:::
 
diff --git a/content/mpi4py.md b/content/mpi4py.md
index 453f41b..f4b8c6e 100644
--- a/content/mpi4py.md
+++ b/content/mpi4py.md
@@ -18,19 +18,83 @@
 
 ## What Is MPI?
 
-**MPI (Message Passing Interface)** is a standard for communication among processes that run on distributed-memory systems, such as HPC clusters.  
+**MPI (Message Passing Interface)** is a standardized programming model for communication among processes that run on **distributed-memory systems**, such as HPC clusters.
 
-When you run an MPI program, the system **creates multiple independent processes**, each running its *own copy* of the same program.  
-Each process is assigned a unique identifier called a **rank**, and all ranks together form a **communicator** (most commonly `MPI.COMM_WORLD`).
+In a distributed-memory system, each compute node (or process) has its **own local memory**.  
+Unlike shared-memory systems, where threads can directly read and write to a common address space, distributed processes **cannot directly access each other’s memory**.  
+To collaborate, they must explicitly **send and receive messages** containing the data they need to share.
 
-Key ideas:
-- Every process runs the same code, but can behave differently depending on its rank.  
-- Processes do not share memory, they communicate by **sending and receiving messages**.  
-- MPI provides a consistent way to exchange data across nodes and processors.
+### Independent Processes and the SPMD Model
 
-Conceptually, this is sometimes called the **SPMD (Single Program, Multiple Data)** model.
+When you run an MPI program, the system launches **multiple independent processes**, each running its **own copy** of the same program.  
+This design is fundamental: because each process owns its own memory space, it must contain its own copy of the code to execute its portion of the computation.
+
+Each process:
+- Runs the same code but operates on a different subset of the data.  
+- Is identified by a unique number called its **rank**.  
+- Belongs to a **communicator**, a group of processes that can exchange messages (most commonly `MPI.COMM_WORLD`).
+
+This model is known as **SPMD: Single Program, Multiple Data**:  
+a single source program runs simultaneously on many processes, each working on different data.
+
+### Why Copies of the Program Are Needed?
+
+Because processes in distributed memory do not share variables or memory addresses, each process must have:
+- Its **own copy of the executable code**, and  
+- Its **own private workspace (variables, arrays, etc.)**.
+
+This independence is crucial for scalability:
+- Each process can execute independently without memory contention.  
+- The program can scale to thousands of nodes, since no shared memory bottleneck exists.  
+- Data movement becomes explicit and controllable, ensuring predictable performance on large clusters.
+
+### Sharing Data Between Processes
+
+Although memory is not shared, processes can **cooperate** by exchanging information through **message passing**.  
+MPI defines two main communication mechanisms:
+
+1. **Point-to-point communication**: Data moves **directly** between two processes.  
+2. **Collective communication**: Data is exchanged among **all processes** in a communicator in a coordinated way.  
+
+:::{keypoints}
+- **Process:** Each MPI program runs as multiple independent processes, not threads.  
+- **Rank:** Every process has a unique identifier (its *rank*) within a communicator, used to distinguish and coordinate them.  
+- **Communication:** Processes exchange data explicitly through message passing, either **point-to-point** (between pairs) or **collective** (among groups).  
+
+Together, these three ideas form the foundation of MPI’s model for parallel computing.
+:::
+
+---
+
+## mpi4py
+
+`mpi4py` is the standard Python interface to the **Message Passing Interface (MPI)**, the same API used by C, C++, and Fortran codes for distributed-memory parallelism.  
+It allows Python programs to run on many processes, each with its own memory space, communicating through explicit messages.
+
+### Communicators and Initialization
+
+In MPI, all communication occurs through a **communicator**, an object that defines which processes can talk to each other.  
+When a program starts, each process automatically becomes part of a predefined communicator called **`MPI.COMM_WORLD`**.
+
+This object represents *all processes* that were launched together by `mpirun` or `srun`.
+
+A typical initialization pattern looks like this:
+
+```python
+from mpi4py import MPI
+
+comm = MPI.COMM_WORLD      # Initialize communicator
+size = comm.Get_size()     # Total number of processes
+rank = comm.Get_rank()     # Rank (ID) of this process
+
+print(f"I am rank {rank} of {size}")
+```
+Every process executes the same code, but rank and size allow them to behave differently.
+
+:::{exercise} Hello world MPI
+Copy and paste this code and execute it using `mpirun -n N mpi_hello.py` where `N` is the number of tasks. \
+**Note:** Do not put more tasks than the number of cores that your computer has.
 
-In order to see how this works let us try this simple code snippet:
 ```python
 # mpi_hello.py
 from mpi4py import MPI
@@ -45,8 +109,72 @@ size = comm.Get_size()
 rank = comm.Get_rank()
 
 # Print a message from each process
-print(f"Hello from process {rank} out of {size}")
+print(f"Hello world")
+```
+This code snippet illustrates how independent processes run copies of the program. \
+To practice further try the following:
+1. Use the `rank` variable to print the square of `rank` in each rank.
+2. Make the program print only in rank 0, hint: `if rank == 0:`
+:::
+:::{solution}
+
+*Solution 1:* Print the square of each rank
+```python
+# mpi_hello_square.py
+from mpi4py import MPI
+
+comm = MPI.COMM_WORLD
+size = comm.Get_size()
+rank = comm.Get_rank()
+
+# Each process prints its rank and the square of its rank
+print(f"Process {rank} of {size} has value {rank**2}")
+```
+*Solution 2:* Print only one process (rank 0)
+```python
+# mpi_hello_rank0.py
+from mpi4py import MPI
+
+comm = MPI.COMM_WORLD
+size = comm.Get_size()
+rank = comm.Get_rank()
+
+if rank == 0:
+    print(f"Hello world from the root process (rank {rank}) out of {size} total processes")
 ```
+:::
+
+### Method Naming Convention
+
+In `mpi4py`, most MPI operations exist in **two versions**, a *lowercase* and an *uppercase* form, that differ in how they handle data.
+
+| Convention | Example | Description |
+|-------------|----------|-------------|
+| **Lowercase methods** | `send()`, `recv()`, `bcast()`, `gather()` | High-level, Pythonic methods that can send and receive arbitrary Python objects. Data is automatically serialized (pickled). Simpler to use but slower for large data. |
+| **Uppercase methods** | `Send()`, `Recv()`, `Bcast()`, `Gather()` | Low-level, performance-oriented methods that operate on **buffer-like objects** such as NumPy arrays. Data is transferred directly from memory without serialization, achieving near-C speed. |
+
+**Rule of thumb:**  
+Use *lowercase* methods for small control messages or Python objects,  
+and *uppercase* methods for numerical data stored in arrays when performance matters.
+
+#### Syntax differences
+**Lowercase (Python objects):**
+```python
+comm.send(obj, dest=1)
+data = comm.recv(source=0)
+```
+- The message (obj) can be any Python object.
+- MPI automatically serializes and deserializes it internally.
+- Fewer arguments: simple but less efficient for large data.
+
+**Uppercase (buffer-like objects, e.g., NumPy arrays):**
+```python
+comm.Send([array, MPI.DOUBLE], dest=1)
+comm.Recv([array, MPI.DOUBLE], source=0)
+```
+- Requires explicit definition of the data buffer and its MPI datatype. (same syntax as C++)
+- Works directly with the memory address of the array (no serialization).
+- Achieves maximum throughput for numerical and scientific workloads.
 
 ---
 
@@ -59,8 +187,10 @@ Each message involves:
 - A **tag** identifying the message type
 - A **data buffer** that holds the information being transmitted
 
+These operations are methods of the class `MPI.COMM_WORLD`. This means that one needs to initialize it
+
 Typical operations:
-- **Send:** one process transmits data.
+- **Send:** one process transmits data. 
 - **Receive:** another process waits for that data.
 
 In `mpi4py`, each of these operations maps directly to MPI’s underlying mechanisms but with a simple Python interface.  
@@ -70,8 +200,8 @@ Examples of conceptual use cases:
 - Distributing different chunks of data to multiple workers.
 - Passing boundary conditions between neighboring domains in a simulation.
 
-To illustrate this let us construct a minimal example:
-
+:::{exercise} Point-to-Point Communication
+Copy and paste the code below into a file called `mpi_send_recv.py`.
 ```python
 # mpi_send_recv.py
 from mpi4py import MPI
@@ -94,7 +224,85 @@ else:
     # Other ranks do nothing
     print(f"Process {rank} is idle")
 ```
+Run the program using:
+`mpirun -n 3 python mpi_send_recv.py`
+You should see output indicating that process 0 sent data and process 1 received it, while all others remained idle.
+Now try:
+1.	Change the roles:
+Make process 1 send a reply back to process 0 after receiving the message.
+Use `comm.send()` and `comm.recv()` in both directions.
+2.	Blocking communication:
+Notice that `comm.send()` and `comm.recv()` are blocking operations.
+- Add a short delay using `time.sleep(rank)` before sending or receiving.
+- Observe how process 0 must wait until process 1 calls `recv()` before it can continue, and vice versa.
+- Try swapping the order of the calls (e.g., both processes call `send()` first), what happens?
+- You may notice the program hangs or deadlocks, because both processes are waiting for a `recv()` that never starts.
+:::
 
+:::{solution}
+*Solution 1:* Change the roles (reply back):
+```python
+# mpi_send_recv_reply.py
+from mpi4py import MPI
+
+comm = MPI.COMM_WORLD
+rank = comm.Get_rank()
+
+if rank == 0:
+    data_out = "Hello from process 0"
+    comm.send(data_out, dest=1)
+    print(f"Process {rank} sent: {data_out}")
+
+    data_in = comm.recv(source=1)
+    print(f"Process {rank} received: {data_in}")
+
+elif rank == 1:
+    data_in = comm.recv(source=0)
+    print(f"Process {rank} received: {data_in}")
+
+    data_out = "Reply from process 1"
+    comm.send(data_out, dest=0)
+    print(f"Process {rank} sent: {data_out}")
+
+else:
+    print(f"Process {rank} is idle")
+```
+*Solution 2:* Blocking communication behavior:
+1.	Add a delay (e.g., time.sleep(rank)) before send/recv and observe that each blocking call waits for its partner. Example:
+```python
+# mpi_blocking_delay.py
+from mpi4py import MPI
+import time
+
+comm = MPI.COMM_WORLD
+rank = comm.Get_rank()
+
+time.sleep(rank)  # stagger arrival
+
+if rank == 0:
+    comm.send("msg", dest=1)
+    print("0 -> sent")
+    print("0 -> got:", comm.recv(source=1))
+
+elif rank == 1:
+    print("1 -> got:", comm.recv(source=0))
+    comm.send("ack", dest=0)
+    print("1 -> sent")
+```
+2.	Deadlock demonstration:
+```python
+# mpi_deadlock.py
+from mpi4py import MPI
+comm = MPI.COMM_WORLD
+rank = comm.Get_rank()
+
+if rank in (0, 1):
+    # Both ranks call send() first -> potential deadlock
+    comm.send(f"from {rank}", dest=1-rank)
+    # This recv may never be reached if partner is also stuck in send()
+    print("received:", comm.recv(source=1-rank))
+```
+:::
 ---
 
 ## Collective Communication
@@ -111,7 +319,9 @@ Common collectives include:
 Collectives are conceptually similar to group conversations, where every participant either contributes, receives, or both.  
 They are essential for algorithms that require sharing intermediate results or aggregating outputs.
 
-Example broadcast + gather
+:::{exercise} Collectives
+Let us run this code to see the collectives `bcast` and `gather` in action:
+
 ```python
 # mpi_collectives.py
 from mpi4py import MPI
@@ -142,27 +352,71 @@ if rank == 0:
     for msg in all_msgs:
         print(msg)
 ```
+Now try the following: 
+1. Change the root process: In the broadcast section, change the root process from `rank 0` to `rank 1`.
+2. How would be the same done with point to point communication?
+:::
+:::{solution}
+*Solution 1:* Change the root process:
+The root process is the one that handles the behaviour of the collectives. So we just need to change the root
+of the collective **broadcast**.
+```python
+# --- Broadcast example ---
+if rank == 1:
+    data = "Hello from process 1 (new root)"
+else:
+    data = None
+
+# Broadcast data from process 1 to all others
+data = comm.bcast(data, root=1)
+print(f"Process {rank} received: {data}")
+```
+*Solution 2:* Manual broadcasting the previous code:
+To reproduce a broadcast manually using only send() and recv(), one could write:
+```python
+# Manual broadcast using point-to-point
+if rank == 1:
+    data = "Hello from process 1 (manual broadcast)"
+    # Send to all other processes
+    for dest in range(size):
+        if dest != rank:
+            comm.send(data, dest=dest)
+else:
+    data = comm.recv(source=1)
 
+print(f"Process {rank} received: {data}")
+```
+:::
 ---
 
 ## Integration with NumPy: Buffer-Like Objects
 
-One of the strengths of `mpi4py` is its **tight integration with NumPy**.  
-MPI operations in `mpi4py` can directly send and receive **buffer-like objects**, such as NumPy arrays, without copying data back and forth between Python and C memory.
+A major strength of `mpi4py` is its **direct integration with NumPy arrays**.  
+MPI operations can send and receive **buffer-like objects**, such as NumPy arrays, without copying data between Python and C memory.
+
+:::{keypoints} Important
+Remember that **buffer-like objects** can be used with the **uppercase methods** for avoiding serialization and its time overhead.
+:::
+Because NumPy arrays expose their internal memory buffer, MPI can access this data directly.  
+This eliminates the need for serialization (no `pickle` step) and allows **near-native C performance** for communication and collective operations.
 
 Conceptually:
-- Each NumPy array exposes its underlying memory buffer.
-- MPI can access this memory region directly.
-- This avoids serialization overhead (no need for `pickle`) and achieves near-native C performance.
+- Each NumPy array acts as a **contiguous memory buffer**.  
+- MPI transfers data directly from this buffer to another process’s memory.  
+- This mechanism is ideal for large numerical datasets, enabling efficient data movement in parallel programs.
 
-This makes it possible to:
-- Efficiently distribute large numerical datasets across processes.  
-- Perform collective operations directly on arrays.  
-- Integrate `mpi4py` seamlessly into scientific Python workflows.
+This integration makes it possible to:
+- Distribute large datasets across processes using **collectives** like `Scatter` and `Gather`.  
+- Combine results efficiently with operations like `Reduce` or `Allreduce`.  
+- Seamlessly integrate parallelism into scientific Python workflows.
 
-For example, collective reductions like sums or means across large NumPy arrays are routine building blocks in parallel simulations and machine learning workloads.
+---
+
+:::{exercise} Collective Operations on NumPy Arrays
+In this example, you will see how collective MPI operations distribute and combine large arrays across multiple processes using **buffer-based communication**.
+
+Save the following code as `mpi_numpy_collectives.py` and run it with multiple processes:
 
-Example with collectives:
 ```python
 # mpi_numpy_collectives.py
 from mpi4py import MPI
@@ -189,74 +443,42 @@ local_array = np.empty(local_N, dtype="float64")
 
 # Scatter the big array from root to all processes
 comm.Scatter(
-    [big_array, MPI.DOUBLE],  # send buffer (only meaningful on root)
-    [local_array, MPI.DOUBLE],  # receive buffer on every rank
+    [big_array, MPI.DOUBLE],       # send buffer (only valid on root)
+    [local_array, MPI.DOUBLE],     # receive buffer on all processes
     root=0,
 )
 
 # Each process computes a local sum
 local_sum = np.sum(local_array)
 
-# Reduce all local sums to a global sum on root
+# Reduce all local sums to a global sum on the root process
 global_sum = comm.reduce(local_sum, op=MPI.SUM, root=0)
 
 if rank == 0:
     print(f"Global sum = {global_sum}")
     print(f"Expected   = {float(N)}")
 ```
+Run the program using
+```bash
+mpirun -n 4 python mpi_numpy_collectives.py
+```
+Questions:
+1.	Which MPI calls distribute and collect data in this program?
+2.	Why is it necessary to preallocate local_array on every process?
+3.	What would happen if you used lowercase methods (scatter, reduce) instead of Scatter, Reduce?
+:::
+:::{solution}
 
-To really understand the power of collectives, we can have a look at the same code using p2p communication:
-```python
-# mpi_numpy_point_to_point.py
-from mpi4py import MPI
-import numpy as np
-
-comm = MPI.COMM_WORLD
-rank = comm.Get_rank()
-size = comm.Get_size()
-
-# Total number of elements in the big array (must be divisible by size)
-N = 10_000_000
-local_N = N // size
-
-if rank == 0:
-    # Root creates the full big array
-    big_array = np.ones(N, dtype="float64")  # all ones for simplicity
-
-    # --- Send chunks to other ranks ---
-    for dest in range(1, size):
-        start = dest * local_N
-        end = (dest + 1) * local_N
-        chunk = big_array[start:end]
-        comm.Send([chunk, MPI.DOUBLE], dest=dest, tag=0)
-
-    # Root keeps its own chunk (rank 0)
-    local_array = big_array[0:local_N]
-
-else:
-    # Other ranks allocate a buffer and receive their chunk
-    local_array = np.empty(local_N, dtype="float64")
-    comm.Recv([local_array, MPI.DOUBLE], source=0, tag=0)
-
-# Each process computes its local sum
-local_sum = np.sum(local_array)
-
-if rank == 0:
-    # Root starts with its own local sum
-    global_sum = local_sum
-
-    # --- Receive partial sums from all other ranks ---
-    for source in range(1, size):
-        recv_sum = comm.recv(source=source, tag=1)
-        global_sum += recv_sum
+*Solution 1:* The MPI calls that distribute and collect data in this program are comm.Scatter() and comm.reduce().
+Scatter divides the large NumPy array on the root process and sends chunks to all ranks, while Reduce collects the locally computed results and combines them (using MPI.SUM) into a single global result on the root process.
 
-    print(f"Global sum (point-to-point) = {global_sum}")
-    print(f"Expected                    = {float(N)}")
+*Solution 2:* It is necessary to preallocate local_array on every process because the uppercase MPI methods (Scatter, Gather, Reduce, etc.) work directly with memory buffers.
+Each process must provide a fixed, correctly sized buffer so that MPI can write received data directly into it without additional memory allocation or copying.
 
-else:
-    # Other ranks send their local sum back to root
-    comm.send(local_sum, dest=0, tag=1)
-```
+*Solution 3:* If lowercase methods (scatter, reduce) were used instead, MPI would serialize and deserialize the Python objects being communicated (using pickle).
+This would make the program simpler but significantly slower for large numerical arrays, since it adds extra copying and memory overhead.
+Using the uppercase buffer-based methods avoids this cost and achieves near-native C performance.
+:::
 ---
 
 ## Summary

From 550efd9a57e3220fc6f144bb7ebb69319b115d50 Mon Sep 17 00:00:00 2001
From: Juan de Gracia <juan.degracia@icloud.com>
Date: Tue, 16 Dec 2025 11:34:50 +0100
Subject: [PATCH 6/6] Adressed comments

---
 content/conf.py    |  4 ++--
 content/intohpc.md | 53 ++++++++++++++++++++++++++++++++++------------
 content/mpi4py.md  |  4 ++--
 3 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/content/conf.py b/content/conf.py
index 7c33d39..45ee739 100644
--- a/content/conf.py
+++ b/content/conf.py
@@ -14,9 +14,9 @@
 # -- Project information -----------------------------------------------------
 
 # FIXME: choose title
-project = "Your lesson name"
+project = "Python for HPC"
 # FIXME: insert correct author
-author = "The contributors"
+author = "Ashwin Mohanan, Francesco Fiusco, Qiang Li, Juan de Gracia"
 copyright = f"2025, ENCCS, {author}"
 
 # FIXME: github organization / user that the repository belongs to
diff --git a/content/intohpc.md b/content/intohpc.md
index a27206d..dd0fa30 100644
--- a/content/intohpc.md
+++ b/content/intohpc.md
@@ -28,7 +28,7 @@ use clusters to perform large-scale computations.
 HPC systems, often called *supercomputers* or *clusters*, are made up of
 many computers (called **nodes**) connected by a fast network. Each node
 can have multiple cores which are **CPUs** (and sometimes **GPUs**) that 
-run tasks in parallel.
+run tasks (called **jobs**) in parallel.
 
 ### Typical HPC Components
 
@@ -131,7 +131,7 @@ Execute it and following let us verify the effect on the following modifications
 - Minimize Python in hot paths: Move heavy math into NumPy calls; keep Python for orchestration only.
 - Benchmark correctly: Use large N, pin threads to 1 for fair single-core tests, and report the best of multiple runs after a warmup.
 
---
+---
 
 ## Parallel Computing
 
@@ -155,7 +155,7 @@ This is the model used in:
 - Multicore laptops and workstations  
 - *Single compute nodes* on a cluster  
 
-Programs use **threads** to execute in parallel (e.g., with OpenMP in C/C++/Fortran or **multiprocessing in Python**).
+Programs use **threads** to execute in parallel (e.g., with OpenMP in C/C++/Fortran or **threading in Python**).
 
 :::{keypoints} 
 Advantages:
@@ -168,17 +168,44 @@ Limitations:
 ::: 
 
 :::{exercise} Practice with threaded parallelism in Python
-Example:
+This is a textbook example of I/O-bound concurrency with shared memory. It efficiently handles tasks that spend most of their time waiting (simulated by time.sleep) by allowing other threads to run during those pauses, maximizing efficiency despite the GIL. It also perfectly demonstrates the convenience of Python threading: because all threads live in the same process, they can instantly write to a single global data structure (database), avoiding the complexity of inter-process communication, while using a Lock to safely manage the one major risk of this approach (race conditions).
 ```python
-from multiprocessing import Pool
-
-def square(x):
-    return x * x
-
-if __name__ == "__main__":
-    with Pool(4) as p:
-        result = p.map(square, range(8))
-    print(result)
+import threading
+import time
+
+# 1. SHARED MEMORY
+# This list lives in the global process memory. 
+# All threads can see and modify it.
+database = []
+
+# We create a lock to synchronize the THREADS.  
+lock = threading.Lock()
+
+def save_data(data_id):
+    print(f"Thread-{data_id}: Processing...")
+    time.sleep(0.1)  # Simulate I/O delay (network/disk)
+    
+    # 2. MODIFYING SHARED MEMORY
+    # We use the Lock to ensure two threads don't append at the exact same time
+    with lock:
+        database.append(f"Record {data_id}")
+
+# 3. CREATING THREADS
+# Defining the list to hold thread objects  
+threads = []
+for i in range(5):
+    t = threading.Thread(target=save_data, args=(i,))
+    # Adding thread objects to the list
+    threads.append(t)
+    t.start()
+
+# 4. JOINING (Waiting for completion)
+for t in threads:
+    # This tells the main thread to not continue until 
+    # this specific THREAD is finished.     
+    t.join()
+
+print("\nFinal Shared Database:", database)
 ```
 :::
 
diff --git a/content/mpi4py.md b/content/mpi4py.md
index f4b8c6e..c24d3e6 100644
--- a/content/mpi4py.md
+++ b/content/mpi4py.md
@@ -236,7 +236,7 @@ Notice that `comm.send()` and `comm.recv()` are blocking operations.
 - Add a short delay using `time.sleep(rank)` before sending or receiving.
 - Observe how process 0 must wait until process 1 calls `recv()` before it can continue, and vice versa.
 - Try swapping the order of the calls (e.g., both processes call `send()` first), what happens?
-- You may notice the program hangs or deadlocks, because both processes are waiting for a `recv()` that never starts.
+- You may notice the program hangs or deadlocks, because both processes are waiting for a `recv()` that never starts. There are versions of `send` and `recv` that are non-blocking called [`isend`](https://mpi4py.readthedocs.io/en/stable/reference/mpi4py.MPI.Comm.html#mpi4py.MPI.Comm.isend) and [`irecv`](https://mpi4py.readthedocs.io/en/stable/reference/mpi4py.MPI.Comm.html#mpi4py.MPI.Comm.irecv) that will not be discussed in this course but there are links to the documentation. And an example can be found [here](https://mpi4py.readthedocs.io/en/stable/tutorial.html) under the section *Python objects with non-blocking communication* 
 :::
 
 :::{solution}
@@ -504,4 +504,4 @@ mpi4py allows Python users to write distributed parallel programs that scale fro
 - Collective communication coordinates data exchange across many processes.  
 - mpi4py integrates tightly with NumPy for efficient, zero-copy data transfers.  
 - These concepts allow Python programs to scale effectively on HPC systems.
-:::
\ No newline at end of file
+:::