Skip to content

Conversation

@jadgt
Copy link
Contributor

@jadgt jadgt commented Dec 15, 2025

These are the lectures for the Python for HPC course.

…key concepts, communication types, and integration with NumPy.
…rmance optimization, independent processes, and collective communication. Add practical exercises for users to practice single-core speed and MPI concepts.
- Minimize Python in hot paths: Move heavy math into NumPy calls; keep Python for orchestration only.
- Benchmark correctly: Use large N, pin threads to 1 for fair single-core tests, and report the best of multiple runs after a warmup.

--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jadgt I guess you want to use "---" instead of "--" here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link

@ffrancesco94 ffrancesco94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guapísimo, está muy bien <3 I only wrote a couple of minor things. I enjoyed!

HPC systems, often called *supercomputers* or *clusters*, are made up of
many computers (called **nodes**) connected by a fast network. Each node
can have multiple cores which are **CPUs** (and sometimes **GPUs**) that
run tasks in parallel.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe have tasks (called **jobs**) since you use the word 'job' later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I have added it!

- Multicore laptops and workstations
- *Single compute nodes* on a cluster

Programs use **threads** to execute in parallel (e.g., with OpenMP in C/C++/Fortran or **multiprocessing in Python**).
Copy link

@ffrancesco94 ffrancesco94 Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python multiprocessing uses processes, not threads, and the processes do not share the memory space (they actually clone the interpreter and everything else). Python hides it well, so maybe it's not that important, but I think it's better to use threading rather than multiprocessing here. Especially in the new free-threaded versions, this can play a role.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nicely spotted, I have changed it from the multiprocessing to the threading module.
Also I have updated the example to reflect the typical I/O operations that benefit from using threading. Thanks.

- Add a short delay using `time.sleep(rank)` before sending or receiving.
- Observe how process 0 must wait until process 1 calls `recv()` before it can continue, and vice versa.
- Try swapping the order of the calls (e.g., both processes call `send()` first), what happens?
- You may notice the program hangs or deadlocks, because both processes are waiting for a `recv()` that never starts.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just add a line saying that there are non-blocking versions of these primitives? No need to explain anything, just a link to the docs so that interested people can read about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I didn't wanted to go into this because it requires explaining the wait and so on but I have added links to the docs.

Comment on lines +64 to +71
# timed runs
tmin = float("inf")
for _ in range(repeats):
t0 = perf_counter()
fn(*args, **kwargs)
dt = perf_counter() - t0
tmin = min(tmin, dt)
return tmin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. While making benchmarks, it makes sense to either do median / arithmetic mean. More on this here:
    https://pyperf.readthedocs.io/en/latest/analyze.html#minimum-vs-average
    The general advice that I see is that if the benchmark is unstable, which is usually the case, median is a good measure. Of course we don't go to this depth here, but let's show the right way to do things.
    For either of those operations you can use np.median or np.mean.

  2. Kudos for using perf_counter!

  3. Collect observations into a list and compute the the right stats out of the loop.

  4. If it makes sense, we can use timeit.timeit directly. If not, it can be mentioned as a :::{tip} ... :::

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant timeit.timeit, from the std library.

Comment on lines +160 to +168
:::{keypoints}
Advantages:
- Easy communication between threads (shared variables)
- Low latency data access

Limitations:
- Limited by the number of cores on one machine
- Risk of race conditions if data access is not synchronized
:::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reserve keypoints for the end of the episode.

Suggested change
:::{keypoints}
Advantages:
- Easy communication between threads (shared variables)
- Low latency data access
Limitations:
- Limited by the number of cores on one machine
- Risk of race conditions if data access is not synchronized
:::
:::{note}
Advantages:
- Easy communication between threads (shared variables)
- Low latency data access
Limitations:
- Limited by the number of cores on one machine
- Risk of race conditions if data access is not synchronized
:::

:::

:::{exercise} Practice with threaded parallelism in Python
This is a textbook example of I/O-bound concurrency with shared memory. It efficiently handles tasks that spend most of their time waiting (simulated by time.sleep) by allowing other threads to run during those pauses, maximizing efficiency despite the GIL. It also perfectly demonstrates the convenience of Python threading: because all threads live in the same process, they can instantly write to a single global data structure (database), avoiding the complexity of inter-process communication, while using a Lock to safely manage the one major risk of this approach (race conditions).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is a textbook example of I/O-bound concurrency with shared memory. It efficiently handles tasks that spend most of their time waiting (simulated by time.sleep) by allowing other threads to run during those pauses, maximizing efficiency despite the GIL. It also perfectly demonstrates the convenience of Python threading: because all threads live in the same process, they can instantly write to a single global data structure (database), avoiding the complexity of inter-process communication, while using a Lock to safely manage the one major risk of this approach (race conditions).
This is a textbook example of I/O-bound concurrency with shared memory. It efficiently handles tasks that spend most of their time waiting (simulated by `time.sleep`) by allowing other threads to run during those pauses, maximizing efficiency despite the {abbr}`GIL (Global Interpreter Lock: a built-in internal thread lock to prevent race conditions that could corrupt data)`. It also perfectly demonstrates the convenience of Python threading: because all threads live in the same process, they can instantly write to a single global data structure (database), avoiding the complexity of inter-process communication, while using a Lock to safely manage the one major risk of this approach (race conditions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html#role-abbr here, if you are wondering what {abbr} is.

## Python in High-Performance Computing

Python has become one of the most widely used languages in scientific computing due to its simplicity, readability, and extensive ecosystem of numerical libraries.
Although Python itself is interpreted and slower than compiled languages such as C or Fortran, it now provides a mature set of tools that allow code to **run efficiently on modern HPC architectures**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Although Python itself is interpreted and slower than compiled languages such as C or Fortran, it now provides a mature set of tools that allow code to **run efficiently on modern HPC architectures**.
Although Python itself is interpreted and often slower than compiled languages such as C or Fortran, it now provides a mature set of tools that allow code to **run efficiently on modern HPC architectures**.

Comment on lines +246 to +255
:::{keypoints}
Advantages:
- Scales to thousands of nodes
- Each process works independently, avoiding memory contention

Limitations:
- Requires explicit communication (send/receive)
- More complex programming model
- More latency, requires minimizing movement of data.
:::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::{keypoints}
Advantages:
- Scales to thousands of nodes
- Each process works independently, avoiding memory contention
Limitations:
- Requires explicit communication (send/receive)
- More complex programming model
- More latency, requires minimizing movement of data.
:::
:::{note}
Advantages:
- Scales to thousands of nodes
- Each process works independently, avoiding memory contention
Limitations:
- Requires explicit communication (send/receive)
- More complex programming model
- More latency, requires minimizing movement of data.
:::

- MPI creates multiple independent processes running the same program.
- Point-to-point communication exchanges data directly between two processes.
- Collective communication coordinates data exchange across many processes.
- mpi4py integrates tightly with NumPy for efficient, zero-copy data transfers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking zero-copy again.

1. **Point-to-point communication**: Data moves **directly** between two processes.
2. **Collective communication**: Data is exchanged among **all processes** in a communicator in a coordinated way.

:::{keypoints}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::{keypoints}
:::{note}

A major strength of `mpi4py` is its **direct integration with NumPy arrays**.
MPI operations can send and receive **buffer-like objects**, such as NumPy arrays, without copying data between Python and C memory.

:::{keypoints} Important
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::{keypoints} Important
:::{important}

Comment on lines +160 to +166
#### Syntax differences
**Lowercase (Python objects):**
```python
comm.send(obj, dest=1)
data = comm.recv(source=0)
```
- The message (obj) can be any Python object.
Copy link
Contributor

@ashwinvis ashwinvis Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Syntax differences
**Lowercase (Python objects):**
```python
comm.send(obj, dest=1)
data = comm.recv(source=0)
```
- The message (obj) can be any Python object.
#### Syntax differences
**Lowercase (Python objects):**
```python
comm.send(obj, dest=1)
data = comm.recv(source=0)
```
- The message (`obj`) can be any Python object.

The most basic form of communication in MPI is **point-to-point**, meaning data is sent from one process directly to another.

Each message involves:
- A **sender** and a **receiver**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- A **sender** and a **receiver**
- A **sender** (`source`) and a **receiver** (`dest`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course you can avoid this since this terminology assumes point-to-point and not collectives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants