-
Notifications
You must be signed in to change notification settings - Fork 2
Juan lectures #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…key concepts, communication types, and integration with NumPy.
…rmance optimization, independent processes, and collective communication. Add practical exercises for users to practice single-core speed and MPI concepts.
content/intohpc.md
Outdated
| - Minimize Python in hot paths: Move heavy math into NumPy calls; keep Python for orchestration only. | ||
| - Benchmark correctly: Use large N, pin threads to 1 for fair single-core tests, and report the best of multiple runs after a warmup. | ||
|
|
||
| -- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jadgt I guess you want to use "---" instead of "--" here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
ffrancesco94
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guapísimo, está muy bien <3 I only wrote a couple of minor things. I enjoyed!
content/intohpc.md
Outdated
| HPC systems, often called *supercomputers* or *clusters*, are made up of | ||
| many computers (called **nodes**) connected by a fast network. Each node | ||
| can have multiple cores which are **CPUs** (and sometimes **GPUs**) that | ||
| run tasks in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe have tasks (called **jobs**) since you use the word 'job' later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I have added it!
content/intohpc.md
Outdated
| - Multicore laptops and workstations | ||
| - *Single compute nodes* on a cluster | ||
|
|
||
| Programs use **threads** to execute in parallel (e.g., with OpenMP in C/C++/Fortran or **multiprocessing in Python**). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python multiprocessing uses processes, not threads, and the processes do not share the memory space (they actually clone the interpreter and everything else). Python hides it well, so maybe it's not that important, but I think it's better to use threading rather than multiprocessing here. Especially in the new free-threaded versions, this can play a role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nicely spotted, I have changed it from the multiprocessing to the threading module.
Also I have updated the example to reflect the typical I/O operations that benefit from using threading. Thanks.
content/mpi4py.md
Outdated
| - Add a short delay using `time.sleep(rank)` before sending or receiving. | ||
| - Observe how process 0 must wait until process 1 calls `recv()` before it can continue, and vice versa. | ||
| - Try swapping the order of the calls (e.g., both processes call `send()` first), what happens? | ||
| - You may notice the program hangs or deadlocks, because both processes are waiting for a `recv()` that never starts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just add a line saying that there are non-blocking versions of these primitives? No need to explain anything, just a link to the docs so that interested people can read about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I didn't wanted to go into this because it requires explaining the wait and so on but I have added links to the docs.
| # timed runs | ||
| tmin = float("inf") | ||
| for _ in range(repeats): | ||
| t0 = perf_counter() | ||
| fn(*args, **kwargs) | ||
| dt = perf_counter() - t0 | ||
| tmin = min(tmin, dt) | ||
| return tmin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
While making benchmarks, it makes sense to either do median / arithmetic mean. More on this here:
https://pyperf.readthedocs.io/en/latest/analyze.html#minimum-vs-average
The general advice that I see is that if the benchmark is unstable, which is usually the case, median is a good measure. Of course we don't go to this depth here, but let's show the right way to do things.
For either of those operations you can usenp.medianornp.mean. -
Kudos for using
perf_counter! -
Collect observations into a list and compute the the right stats out of the loop.
-
If it makes sense, we can use
timeit.timeitdirectly. If not, it can be mentioned as a:::{tip} ... :::
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant timeit.timeit, from the std library.
| :::{keypoints} | ||
| Advantages: | ||
| - Easy communication between threads (shared variables) | ||
| - Low latency data access | ||
|
|
||
| Limitations: | ||
| - Limited by the number of cores on one machine | ||
| - Risk of race conditions if data access is not synchronized | ||
| ::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We reserve keypoints for the end of the episode.
| :::{keypoints} | |
| Advantages: | |
| - Easy communication between threads (shared variables) | |
| - Low latency data access | |
| Limitations: | |
| - Limited by the number of cores on one machine | |
| - Risk of race conditions if data access is not synchronized | |
| ::: | |
| :::{note} | |
| Advantages: | |
| - Easy communication between threads (shared variables) | |
| - Low latency data access | |
| Limitations: | |
| - Limited by the number of cores on one machine | |
| - Risk of race conditions if data access is not synchronized | |
| ::: |
| ::: | ||
|
|
||
| :::{exercise} Practice with threaded parallelism in Python | ||
| This is a textbook example of I/O-bound concurrency with shared memory. It efficiently handles tasks that spend most of their time waiting (simulated by time.sleep) by allowing other threads to run during those pauses, maximizing efficiency despite the GIL. It also perfectly demonstrates the convenience of Python threading: because all threads live in the same process, they can instantly write to a single global data structure (database), avoiding the complexity of inter-process communication, while using a Lock to safely manage the one major risk of this approach (race conditions). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This is a textbook example of I/O-bound concurrency with shared memory. It efficiently handles tasks that spend most of their time waiting (simulated by time.sleep) by allowing other threads to run during those pauses, maximizing efficiency despite the GIL. It also perfectly demonstrates the convenience of Python threading: because all threads live in the same process, they can instantly write to a single global data structure (database), avoiding the complexity of inter-process communication, while using a Lock to safely manage the one major risk of this approach (race conditions). | |
| This is a textbook example of I/O-bound concurrency with shared memory. It efficiently handles tasks that spend most of their time waiting (simulated by `time.sleep`) by allowing other threads to run during those pauses, maximizing efficiency despite the {abbr}`GIL (Global Interpreter Lock: a built-in internal thread lock to prevent race conditions that could corrupt data)`. It also perfectly demonstrates the convenience of Python threading: because all threads live in the same process, they can instantly write to a single global data structure (database), avoiding the complexity of inter-process communication, while using a Lock to safely manage the one major risk of this approach (race conditions). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html#role-abbr here, if you are wondering what {abbr} is.
| ## Python in High-Performance Computing | ||
|
|
||
| Python has become one of the most widely used languages in scientific computing due to its simplicity, readability, and extensive ecosystem of numerical libraries. | ||
| Although Python itself is interpreted and slower than compiled languages such as C or Fortran, it now provides a mature set of tools that allow code to **run efficiently on modern HPC architectures**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Although Python itself is interpreted and slower than compiled languages such as C or Fortran, it now provides a mature set of tools that allow code to **run efficiently on modern HPC architectures**. | |
| Although Python itself is interpreted and often slower than compiled languages such as C or Fortran, it now provides a mature set of tools that allow code to **run efficiently on modern HPC architectures**. |
| :::{keypoints} | ||
| Advantages: | ||
| - Scales to thousands of nodes | ||
| - Each process works independently, avoiding memory contention | ||
|
|
||
| Limitations: | ||
| - Requires explicit communication (send/receive) | ||
| - More complex programming model | ||
| - More latency, requires minimizing movement of data. | ||
| ::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :::{keypoints} | |
| Advantages: | |
| - Scales to thousands of nodes | |
| - Each process works independently, avoiding memory contention | |
| Limitations: | |
| - Requires explicit communication (send/receive) | |
| - More complex programming model | |
| - More latency, requires minimizing movement of data. | |
| ::: | |
| :::{note} | |
| Advantages: | |
| - Scales to thousands of nodes | |
| - Each process works independently, avoiding memory contention | |
| Limitations: | |
| - Requires explicit communication (send/receive) | |
| - More complex programming model | |
| - More latency, requires minimizing movement of data. | |
| ::: |
| - MPI creates multiple independent processes running the same program. | ||
| - Point-to-point communication exchanges data directly between two processes. | ||
| - Collective communication coordinates data exchange across many processes. | ||
| - mpi4py integrates tightly with NumPy for efficient, zero-copy data transfers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking zero-copy again.
| 1. **Point-to-point communication**: Data moves **directly** between two processes. | ||
| 2. **Collective communication**: Data is exchanged among **all processes** in a communicator in a coordinated way. | ||
|
|
||
| :::{keypoints} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :::{keypoints} | |
| :::{note} |
| A major strength of `mpi4py` is its **direct integration with NumPy arrays**. | ||
| MPI operations can send and receive **buffer-like objects**, such as NumPy arrays, without copying data between Python and C memory. | ||
|
|
||
| :::{keypoints} Important |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :::{keypoints} Important | |
| :::{important} |
| #### Syntax differences | ||
| **Lowercase (Python objects):** | ||
| ```python | ||
| comm.send(obj, dest=1) | ||
| data = comm.recv(source=0) | ||
| ``` | ||
| - The message (obj) can be any Python object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Syntax differences | |
| **Lowercase (Python objects):** | |
| ```python | |
| comm.send(obj, dest=1) | |
| data = comm.recv(source=0) | |
| ``` | |
| - The message (obj) can be any Python object. | |
| #### Syntax differences | |
| **Lowercase (Python objects):** | |
| ```python | |
| comm.send(obj, dest=1) | |
| data = comm.recv(source=0) | |
| ``` | |
| - The message (`obj`) can be any Python object. |
| The most basic form of communication in MPI is **point-to-point**, meaning data is sent from one process directly to another. | ||
|
|
||
| Each message involves: | ||
| - A **sender** and a **receiver** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - A **sender** and a **receiver** | |
| - A **sender** (`source`) and a **receiver** (`dest`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course you can avoid this since this terminology assumes point-to-point and not collectives.
These are the lectures for the Python for HPC course.