Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions about/checkpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ nav_order: 4
# Checkpointing and Requeing Jobs
Have a really long job that you want to run? Here's how you do it:
1. Submit the job to the queue
2. Run for almost the full max wall time
2. Run for almost the full max wall time
3. Send a kill signal to your code using [`timeout`](https://manpages.org/timeout)
4. Your code saves a checkpoint
5. Requeue the job with [scontrol](https://slurm.schedmd.com/scontrol.html)
6. Repeat 2-5 until your job finishes
6. Repeat 2-5 times until your job finishes

```bash
#!/bin/bash
Expand Down Expand Up @@ -49,7 +49,7 @@ if [[ $? == 124 ]]; then
fi
```

> Typically, a non-zero exit code in Linux means "something went wrong". Because we don't want to requeue a job that failed indefinetly, we need to be able to distighish between "Something went wrong" and "I need more time".
> Typically, a non-zero exit code in Linux means "something went wrong". Because we don't want to requeue a job that failed indefinitely, we need to be able to distinguish between "Something went wrong" and "I need more time".
>
> Here we're checking if the exit code is 124 (`timeout` uses 124 to indicate the command timed out), but any non-zero exit code could work. Check your code's docs to see what's normal, what's an error, and how to throw a different signal

Expand Down
2 changes: 1 addition & 1 deletion about/hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ math: mathjax2
- Bigger Nodes: Higher Core counts, More Memory, TBs of NVMe scratch
- Faster GPUs: Between 2.5x and 36x faster
- More Storage: Up to 100TB of Archival Storage
- No pre-built modules will need to use [spack](https://spack.io )
- No pre-built modules will need to use [Spack](https://spack.io)
- 4 Tier Storage System: Node, Scratch, Turbo, and DataDen
- Short queues, limited wall times

Expand Down
8 changes: 4 additions & 4 deletions about/miscellaneous.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ nav_order: 5
## Getting Help
- The [Artemis slack channel](https://eeg-group.slack.com/archives/C070HCDCY9F)
- [UM CoderSpaces Slack](https://umich.enterprise.slack.com/archives/C02T1M5QNH3) ([join](https://documentation.its.umich.edu/node/352#JoinResign))
- [UM Lighthouse User Guide](https://arc.umich.edu/lighthouse/user-guide/)
- [UM Great Lakes User Guide](https://arc.umich.edu/greatlakes/user-guide/)
- [UM Cheat Sheet](https://arc.umich.edu/wp-content/uploads/sites/4/2020/05/Great-Lakes-Cheat-Sheet.pdf)
- [UM Lighthouse User Guide](https://documentation.its.umich.edu/arc-hpc/lighthouse/user-guide)
- [UM Great Lakes User Guide](https://documentation.its.umich.edu/arc-hpc/greatlakes/user-guide)
- [UM Great Lakes Cheat Sheet](https://docs.google.com/document/d/1wsr3yzkkojUMBCCneCz-l413xBzU-SZFAqcFrAAjttk/edit?usp=sharing)

## Tmux
Lighthouse and GreatLakes use multiple login nodes for load balancing/redundancy. To persist a session across login nodes, change where tmux creates its sockets:
Expand Down Expand Up @@ -40,7 +40,7 @@ If you're moving data between clusters, use [Globus](https://www.globus.org):
- It's way faster than scp/rclone/rsync
- On Arjuna use [Globus Connect Personal](https://www.globus.org/globus-connect-personal)

[UM ARC Endpoints](https://arc.umich.edu/globus/#document-4) (don't go using some rando endpoint)
[UM ARC Endpoints](https://coerc.engin.umich.edu/globus/) (don't go using some random endpoint)
- [DataDen](https://app.globus.org/file-manager?origin_id=ab65757f-00f5-4e5b-aa21-133187732a01)
- [Turbo](https://app.globus.org/file-manager?origin_id=8c185a84-5c61-4bbc-b12b-11430e20010f&origin_path=%2F)
- [/home on Lighthouse](https://app.globus.org/file-manager?origin_id=3242c149-a2b9-4dba-9406-ae3717981621)
Expand Down
2 changes: 1 addition & 1 deletion getting_started/jupyter_notebooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ You must run notebooks on the worker nodes, as described, in this tutorial.
For using Jupyter Notebooks you will need to have:

1. Visual Studio Code installed on your local machine with [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python), [Jupyter](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) and [Remote SSH](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) extensions enabled.
2. Installed Jupyter notebook on Artemis (i.e. via [conda](https://docs.conda.io/en/latest/) or [spack](https://spack.readthedocs.io/en/latest/))
2. Installed Jupyter notebook on Artemis (i.e. via [uv](https://docs.astral.sh/uv/) or [spack](https://spack.readthedocs.io/en/latest/))

### Instructions
1. Allocate an interactive worker node with the resources you need, for example:
Expand Down
Loading