Skip to content

Update "Install Slurm" documentation to leverage cloud-init#82

Open
lunamorrow wants to merge 28 commits intoOpenCHAMI:mainfrom
lunamorrow:lunamorrow/cloud-init-compute-node-slurm-config
Open

Update "Install Slurm" documentation to leverage cloud-init#82
lunamorrow wants to merge 28 commits intoOpenCHAMI:mainfrom
lunamorrow:lunamorrow/cloud-init-compute-node-slurm-config

Conversation

@lunamorrow
Copy link
Copy Markdown
Contributor

@lunamorrow lunamorrow commented Mar 16, 2026

Pull Request Template

Thank you for your contribution! Please ensure the following before submitting:

Checklist

  • My code follows the style guidelines of this project
  • I have added/updated comments where needed
  • I have added tests that prove my fix is effective or my feature works
  • I have run make test (or equivalent) locally and all tests pass
  • DCO Sign-off: All commits are signed off (git commit -s) with my real name and email
  • REUSE Compliance:
    • Each new/modified source file has SPDX copyright and license headers
    • Any non-commentable files include a <filename>.license sidecar
    • All referenced licenses are present in the LICENSES/ directory

Description

Updating/extending the "Install Slurm" documentation guide to leverage OpenCHAMI's cloud-init to make compute node configuration persistent across nodes and on reboot. See discussion/comments on PR #72.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

For more info, see Contributing Guidelines.

…n will need some further updates to align better with the Tutorial (e.g. changing IP addresses, adjusting comments to support bare-metal and cloud setups, etc.) and to ensure the documented approach is sufficiently broad for general purpose.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… for creating some files from cat to copy-paste to prevent issues with bash command/variable processing

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… - this should make this guide easy to follow on with after the tutorial

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…Next step will be expanding comments/explanations to provide more context to users, as well as providing more code blocks to show expected output of commands that produce output.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…id. Changes include making it more clear when pwgen password is used, correcting the file creation step for slurm.conf to prevent errors, removing instructions for aliasing the build commend (and instead redirecting to the appropriate tutorial section), updating instructions inline with a recent PR to replace MinIO with Versity S3 and some minor typo fixes

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ck from David.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…Some reviews are still pending as I figure out the source of the problem and a solution, and I will address these in a later commit.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… to VM head nodes.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…certain commands shoudl behave and/or the output they should produce.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ecurity vulnerabilities with versions 0.5-0.5.17

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ompute node. Additionally made some tweaks to the documentation to make the workflow more robust after repeating it on a fresh node.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…in a few places

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…erence to the 'Install Slurm' guide

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…t and the image config to reduce the number of commands needing to be run on the compute node. We are waiting on feedback from David and Alex before potentially implementing a more persistent Slurm configuration on the compute node/s.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…evon

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… in the working directory '/opt/workdir' (as desired) and not the user's home directory

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…r' in the slurm-local.repo file

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…f slurm RPMs in '/opt/workdir'

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ommand

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… explanation that the SlurmctldHost must be 'head' instead of 'demo' when the head node is a VM

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…rrow/cloud-init-compute-node-slurm-config

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…t so that compute node Slurm configuration is persistent across nodes and on reboot

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Copy Markdown
Contributor Author

I have made some changes to the documentation to use cloud-init instead of manually configuring the compute node. This process also sets up NFS to mount shared files (e.g. Slurm configuration files) used by both the compute node and head node. The current commit only adds a basic compute node configuration (similar to what was already there, only with cloud-init now), but I am able to push up a more complex configuration which sets up LDAP and mounts the compute node with more memory for a more "realistic" Slurm setup. That way anyone who follows the guide will finish with a more production-ready Slurm configuration. Let me know what you think @synackd @davidallendj @alexlovelltroy

The merge I performed on this branch pulled in quite a lot of old commits which has clogged up this PR a bit, sorry about that!

…hown' command

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Copy Markdown
Contributor Author

lunamorrow commented Mar 24, 2026

As an aside, has someone updated the documentation formatting? All of the in-line code and code block headings are black in the Tutorial, which makes it impossible to read some of the documentation. It still appears the same as usual when I render it locally, but it has changed on https://openchami.org/docs/tutorial/

Locally rendered:
Screenshot from 2026-03-24 11-58-18

From OpenCHAMI website:
Screenshot from 2026-03-24 11-58-28

…nly fixing the name of the ACCESS and SECRET tokens for S3 and making a comment into a note to improve visibility

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@synackd
Copy link
Copy Markdown
Contributor

synackd commented Mar 24, 2026

I wonder if the rendering issues were caused by the updates in #88. @alexlovelltroy?

@synackd
Copy link
Copy Markdown
Contributor

synackd commented Mar 24, 2026

It might more likely be #81.

@davidallendj
Copy link
Copy Markdown
Contributor

Just a small nit-pick: line 1303 says "short-name": "nid" but it should be "short-name": "de" here.

…tput of ci-defaults

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ode, to ensure that slurmdbd is up before slurmctld restarts

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…e munge.key between head node and compute node

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…compute node instead of short hostname

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
Copy link
Copy Markdown
Contributor

@davidallendj davidallendj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this pretty extensively with a fresh JetStream 2 instance from start to finish, so I feel confident that the cloud-init additions work like expected so I'm going to go ahead and approve.

Copy link
Copy Markdown
Contributor

@synackd synackd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @lunamorrow ! This is great.

I've asked for a few small changes based on my runthrough.

Also, this patch edits the cloud-init group for the compute SMD group. Idiomatic OpenCHAMI practice would warrant creating a separate slurm SMD group and setting the cloud-init group config for that group. Since this PR works, I'm inclined to keep the change and edit that later, but I will leave it up to you.

Also, long term, it might be good to refactor this to support newer Slurm/Munge versions so it's not a big task to update these docs, but that is a task for a different time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Commenting on file since this isn't in the diff)

When running this, it took a few minutes without any output and I anticipate readers suspecting something might be hanging. Can we add a note in the callout here stating that it could take a few minutes to complete without output?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and I thought something went wrong the first time I ran it.

Comment on lines +1128 to +1132
- bind-utils
- openldap-clients
- sssd
- sssd-ldap
- oddjob-mkhomedir
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a nitpick, but I think it applies when having a large list of packages. Can we make this list alphabetical? That way it's straightforward to know where to add new packages to the list and its easy to visually search if a certain package is present.

The output should be:

```
1615M s3://boot-images/compute/slurm/rocky9.7-compute-slurm-rocky9
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the versions displayed when I ran through this:

1660M  s3://boot-images/compute/slurm/rocky9.7-compute-slurm-rocky9
  85M  s3://boot-images/efi-images/compute/slurm/initramfs-5.14.0-611.36.1.el9_7.x86_64.img
  14M  s3://boot-images/efi-images/compute/slurm/vmlinuz-5.14.0-611.36.1.el9_7.x86_64

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also will need to update in the BSS output below.

Now, set this configuration for the compute group:

```bash
ochami cloud-init group set -f yaml -d @/etc/openchami/data/cloud-init/ci-group-compute.yaml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but ideally we have a separate slurm group in SMD and set the cloud-init config for this group. That way we keep the compute group's config separate for general-purpose things. I think that, since this works, I'm inclined to accept this change and save this improvement for a different PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants