Skip to content

Compute node hostname missing cluster prefix (install.py skips prefix for execute mode) #490

@SubaruArai

Description

@SubaruArai

Description

CCWS compute (execute) nodes get hostname gpu-test-1 but the Slurm autoscaler generates ccw-gpu-test-1 (with cluster prefix) in azure.conf. slurmd crash-loops because the names don't match. Login nodes are not affected.

Root cause

In /opt/cycle/jetpack/system/bootstrap/azure-slurm-install/install.py, line ~1028:

def set_hostname(s: InstallSettings) -> None:
    new_hostname = s.node_name.lower()
    if s.mode != "execute" and not new_hostname.startswith(s.node_name_prefix):
        new_hostname = f"{s.node_name_prefix}{new_hostname}"

Execute nodes skip the prefix prepend (if s.mode != "execute"). But the autoscaler always generates names WITH the prefix (ccw-gpu-test-1).

Similarly, CCWS 01-rename_host.sh only prepends the cluster prefix for login nodes (if is_login), not compute nodes.

The result: login → ccw-login-1 (correct), compute → gpu-test-1 (wrong).

Steps to reproduce

  1. Deploy CCWS with NodeNameIsHostname=true and NodeNamePrefix="Cluster Prefix"
  2. Add a compute node via the autoscaler (submit a job)
  3. Check hostname on the compute node: hostnamegpu-test-1
  4. Check Slurm expects: grep gpu /etc/slurm/azure.confccw-gpu-test-1
  5. slurmd crash-loops with "lookup failure for node"

Environment

  • CycleCloud 8.8.3-3667
  • Slurm 23.11.7
  • Ubuntu 22.04

Workaround

Cluster-init script that extends the 01-rename_host.sh pattern to compute nodes: checks if cyclecloud.node.name already has the prefix (autoscaler-provisioned) and prepends if not (manually added).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions