Skip to content

fix: only check modified queues during cluster readiness validation#3097

Open
almightychang wants to merge 1 commit intoaws:developfrom
almightychang:fix/queue-add-readiness-check
Open

fix: only check modified queues during cluster readiness validation#3097
almightychang wants to merge 1 commit intoaws:developfrom
almightychang:fix/queue-add-readiness-check

Conversation

@almightychang
Copy link
Copy Markdown

Summary

Fixes HeadNodeWaitCondition timeout when adding new SLURM queues to clusters with running jobs.

When adding a new queue via pcluster update-cluster, the HeadNode readiness check currently validates cluster_config_version for all compute nodes, including those in unmodified queues. This causes timeout if existing nodes have running jobs since they retain the old config version.

This PR optimizes the check to only validate nodes in modified queues by reading change-set.json.

Changes

  • check_cluster_ready.py: Add _read_change_set() to parse modified queues and filter readiness checks
  • ec2_utils.py: Add queue_names parameter to list_cluster_instance_ids_iterator()
  • constants.py: Add QUEUE_NAME_TAG constant
  • test_check_cluster_ready.py: Add 6 new test cases covering queue filtering scenarios

Implementation Details

Queue Filtering Logic

  1. Read change-set.json (already generated by ParallelCluster)
  2. Extract modified queue names via regex: Scheduling.SlurmQueues[queue_name].*
  3. Check only compute nodes in modified queues + LoginNodes
  4. Fallback: If change-set unavailable/invalid → check all nodes (backward compatible)

Edge Cases Handled

  • Missing change-set.json → fallback to all nodes
  • Malformed JSON → log warning, fallback
  • No queue changes → check all nodes
  • LoginNodes always checked separately (no queue tags)
  • New queue with no nodes yet → empty result, skip

Testing

All 12 test cases pass (6 existing + 6 new):

  • ✅ Queue filtering for single modified queue
  • ✅ Adding new queue with no nodes
  • ✅ Fallback when change-set missing
  • ✅ Fallback when change-set malformed
  • ✅ LoginNode always checked separately
  • ✅ Multiple queues modified simultaneously

Backward Compatibility

Fully backward compatible:

  • No CLI interface changes
  • No new dependencies
  • Graceful fallback if change-set unavailable
  • Uses existing ParallelCluster infrastructure

Related Issue

Fixes aws/aws-parallelcluster#7203

Checklist

  • Tests added and passing (12/12)
  • Backward compatibility maintained
  • Logging enhanced (queue filtering details)
  • Regex follows AWS queue naming convention: ^[a-z][a-z0-9-]*$
  • Code follows existing patterns (Ruby cookbook already uses change-set.json)

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

When adding a new SLURM queue, readiness check now only validates
nodes in modified queues by reading change-set.json. Falls back to
checking all nodes if change-set unavailable (backward compatible).

Fixes: aws/aws-parallelcluster#7203

Changes:
- check_cluster_ready.py: Add change-set parsing and queue filtering
- ec2_utils.py: Add queue_names filtering parameter
- constants.py: Add QUEUE_NAME_TAG constant
- test_check_cluster_ready.py: Add 6 new test cases

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@hanwen-cluster
Copy link
Copy Markdown
Contributor

Hi almightychang,

Thank you for this PR. We are still reviewing the PR and thinking about a way to reuse def get_queues_with_changes.

We will keep you posted.

Thank you,
Hanwen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] HeadNode checks all compute nodes when adding queue, causing timeout with running jobs

2 participants