Skip to content

Make slurm IDM tools compatible with NYU's Cluster#2670

Merged
ZDu-IDM merged 5 commits intoInstituteforDiseaseModeling:mainfrom
kaftand:slurm_srun_compat
Apr 13, 2026
Merged

Make slurm IDM tools compatible with NYU's Cluster#2670
ZDu-IDM merged 5 commits intoInstituteforDiseaseModeling:mainfrom
kaftand:slurm_srun_compat

Conversation

@kaftand
Copy link
Copy Markdown

@kaftand kaftand commented Feb 20, 2026

Our (NYU) HPC policy does not allow you to run python on the head node. Therefore, we commonly will pull down a node into an interactive terminal using srun. The default behavior in slurm is to inherit environment variables from parent environments. This becomes a problem when we create a job via emodpy. There are variables that should not be inherited, and we get a weird cpu bind error (see bottom, A).

I added an option in slurm_platform.py and the batch.sh.jinja2 file to unset the parent env variables. The default behavior is unchanged.

Additionally, I was unable to build idmtools_platform_slurm due to a dependency issue. There was a dependency for "idmtools_platform_general~=3.0.0", which I was unable to build due to a idmtools_cli~=3.00 dependency. I went out on a limb and saw the comps platform required idmtools~=3.0.0, and updated the dependencies in slurm to reflect that. I am not super confident in my change, but it worked. I would love some insight into whether or not my change was kosher.

Error message from slurm issue:
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008000000.
srun: error: Task launch for StepId=18072825.0 failed on node cl40s-8001: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted

Copilot AI review requested due to automatic review settings February 20, 2026 21:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request aims to make the SLURM platform compatible with NYU's HPC cluster by addressing environment variable inheritance issues that occur when idmtools is run from an interactive SLURM session. When running from an allocated partition using srun, child jobs were inheriting parent SLURM environment variables causing CPU bind errors. The PR also includes a dependency update and a minor dev script improvement.

Changes:

  • Adds a new configuration option propogate_slurm_env_var (default True) to control whether parent SLURM environment variables are propagated to child job scripts
  • Updates the batch.sh.jinja2 template to unset key SLURM environment variables when propagation is disabled
  • Changes dependency from idmtools_platform_general~=3.0.0 to idmtools~=3.0.0 in pyproject.toml
  • Adds comprehensive tests for the new configuration option
  • Improves path quoting in dev_scripts/run_pymake_on_all.py

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
idmtools_platform_slurm/pyproject.toml Updates dependency specification (contains critical error)
idmtools_platform_slurm/idmtools_platform_slurm/slurm_platform.py Adds propogate_slurm_env_var field with documentation
idmtools_platform_slurm/idmtools_platform_slurm/assets/batch.sh.jinja2 Conditionally unsets SLURM environment variables in batch submission script
idmtools_platform_slurm/idmtools_platform_slurm/assets/init.py Passes propogate_slurm_env_var to batch template
idmtools_platform_slurm/tests/test_slurm_operations.py Adds tests for the new configuration option
dev_scripts/run_pymake_on_all.py Adds quotes around paths to handle spaces

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +240 to +247
"propogate_slurm_env_var = False"
])
self.assertIn("unset SLURM_JOB_ID", contents)
self.assertIn("unset SLURM_MEM_PER_NODE", contents)

def test_simtools_ini_propogate_true_or_default_keeps_slurm_env_vars(self):
scenarios = {
"explicit_true": ["propogate_slurm_env_var = True"],
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "propogate" should be "propagate". The configuration key in the test should use the correct spelling to match the corrected field name.

Suggested change
"propogate_slurm_env_var = False"
])
self.assertIn("unset SLURM_JOB_ID", contents)
self.assertIn("unset SLURM_MEM_PER_NODE", contents)
def test_simtools_ini_propogate_true_or_default_keeps_slurm_env_vars(self):
scenarios = {
"explicit_true": ["propogate_slurm_env_var = True"],
"propagate_slurm_env_var = False"
])
self.assertIn("unset SLURM_JOB_ID", contents)
self.assertIn("unset SLURM_MEM_PER_NODE", contents)
def test_simtools_ini_propogate_true_or_default_keeps_slurm_env_vars(self):
scenarios = {
"explicit_true": ["propagate_slurm_env_var = True"],

Copilot uses AI. Check for mistakes.
Comment thread idmtools_platform_slurm/pyproject.toml Outdated
]
dependencies = [
"idmtools_platform_general~=3.0.0",
"idmtools~=3.0.0",
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency change is incorrect. The SlurmPlatform class inherits from FilePlatform which is part of idmtools_platform_general package (located in idmtools_platform_general/idmtools_platform_file/file_platform.py). Removing the dependency on idmtools_platform_general will cause import errors when trying to use SlurmPlatform.

The original dependency on "idmtools_platform_general~=3.0.0" should be kept. If there were build issues with idmtools_cli~=3.0.0, those should be resolved separately, but this dependency change breaks the platform functionality.

Suggested change
"idmtools~=3.0.0",
"idmtools~=3.0.0",
"idmtools_platform_general~=3.0.0",

Copilot uses AI. Check for mistakes.
Comment on lines +102 to +103
propogate_slurm_env_var: bool = field(default=True, metadata=dict(sbatch=False,
help="Keep SLURM env vars available to child scripts"))
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "propogate" should be "propagate". This appears in the field name, comments, and throughout the codebase. The field should be renamed to "propagate_slurm_env_var" for correct spelling. This affects the configuration file parsing as well, so users would need to use the correct spelling in their configuration files.

Suggested change
propogate_slurm_env_var: bool = field(default=True, metadata=dict(sbatch=False,
help="Keep SLURM env vars available to child scripts"))
propagate_slurm_env_var: bool = field(default=True, metadata=dict(sbatch=False,
help="Keep SLURM env vars available to child scripts"))

Copilot uses AI. Check for mistakes.
@@ -1,5 +1,18 @@
#!/bin/bash

{% if not propogate_slurm_env_var %}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "propogate" should be "propagate". This template variable name should match the corrected field name from slurm_platform.py.

Suggested change
{% if not propogate_slurm_env_var %}
{% if not propagate_slurm_env_var %}

Copilot uses AI. Check for mistakes.
template_vars['dependency'] = dependency

# Control propagation of inherited SLURM env vars into downstream scripts
template_vars['propogate_slurm_env_var'] = getattr(platform, 'propogate_slurm_env_var', True)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "propogate" should be "propagate". This template variable assignment should use the correct spelling to match the corrected field name.

Suggested change
template_vars['propogate_slurm_env_var'] = getattr(platform, 'propogate_slurm_env_var', True)
template_vars['propagate_slurm_env_var'] = getattr(platform, 'propagate_slurm_env_var', True)

Copilot uses AI. Check for mistakes.
Comment on lines +238 to +247
def test_simtools_ini_propogate_slurm_env_var_updates_batch_script(self):
contents = self._generate_batch_from_simtools_config([
"propogate_slurm_env_var = False"
])
self.assertIn("unset SLURM_JOB_ID", contents)
self.assertIn("unset SLURM_MEM_PER_NODE", contents)

def test_simtools_ini_propogate_true_or_default_keeps_slurm_env_vars(self):
scenarios = {
"explicit_true": ["propogate_slurm_env_var = True"],
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "propogate" should be "propagate". The configuration key in the test should use the correct spelling to match the corrected field name.

Suggested change
def test_simtools_ini_propogate_slurm_env_var_updates_batch_script(self):
contents = self._generate_batch_from_simtools_config([
"propogate_slurm_env_var = False"
])
self.assertIn("unset SLURM_JOB_ID", contents)
self.assertIn("unset SLURM_MEM_PER_NODE", contents)
def test_simtools_ini_propogate_true_or_default_keeps_slurm_env_vars(self):
scenarios = {
"explicit_true": ["propogate_slurm_env_var = True"],
def test_simtools_ini_propagate_slurm_env_var_updates_batch_script(self):
contents = self._generate_batch_from_simtools_config([
"propagate_slurm_env_var = False"
])
self.assertIn("unset SLURM_JOB_ID", contents)
self.assertIn("unset SLURM_MEM_PER_NODE", contents)
def test_simtools_ini_propagate_true_or_default_keeps_slurm_env_vars(self):
scenarios = {
"explicit_true": ["propagate_slurm_env_var = True"],

Copilot uses AI. Check for mistakes.
@shchen-idmod
Copy link
Copy Markdown
Collaborator

shchen-idmod commented Feb 20, 2026

@kaftand Thanks for submitting this issue. I can build packages very quick for you with your changes. and publish to pypi. You can download from there and test it. If it works, we can officially put in our release. I will let you know once I have packages.

Update: please install the latest version (3.0.5) with your changes:
pip install idmtools[slurm] --upgrade --no-cache-dir --force-reinstall

BTW, we are in middle of moving everything from jfrog to pypi. So ignore version and merge branch thing for now.
This version contains your code from https://github.com/InstituteforDiseaseModeling/idmtools/tree/slurm_srun_compat
Let me know if this is working for you.
Thanks again!

@kaftand
Copy link
Copy Markdown
Author

kaftand commented Mar 2, 2026

Hi @shchen-idmod, I'm Currently waiting on EMOD-Hub/emodpy-hiv#38 to go through. There are a few dependency conflict, where emodpy-workflow requires emodpy-hiv 2.0, which requires emodpy 2.x, which requires idmtools 2.x. Quite the chain of dependencies! Once that gets merged, I think the path is clear to make my changes to emodpy-workflow (upgrade to use emodpy-hiv 3.0). I'm not satisfied with my ability to test the idmtools behavior until I can use it with emodpy-workflow.

@shchen-idmod
Copy link
Copy Markdown
Collaborator

shchen-idmod commented Mar 4, 2026

Hi @kaftand I have merged your fix and update some of your tests in 'pypi' branch (easier to build package with GHA). Published version is 3.0.6. Please test it whenever you have chance. And let us know if there is any issue. Thanks.

@kaftand
Copy link
Copy Markdown
Author

kaftand commented Mar 4, 2026

Thanks @YeChen-IDM , I have tested and it works on NYU's cluster. Thanks for your help!

@ZDu-IDM ZDu-IDM merged commit cf1d69c into InstituteforDiseaseModeling:main Apr 13, 2026
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants