Make slurm IDM tools compatible with NYU's Cluster#2670
Make slurm IDM tools compatible with NYU's Cluster#2670ZDu-IDM merged 5 commits intoInstituteforDiseaseModeling:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request aims to make the SLURM platform compatible with NYU's HPC cluster by addressing environment variable inheritance issues that occur when idmtools is run from an interactive SLURM session. When running from an allocated partition using srun, child jobs were inheriting parent SLURM environment variables causing CPU bind errors. The PR also includes a dependency update and a minor dev script improvement.
Changes:
- Adds a new configuration option
propogate_slurm_env_var(default True) to control whether parent SLURM environment variables are propagated to child job scripts - Updates the batch.sh.jinja2 template to unset key SLURM environment variables when propagation is disabled
- Changes dependency from
idmtools_platform_general~=3.0.0toidmtools~=3.0.0in pyproject.toml - Adds comprehensive tests for the new configuration option
- Improves path quoting in dev_scripts/run_pymake_on_all.py
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| idmtools_platform_slurm/pyproject.toml | Updates dependency specification (contains critical error) |
| idmtools_platform_slurm/idmtools_platform_slurm/slurm_platform.py | Adds propogate_slurm_env_var field with documentation |
| idmtools_platform_slurm/idmtools_platform_slurm/assets/batch.sh.jinja2 | Conditionally unsets SLURM environment variables in batch submission script |
| idmtools_platform_slurm/idmtools_platform_slurm/assets/init.py | Passes propogate_slurm_env_var to batch template |
| idmtools_platform_slurm/tests/test_slurm_operations.py | Adds tests for the new configuration option |
| dev_scripts/run_pymake_on_all.py | Adds quotes around paths to handle spaces |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "propogate_slurm_env_var = False" | ||
| ]) | ||
| self.assertIn("unset SLURM_JOB_ID", contents) | ||
| self.assertIn("unset SLURM_MEM_PER_NODE", contents) | ||
|
|
||
| def test_simtools_ini_propogate_true_or_default_keeps_slurm_env_vars(self): | ||
| scenarios = { | ||
| "explicit_true": ["propogate_slurm_env_var = True"], |
There was a problem hiding this comment.
Spelling error: "propogate" should be "propagate". The configuration key in the test should use the correct spelling to match the corrected field name.
| "propogate_slurm_env_var = False" | |
| ]) | |
| self.assertIn("unset SLURM_JOB_ID", contents) | |
| self.assertIn("unset SLURM_MEM_PER_NODE", contents) | |
| def test_simtools_ini_propogate_true_or_default_keeps_slurm_env_vars(self): | |
| scenarios = { | |
| "explicit_true": ["propogate_slurm_env_var = True"], | |
| "propagate_slurm_env_var = False" | |
| ]) | |
| self.assertIn("unset SLURM_JOB_ID", contents) | |
| self.assertIn("unset SLURM_MEM_PER_NODE", contents) | |
| def test_simtools_ini_propogate_true_or_default_keeps_slurm_env_vars(self): | |
| scenarios = { | |
| "explicit_true": ["propagate_slurm_env_var = True"], |
| ] | ||
| dependencies = [ | ||
| "idmtools_platform_general~=3.0.0", | ||
| "idmtools~=3.0.0", |
There was a problem hiding this comment.
This dependency change is incorrect. The SlurmPlatform class inherits from FilePlatform which is part of idmtools_platform_general package (located in idmtools_platform_general/idmtools_platform_file/file_platform.py). Removing the dependency on idmtools_platform_general will cause import errors when trying to use SlurmPlatform.
The original dependency on "idmtools_platform_general~=3.0.0" should be kept. If there were build issues with idmtools_cli~=3.0.0, those should be resolved separately, but this dependency change breaks the platform functionality.
| "idmtools~=3.0.0", | |
| "idmtools~=3.0.0", | |
| "idmtools_platform_general~=3.0.0", |
| propogate_slurm_env_var: bool = field(default=True, metadata=dict(sbatch=False, | ||
| help="Keep SLURM env vars available to child scripts")) |
There was a problem hiding this comment.
Spelling error: "propogate" should be "propagate". This appears in the field name, comments, and throughout the codebase. The field should be renamed to "propagate_slurm_env_var" for correct spelling. This affects the configuration file parsing as well, so users would need to use the correct spelling in their configuration files.
| propogate_slurm_env_var: bool = field(default=True, metadata=dict(sbatch=False, | |
| help="Keep SLURM env vars available to child scripts")) | |
| propagate_slurm_env_var: bool = field(default=True, metadata=dict(sbatch=False, | |
| help="Keep SLURM env vars available to child scripts")) |
| @@ -1,5 +1,18 @@ | |||
| #!/bin/bash | |||
|
|
|||
| {% if not propogate_slurm_env_var %} | |||
There was a problem hiding this comment.
Spelling error: "propogate" should be "propagate". This template variable name should match the corrected field name from slurm_platform.py.
| {% if not propogate_slurm_env_var %} | |
| {% if not propagate_slurm_env_var %} |
| template_vars['dependency'] = dependency | ||
|
|
||
| # Control propagation of inherited SLURM env vars into downstream scripts | ||
| template_vars['propogate_slurm_env_var'] = getattr(platform, 'propogate_slurm_env_var', True) |
There was a problem hiding this comment.
Spelling error: "propogate" should be "propagate". This template variable assignment should use the correct spelling to match the corrected field name.
| template_vars['propogate_slurm_env_var'] = getattr(platform, 'propogate_slurm_env_var', True) | |
| template_vars['propagate_slurm_env_var'] = getattr(platform, 'propagate_slurm_env_var', True) |
| def test_simtools_ini_propogate_slurm_env_var_updates_batch_script(self): | ||
| contents = self._generate_batch_from_simtools_config([ | ||
| "propogate_slurm_env_var = False" | ||
| ]) | ||
| self.assertIn("unset SLURM_JOB_ID", contents) | ||
| self.assertIn("unset SLURM_MEM_PER_NODE", contents) | ||
|
|
||
| def test_simtools_ini_propogate_true_or_default_keeps_slurm_env_vars(self): | ||
| scenarios = { | ||
| "explicit_true": ["propogate_slurm_env_var = True"], |
There was a problem hiding this comment.
Spelling error: "propogate" should be "propagate". The configuration key in the test should use the correct spelling to match the corrected field name.
| def test_simtools_ini_propogate_slurm_env_var_updates_batch_script(self): | |
| contents = self._generate_batch_from_simtools_config([ | |
| "propogate_slurm_env_var = False" | |
| ]) | |
| self.assertIn("unset SLURM_JOB_ID", contents) | |
| self.assertIn("unset SLURM_MEM_PER_NODE", contents) | |
| def test_simtools_ini_propogate_true_or_default_keeps_slurm_env_vars(self): | |
| scenarios = { | |
| "explicit_true": ["propogate_slurm_env_var = True"], | |
| def test_simtools_ini_propagate_slurm_env_var_updates_batch_script(self): | |
| contents = self._generate_batch_from_simtools_config([ | |
| "propagate_slurm_env_var = False" | |
| ]) | |
| self.assertIn("unset SLURM_JOB_ID", contents) | |
| self.assertIn("unset SLURM_MEM_PER_NODE", contents) | |
| def test_simtools_ini_propagate_true_or_default_keeps_slurm_env_vars(self): | |
| scenarios = { | |
| "explicit_true": ["propagate_slurm_env_var = True"], |
|
@kaftand Thanks for submitting this issue. I can build packages very quick for you with your changes. and publish to pypi. You can download from there and test it. If it works, we can officially put in our release. I will let you know once I have packages. Update: please install the latest version (3.0.5) with your changes: BTW, we are in middle of moving everything from jfrog to pypi. So ignore version and merge branch thing for now. |
|
Hi @shchen-idmod, I'm Currently waiting on EMOD-Hub/emodpy-hiv#38 to go through. There are a few dependency conflict, where emodpy-workflow requires emodpy-hiv 2.0, which requires emodpy 2.x, which requires idmtools 2.x. Quite the chain of dependencies! Once that gets merged, I think the path is clear to make my changes to emodpy-workflow (upgrade to use emodpy-hiv 3.0). I'm not satisfied with my ability to test the idmtools behavior until I can use it with emodpy-workflow. |
|
Hi @kaftand I have merged your fix and update some of your tests in 'pypi' branch (easier to build package with GHA). Published version is 3.0.6. Please test it whenever you have chance. And let us know if there is any issue. Thanks. |
|
Thanks @YeChen-IDM , I have tested and it works on NYU's cluster. Thanks for your help! |
cf1d69c
into
InstituteforDiseaseModeling:main
Our (NYU) HPC policy does not allow you to run python on the head node. Therefore, we commonly will pull down a node into an interactive terminal using srun. The default behavior in slurm is to inherit environment variables from parent environments. This becomes a problem when we create a job via emodpy. There are variables that should not be inherited, and we get a weird cpu bind error (see bottom, A).
I added an option in slurm_platform.py and the batch.sh.jinja2 file to unset the parent env variables. The default behavior is unchanged.
Additionally, I was unable to build idmtools_platform_slurm due to a dependency issue. There was a dependency for "idmtools_platform_general~=3.0.0", which I was unable to build due to a idmtools_cli~=3.00 dependency. I went out on a limb and saw the comps platform required idmtools~=3.0.0, and updated the dependencies in slurm to reflect that. I am not super confident in my change, but it worked. I would love some insight into whether or not my change was kosher.
Error message from slurm issue:
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008000000.
srun: error: Task launch for StepId=18072825.0 failed on node cl40s-8001: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted