-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Moving this - spinalcordtoolbox/template-dog#18 issue to this repository since it is more relevant here.
The generate_template script is dependent on, as described here:
We have been using the SHA cadc7219e79d6edb90742e1e340f8eee76332006 version of the nist_mni_piplelines which used the scoop package for parallelizing. The newer versions (I'm using the commit 608acff75601bf80f79334abc0434bbc0734af0d)of the nist_mni_pipelines uses the ray package. Now when I try to use install ray by pip install ray, the jobs crash and run into the following error:
error stack
[2024-04-04 07:13:48,381] launcher INFO SCOOP 0.7 2.0 on linux using Python 3.8.10 (default, Jun 16 2021, 14:19:02) [GCC 9.3.0], API: 1013
[2024-04-04 07:13:48,382] launcher INFO Detected SLURM environment.
[2024-04-04 07:13:48,382] launcher INFO Deploying 1 worker(s) over 1 host(s).
[2024-04-04 07:13:48,382] launcher DEBUG Using hostname/ip: "bc11259" as external broker reference.
[2024-04-04 07:13:48,382] launcher DEBUG The python executable to execute the program with is: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python.
[2024-04-04 07:13:48,382] launcher INFO Worker distribution:
[2024-04-04 07:13:48,382] launcher INFO bc11259: 0 + origin
[2024-04-04 07:13:48,816] brokerLaunch (127.0.0.1:36071) DEBUG Local broker launched on ports 36071, 33491.
[2024-04-04 07:13:48,816] launcher (127.0.0.1:36071) DEBUG Initialising local origin worker 1 [bc11259].
[2024-04-04 07:13:48,816] launcher (127.0.0.1:36071) DEBUG bc11259: Launching 'env PYTHONPATH=/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python -m scoop.launch.__main__ 1 3 --size 1 --workingDirectory /lustre04/scratch/rohanb1/dog_template/template --brokerHostname 127.0.0.1 --externalBrokerHostname bc11259 --taskPort 36071 --metaPort 33491 --origin --backend=ZMQ -vvv generate_template_pediatric.py'
Launching 1 worker(s) using /bin/bash.
Executing '['/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python', '-m', 'scoop.bootstrap.__main__', '--size', '1', '--workingDirectory', '/lustre04/scratch/rohanb1/dog_template/template', '--brokerHostname', '127.0.0.1', '--externalBrokerHostname', 'bc11259', '--taskPort', '36071', '--metaPort', '33491', '--origin', '--backend=ZMQ', '-vvv', 'generate_template_pediatric.py']'...
2024-04-04 07:14:35,671 INFO worker.py:1553 -- Started a local Ray instance.
[2024-04-04 07:15:06,066 E 449836 449836] core_worker.cc:191: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
[2024-04-04 07:15:06,132] launcher (127.0.0.1:36071) INFO Root process is done.
[2024-04-04 07:15:06,132] workerLaunch (127.0.0.1:36071) DEBUG Closing workers on bc11259 (1 workers).
[2024-04-04 07:15:06,132] brokerLaunch (127.0.0.1:36071) DEBUG Closing local broker.
[2024-04-04 07:15:06,132] launcher (127.0.0.1:36071) INFO Finished cleaning spawned subprocesses.
I did some search and found a temporary fix to this issue here: https://stackoverflow.com/a/72492737 which did resolve the above error but the job still crash and following is the crash output (attached):
slurm-46365536.out.zip
Steps to reproduce this issue:
- Download the following data: https://drive.google.com/file/d/13yE3sS-GpawC-JcP-uDCJ-FTT1Jzmca9/view?usp=sharing
- For step 2a here, Drag and drop to the
scratchfolder on Compute Canada and unzip the file - Open
bids_data_final/derivatives/template/subjects.csvand update the paths - Follow the rest of the steps mentioned here https://github.com/neuropoly/template?tab=readme-ov-file#step-2-template-creation
I'm trying to solve this issue on my side but if anyone has any insights, pls share!
(tagging @namgo if you have any information on this)