Skip to content

Template generation pipeline crashing on the Compute Canada cluster #86

@rohanbanerjee

Description

@rohanbanerjee

Moving this - spinalcordtoolbox/template-dog#18 issue to this repository since it is more relevant here.

The generate_template script is dependent on, as described here:

  1. minc-toolkit-v2
  2. minc2-simple
  3. nist_mni_pipelines

We have been using the SHA cadc7219e79d6edb90742e1e340f8eee76332006 version of the nist_mni_piplelines which used the scoop package for parallelizing. The newer versions (I'm using the commit 608acff75601bf80f79334abc0434bbc0734af0d)of the nist_mni_pipelines uses the ray package. Now when I try to use install ray by pip install ray, the jobs crash and run into the following error:

error stack
[2024-04-04 07:13:48,381] launcher  INFO    SCOOP 0.7 2.0 on linux using Python 3.8.10 (default, Jun 16 2021, 14:19:02) [GCC 9.3.0], API: 1013
[2024-04-04 07:13:48,382] launcher  INFO    Detected SLURM environment.
[2024-04-04 07:13:48,382] launcher  INFO    Deploying 1 worker(s) over 1 host(s).
[2024-04-04 07:13:48,382] launcher  DEBUG   Using hostname/ip: "bc11259" as external broker reference.
[2024-04-04 07:13:48,382] launcher  DEBUG   The python executable to execute the program with is: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python.
[2024-04-04 07:13:48,382] launcher  INFO    Worker distribution: 
[2024-04-04 07:13:48,382] launcher  INFO       bc11259:	0 + origin
[2024-04-04 07:13:48,816] brokerLaunch (127.0.0.1:36071) DEBUG   Local broker launched on ports 36071, 33491.
[2024-04-04 07:13:48,816] launcher  (127.0.0.1:36071) DEBUG   Initialising local origin worker 1 [bc11259].
[2024-04-04 07:13:48,816] launcher  (127.0.0.1:36071) DEBUG   bc11259: Launching 'env PYTHONPATH=/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python -m scoop.launch.__main__ 1 3 --size 1 --workingDirectory /lustre04/scratch/rohanb1/dog_template/template --brokerHostname 127.0.0.1 --externalBrokerHostname bc11259 --taskPort 36071 --metaPort 33491 --origin --backend=ZMQ -vvv generate_template_pediatric.py'
Launching 1 worker(s) using /bin/bash.
Executing '['/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python', '-m', 'scoop.bootstrap.__main__', '--size', '1', '--workingDirectory', '/lustre04/scratch/rohanb1/dog_template/template', '--brokerHostname', '127.0.0.1', '--externalBrokerHostname', 'bc11259', '--taskPort', '36071', '--metaPort', '33491', '--origin', '--backend=ZMQ', '-vvv', 'generate_template_pediatric.py']'...
2024-04-04 07:14:35,671	INFO worker.py:1553 -- Started a local Ray instance.
[2024-04-04 07:15:06,066 E 449836 449836] core_worker.cc:191: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
[2024-04-04 07:15:06,132] launcher  (127.0.0.1:36071) INFO    Root process is done.
[2024-04-04 07:15:06,132] workerLaunch (127.0.0.1:36071) DEBUG   Closing workers on bc11259 (1 workers).
[2024-04-04 07:15:06,132] brokerLaunch (127.0.0.1:36071) DEBUG   Closing local broker.
[2024-04-04 07:15:06,132] launcher  (127.0.0.1:36071) INFO    Finished cleaning spawned subprocesses.

I did some search and found a temporary fix to this issue here: https://stackoverflow.com/a/72492737 which did resolve the above error but the job still crash and following is the crash output (attached):
slurm-46365536.out.zip

Steps to reproduce this issue:

  1. Download the following data: https://drive.google.com/file/d/13yE3sS-GpawC-JcP-uDCJ-FTT1Jzmca9/view?usp=sharing
  2. For step 2a here, Drag and drop to the scratch folder on Compute Canada and unzip the file
  3. Open bids_data_final/derivatives/template/subjects.csv and update the paths
  4. Follow the rest of the steps mentioned here https://github.com/neuropoly/template?tab=readme-ov-file#step-2-template-creation

I'm trying to solve this issue on my side but if anyone has any insights, pls share!
(tagging @namgo if you have any information on this)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions