-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Dear team,
I'm getting the following error when I run Score-P with a module for tracing python scripts:
2020-10-20 09:24:14.149317: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.00M (10485
76 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149357: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 921.8K (9438
72 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149366: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 829.8K (8496
64 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149373: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 747.0K (7649
28 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
2020-10-20 09:24:14.149380: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 672.5K (6886
40 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
The error files grows very quickly and I end up killing the job.
I use a custom Score-P build. The details about the environment setup is in the attached job script and the error output is attached too.
Without the Score-P, the application runs as expected even without specifying the LD_PRELOAD for MPI.
When I run Score-P with the LD_PRELOAD set, I get the following error instead:
[Score-P] src/adapters/mpi/SCOREP_Mpi_Env.c:230: Warning: MPI environment initialization request and provided level exceed MPI_THREAD_FUNNELED!
2020-10-19 10:56:13.384533: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494285000 Hz [rc0003:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: rc0003: task 0: Segmentation fault
Would appreciate any feedback on this issue.
Thanks in advance!