Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Debugging on AzureML

Anton Schwaighofer edited this page Sep 29, 2020 · 4 revisions

Necessary setup

Create the AzureML cluster with SSH enabled

When creating the AzureML cluster, you need to tick the "Enable ssh" section. Pick your authentication method.

Instrumenting your Python code

import rpdb
rpdb_port = 4444
rpdb.handle_trap(port=rpdb_port)
logging.info(f"rpdb is handling traps. To debug: identify the main runner.py process, then as root: "
             f"kill -TRAP <process_id>; nc 127.0.0.1 {rpdb_port}")

This is already done by the InnerEye toolbox, just adding here for completeness.

Identifying the AzureML node that runs your job  

  • From the "Details" tab in the run's page, note the Run ID, then click on the target name under "Compute target".
  • Click on the "Nodes" tab, and identify the node whose "Current run ID" is that of your run.
  • Copy the contents of the "Connection string" column for that node to the clipboard (ssh user@...) and execute it in a shell. You need to have ssh installed obviously.
  • Type "bash" for a nicer command shell (optional).
  • Run sudo docker ps to see if Docker is running. You should see an output that lists 1 Docker container ID.
  • Identify the main python process with a command such as
ps aux | grep 'python.*runner.py' | egrep -wv 'bash|grep'

You may need to vary this if it does not yield exactly one line of output.

  • Note the process identifier (the value in the PID column, generally the second one).
  • Issue the commands
kill -TRAP nnnn
nc 127.0.0.1 4444

where nnnn is the process identifier. If the python process is in a state where it can accept the connection, the "nc" command will print a prompt from which you can issue pdb commands.

Notes:

  • The last step (kill and nc) can be successfully issued at most once for a given process. Thus if you might want a colleague to carry out the debugging, think carefully before issuing these commands yourself.

Clone this wiki locally