dccli is a command line interface to train deep learning model using GPU cluster.
Read more about the platform
Install latest cli with pip
pip install dccli
Warning: dccli<=0.0.11 will no longer work due to updates that are not backward compatible
To use DeepCluster, your project is required to be a git repository. The easitest way is to start with the template git repository.
Template repo has the following structure:
/DeepClusterTemplate
/main.py
/requirements.txt
/config.yaml
/README.md
/.gitignore
You may rename the repo but make sure you DO NOT rename/move
main.py
config.yaml looks like the following. You can choose our known dataset by providing dataset_name, deep learning framework type with container_name, number of worker you want to use with worker_required, and the entry command with command.
There are two environment variables. $DATASET is the path to dataset and $OUTPUT is the path where you can put output for download.
You can find more detailed information about each field here
# speficy your container name, such as deepcluster/tensorflow:1.12-python3.6 or deepcluster/pytorch:1.0-python3.7
container_image:
# provide known dataset name or local datasets
dataset_name:
dataset_path:
# number of GPU used to train, default it 1
worker_required: 1
# command to run
# use the environment variable $DATASET to access dataset
# and write output to $OUTPUT
command:
# other custom configs
You can also provide other configs that are specific to your model here and access config.yaml in the main function of main.py at "./configs.yaml"
If model requires other python packages from Pypi, you can list them in requirements.txt
dccli register
you will be asked for email and password to register for the service:
===========================================
Register with DeepCluster
===========================================
Please enter your email address: <your email>
Password must be 8 - 20 characters and can contain alphanumeric and @#$%^&+=
Please enter your password: <your password>
Please re-enter your password: <your password>
Registered successfully
Login successfully
Once you successfully registered with DeepCluster, you are already logged in. Skip to Step 3: Submit your training job
If your log in is expired, log in with the following command
dccli login
you will be asked for email and password to login to the service:
===========================================
Login to DeepCluster
===========================================
Please enter your email: <your email>
Please enter your password: <your password>
Login successfully
dccli will package and submit the entire git repository by default. Use -c flag to submit current directory instead.
dccli submit
you should see something like below if it is successful
===========================================
Submit Job
===========================================
Code package includes uncommitted changes
zip source code...
Upload code...
Submit job successfully
> job type: tensorflow
> job uuid: 7040d88f-d02f-4529-84e2-d1991b90afc0
job uuid is the identifier to track training progress, stream log and download artifacts
congratulations! Now you have successfully submit a training job to DeepCluster.
You can check the progress using:
dccli progress
Optionally, you can provide --job_uuid <job uuid> to comnand line if you have more than one job:
===========================================
Check Job Progress
===========================================
Job 7040d88f-d02f-4529-84e2-d1991b90afc0
> job type: tensorflow
> job state: waiting for worker to join
> duration: 00:00:58
> job local history:
[2019-03-10 12:42:13] pending upload
[2019-03-10 12:42:19] job data uploaded
[2019-03-10 12:42:20] job is ready
job state indicate the current status of the job
job local history captures the events from the earliest to latest
To stream console logs:
dccli stream
output will be something like
===========================================
Query for logs
===========================================
first line of log ## if there is any console log of your code
second line of log
Once your job is completed, you can download code outputs, such as model artifacts or plots, to output_dir
dccli download --dest=<local path where you want model outputs downloaded to>
output will look like below
===========================================
Download Job Output
===========================================
Output downloaded to: <local path where you want model outputs downloaded to>