T = |G| x t
where:
- T: The total SUs charged.
- |G|: The total number of GPUs used.
- t: The total wallclock time in hours.
-
Login via:
$ ssh -l <username> login.xsede.org
-
Enter your XUP Password when prompted. A directory at
/home/usernamewill be created on your first login:# Welcome to the XSEDE Single Sign-On (SSO) Hub! [username@ssohub ~]$
-
By default, upon logging into the SSO Hub, the X.509 credential is obtained on your behalf and is valid only for a 12-hour period. You can check the validity of this credential via:
[username@ssohub ~]$ grid-proxy-infoand you should see the following:
subject : /C=US/O=National Center for Supercomputing Applications/CN=[YOUR NAME HERE] issuer : /C=US/O=National Center for Supercomputing Applications/OU=Certificate Authorities/CN=MyProxy CA 2013 identity : /C=US/O=National Center for Supercomputing Applications/CN=[YOUR NAME HERE] type : end entity credential strength : 2048 bits path : /tmp/x509up_u[XXXX] timeleft : 11:45:38
Note the
timeleftentry. Renew your X.509 credential while logged into the SSO Hub via themyproxy-logoncommand with your XUP Password:[username@ssohub ~]$ myproxy-logon Enter MyProxy pass phrase: [YOUR XUP PASSWORD HERE] A credential has been received for user [USERNAME] in /tmp/x509up_u[XXXXX].
-
Once logged onto the hub, use the
gsisshutility to login into XStream Login Node:[username@ssohub ~]$ gsissh xstream # --*-*- Stanford University Research Computing Center -*-*-- # __ ______ _ # \ \/ / ___|| |_ _ __ ___ __ _ _ __ ___ # \ /\___ \| __| '__/ _ \/ _` | '_ ` _ \ # / \ ___) | |_| | | __/ (_| | | | | | | # /_/\_\____/ \__|_| \___|\__,_|_| |_| |_| [xs-username@xstream-ln0X ~]$
-
Submit jobs via Slurm, do cool stuff with GPUs and then
exit:[xs-username@xstream-ln0X ~]$ exit [username@ssohub ~]$ exit
-
There are 3 filesystems with each dedicated to specific tasks:
$HOME: 5GB limited space to store scripts, binaries, logs, etc.$WORK: 1TB Lustre FS for computationally expensive I/Os (store data here).$LSTOR&$TMPDIR(On Compute Node): Local Scratch Disks (up to 447GB)
-
Modules a.k.a. Packages (including TensorFlow):
- List all available modules:
[xs-username@xstream-ln0X ~]$ module spider - Details on how to load the module and module support info:
[xs-username@xstream-ln0X ~]$ module spider [MODULE]/[VERSION] - Example on how to load TensorFlow (which automatically loads CUDA and cuDNN):
where
[xs-username@xstream-ln0X ~]$ ml tensorflow/0.10mlis an alias formodule load.
- List all available modules:
Only rule for job submission is CPU:GPU ratio, r should be at most 5:4.
Queues and QoS:
| Slurm QoS | Max CPUs | Max GPUs | Max Jobs | Max Nodes | Job Time Limits |
|---|---|---|---|---|---|
normal |
320/USER, 400/GROUP | 256/USER, 320/GROUP | 512/USER | 16/USER, 20/GROUP | 48 HOURS |
long** |
20/USER, 80/GROUP, 200 MAX TOTAL | 16/USER, 64/GROUP, 160 MAX TOTAL | 4/USER, 64 MAX TOTAL | N/A | 7 DAYS |
** Enable the long QoS mode via the --qos=long flag when submitting jobs.
Two steps to running jobs:
- Resource Requests (
SBATCHprefix) - Job Steps (
sruncommand)
A few useful SBATCH parameters (more in man sbatch) include:
--job-name: Define the name of the job.--output: Define output file for job completion information.--time: Set the runtime of the job.--ntasks: Define the number of tasks. Typically1for a single TensorFlow job.--cpus-per-task: Number of CPUs to be allocated.--mem-per-cpu: Total memory per CPU in MB. Max is 12800 and default is 12000.--gres: Typically used to define GPU resources e.g.--gres gpu:2for 2 GPUs.--gres-flags: Used toenforce-bindingi.e. ensure that the GPUs allcoated all reside within the same CPU socket. May improve communication speed between GPUs.
Putting it all together (A Sample SLURM scipt):
- Create the
submit.shscript:
#!/bin/bash
#
#SBATCH --job-name=tf_trial
#SBATCH --output=res_%j.txt
#
#SBATCH --time=12:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gres gpu:4
#SBATCH --gres-flags=enforce-binding
ml tensorflow/0.10 protobuf/2.6.1
python main.py ...- Submit the job via the
sbatchcommand:
[xs-username@xstream-ln0X ~]$ sbatch submit.sh
Submitted batch job XXXXMonitoring, Terminating and Gathering Information:
scancel: Used to kill jobs e.g.scancel [JOB ID]orscancel -u [USERNAME].squeue: View PENDING and RUNNING jobs e.g.squeue -u [USERNAME].scontrol show job: Get full details about a PENDING or RUNNING job e.g.scontrol show job [JOB ID]