diff --git a/README.md b/README.md index 9cb4c1c..8056f37 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,7 @@ # flux-tools -Code and demos for running scripts on FLUX +This is a repo containing demos for how to get started with the University of Michigan FLUX HPC cluster and run some basic bioinformatics/microbiome scripts. + +You can access lessons at [http://deneflab.github.io/flux-tools/](http://deneflab.github.io/flux-tools/). + +This repo was created by @michberr. As of September 2016 updates to flux may not be reflected in the demos. + diff --git a/_includes/header.html b/_includes/header.html new file mode 100644 index 0000000..aca3695 --- /dev/null +++ b/_includes/header.html @@ -0,0 +1,12 @@ + + + + + + + + + + diff --git a/_includes/javascript.html b/_includes/javascript.html new file mode 100644 index 0000000..662c4a3 --- /dev/null +++ b/_includes/javascript.html @@ -0,0 +1,5 @@ + + + + diff --git a/_layouts/lesson.html b/_layouts/lesson.html new file mode 100644 index 0000000..03bd8ed --- /dev/null +++ b/_layouts/lesson.html @@ -0,0 +1,29 @@ +--- +--- + + +
+ {% if page.title %} +++Objectives
++
+- Explain the steps involved in obtaining a flux account
+- Provide links to supplementary software that makes interacting with flux easier
+
Arc-ts describes how to set up a flux account here. Follow steps 1-4 before moving onto any other lesson.
+Note: I recommend the software mtoken. Then you don’t have to keep track of it!
+There are some other software programs that will make your life much easier while working with flux.
+Cyberduck is a GUI program that allows you to connect to a remote server and transfer files between a remote server and local machine. You can essentially drag and drop files from your flux account onto your computer.
+You can download Cyberduck here.
+You will need to configure Cyberduck to access your flux account
+Click on the + button at the bottom left to add a new server connection to your bookmarks
Select SFTP (SSH File Transfer Protocol) from the drop down menu at the top
Globus is similar to cyberduck in that it provides a GUI for transferring files between machines. There are a few differences, however.
+Globus is much faster than cyberduck, so it is ideal to use for very large files.
Globus can transfer files between two remote servers. For example, you can use Globus to transfer files between Greg’s geomicro servers and your flux account. You can also use Globus to transfer files between different locations on your flux account such as between the nfs and scratch drives. The only requirement for tranferring files with Globus is that an endpoint must be set up. These have already been configured on flux and the geomicro servers. It is not too hard to set up a personal endpoint on your own computer.
Globus is accessed through a web browser. If you are transferring files between two remote servers, once you have submitted the transfer request, you can close your computer :)
You can download and register for a Globus personal endpoint here.
+Once you have Globus downloaded, you can access a web page to transfer files between remote servers and your personal machine using aliases. For example, flux is accessible through the alias “umich#flux”. You can set an alias for your own computer. In the example below, it is simply called “laptop”.
+++Objectives
++
+- Demonstrate how to log into FLUX
+- Describe how to navigate in FLUX
+- Provide some useful aliases to make login easier
+
You will need to log into flux using a command-line terminal. If you have a mac machine, open your terminal program. On a PC you will need to download an external emulator program like cygwin.
+If you are logging into FLUX from off campus, you will need to set up a VPN. Information on this is available here.
+To log into FLUX, use the following command:
+ssh -l your-uniq-name flux-login.arc-ts.umich.edu
+FLUX will then ask for your password. This is your umich password. Type your password and press enter. Your password will not appear on the screen but it is there.
+Next, flux will ask for your Duo authentication code. Type it in and press enter.
+If you are logging into flux frequently, it can be cumbersome to type this in all the time. Therefore, i recommend you create some aliases in your bash profile.
+Navigate to your home directory:
+# cd with no argument will take you to your home
+cd
+
+# Print working directory to make sure you're in the right place
+pwd
+See if you have a file called .bash_profile, and if not, make a file:
# List all files in your current directory (including files starting with '.')
+ls -a
+
+# Create a .bash_profile file if it doesn't exist
+touch .bash_profile
+
+# Open your file for editing with nano
+nano .bash_profile
+Add these lines to your file (make sure to change to your uniq name!) and save.
+# login to flux
+alias flux='ssh -l your-uniq-name flux-login.arc-ts.umich.edu'
+
+Now, when you want to login to flux, you only have to flux to initiate the login process!
++Objectives
++
+- Introduce PBS and describe the arguments of a PBS script
+- Explain how to execute, view, and cancel jobs on FLUX
+- Discuss memory and walltime considerations
+
To execute programs on FLUX, you will submit them to a job scheduling software called PBS, which stands for Portable Batch System. The role of PBS is to allocate computational resources (e.g., nodes, processors, and time) among the scheduled tasks.
+This means that every time you want to run a program on FLUX, you will need a PBS script, ending in .pbs, that describes what resources are required for the program.
+To help distinguish code chunks that pertain to a pbs script from code chunks that should be typed in the terminal, chunks from scripts will have white backgrounds and chunks with terminal commands will have grey backgrounds.
+#### PBS preamble
+#PBS -N blast
+#PBS -M michberr@umich.edu
+
+#PBS -l nodes=1:ppn=4, mem=100gb, walltime=10:00:00
+#PBS -V
+
+#PBS -A vdenef_fluxm
+#PBS -l qos=flux
+#PBS -q fluxm
+
+#PBS -m bea
+#PBS -j oe
+
+#### End PBS preamble
+
+# Show list of CPUs you ran on, if you're running under PBS
+if [ -n "$PBS_NODEFILE" ]; then cat $PBS_NODEFILE; fi
+
+# Change to the directory you submitted from
+if [ -n "$PBS_O_WORKDIR" ]; then cd $PBS_O_WORKDIR; fi
+pwd
+
+## Job commands
+# must load module med/ncbi-blast
+
+bash ./blast-analysis.sh
+
+
+The first thing you might notice about this script is that the first several lines start with #PBS. Normally, a # indicates a comment, i.e. a command that will not be read by a computer program. However, in this case, a line starting with #PBS indicates that it is a line that should be read by the PBS software. Lines that begin with a # without PBS directly following it are interpreted as comments by bash.
This first line takes an argument -N, which assigns a name to the job. Pick something descriptive but short - only the first 8 characters will be used. In this example my job is simply named “blast”.
#PBS -N blast
+This line indicates which email you want notifications sent to. By default notifications will be sent to your umich address.
+#PBS -M michberr@umich.edu
+This next line indicates how many nodes, processors per nodes, and walltime to allocate. The Denef lab has two types of flux nodes:
+1) Our regular flux nodes (flux) have 20 processors, 4gb per processor.
+2) Our high memory nodes (fluxm) have 40 processors, 25gb per processor.
+If you are not using the Denef lab account, check on what sort of resources you have available. Info on the LSA public account is available here.
+The line below indicates we want to allocate 1 node with 4 processors, 100gb of memory, and 10 hours for the analysis. You can tell from the memory requirements (100gb/ 4 processors) that this job will be run on a fluxm node.
+#PBS -l nodes=1:ppn=4,mem=100gb,walltime=10:00:00
+This next line is like magic. It transfers everything in your current environment into the environment of the node your job will be run on. This will become important later when we load modules
+#PBS -V
+These two lines describe how your job is paid for and which set of resources to run against. -A indicates the account. In the example below, the job is going to be run on the Denef lab’s fluxm account. If I wanted to run it on the flux (low memory) account, I would use -A vdenef_flux. The -q line indicates which queue your job should go into. This should always match the option for -A (i.e. flux or fluxm).
#PBS -A vdenef_fluxm
+#PBS -q fluxm
+The line should never change.
+#PBS -l qos=flux
+This line describes when you want to get email notifications about your job. If you input all options, bea, you will get messages when the job begins, when an error occurs, and after the job has completed. Depending on the type of job, you might find all of the messages in your email annoying. I usually just leave an e or ea on this line.
#PBS -m bea
+This line joins the output and error messages into one document rather than outputting a separate file for each. This isn’t necessary, but it helps to declutter your job output files.
+#PBS -j oe
+Whew, we’ve gotten to the end of our PBS commands! These next two lines are useful output from your job to show how many CPUs the job ran on and which directory it was running from. A PBS job has certain default variables stored like $PBS_O_WORKDIR which is the output working directory.
# Show list of CPUs you ran on, if you're running under PBS
+if [ -n "$PBS_NODEFILE" ]; then cat $PBS_NODEFILE; fi
+
+# Change to the directory you submitted from
+if [ -n "$PBS_O_WORKDIR" ]; then cd $PBS_O_WORKDIR; fi
+pwd
+Finally, this last section actually executes our program. In this case we are executing a shell script called blast-analysis.sh that presumably calls one of the BLAST commands. We also wrote a comment to ourselves reminding us which modules need to be loaded before running the job.
+## Job commands
+# must load module med/ncbi-blast
+
+bash ./blast-analysis.sh
+
+If your job requires any external software such as BLAST, mothur, tophat, etc, you will need to load these modules before executing your job. Modules are available in different locations. I usually check for bioinformatics modules in med, sph, and lsa.
+Typing the following command into the terminal will allow you to view all of the modules available from the school of public health:
+module load sph
+module av
+You will notice that there are often multiple versions of a software available. If you do not specify which version to load, the default version will be selected.
+To load the BLAST module (default is 2.2.29) :
+module load med
+module load ncbi-blast
+
+IMPORTANT: You can technically load modules from within your pbs script, but this is not recommended. Instead, you should load a module from the command line before you submit your job. This is where the #PBS -V option comes in handy! Whatever modules you load in your local environment will be present in the processing environment.
You can start your job with this simple line:
+qsub your-pbs-file.pbs
+Always make sure that you have loaded your modules before hand!
+Once you have submitted your job, you can check on it in a few ways.
+To view the status of your job, use the qstat command.
qstat -u your-uniqname
+
+Each job is assigned a unique job id, which is viewable on the left. On the right, there is a column S for status where you can see whether your job is in the queue (Q), running (R), or completed (C).
+In this example, I have just submitted a job, so it is waiting in the queue:
+To peek at the output of a job as it is running, you can use the qpeek command with the job id
qpeek job-id
+To view all the jobs running on an account, your can use the showq command. This is often useful to do before running a job to determine how long your job will be in the queue before it can run.
showq -w acct=your-account
+
+You can cancel a job in the queue or a running job using qdel with the job id.
qdel job-id
+Notice that the right column now displays a “C” for complete. The job didn’t actually run because we cancelled it before it started. Therefore a “C” just means the job is no longer running or waiting to run. It tells you nothing about the success of the job.
+When you start out it’s hard to know how many resources to allocate to a job. In general you want to overestimate your resources, because if you run out of memory or time, your job will fail and you will have to start over. This sucks if you have been running something for 3 days and you only needed 3 more hours for completion. It’s a good idea to test your code on a subset of your data, both to make sure everything runs correctly, and to estimate the total amount of resources your will need.
+In the next section, we will walk through an example of running mothur on flux, which will give some guidance for the allocation this software requires.
+++Objectives
++
+- Give an example of a full job execution workflow with mothur
+
In the following example we will run mothur, according to the SOP, on a set of bacterial samples that have had their 16S rRNA gene sequences on a MiSeq.
+First, we need to get our sequences onto flux. Most likely, you had your samples sequenced at the UM Med School and they put your sequences on an MBox. Unless they have changed their setup, you will need to download these sequences to a local machine and then use Globus (recommended) or Cyberduck (fine for only a few samples) to transfer these files onto your FLUX account.
+IMPORTANT: You should keep a safe version of these files in a long-term storage location like a flux nfs drive. Then use Globus to transfer a copy to the Scratch drive for processing.
+Here is a link to an example mothur batch file, following the SOP. Please refer to the mothur wiki for an in depth explanation of the commands.
+You can copy it to your current flux location with:
+wget https://github.com/DenefLab/flux-tools/raw/gh-pages/scripts/mothur.batch
+
+You will definitely need to edit the following lines of the file:
+Change your input, output, and tempdefault locations. Tempdefault signifies a location that mothur will search for file that are not in the input directory. This is a good place to leave your databases for alignment/taxonomy.
+set.dir(input=/scratch/lsa_fluxm/michberr/HABS)
+set.dir(output=/scratch/lsa_fluxm/michberr/HABS)
+set.dir(tempdefault=/scratch/lsa_fluxm/vdenef/Databases/ssu_rRNA/Mothur/)
+Change the file argument to a file that contains all your sample names and links to the forward and reverse reads. See the example stability.file in the SOP. Set the processors argument to the number of processors you will request in your PBS script.
+make.contigs(file=habs.files, processors=30)
+Change the reference argument to the name of the file you want to align to
+align.seqs(fasta=current, reference=silva.seed_v119.pcr.v4.unique.align)
+Change the reference and taxonomy arguments to the files you are using for your analysis
+classify.seqs(fasta=current, count=current, reference=silva.nr_v119.pcr.v4.align, taxonomy=silva.nr_v119.tax, cutoff=60)
+Change the groups argument to the names of the mock groups included in your run. They should be separated by a ‘-’
+remove.groups(count=current, fasta=current, taxonomy=current, groups=Mock1-Mock2)
+You might want to edit several other lines of this file based on the needs of your analysis.
+Here is a link to an example PBS script to execute the mothur batch file. You can copy it to your FLUX location the same way as the batch file.
+wget https://github.com/DenefLab/flux-tools/raw/gh-pages/scripts/mothur.pbs
+You will definitely need to edit the -M line:
#PBS -M your-email
+
+If you are not in the Denef lab, you will need to edit the -A line:
#PBS -A your-account
+You might want to edit the -l line with a different memory or walltime allocation based on the size of your dataset. Usually, for ~350 samples of moderate diversity, it takes less than 24 hours on 30 processors.
To begin running mothur we need to load the mothur module.
+module load med
+module load mothur
+Now it’s time to execute!
+qsub mothur.pbs
+We can check our job with:
+qstat -u your-uniq-name
+
+Most likely, your script will fail the first few times you try to run it. Usually the error is related to an incorrect path or filename or setting an impossible allocation in your PBS script. Read your output/error file (the file ending in .o and your jobid) to debug.
+cat mothur-output-file
+Some mothur commands are designed to run very efficiently on a large number of processors. This is where running your job on FLUX will save you a LOT of time over a local machine. However, the OTU picking stage can only run on a single processor and this is usually the most time consuming part of the process. The more diversity you have, the longer this will take. This is good to keep in mind as you are weighing processor and time allocation. Adding 40 processors to your job will only help to a point. You should still add sufficient time for all of your OTUs to be calculated.
+The most frustrating issue with mothur I have encountered is when you end up with many strange unclassified OTUs that you know should really be classified. This took us in the Denef lab almost a year to figure out. It turns out that a few extra files are generated when you run the classification step. You need to delete these files in between mothur runs, or move your reference and taxonomy files to a new location, or you will get strange and wrong results. These are the problemmatic files:
+++Objectives
++
+- Demonstrate how to run several jobs in succession using a for loop
+- Demonstrate how to run several jobs simultaneously using a PBS job array
+
Often, you will want to run the same analysis on several samples. Rather than submit a job for each of these samples, you would like to submit one job and apply it to all of your samples. Depending on the requirements of the task, you can either write your script in a for loop or submit a PBS job array.
+If your analysis is memory intensive, i.e. requires close to your full allocation, you might use a for loop. For example, if you are assembling several metagenomes, it would be best to assemble each one at a time, so that you can allocate all 40 processors from the Denef fluxm account to each sample.
+Here is an example of a shell script that looks for directories that start with “Sample_” and then runs an analysis called my-analysis.sh on a fasta file in each of those directories.
+for i in $(find -maxdepth 1 -type d -name "Sample_*"); do
+ myPath=${i}
+ echo -e "[`date`]\tStarting analysis on ${i}"
+ cd $myPath
+ bash ./my-analysis.sh my.fasta
+ cd -
+ echo -e "[`date`]\tFinished with ${i}"
+done
+
+echo "[`date`]\tDone!"
+
+For loops allow you to process many samples sequentially, but it would really speed up the process if you could submit the same job on many samples at the same time. This is exactly what a PBS job array does. Particularly, if your analysis is not that memory intensive, i.e. your allocation is large enough to run several jobs at once, you will definitely want to use a job array.
+To submit a job array, you only need to add one line to your PBS script with the -t argument. The -t argument takes a set of indices for your job array. For example, if you wanted to run your script on 5 samples:
#PBS -t 1-5
+PBS will then store a variable, $PBS_ARRAYID, which takes on values from the specified range (1-5 in this case). If we then started our job with qsub, 5 different instances of the job would be executed, each uniquely identified by a value from $PBS_ARRAYID.
We can use the $PBS_ARRAYID variable to assign samples to our job array. The way I like to do this is to create a file called job.conf that contains the name of each sample directory on a separate line. It might look like this:
Sample_1
+Sample_2
+Sample_3
+Sample_4
+Sample_5
+Then, I can use the following line to assign a different sample directory for each element of my job array:
+sample=$(sed "${PBS_ARRAYID}q;d" job.conf)
+This line extracts the nth line of job.conf where n is our $PBS_ARRAYID number. This means that Sample_3 will run in the 3rd index of our job array.
If you want to submit a job array with a large number of samples, but you only want a few of them to run at a time, you can specify this with a %. For example the following line will run an array with 20 jobs, but will only run 5 of them at a time:
#PBS -t 1-20%5
+Below is an example of a PBS script that will execute an analysis using the CRASS crispr-finding software
+#### PBS preamble
+#PBS -N crass
+#PBS -m abe
+
+#PBS -l nodes=1:ppn=2,mem=50gb,walltime=200:00:00
+#PBS -V
+
+#PBS -A vdenef_fluxm
+#PBS -l qos=flux
+#PBS -q fluxm
+#PBS -t 1-7
+#PBS -j oe
+
+#### End PBS preamble
+
+# Show list of CPUs you ran on, if you're running under PBS
+if [ -n "$PBS_NODEFILE" ]; then cat $PBS_NODEFILE; fi
+
+# Change to the directory you submitted from
+if [ -n "$PBS_O_WORKDIR" ]; then cd $PBS_O_WORKDIR; fi
+pwd
+
+
+# required module: lsa/crass
+
+# Run CRASS script
+bash ./crass-analysis.sh ${PBS_ARRAYID}
+
+
+Notice that here I pass $PBS_ARRAYID as an argument to my crass-analysis.sh script.
The crass-analysis.sh script looks like this:
+arrayid="$1"
+sample=$(sed "${arrayid}q;d" job.conf)
+
+cd ${sample}
+echo -e "[`date`]\t Running CRASS on $sample"
+crass dt_int.fasta
+echo -e "[`date`]\t Finished with $sample"
+
+
+In the first line $PBS_ARRAYID gets assigned to the variable $arrayid. In the second line $arrayid is used to assign the correct sample directory name to $sample.
You can use all of the same commands (e.g., qstat, qdel, qpeek) with a job array as regular job. However, you will need to use bracket indexing to select a job from the array.
For example, for a job with id 19957691, I could peek at the output of the second sample with this command:
+qpeek 19957691[2]
+
+If I wanted to cancel all of the jobs in my job array I could do this:
+qdel 19957691[]
+
+