-
Notifications
You must be signed in to change notification settings - Fork 15
Example Pipeline
In our example analysis, we investigate the differences between the microbiome of 20 rural and 20 recently urbanized subjects from the Chinese province of Hunan. For more information on this dataset, please review the analysis Fodor Lab published in the Sep 2017 issue of the journal Microbiome: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0338-7
The BioLockJ project Config chinaKrakenFullDB.properties lists 5 BioModules to run (lines 3-7) + 13 properties:
#BioModule biolockj.module.implicit.RegisterNumReads
#BioModule biolockj.module.classifier.wgs.KrakenClassifier
#BioModule biolockj.module.report.taxa.NormalizeTaxaTables
#BioModule biolockj.module.report.r.R_PlotPvalHistograms
#BioModule biolockj.module.report.r.R_PlotOtus
In addition to the 5 listed BioModules, 4 additional implicit BioModules will also run:
| Mod# | Module | Description |
|---|---|---|
| 1 | ImportMetadata | Always run 1st (for all pipelines) |
| 2 | KrakenParser | Always run after KrakenClassifier |
| 3 | AddMetadataToOtuTables | Always run just before the 1st R module |
| 4 | CalculateStats | Always run as the 1st R module. |
Key properties:
| Line# | Property | Description |
|---|---|---|
| 08 | cluster.jobHeader | Each script will run on 1 node, 16 cores, and 128GB RAM for up to 30 minutes |
| 10 | pipeline.defaultProps | Default config file defines most properties – in this case copperhead.properties |
| 12 | input.dirPaths | Directory path containing 40 gzipped whole genome sequencing (WGS) fastq files |
| 18 | metadata.filePath | Metadata file path: chinaMetadata.tsv |
BioLockJ must associate sequence files in input.dirPaths with the correct metadata row. This is done by matching sequence file names to the 1st column in the metadata file. If the Sample ID is not found in your file names, the file names must be updated. Use the following properties to ignore a file prefix or suffix when matching the sample IDs.
- input.suffixFw
- input.suffixRv
- input.trimPrefix
- input.trimSuffix
Sample IDs from 1st column of the metadata file: 081A, 082A, 083A...etc.
Sequence file names: 081A_R1.fq.gz, 082A_R1.fq.gz, 083A_R1.fq.gz...etc.
The default Config file, copperhead.properties, has its own default Config file standard.properties which defines the property input.suffixFw=_R1. As a result, all characters starting with (and including) “_R1” are ignored when matching the file name to the metadata sample ID.
> biolockj ~/chinaKrakenFullDB.properties
- Look in the BioLockJ pipeline output directory defined by $BLJ_PROJ for a new pipeline directory named after the property file + today’s date: ~/projects/chinaKrakenFullDB_2018Apr09
- The 5 configured modules have run in order, with the addition of 2 implicit modules (1st and last) which are added to all pipelines automatically.
- The biolockjComplete file indicates the pipeline ran successfully.
-
Run the blj_summary command to review the pipeline execution summary.
> blj_summary
-
Run the blj_download command to get the command needed to download the analysis.
> blj_download > rsync
- Open downloadDir on your local filesystem to review the analysis. This directory contains:
| Output | Description |
|---|---|
| /temp | Directory where R log files are saved if R script runs locally. |
| /tables | Directory containing the OTU tables. |
| /local | Directory where R script output is saved if R script runs locally and r.debug=Y. |
| *.RData | The saved R sessions for R modules run if r.saveRData=Y. |
| chinaKrakenFullDB.log | The pipeline Java log file. |
| MAIN_*.R | Each R script for each module that generated reports has been updated to run on your local filesystem. |
| *.tsv files | Spreadsheets containing p-value and R^2 statistics for each OTU in the taxonomy level. |
| *.pdf files | P-value histograms, and bar-charts or scatterplots for each OTU in the taxonomy level. |
- Each R module generates a report for each report.taxonomyLevel configured:
- The report begins with the unadjusted P-Value Distributions:
- Since r.numHistogramBreaks=20 so the 1st bar represents the p-values < 0.05. The ruralUrban attribute appears significant, as indicated by the high number p-values < 0.05.
- For each OTU, a bar-chart or scatterplot is output with adjusted parametric and non-parametric p-values formatted using in the plot header.
- The p-value format is defined by r.pValFormat.
- The p-adjust method is defined by rStats.pAdjustMethod.
- P-values that meet the r.pvalCutoff threshold are highlighted with r.colorHighlight.
BioLockJ: data-wrangling done right.
Getting Started
Dependencies
Installation
Configuration
Commands
Example Pipeline
Failure Recovery
Validation
Building Modules
API
FAQ
Sequence Processing Modules
AwkFastaConverter
Gunzipper
KneadDataSanitizer
Multiplexer
PearMergeReads
RarefySeqs
SeqFileValidator
TrimPrimers
Classifier Modules
for whole genome sequences
Humann2Classifier
KrakenClassifier
Kraken2Classifier
Metaphlan2Classifier
for 16S sequences
QiimeClosedRefClassifier
QiimeDeNovoClassifier
QiimeOpenRefClassifier
RdpClassifier
Report Modules
general
Email
JsonReport
for otu tables
CompileOtuCounts
RarefyOtuCounts
RemoveLowOtuCounts
RemoveScarceOtuCounts
for taxa tables
AddMetadataToOtuTables
BuildTaxaTables
LogTransformTaxaTables
NormalizeTaxaTables
for pathway tables
AddMetadataToPathwayTables
RemoveLowPathwayCounts
RemoveScarcePathwayCounts
for statistics and visualization
R_CalculateStats
R_PlotEffectSize
R_PlotMds
R_PlotOtus
R_PlotPvalHistograms