Skip to content

Commit cc2a4e3

Browse files
Merge pull request #28 from AndersenLab/gcp-nextflow-container
Gcp nextflow container
2 parents 8a85b5f + 27f6f1a commit cc2a4e3

6 files changed

Lines changed: 302 additions & 5 deletions

File tree

Dockerfile

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
##############################################################################################################################
2+
#
3+
# This container includes all necessary components for initializing the NemaScan Nextflow pipeline in Google Cloud
4+
# Additional configuration options can be passed in via environment variables
5+
#
6+
##############################################################################################################################
7+
8+
# Base image includes Google Cloud SDK tools
9+
FROM google/cloud-sdk:slim
10+
11+
# Install OpenJDK JRE for Nextflow
12+
RUN apt-get update && apt-get upgrade -y && apt-get install -y --no-install-recommends openjdk-11-jre wget procps
13+
14+
LABEL Name="NemaScan-NXF" Author="Sam Wachspress"
15+
16+
# Specify Nextflow version and mode
17+
# (21.05.0-edge is the first version to support configuring which service account acts as pipeline-runner)
18+
ENV NXF_VER=21.05.0-edge \
19+
NXF_MODE=google \
20+
NXF_EDGE=1
21+
22+
WORKDIR /nemascan
23+
24+
# Run the Nextflow install script (version and mode must be piped in to bash during install
25+
# or nextflow will initially download the latest version and only download and switch to NXF_VER when the container runs)
26+
RUN NXF_VER=21.05.0-edge NXF_MODE=google NXF_EDGE=1 \
27+
wget -qO- https://get.nextflow.io | bash
28+
29+
COPY nemascan-nxf.sh /nemascan/nemascan-nxf.sh
30+
COPY nextflow.config /nemascan/nextflow.config
31+
COPY main.nf /nemascan/main.nf
32+
COPY conf/* /nemascan/conf/
33+
COPY bin/* /nemascan/bin/
34+
35+
36+
# add nextflow and nemarun directory to te system path and make them executable
37+
ENV PATH="/nemascan:${PATH}"
38+
RUN chmod +x /nemascan/nemascan-nxf.sh /nemascan/nextflow
39+
40+
WORKDIR /nemascan

README.md

Lines changed: 207 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,211 @@ nextflow self-update
5151

5252
# Usage
5353

54-
## Recommended: running remote from GitHub
54+
55+
## Provisioning a Virtual Machine in Google Cloud
56+
You can create a virtual machine from the existing template: [nemascan-test-vm](https://console.cloud.google.com/compute/instanceTemplates/list?project=andersen-lab).
57+
58+
Click on **Actions** -> **Create VM**
59+
60+
Scroll to the bottom of the page and click **'Create'**
61+
62+
Once the VM has been created, it should appear in the [list](https://console.cloud.google.com/compute/instances?project=andersen-lab).
63+
64+
**Remember to Stop the VM and Delete it when you are done to avoid excess charges!!!**
65+
66+
Click on 'SSH' to connect.
67+
68+
Switch to the root user and install some required packages:
69+
```
70+
sudo su
71+
apt-get install docker.io git nano
72+
```
73+
74+
75+
## Using NemaScan Nextflow container
76+
Provision a virtual machine from the existing template and connect with SSH, then switch to the root user and install the required packages:
77+
```
78+
sudo su
79+
apt-get install docker.io git nano
80+
```
81+
82+
To start a terminal session inside of the container, use:
83+
```
84+
docker run -i -t andersenlab/nemascan-nxf /bin/bash
85+
```
86+
87+
Once you are done, exit the container with Ctrl-C or use:
88+
```
89+
exit
90+
```
91+
92+
**Remember to Stop the VM and Delete it when you are done to avoid excess charges!!!**
93+
94+
95+
## Running NemaScan Nextflow container (this is what CeNDR does)
96+
To reproduce the details of a pipeline exactly, you can use the containerized version of the tool set to a specific release version.
97+
Provision a virtual machine from the existing template and connect with SSH, then switch to the root user and install the required packages:
98+
```
99+
sudo su
100+
apt-get install docker.io git nano
101+
```
102+
103+
Use 'nano' to create a *.env file (see: example.env) with variables pointing to your Google Storage locations:
104+
```
105+
nano test.env
106+
```
107+
108+
test.env:
109+
```
110+
TRAIT_FILE="gs://elegansvariation.org/reports/nemascan/abcd123/data.tsv"
111+
OUTPUT_DIR="gs://elegansvariation.org/reports/nemascan/abcd123/results"
112+
WORK_DIR="gs://nf-pipelines/workdir/abcd123"
113+
VCF_VERSION="20210121"
114+
```
115+
116+
```
117+
docker run -i -t \
118+
--env-file test.env \
119+
andersenlab/nemascan-nxf:v0.01 \
120+
nemascan-nxf.sh
121+
```
122+
123+
You can also pass them in as part of the command:
124+
```
125+
docker run -i -t \
126+
-e TRAIT_FILE="gs://elegansvariation.org/reports/nemascan/abcd123/data.tsv" \
127+
-e OUTPUT_DIR="gs://elegansvariation.org/reports/nemascan/abcd123/results" \
128+
-e WORK_DIR="gs://nf-pipelines/workdir/abcd123" \
129+
-e VCF_VERSION="20210121" \
130+
andersenlab/nemascan-nxf \
131+
nemascan-nxf.sh
132+
```
133+
134+
**Remember to Stop the VM and Delete it when you are done to avoid excess charges!!!**
135+
136+
137+
138+
## Testing a new version of the container
139+
First you will have to build and test the container in Google Cloud. Provision a virtual machine from the existing template and connect with SSH, then switch to the root user and install the required packages:
140+
```
141+
sudo su
142+
apt-get install docker.io git nano
143+
```
144+
145+
Clone the repository and check out the branch that contains the version you want to test (in this example, the branch is named 'gcp-nextflow-container'):
146+
```
147+
git clone https://github.com/AndersenLab/NemaScan.git
148+
cd NemaScan
149+
git checkout remotes/origin/gcp-nextflow-container
150+
git pull
151+
```
152+
153+
If you aren't sure of the branch, you can list all available branches with:
154+
```
155+
git branch -a
156+
```
157+
158+
Now you are ready to build the container for testing:
159+
```
160+
docker build -t "andersenlab/nemascan-nxf" .
161+
```
162+
163+
If the container is built successfully you should be able to see the details with:
164+
```
165+
docker image list
166+
```
167+
example:
168+
```
169+
REPOSITORY TAG IMAGE ID CREATED SIZE
170+
andersenlab/nemascan-nxf latest bb4f296feec8 26 seconds ago 1.88GB
171+
```
172+
173+
174+
Now you can begin testing. To start the container and open a terminal prompt, substitute your container version's IMAGE ID value in the command below:
175+
```
176+
docker run -i -t bb4f296feec8 /bin/bash
177+
```
178+
179+
Configure the pipeline options by setting the environment variables described in example.env. Substitute the path for your own test data:
180+
```
181+
export TRAIT_FILE="gs://elegansvariation.org/reports/nemascan/abcd123/data.tsv"
182+
export OUTPUT_DIR="gs://elegansvariation.org/reports/nemascan/abcd123/results"
183+
export WORK_DIR="gs://nf-pipelines/workdir/abcd123"
184+
export VCF_VERSION="20210121"
185+
```
186+
187+
Run the pipeline:
188+
```
189+
./nemascan-nxf.sh
190+
```
191+
192+
Because the pipeline takes so long to run, it is possible that your SSH session may time out and disconnect during the test.
193+
Reconnect to the VM with SSH and then list the running containers with:
194+
```
195+
docker container list
196+
```
197+
You should see an output similar to this:
198+
```
199+
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
200+
d2f3dbb5e136 bb4f296feec8 "/bin/bash" 6 minutes ago Up 6 minutes jolly_cartwright
201+
```
202+
203+
You can re-attach to the container by substituting your own CONTAINER ID in the command below:
204+
```
205+
docker attach d2f3dbb5e136
206+
```
207+
Warning: Nextflow only prints status updates sporadically, so you may not see any output for some time after attaching to the container.
208+
209+
After the pipeline has completed, use Ctrl-C to exit the container, or type:
210+
```
211+
exit
212+
```
213+
214+
215+
## Publishing a new version of the container
216+
Once you have validated the pipeline against test data, you can publish a release to docker hub.
217+
Switch to the root user and log in with your credentials to docker hub:
218+
219+
```
220+
sudo su
221+
docker login
222+
```
223+
224+
List the containers, and select the IMAGE ID of the container you just tested:
225+
```
226+
docker image list
227+
```
228+
229+
example:
230+
```
231+
REPOSITORY TAG IMAGE ID CREATED SIZE
232+
andersenlab/nemascan-nxf latest bb4f296feec8 4 hours ago 1.88GB
233+
google/cloud-sdk slim d6d0a7854ac3 28 hours ago 1.16GB
234+
```
235+
236+
Substitute your IMAGE ID, then tag the container with the docker hub repository, container name, and version number. You can also omit the version number to make the selected version the default (or 'latest')
237+
```
238+
docker tag bb4f296feec8 andersenlab/nemascan-nxf
239+
```
240+
```
241+
docker tag bb4f296feec8 andersenlab/nemascan-nxf:v0.01
242+
```
243+
244+
Publish the container to docker hub with a version tag:
245+
```
246+
docker push andersenlab/nemascan-nxf:v0.01
247+
```
248+
You can also publish the container to docker hub without a version number if you want it to be the default ('latest') version:
249+
```
250+
docker push andersenlab/nemascan-nxf
251+
```
252+
253+
**Remember to Stop the VM and Delete it when you are done to avoid excess charges!!!**
254+
255+
256+
257+
258+
## Running remote from GitHub
55259
For reproducible pipelines, it is recommended to run NemaScan **without cloning the repo**. In this manner, you can also choose which branch and/or commit you wish to run. However, you can always clone the repo with:
56260

57261
```
@@ -68,6 +272,8 @@ nextflow run andersenlab/nemascan --debug
68272

69273
To display the help message, run `nextflow andersenlab/nemascan --help`
70274

275+
276+
71277
# Profiles and Parameters
72278

73279
## Mappings Profile

conf/gcp.config

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ process {
77
executor = 'google-lifesciences'
88

99
// change this container eventually
10-
container = 'northwesternmti/nemascan:0.31'
10+
container = 'andersenlab/nemascan-worker:v0.99'
1111

1212
// add support for machine types
1313
//machineType = 'n1-standard-4'
@@ -66,10 +66,8 @@ params {
6666
// misc
6767
eigen_mem = "10 GB"
6868
date = new Date().format( 'yyyyMMdd' )
69-
data_dir = "gs://nf-pipelines/NemaScan/input_data/"
70-
bin_dir = "gs://nf-pipelines/NemaScan/bin/"
7169

7270
}
7371

74-
workDir = 'gs://nf-pipelines/workdir/'
72+
workDir = "${params.work_dir}"
7573

example.env

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
TRAIT_FILE="gs://elegansvariation.org/reports/nemascan/abcd123/data.tsv"
2+
OUTPUT_DIR="gs://elegansvariation.org/reports/nemascan/abcd123/results"
3+
WORK_DIR="gs://nf-pipelines/workdir/abcd123"
4+
VCF_VERSION="20210121"
5+
DATA_DIR="gs://nf-pipelines/NemaScan/input_data"

main.nf

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ if(params.debug) {
7171
impute_vcf_index = Channel.fromPath("gs://elegansvariation.org/releases/${params.vcf}/variation/WI.${params.vcf}.impute.isotype.vcf.gz.tbi")
7272

7373
ann_file = Channel.fromPath("gs://elegansvariation.org/releases/${params.vcf}/variation/WI.${params.vcf}.strain-annotation.${params.annotation}.tsv")
74+
params.strains = "input_data/${params.species}/phenotypes/strain_file.tsv"
7475
} else if(!params.vcf) {
7576
// if there is no VCF date provided, pull the latest vcf from cendr.
7677
params.vcf = "20210121"

nemascan-nxf.sh

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
#!/bin/bash
2+
#
3+
# This script acts as a wrapper around the execution of Nextflow passing environment variables as arguments
4+
#
5+
###################################################################################################################
6+
7+
DEFAULT_DATA_DIR="gs://nf-pipelines/NemaScan/input_data"
8+
DEFAULT_VCF_VERSION="20210121"
9+
10+
# Environment variables with default values:
11+
12+
if [[ -z "${VCF_VERSION}" ]]; then
13+
VCF_VERSION=${DEFAULT_VCF_VERSION}
14+
echo "VCF_VERSION environment variable is not set - defaulting to ${VCF_VERSION}"
15+
fi
16+
17+
if [[ -z "${DATA_DIR}" ]]; then
18+
DATA_DIR=${DEFAULT_DATA_DIR}
19+
echo "DATA_DIR environment variable is not set - defaulting to ${DATA_DIR}"
20+
fi
21+
22+
23+
# Environment variables that MUST be set
24+
25+
if [[ -z "${TRAIT_FILE}" ]]; then
26+
echo "TRAIT_FILE environment variable must be set to the Google Storage path of the data"
27+
exit 1
28+
fi
29+
30+
if [[ -z "${OUTPUT_DIR}" ]]; then
31+
echo "OUTPUT_DIR environment variable must be set to the Google Storage path of the output directory"
32+
exit 1
33+
fi
34+
35+
if [[ -z "${WORK_DIR}" ]]; then
36+
echo "WORK_DIR environment variable must be set to the Google Storage path of the working directory"
37+
exit 1
38+
fi
39+
40+
41+
nextflow run main.nf \
42+
-profile gcp \
43+
--traitfile "${TRAIT_FILE}" \
44+
--vcf "${VCF_VERSION}" \
45+
--work_dir "${WORK_DIR}" \
46+
--out "${OUTPUT_DIR}" \
47+
--data_dir "${DATA_DIR}"

0 commit comments

Comments
 (0)