Skip to content

Commit 438bf2f

Browse files
authored
Merge pull request #6 from StatFunGen/update_readme
update README and cleanup
2 parents fbb6e69 + 5b257d8 commit 438bf2f

13 files changed

Lines changed: 5 additions & 69 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
src/*.log

example/README.md

Lines changed: 4 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Here we assume that
44

55
1. the data to be analyzed are already uploaded to S3 bucket by [the Data Admin](https://wanggroup.org/productivity_tips/memverge-aws#notes-for-data-admin).
66
2. the analysis script is also available on S3 --- in this example the [xqtl-pipeline repo](https://github.com/cumc/xqtl-pipeline) is cloned to the bucket.
7-
3. the container image used for the analysis is the latest
7+
3. the FSx or EFS volume already has all necessary software installed
88

99
We use the command below to submit commands in `commands_to_submit.txt`.
1010

@@ -18,9 +18,7 @@ username=aw3600
1818
-sg sg-02867677e76635b25 \
1919
-efs 10.1.10.210 \
2020
--job-script ./example/commands_to_submit.txt \
21-
--oem-packages \
2221
--mount-packages \
23-
-c 2 -m 16 \
2422
--job-size 100 \
2523
--mount "statfungen/ftp_fgc_xqtl:/home/$username/data,statfungen/ftp_fgc_xqtl/sos_cache/$username:/home/$username/.sos,statfungen/ftp_fgc_xqtl/analysis_result/finemapping_twas:/home/$username/output" \
2624
--mountOpt mode=r,mode=rw,mode=rw \
@@ -37,15 +35,15 @@ To explain the parameters,
3735
- `-g g-sidlpgb7oi9p48kxycpmn` and `-sg sg-02867677e76635b25` are gateway ID and security group, respectively. You can ask your admin for these IDs. These help in the VM's networking.
3836
- `-efs 10.1.10.210` specifies the IP of the EFS used in order to access installed packages.
3937
- `--job-script ./example/commands_to_submit.txt` provides the actual commands we want to submit to the VM. Providing this specifies batch mode
40-
- `--oem-packages` and `--mount-packages` are two modes that specify how the user can use certain packages. The former allows the user to use shared packages, and the latter allow the user to use user-installed packages. Specifying both will aloo both these features, but either can be used.
38+
- `--oem-packages` and `--mount-packages` are two modes that specify how the user can use certain packages. The former allows the user to use shared packages, and the latter allow the user to use user-installed packages. One and only one can be used.
4139
- `-c 2` and `-m 16` specifies that the VM should have 2 CPU threads and 16GB of memory.
4240
- `--job-size 100` will split commands per line within `commands_to_submit.txt` into batches, each batch has at most 100 commands.
4341
- `--mount` includes three folders: the AWS folder `s3://statfungen/ftp_fgc_xql` is mounted to the VM as `~/data`; the AWS folder `s3://statfungen/ftp_fgc_xqtl/sos_cache/aw3600` is mounted to the VM as `~/.sos`; the AWS folder `statfungen/ftp_fgc_xqtl/analysis_result/finemapping_twas` is mounted to the VM as `~/output`. Notice how they are comma-separated.
4442
- `--mountOpts` specifies "mode=r" for the first folder that mounts it as read-only to the analysis command. That means the analysis command cannot directly change or add anything to `~/data` folder in the VM. The second folder is mounted with "mode=rw", that is, the analysis command can write into the `~/.sos` folder in the VM. The third folder is mounted with "mode=rw", so we can directly write the outputs to that folder as they are generated. Notice how they are comma-separated.
4543
- `--download` specifies the folder inside of the S3 bucket that we would like to download to the VM, at the begin of the analysis. If any data has been downloaded using this command, you should update the file paths in the 'commands_to_submit.txt' file accordingly. And **add `/` after the local folder in download** (because we want to download into a folder). For instance, if we downloaded genotype data from `statfungen/ftp_fgc_xqtl/ROSMAP/genotype/analysis_ready/geno_by_chrom/` to the VM at `/home/$username/input/`, then the genotype data path in your 'commands_to_submit.txt' should be specified as `../input`.
4644
- `--download-include` should be used to specify the prefix or suffix of files you want to download from S3 bucket.
4745
- `--ebs-mount` Mount a dedicated local EBS volume to the VM instance. When downloading data from an S3 bucket instead of using direct mounts, ensure you allocate sufficient storage space to the destination path by mounting a dedicated EBS volume. It must be different from the path in `--mount` which mounts a folder on the S3 bucket.
48-
- `-jn` is the job name of the batch job. By default, the name of the batch job is the name of the image. If a job name is specified, a number suffix will be added to the job name. For example, if there were 10 jobs submitted with this command, you will see job names from `example_job_1` to `example_job_10`.
46+
- `-jn` is the job name of the batch job. If a job name is specified, a number suffix will be added to the job name. For example, if there were 10 jobs submitted with this command, you will see job names from `example_job_1` to `example_job_10`.
4947
- `--no-fail-fast` when this switch is turned on, all commands in a batch will be executed regardless if the previous ones failed or succeeded.
5048

5149
To test this for yourself without submitting the job, please add `--dryrun` to the end of the command (eg right after `--no-fail-fast`) and run on your computer. You should find a file called `commands_to_submit_1.mmjob.sh` you can take a look at it to see the actual script that will be executed on the VM.
@@ -56,57 +54,12 @@ To test this for yourself without submitting the job, please add `--dryrun` to t
5654
-g g-sidlpgb7oi9p48kxycpmn \
5755
-sg sg-02867677e76635b25 \
5856
-efs 10.1.10.210 \
59-
--oem-packages \
6057
--mount-packages \
6158
-jn TEST_ROCKEFELLER_oem_mount_packages \
6259
-ide juypter
6360
```
6461

6562
Some of these parameters are shared with the batch job above. They will be skipped in the following explanation:
66-
- `--oem-packages` and `--mount-packages` are two modes that specify how the user can use certain packages. The former allows the user to use shared packages, and the latter allow the user to use user-installed packages. Specifying both will allow both these features, but either can be used.
63+
- `--oem-packages` and `--mount-packages` are two modes that specify how the user can use certain packages. The former allows the user to use shared packages, and the latter allow the user to use user-installed packages. One and only one can be used.
6764
- `-jn` is the job name of the interactive job. By default, the name of the interactive job would be `<user>_<ide>_<port>`.
6865
- `-ide jupyter` specifies the ide used for the interactive job and providing this specifies an interactive job. By default, it will use the shell session `tmate`, however, `jupyter`, `vscode`, and `rstudio` can be used.
69-
70-
71-
72-
## Example `jupyter_setup.sh` command
73-
```bash
74-
bash jupyter_setup.sh -u <float_user> -p <password>
75-
```
76-
77-
The parameters including:
78-
- `-u|--user` user name for your float account
79-
- `-p|--password` password for your float account
80-
- `-o|--OP_IP` the IP address for your opcenter, default is `54.81.85.209`
81-
- `-dv|--dataVolume` to choose mount to S3 or not, the option is `yes|no`, default is `yes`
82-
- `-s3|--s3_path` data path on S3 bucket would be mountted to VM, default is `s3://statfungen/ftp_fgc_xqtl/`
83-
- `-vm|--VM_path` the VM path would be mountted to S3 bucket, default is `/data/`
84-
- `-i|--image` image for jupyter notebook, default is `sos2:latest`
85-
- `-c|--core` default is `4`
86-
- `-m|--mem` default is `16`
87-
- `-pub|--publish` default is `8888:8888`
88-
- `-sg|--securityGroup` default
89-
90-
## Example `hpc_jobman.sh` command
91-
92-
This is designed for submitting jobs to our HPC.
93-
94-
```
95-
bash archive/hpc_jobman.sh commands_to_submit.txt \
96-
-c 3 -m 32 --cwd ~/output \
97-
--walltime 40:00:00 --queue csg.q \
98-
--entrypoint "source ~/mamba_activate.sh" \
99-
--job-size 3 --job-name susie_rss_gwas \
100-
--no-fail-fast --dryrun
101-
```
102-
103-
run the example command exactly as is, on your Mac is fine, or HPC. You should see screen output like this:
104-
105-
```
106-
#-------------
107-
qsub /home/gw/Downloads/commands_to_submit_0.mmjob.sh
108-
```
109-
110-
check the contents of `commands_to_submit_0.mmjob.sh` to understand what it is. Then you can use this for analysis on the cluster to submit eg 1300 jobs. To do so, you can put `--job-size 20` so you will submit 1300 / 20 = 65 jobs. Each of these jobs will use 3 CPU and 32 G of memory, which you can change. If you use multiple CPU, the jobs will be running in parallel by batches of size specified by `--parallel-commands`, default value set to `-c`.
111-
112-
Once you are comfortable with the outcome of the `--dryrun`, you can remove `--dryrun` and run on the HPC, which will submit all the jobs.

src/2k1tyziipiydwtutzwmmn_rstudio.log

Lines changed: 0 additions & 2 deletions
This file was deleted.

src/cko2dsksc75lm9ztz21d9_tmate.log

Lines changed: 0 additions & 1 deletion
This file was deleted.

src/cko2dsksc75lm9ztz21d9_tmate_session.log

Lines changed: 0 additions & 2 deletions
This file was deleted.

src/f3si4u2lkjjvrpsn03s4p_rstudio.log

Lines changed: 0 additions & 2 deletions
This file was deleted.

src/g34dezb21v6ix18r9789f_tmate.log

Lines changed: 0 additions & 1 deletion
This file was deleted.

src/g34dezb21v6ix18r9789f_tmate_session.log

Lines changed: 0 additions & 2 deletions
This file was deleted.

src/pkj42xwwpi8m394eifvbo_tmate.log

Lines changed: 0 additions & 1 deletion
This file was deleted.

src/pkj42xwwpi8m394eifvbo_tmate_session.log

Lines changed: 0 additions & 2 deletions
This file was deleted.

0 commit comments

Comments
 (0)