Foremost, thank you for making this pipeline!
I noticed unexpected silent bug when using a custom --host for host removal. In summary, even if I provided --host (or --reference) parameters, nextflow/amr++ was not reporting errors, but also not using the correct host for removal. Additionally, it appears that providing a --host_index is mandatory (not optional) when using the host removal functions. As a result, this might mean that the user thinks the pipeline is removing their custom host, but in reality isn't. This can impact interpretation of the data.
I included some reproducible examples below to explain.
I did find a solution (see below), but I just wanted to let you know of this potential hidden bug and provide some suggestions & contributions:
My suggestions are to:
-
Consolidate the documents to remove discrepencies in how-to use the
program (e.g. --host vs usage of --reference)
-
Clarify the use of --host_index either in the params.config file or in the documentation (example here)
-
Modify the code to use --host_index and --index to properly read in and use the null index parameter?
I will send a pull request for my suggestion 1. I am unsure how to address 2 (see what I tried below) I attempted to look into 3 but didn't solve it yet.
Please let me know if I can contribute/test things for this codebase.
Best,
Patricia
--
All these tests are performed using the test data provided in
AMRplusplus/data Nextflow: N E X T F L O W ~ version 23.10.1
The test genomes I used are the default chr21 in /data/host and the bovine and chicken reference genome from NCBI.
I first tested how using --host or --reference influenced the result, to resolve the discrepency in the doc linked above.
Control:
nextflow run main_AMR++.nf -profile conda --pipeline rm_host --reads "data/raw/*_R{1,2}.fastq.gz" --host "data/host/chr21.fasta.gz" --output "rm_host_default"
Removing bovine genome host using --host
nextflow run main_AMR++.nf -profile conda --pipeline rm_host --reads "data/raw/*_R{1,2}.fastq.gz" --host "GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz" --output "rm_host_using--host"
Removing bovine genome host using --reference
nextflow run main_AMR++.nf -profile conda --pipeline rm_host --reads "data/raw/*_R{1,2}.fastq.gz" --reference "GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz" --output "rm_host_using--reference"
Results
All worked without printing any errors, but the outcomes are unexpected: all report the same number of mapped & unmapped reads.
## Default ##
Sample NumberOfInputReads Mapped Unmapped
S2_test 40007 1288 38719
S1_test 60008 1496 58512
S3_test 60021 2607 57414
## Using --host ##
Sample NumberOfInputReads Mapped Unmapped
S2_test 40007 1288 38719
S1_test 60008 1496 58512
S3_test 60021 2607 57414
## Using --reference ##
Sample NumberOfInputReads Mapped Unmapped
S2_test 40007 1288 38719
S1_test 60008 1496 58512
S3_test 60021 2607 57414
Second strategy: changing the host
To test whether this was caused somehow by the genome of choice, I tested using the chicken genome as a host.
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/315/GCF_000002315.5_GRCg6a/GCF_000002315.5_GRCg6a_genomic.fna.gz
nextflow run main_AMR++.nf -profile conda --pipeline rm_host --reads "data/raw/*_R{1,2}.fastq.gz" --host "GCF_000002315.5_GRCg6a_genomic.fna.gz" --output "rm_host_using--host-chicken"
nextflow run main_AMR++.nf -profile conda --pipeline rm_host --reads "data/raw/*_R{1,2}.fastq.gz" --reference "GCF_000002315.5_GRCg6a_genomic.fna.gz" --output "rm_host_using--reference-chicken"
Results
Once again the same values are reported. More importantly, the same number of reads mapped + unmapped are reported as when we used the chr21 and the bovine genome, which is unexpected. No errors are reported by Nextflow/AMR++.
## chicken using --reference ##
Sample NumberOfInputReads Mapped Unmapped
S2_test 40007 1288 38719
S1_test 60008 1496 58512
S3_test 60021 2607 57414
## chicken using --host ##
Sample NumberOfInputReads Mapped Unmapped
S2_test 40007 1288 38719
S1_test 60008 1496 58512
S3_test 60021 2607 57414
Third test:
To test whether this was due to the --pipeline rm_host function, I decided to run --pipeline standard_AMR instead.
nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --reads "data/raw/*_R{1,2}.fastq.gz" --host "data/host/chr21.fasta.gz" --output "standard_all_default"
nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --reads "data/raw/*_R{1,2}.fastq.gz" --host "GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz" --output "standard_bovine_host"
nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --reads "data/raw/*_R{1,2}.fastq.gz" --reference "GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz" --output "standard_bovine_reference"
nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --reads "data/raw/*_R{1,2}.fastq.gz" --host "GCF_000002315.5_GRCg6a_genomic.fna.gz" --output "standard_chicken_host"
nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --reads "data/raw/*_R{1,2}.fastq.gz" --reference "GCF_000002315.5_GRCg6a_genomic.fna.gz" --output "standard_chicken_reference"
Results
Unfortunately, we get the same numbers:
cat standard_all*/Results/Stats/host.removal.stats
Sample NumberOfInputReads Mapped Unmapped
S2_test 27045 1258 25787
S1_test 34082 1446 32636
S3_test 59577 2600 56977
Sample NumberOfInputReads Mapped Unmapped
S2_test 27045 1258 25787
S1_test 34082 1446 32636
S3_test 59577 2600 56977
Sample NumberOfInputReads Mapped Unmapped
S2_test 27045 1258 25787
S1_test 34082 1446 32636
S3_test 59577 2600 56977
Sample NumberOfInputReads Mapped Unmapped
S2_test 27045 1258 25787
S1_test 34082 1446 32636
S3_test 59577 2600 56977
Sample NumberOfInputReads Mapped Unmapped
S2_test 27045 1258 25787
S1_test 34082 1446 32636
S3_test 59577 2600 56977
Fourth test:
In my next series of tests, I'm straight up deleting chr21.fasta from the /data folder and seeing if the command works. Nextflow is now properly reporting an error, and tells me that's it's expecting some index and it's not working! Upon further inspection, I found in theparams.config code these lines telling us to also use the --host_index genome, however this is not reported in the usage.md and configuration.md files.
To put that in practice, I tested:
--host_index null
--host_index NULL
--host_index 'null'
--host_index ""
However, none were giving me the expected results. Nextflow couldn't read the null parameter properly and AMR++ was telling me I needed an index file (so the loop about if param = null --> bwa index doesn't automatically go into effect)
Alternative solution
I don't know if I can/have time to fully troubleshoot the code at the moment, but if I do I will send a pull request. For now, I am running bwa prior to running the pipeline, and simply providing BOTH custom --host_index and --host parameters.
conda activate bwa
# if not done already: conda install -c bioconda bwa
bwa index /staging/ptran5/GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz
conda deactivate
Nextflow amr++ command using custom host but with default sample (S1, S2 and S3) data:
nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --host_index "/staging/ptran5/GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz*" --reads "data/raw/*_R{1,2}.fastq.gz" --host "/staging/ptran5/GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz" --output "standard_bovine_reference_with_noindex"
Successful results:
(base) $ head rm_host_using--host_staging/Results/Stats/host.removal.stats
Sample NumberOfInputReads Mapped Unmapped
S2_test 27044 20757 6287
S1_test 34081 21576 12505
S3_test 59571 43946 15625
Foremost, thank you for making this pipeline!
I noticed unexpected silent bug when using a custom
--hostfor host removal. In summary, even if I provided--host(or--reference) parameters, nextflow/amr++ was not reporting errors, but also not using the correct host for removal. Additionally, it appears that providing a--host_indexis mandatory (not optional) when using the host removal functions. As a result, this might mean that the user thinks the pipeline is removing their custom host, but in reality isn't. This can impact interpretation of the data.I included some reproducible examples below to explain.
I did find a solution (see below), but I just wanted to let you know of this potential hidden bug and provide some suggestions & contributions:
My suggestions are to:
Consolidate the documents to remove discrepencies in how-to use the
program (e.g. --host vs usage of --reference)
Clarify the use of
--host_indexeither in theparams.configfile or in the documentation (example here)Modify the code to use --host_index and --index to properly read in and use the
nullindex parameter?I will send a pull request for my suggestion 1. I am unsure how to address 2 (see what I tried below) I attempted to look into 3 but didn't solve it yet.
Please let me know if I can contribute/test things for this codebase.
Best,
Patricia
--
All these tests are performed using the test data provided in
AMRplusplus/dataNextflow: N E X T F L O W ~ version 23.10.1The test genomes I used are the default chr21 in
/data/hostand the bovine and chicken reference genome from NCBI.I first tested how using --host or --reference influenced the result, to resolve the discrepency in the doc linked above.
Control:
Removing bovine genome host using
--hostRemoving bovine genome host using
--referenceResults
All worked without printing any errors, but the outcomes are unexpected: all report the same number of mapped & unmapped reads.
Second strategy: changing the host
To test whether this was caused somehow by the genome of choice, I tested using the chicken genome as a host.
Results
Once again the same values are reported. More importantly, the same number of reads mapped + unmapped are reported as when we used the chr21 and the bovine genome, which is unexpected. No errors are reported by Nextflow/AMR++.
Third test:
To test whether this was due to the
--pipeline rm_hostfunction, I decided to run--pipeline standard_AMRinstead.Results
Unfortunately, we get the same numbers:
Fourth test:
In my next series of tests, I'm straight up deleting chr21.fasta from the
/datafolder and seeing if the command works. Nextflow is now properly reporting an error, and tells me that's it's expecting some index and it's not working! Upon further inspection, I found in theparams.configcode these lines telling us to also use the--host_indexgenome, however this is not reported in theusage.mdandconfiguration.mdfiles.To put that in practice, I tested:
However, none were giving me the expected results. Nextflow couldn't read the
nullparameter properly and AMR++ was telling me I needed an index file (so the loop about if param = null --> bwa index doesn't automatically go into effect)Alternative solution
I don't know if I can/have time to fully troubleshoot the code at the moment, but if I do I will send a pull request. For now, I am running
bwaprior to running the pipeline, and simply providing BOTH custom--host_indexand--hostparameters.Nextflow amr++ command using custom host but with default sample (S1, S2 and S3) data:
Successful results: