diff --git a/README.md b/README.md index 67374c4..48329e4 100644 --- a/README.md +++ b/README.md @@ -29,15 +29,18 @@ ![](./assets/pipeline_light.svg) -Supported profilers: +Supported profilers and current status: 1. [**HUMANn v3**](https://huttenhower.sph.harvard.edu/humann/) — functional profiling via MetaPhlAn + HUMANn 3 (`--run_humann_v3`) 2. [**HUMANn v4**](https://huttenhower.sph.harvard.edu/humann/) — functional profiling via MetaPhlAn + HUMANn 4 (`--run_humann_v4`) 3. [**FMH FunProfiler**](https://github.com/dib-lab/fmh_funprofiler) — sketch-based functional profiling (`--run_fmhfunprofiler`) 4. [**RGI**](https://github.com/arpcard/rgi) — antimicrobial resistance gene identification (`--run_rgi`, available) 5. [**mifaser**](https://bromberglab.org/project/mifaser/) — functional profiling via mifaser (`--run_mifaser`, available) -6. [**DIAMOND**](https://github.com/bbuchfink/diamond) — alignment with DIAMOND blastx (`--run_diamond`, available) -7. [**eggNOG-mapper**](https://academic.oup.com/mbe/article/38/12/5825/6379734) — functional annotation, orthology assignments and domain prediction (`--run_eggnogmapper`, available) +6. [**DIAMOND**](https://github.com/bbuchfink/diamond) — alignment with DIAMOND blastx (`--run_diamond`, work in progress / beta) +7. [**eggNOG-mapper**](https://academic.oup.com/mbe/article/38/12/5825/6379734) — functional annotation, orthology assignments and domain prediction (`--run_eggnogmapper`, work in progress / beta) + +> [!WARNING] +> DIAMOND and eggNOG-mapper support is currently in beta and should be treated as work in progress. These modules are still being validated in the full pipeline, including database handling, output behavior, and downstream reporting. Use them with caution, expect potential issues, and independently review results before using them for production analyses or interpretation. ## Usage diff --git a/docs/output.md b/docs/output.md index b5a7710..079f4f0 100644 --- a/docs/output.md +++ b/docs/output.md @@ -14,8 +14,8 @@ The pipeline processes data using the following steps: - [HUMANn v3 / v4](#humann-v3--v4) - Functional profiling via MetaPhlAn + HUMANn - [FMH FunProfiler](#fmh-funprofiler) - Sketch-based functional profiling - [mifaser](#mifaser) - Read-level functional profiling -- [DIAMOND blastx](#diamond-blastx) - Translated alignment against a protein database -- [EggNOG-mapper](#eggnog-mapper) - Functional annotation via orthology assignment +- [DIAMOND blastx](#diamond-blastx) - Translated alignment against a protein database (work in progress / beta) +- [EggNOG-mapper](#eggnog-mapper) - Functional annotation via orthology assignment (work in progress / beta) - [RGI BWT](#rgi-bwt) - Antimicrobial resistance gene identification - [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution @@ -80,6 +80,9 @@ Enabled with `--run_mifaser`. Maps reads to functional databases at the protein Enabled with `--run_diamond`. Performs fast translated alignment of metagenomic reads against a protein reference database. Each read is aligned in all six reading frames and only significant hits are reported. +> [!WARNING] +> DIAMOND support is currently in beta and should be treated as work in progress. The module is still being validated in the full pipeline, including database handling, output behavior, and downstream reporting. Use with caution and independently review results before production use or interpretation. +
Output files @@ -97,6 +100,9 @@ Requires a pre-built `.dmnd` database (see [usage docs](usage.md#diamond-blastx) Enabled with `--run_eggnogmapper`. Assigns functional annotations to sequences by mapping them to orthologous groups in the EggNOG database. +> [!WARNING] +> EggNOG-mapper support is currently in beta and should be treated as work in progress. The module is still being validated in the full pipeline, including database handling, output behavior, and downstream reporting. Use with caution and independently review results before production use or interpretation. +
Output files diff --git a/docs/usage.md b/docs/usage.md index 6c2c875..14d5f1b 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -41,6 +41,23 @@ SAMPLE3,RUN1,OXFORD_NANOPORE,/data/sample3_nanopore.fastq.gz,, In this example, `SAMPLE1` has two runs which will be merged before profiling. `SAMPLE2` is single-end short reads. `SAMPLE3` is Oxford Nanopore long reads. +## Enabling profilers + +At least one profiler must be enabled via command-line flags. The pipeline will only run the profilers you explicitly turn on: + +| Flag | Profiler | Status | +| ---------------------- | --------------- | ----------------------- | +| `--run_humann_v3` | HUMANn v3 | Available | +| `--run_humann_v4` | HUMANn v4 | Available | +| `--run_fmhfunprofiler` | FMH FunProfiler | Available | +| `--run_mifaser` | mifaser | Available | +| `--run_diamond` | diamond | Work in progress / beta | +| `--run_eggnogmapper` | EggNOG-mapper | Work in progress / beta | +| `--run_rgi` | RGI BWT | Available | + +> [!IMPORTANT] +> Each `--run_` flag requires a matching database entry in the `--databases` CSV. Database rows for tools that are not enabled will be ignored. + ## Databases input ```bash @@ -49,6 +66,8 @@ In this example, `SAMPLE1` has two runs which will be merged before profiling. ` The databases sheet is a comma-separated file that specifies which databases to use for each profiler. Only tools enabled via `--run_` flags will use the corresponding database entries. +Use the `db_name` column to record the database release or version used for the run, for example `uniref90_v3`, `eggnog_v5`, `card_v3`, or `GS-24-all`. + | Column | Required | Description | | ----------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `tool` | Yes | Profiler name. Must be one of: `humann_v3`, `humann_v4`, `fmhfunprofiler`, `mifaser`, `diamond`, `rgi`, `eggnogmapper`. | @@ -60,7 +79,7 @@ The databases sheet is a comma-separated file that specifies which databases to ### HUMANn databases -HUMANn requires four database components per named database, each as a separate row with the same `db_name`: +HUMANn requires four database components per named database, each as a separate row with the same `db_name`. The example below uses a HUMANn v3-compatible UniRef90 database set; replace `uniref90_v3` with the exact release or version used in your analysis. ```csv tool,db_name,db_entity,db_params,db_type,db_path @@ -72,7 +91,7 @@ humann_v3,uniref90_v3,humann_utility,,,/data/databases/utility_mapping ### FMH FunProfiler databases -FMH FunProfiler requires a single sketch database: +FMH FunProfiler requires a single sketch database. The example below uses a KEGG-derived sketch database labeled `kegg_v1`; replace this with the exact sketch/database version used in your analysis. ```csv tool,db_name,db_entity,db_params,db_type,db_path @@ -81,7 +100,10 @@ fmhfunprofiler,kegg_v1,,,short;long,/data/databases/fmhfunprofiler_kegg.sig.zip ### EggNOG-mapper databases -EggNOG-mapper requires two database entries per named database: the search database and the EggNOG data directory. The `db_params` field of the `eggnogmapper_db` row must specify the search mode (e.g. `diamond`, `mmseqs`, `hmmer`). +EggNOG-mapper requires two database entries per named database: the search database and the EggNOG data directory. The `db_params` field of the `eggnogmapper_db` row must specify the search mode (e.g. `diamond`, `mmseqs`, `hmmer`). The example below uses an EggNOG v5 database label; replace `eggnog_v5` with the exact EggNOG database release used in your analysis. + +> [!WARNING] +> EggNOG-mapper support is currently in beta and should be treated as work in progress. Database handling, output behavior, and downstream reporting are still being validated in the full pipeline, so use with caution and independently review results before production use or interpretation. ```csv tool,db_name,db_entity,db_params,db_type,db_path @@ -97,15 +119,17 @@ eggnogmapper,eggnog_v5,eggnogmapper_data_dir,,,/data/databases/eggnog_mapper/dat #### Database preparation -Download a pre-built mifaser database (e.g. GS-21 or GS-580) from the [mifaser website](https://bromberglab.org/project/mifaser/). The `db_path` should point to the directory containing the database files. +Download a pre-built mifaser database (e.g. GS-21, GS-24-all, or GS-580) from the [mifaser website](https://bromberglab.org/project/mifaser/). The `db_path` should point to the directory containing the database files and `db_name` should record the downloaded database version. ```csv tool,db_name,db_entity,db_params,db_type,db_path -mifaser,gs21,,,short,/data/databases/mifaser/GS-21 +mifaser,GS-24-all,,,short,/data/databases/mifaser/GS-24-all ``` ### Full example databases sheet +This example uses versioned database names to make the database releases traceable in the run outputs. Replace these names and paths with the exact database releases you downloaded. + ```csv tool,db_name,db_entity,db_params,db_type,db_path humann_v3,uniref90_v3,humann_metaphlan,,,/data/databases/metaphlan_db @@ -125,7 +149,7 @@ fmhfunprofiler,kegg_v1,,,short;long,/data/databases/fmhfunprofiler_kegg.sig.zip #### Database preparation -Download the CARD database and extract it to a directory: +Download the CARD database and extract it to a directory. The example CSV below labels the database as `card_v3`; replace this with the exact CARD release used in your analysis. ```bash wget https://card.mcmaster.ca/latest/data @@ -147,9 +171,12 @@ rgi,card_v3,,,,/data/databases/card [DIAMOND](https://github.com/bbuchfink/diamond/wiki/) is a high-throughput sequence aligner for translated (nucleotide-vs-protein) alignment. Enable it with `--run_diamond`. +> [!WARNING] +> DIAMOND support is currently in beta and should be treated as work in progress. Database handling, output behavior, and downstream reporting are still being validated in the full pipeline, so use with caution and independently review results before production use or interpretation. + #### Database preparation -The database supplied in the `--databases` CSV must already be in DIAMOND binary format (`.dmnd`). Build it from a protein FASTA using `diamond makedb`: +The database supplied in the `--databases` CSV must already be in DIAMOND binary format (`.dmnd`). Build it from a versioned protein FASTA using `diamond makedb`, and use `db_name` to record the source database and release. ```bash diamond makedb --in proteins.faa --db proteins @@ -186,20 +213,6 @@ work # Directory containing the nextflow working files # Other nextflow hidden files, eg. history of pipeline runs and old logs. ``` -### Enabling profilers - -At least one profiler must be enabled via command-line flags. The pipeline will only run the profilers you explicitly turn on: - -| Flag | Profiler | Status | -| ---------------------- | --------------- | --------- | -| `--run_humann_v3` | HUMANn v3 | Available | -| `--run_humann_v4` | HUMANn v4 | Available | -| `--run_fmhfunprofiler` | FMH FunProfiler | Available | -| `--run_mifaser` | mifaser | Available | -| `--run_diamond` | diamond | Available | -| `--run_eggnogmapper` | EggNOG-mapper | Available | -| `--run_rgi` | RGI BWT | Available | - ### Parameters If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file.