-
Notifications
You must be signed in to change notification settings - Fork 0
miFRED wiki
-
Download the miFRED folder from this repository
-
Unzip training set files inside the Data folder
-
Download MICROPHERRET folder from github and move it inside miFRED/Data/:
git clone https://github.com/MetabioinfomicsLab/MICROPHERRET/ -
Download MICROPHERRET ML models from the designated MEGA folder linked in MetabioinfomicsLab/MICROPHERRET/
Ensure that the saved_models folder is placed inside the MICROPHERRET directory within miFRED/Data/ so that each model follows this path structure:
miFRED/Data/MICROPHERRET/saved_models/ -
miFRED requires an apposite conda environment, which can be generated as follow using the miFRED_environment.yml file in Data:
conda env create -f miFRED_env.yml -
Make scripts executable using
chmod +x script_name
miFRED requires a Linux system and at least 6 CPUs due to the computational demands of the MICROPHERRET training process.
-g GENOMES_FOLDER, --genomes_folder GENOMES_FOLDER
Directory where genomes fasta files are stored
-x GENOMES_EXTENSION, --genomes_extension GENOMES_EXTENSION
Genome fasta files extension (default: .fa)
--all_genomes ALL_GENOMES
Multi-fasta file obtained concatenating all single fasta files. Automatically generated by miFRED in input generation step by processing
-g GENOMES_FOLDER, --genomes_folder GENOMES_FOLDER. Incompatible with-g GENOMES_FOLDER, --genomes_folder GENOMES_FOLDER
--binning_file BINNING_FILE
.txt file with each line listing a scaffold and the corresponding genome/bin name, tab-seperated. Automatically generated by miFRED in input generation step by processing
-g GENOMES_FOLDER, --genomes_folder GENOMES_FOLDERwith inputwriter.py additional script. Incompatible with-g GENOMES_FOLDER, --genomes_folder GENOMES_FOLDER, required if--all_genomes ALL_GENOMESis specified.
-r READS_FOLDER, --reads_folder READS_FOLDER
Directory where metagenomic reads fastq files are stored. Incompatible with
-B BAM_FILES, --bam_files BAM_FILES
-u {True,False}, --unpaired_reads {True,False}
Required if
-r READS_FOLDER, --reads_folder READS_FOLDERis specified. True if fastq file for unpaired reads are also provided in READS_FOLDER, False otherwise (default : True)
-B BAM_FILES, --bam_files BAM_FILES
Directory where sample-specific sorted.bam files and indexes are stored. They can be automatically generated by miFRED input generation steps aligning
-r READS_FOLDER, --reads_folder READS_FOLDERagainst multi-fasta file (either--all_genomes ALL_GENOMESor results of-g GENOMES_FOLDER, --genomes_folder GENOMES_FOLDERprocessing). If provided check that names of aligned contigs mirror names in-g GENOMES_FOLDER, --genomes_folder GENOMES_FOLDERfasta files. If not, provide multi-fasta directly with--all_genomes ALL_GENOMESand specify right contig names in--binning_file BINNING_FILE
-A EGGNOG_ANNOTATION, --eggnog_annotation EGGNOG_ANNOTATION
Either directory where genomes eggNOG .annotations files are stored or .csv file obtained from parsing .annotations files, with genomes as rows, KO as columns and KO counts as values. First column must be named "Genomes"
-db EGGNOG_DATABASE, --eggnog_database EGGNOG_DATABASE
Directory where eggNOG-mapper database is stored, used to obtain .annotations files by launching eggNOG-mapper on genomes fasta files in the input generation steps. Incompatible with
-A EGGNOG_ANNOTATION, --eggnog_annotation EGGNOG_ANNOTATION
-sm {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}, --eggnog-sensmode {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}
eggNOG-mapper Diamond search option (default: sensitive). Incompatible with
-A EGGNOG_ANNOTATION, --eggnog_annotation EGGNOG_ANNOTATION
-f FUNCTIONS_LIST, --functions_list FUNCTIONS_LIST
.txt file containing MICROPHERRET functions to be considered for the calculation, one per line. (default: 86 functions whose models were accurate on test set, stored in functions.txt)
-s TRAINING_SETS, --training_sets TRAINING_SETS
Folder containing the dataset.csv and dataset_acetoclastic_methanogenesis.csv files to be used in the training. (default: ./training_sets/ )
-m MICROPHERRET_PREDICTIONS, --micropherret_predictions MICROPHERRET_PREDICTIONS
.csv file containing MICROPHERRET predictions for all the genomes, the first column with genomes names must be unnamed. Incompatible with
-k, --KO
-k, --KO
Calculation must be performed using KO and not MICROPHERRET phenotypes. Incompatible with
-m MICROPHERRET_PREDICTIONS, --micropherret_predictions MICROPHERRET_PREDICTIONSIf both-k, --KOand-m MICROPHERRET_PREDICTIONS, --micropherret_predictions MICROPHERRET_PREDICTIONSare not specified, MICROPHERRET is launched to predict-f FUNCTIONS_LIST, --functions_list FUNCTIONS_LISTfor FRED calculation.
If -k, --KO or -m MICROPHERRET_PREDICTIONS, --micropherret_predictions MICROPHERRET_PREDICTIONS are specified, -f FUNCTIONS_LIST, --functions_list FUNCTIONS_LIST and -s TRAINING_SETS, --training_sets TRAINING_SETS are ignored.
Parameters set to control calculation, aiming at excluding spurious associations.
-c COVERED_GENOME_FRACTION, --covered_genome_fraction COVERED_GENOME_FRACTION
Minimum fraction of genome with coverage higher than 0 (breadth of coverage) (default: 0.10)
-t RELATIVE_ABUNDANCE_THRESHOLD, --relative_abundance_threshold RELATIVE_ABUNDANCE_THRESHOLD
Minimum relative abundance required to consider a genome as present in a sample (default: 0)
-o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
Output directory
-p PROCESSORS, --processors PROCESSORS
Number of threads (default: 5)
The following folders are generated in the output folder specified by the user.
It stores the results of miFRED's input generation step procedure. The list can change depending on which files were already provided by the user.
- all_genomes.fa
fasta file obtained by concatenating the files provided by the user. It is the reference used in the alignment procedure. Generated if
-g GENOMES_FOLDER, --genomes_folder GENOMES_FOLDERis specified
- info.txt
.txt file with each line listing a scaffold and the corresponding genome/bin name, tab-separated. Generated if
-g GENOMES_FOLDER, --genomes_folder GENOMES_FOLDERis specified
- nreads.txt
.txt file with number of reads mapped to each sample, used for relative abundance normalisation
- bowtie2 folder
sorted and indexed bam files obtained by aligning reads against all_genomes.fa with bowtie2 and processing resulting files with Samtools. Generated if
-r READS_FOLDERis specified
- eggnog_annotations folder
contains eggNOG-mapper results. Generated if
-A EGGNOG_ANNOTATION, --eggnog_annotation EGGNOG_ANNOTATIONis not specified
- annotation_matrix.csv
matrix with the genetic information (KO copy number) per genome, with genomes as rows and KO as columns; input required for prediction of phenotypes. Generated if .csv file is not already provided with
-A EGGNOG_ANNOTATION, --eggnog_annotation EGGNOG_ANNOTATION
- KO_for_calculation.csv
matrix used to calculate FRED if
-k, --KOparameter is specified, with genomes as rows, KO as columns and 0/1 values to indicate presence/absence
If MCROPHERRET models are used, it stores the results of the ML predictions:
- predict_functions.csv
matrix with predicted functions per genome
- predict_sum.csv
number of genomes predicted to perform each function They are generated only if
-k, --KOand-m MICROPHERRET_PREDICTIONS, --micropherret_predictions MICROPHERRET_PREDICTIONSare not specified
It stores results of FRED calculation procedure:
- fredc.csv file
stores results of FREDc calculation for each analysed sample. Additional metrics like alpha diversity (Gini-Simpson index, GSI), a non-normalised version of FREDc(FREDc_tian) and Rao’s entropy for functional diversity are included too.
- freds.csv
stores results of FREDs calculation for each function for each analysed sample
- FREDs_statistic.csv
FREDs statistics for each analysed function and sample
Intermediate results are also provided:
- jaccard_distances.csv
storing the pairwise functional diversity within each genome pair
- used_relative_abundances.csv
relative abundances computed by miFRED and used for FRED calculation
- normalised_relative_abundances.csv
relative abundances computed by miFRED, normalised by the number of mapped reads
FREDc and FREDs distribution plot are provided too to aid the analysis.
Examples of miFRED input and output files are available here: Examples