- pandas
- numpy
- getopt
- hmmlearn (version: 0.2.5)
- pyarrow
- pytables
- pickle
- tqdm
- h5py
- tempfile
- multiprocessing (for multiprocess version)
- threading (for multiprocess version)
- logging (for multiprocess version)
- Command:
generate_emission-probs.py -i path_to/control_dataset -o path_to/output.tsv -f path_to/genome_name.fa -
- Optional Parameters:
-cNumber of reads to load at once (default: 1000000).-tTotal number of reads to run (default: 100000).-mMinimum methylation fraction (default: 0).-rMinimum read length (default: 1000).-sCount of reads until you write a checkpoint output (default: 10000).
- Purpose: Generates emission probabilities if you want to use your own different control datasets.
- Note: Pre-generated accessible and inaccessible probability TSV files are provided, based on experimental data described in our manuscript. Even with different organisms, changes to PacBio technology, and improvements to methylation calling we haven't found major changes in these control datasets. It's also not really necessary to run this on an entire dataset: the default parameters for total read count is more than enough to get stable probabilities. You will need to run this on both an accessible and inaccessible dataset (see our manuscript). These are also encoded in the model, so if you aren't training a new model you also don't need to worry about these.
- Command:
encode_context.py -i path_to/genome_name.fa -o path_to/genome_name.h5 - Purpose: Encodes the genome based on the hexamer sequence context into an HDF5 file for quick lookup.
- Usage: Run once per genome.
- Command:
train_model.py -i path_to/dataset_1.bed,path_to/dataset_2.bed,etc. -g path_to/genome_name.h5 -p path_to/accessible_probs.tsv,path_to/inaccessible_probs.tsv -o path_to_output_directory - Optional Parameters:
-cNumber of iterations to run.-rTotal number of reads to use across all datasets.-sRandom seed for reproducibility.-bColumn number (0-based) in BED files with reference methylation starts (e.g., 11 for m6A output from fibertools, 28 by default for full output).-eHow many bases to mask on both ends of the read as 0% methylation probability (default is 10). Required due to the fact that fibertools needs a window to call methylations.-oDirectory path for storing output files (models, list of reads used in training).-mMinimum fraction of methylations required in a read (default: 0).
- Purpose: Trains the model on a set of reads from specified datasets. Outputs the best model, a list of models in pickle format, and a TSV of reads used in training.
- Usage: In general, run once per organism. Model parameters should remain consistent across similar datasets.
- Command:
- Single-core version:
apply_model.py -i path_to/dataset.bed -m path_to/best_model.pickle -t path_to/training-reads.tsv -g path_to/genome_name.h5 -o path_to_output_directory - Multi-core version:
apply_model_multiprocess.py -i path_to/dataset.bed -m path_to/best_model.pickle -t path_to/training-reads.tsv -g path_to/genome_name.h5 -o path_to_output_directory
- Single-core version:
- Optional Parameters:
-lMinimum footprints allowed per read (default: 0).-rEnable circular mode (default: off).-bColumn number (0-based) in BED files with reference methylation starts (e.g., 12 for m6A output from fibertools, 28 by default for full output).-sChunk size (default: 50000).-eHow many bases to mask on both ends of the read as 0% methylation probability (default is 10). Required due to the fact that fibertools needs a window to call methylations.-mMinimum fraction of methylations required in a read (default: 0).- Multiprocess-specific:
-cCore count for parallel processing (default: all available CPU cores, typically I recommend 4-8 for stability).-xTimeout in seconds before restarting the pool (default: core count * 100).-dExisting tempdir. Specify the path to an existing temp directory if the script previously failed to complete. This will skip quickly through all chunks of the bedfile already footprint-called, and then resume after the last chunk read previously. It is essential to use the same chunksize here (and should be the same parameters as the first time you ran it), as this is based on the # in the tempfile name.
- Purpose: Applies the trained model to remaining data. Outputs results in BED12 format, including footprint starts and lengths. Multiprocess version offers faster execution but may require tuning for stability.
- Note: When using circular mode, reads are tiled 3x, affecting footprint count/length; further downstream processing is required.
The repository includes a folder Example files containing necessary files to run FiberHMM on a short sample Drosophila dataset, enabling you to test each step of the workflow with default parameters:
- accessible_probs.tsv and inaccessible_probs.tsv – probability files for accessible and inaccessible regions (
-pparameter). - dm6.fa – Drosophila melanogaster (dm6) reference genome in FASTA format. This file is too large for github, so please download the file yourself.
- dm6_example.bed – example FiberHMM m6A-only output file for m6A modifications. You should specify
-b 12to use the correct column.
- dm6_example_model.pickle – trained model for the example dataset.
- dm6.h5 – genome context database. This file is too big to upload, so you need to generate the file using dm6.fa.
- dm6_example_training-reads.tsv – reads used in model training.
- dm6_example_fp.bed – final footprint output.
The Example models folder is intended to house pre-trained models for the community. Currently, it contains only the dm6 model, but I’ll continue adding more as I generate them. Community contributions are welcome—if you’ve trained a model using FiberHMM, feel free to share it here for others to use!