Example Files

Required Packages:

pandas
numpy
getopt
hmmlearn (version: 0.2.5)
pyarrow
pytables
pickle
tqdm
h5py
tempfile
multiprocessing (for multiprocess version)
threading (for multiprocess version)
logging (for multiprocess version)

Script Workflow:

1. Generate emission probabilities (optional)

Command: generate_emission-probs.py -i path_to/control_dataset -o path_to/output.tsv -f path_to/genome_name.fa
- Optional Parameters:
- -c Number of reads to load at once (default: 1000000).
- -t Total number of reads to run (default: 100000).
- -m Minimum methylation fraction (default: 0).
- -r Minimum read length (default: 1000).
- -s Count of reads until you write a checkpoint output (default: 10000).
Purpose: Generates emission probabilities if you want to use your own different control datasets.
Note: Pre-generated accessible and inaccessible probability TSV files are provided, based on experimental data described in our manuscript. Even with different organisms, changes to PacBio technology, and improvements to methylation calling we haven't found major changes in these control datasets. It's also not really necessary to run this on an entire dataset: the default parameters for total read count is more than enough to get stable probabilities. You will need to run this on both an accessible and inaccessible dataset (see our manuscript). These are also encoded in the model, so if you aren't training a new model you also don't need to worry about these.

2. Encode context

Command: encode_context.py -i path_to/genome_name.fa -o path_to/genome_name.h5
Purpose: Encodes the genome based on the hexamer sequence context into an HDF5 file for quick lookup.
Usage: Run once per genome.

3. Train model

Command: train_model.py -i path_to/dataset_1.bed,path_to/dataset_2.bed,etc. -g path_to/genome_name.h5 -p path_to/accessible_probs.tsv,path_to/inaccessible_probs.tsv -o path_to_output_directory
Optional Parameters:
- -c Number of iterations to run.
- -r Total number of reads to use across all datasets.
- -s Random seed for reproducibility.
- -b Column number (0-based) in BED files with reference methylation starts (e.g., 11 for m6A output from fibertools, 28 by default for full output).
- -e How many bases to mask on both ends of the read as 0% methylation probability (default is 10). Required due to the fact that fibertools needs a window to call methylations.
- -o Directory path for storing output files (models, list of reads used in training).
- -m Minimum fraction of methylations required in a read (default: 0).
Purpose: Trains the model on a set of reads from specified datasets. Outputs the best model, a list of models in pickle format, and a TSV of reads used in training.
Usage: In general, run once per organism. Model parameters should remain consistent across similar datasets.

4. Apply model

Command:
- Single-core version: apply_model.py -i path_to/dataset.bed -m path_to/best_model.pickle -t path_to/training-reads.tsv -g path_to/genome_name.h5 -o path_to_output_directory
- Multi-core version: apply_model_multiprocess.py -i path_to/dataset.bed -m path_to/best_model.pickle -t path_to/training-reads.tsv -g path_to/genome_name.h5 -o path_to_output_directory
Optional Parameters:
- -l Minimum footprints allowed per read (default: 0).
- -r Enable circular mode (default: off).
- -b Column number (0-based) in BED files with reference methylation starts (e.g., 12 for m6A output from fibertools, 28 by default for full output).
- -s Chunk size (default: 50000).
- -e How many bases to mask on both ends of the read as 0% methylation probability (default is 10). Required due to the fact that fibertools needs a window to call methylations.
- -m Minimum fraction of methylations required in a read (default: 0).
- Multiprocess-specific:
  - -c Core count for parallel processing (default: all available CPU cores, typically I recommend 4-8 for stability).
  - -x Timeout in seconds before restarting the pool (default: core count * 100).
  - -d Existing tempdir. Specify the path to an existing temp directory if the script previously failed to complete. This will skip quickly through all chunks of the bedfile already footprint-called, and then resume after the last chunk read previously. It is essential to use the same chunksize here (and should be the same parameters as the first time you ran it), as this is based on the # in the tempfile name.
Purpose: Applies the trained model to remaining data. Outputs results in BED12 format, including footprint starts and lengths. Multiprocess version offers faster execution but may require tuning for stability.
Note: When using circular mode, reads are tiled 3x, affecting footprint count/length; further downstream processing is required.

Example Files

The repository includes a folder Example files containing necessary files to run FiberHMM on a short sample Drosophila dataset, enabling you to test each step of the workflow with default parameters:

Input Files

accessible_probs.tsv and inaccessible_probs.tsv – probability files for accessible and inaccessible regions (-p parameter).
dm6.fa – Drosophila melanogaster (dm6) reference genome in FASTA format. This file is too large for github, so please download the file yourself.
dm6_example.bed – example FiberHMM m6A-only output file for m6A modifications. You should specify -b 12 to use the correct column.

Expected Output Files

dm6_example_model.pickle – trained model for the example dataset.
dm6.h5 – genome context database. This file is too big to upload, so you need to generate the file using dm6.fa.
dm6_example_training-reads.tsv – reads used in model training.
dm6_example_fp.bed – final footprint output.

Model Sharing

The Example models folder is intended to house pre-trained models for the community. Currently, it contains only the dm6 model, but I’ll continue adding more as I generate them. Community contributions are welcome—if you’ve trained a model using FiberHMM, feel free to share it here for others to use!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Required Packages:

Script Workflow:

1. Generate emission probabilities (optional)

2. Encode context

3. Train model

4. Apply model

Example Files

Input Files

Expected Output Files

Model Sharing

About

Uh oh!

Releases 10

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Example datasets		Example datasets
Example models		Example models
README.md		README.md
accessible_probs.tsv		accessible_probs.tsv
apply_model.py		apply_model.py
apply_model_multiprocess.py		apply_model_multiprocess.py
encode_context.py		encode_context.py
inaccessible_probs.tsv		inaccessible_probs.tsv
train_model.py		train_model.py

fiberseq/FiberHMM

Folders and files

Latest commit

History

Repository files navigation

Required Packages:

Script Workflow:

1. Generate emission probabilities (optional)

2. Encode context

3. Train model

4. Apply model

Example Files

Input Files

Expected Output Files

Model Sharing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Languages

Packages