PyFitSeq is a Python-based fitness estimation tool for pooled amplicon sequencing studies. PyFitSeq is Python re-coded version of the MATLAB tool FitSeq https://github.com/sashaflevy/Fit-Seq. If you use this software, please reference: F. Li, et al. Unbiased Fitness Estimation of Pooled Barcode or Amplicon Sequencing Studies. Cell Systems, 7: 521-525 (2018). PyFitSeq is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
It currently has two main functions:
evo_simulator.pyperforms simulations of competitve pooled growth of a population of genotypes.pyfitseq.pycalculates the fitness of each genotype from read-count time-series data.
A walk-through is included as the jupyter notebook here.
- Python 3 is required. This version has been tested on a MacBook Pro (3.1 GHz Intel Core i5), with Python 3.7.4.
- Clone this repository by running
git clone https://github.com/FangfeiLi05/PyFitSeq.gitin terminal. cdto the root directory of the project (the folder containingREADME.md).- Install dependencies by running
pip install -r requirements.txtin terminal. - Install pyfitseq by running
pip install .
OR
- Run
python3 -m pip install git+https://github.com/darachm/PyFitSeq.gitto install without cloning the repository.
With the closing of Singularity Hub there aren't yet publicly available containers for this, but you can build your own with a command like:
singularity build fitseq-latest.simg Singularity.fitseq-latest
Then you can run on any computer running Singularity, such as your local HPC, using a command like:
singularity exec fitseq-latest.simg pyfitseq.py -h
evo_simulator.py models competative pooled growth of a population of genotypes with different fitnesses. This simulation can be made to include sources of noise, including growth noise, noise from cell transfers, DNA extraction, PCR, and sequencing.
--inputor-i: a .csv file, with- 1st column of .csv: fitness of each genotype, [x1, x2, ...]
- 2nd column .csv: initial cell number of each genotype at generation 0, [n1, n2, ...]
--t_seqor-t: time-points evaluated in number of generations (format: 0 t1 t2 ...)--read_num_average_seqor-r: average number of reads per genotype for each time-point (format: 0 r1 r2 ...)--noise_optionor-n: which types of noise to include in the simulation, default is all sources of noise (default: growth bottleneck_transfer DNA_extraction PCR sequencing)--dna_copiesor-d: average genome copy number per genotype used as template in PCR (default: 500)--pcr_cyclesor-p: number of cycles of PCR (default: 25)--fitness_typeor-f: type of fitness: Wrightian fitness (w), or Malthusian fitness (m)' (default: m)--output_filenameor-o: prefix of output .csv files (default: output)
output_filename_EvoSimulation_Read_Number.csv: read number per genotype for each time-pointoutput_filename_EvoSimulation_Mean_Fitness.csv: mean fitness for each time-pointoutput_filename_EvoSimulation_Input_Log.csv: a record of all inputs
python evo_simulator.py --help
python evo_simulator.py -i input_EvoSimulation.csv -t 0 3 6 9 12 -r 50 50 50 50 50 -o output
python evo_simulator.py -i input_EvoSimulation.csv -t 0 2 4 6 8 -r 75 75 75 75 50 -n DNA_extraction PCR sequencing -d 300 -p 27 -f w -o output
pyfitseq.py estimates the fitness of each genotype from read-count time-series data.
--inputor-i: a .csv file, with each column being the read number per genotype at each sequenced time-point--t_seqor-t: sequenced time-points in number of generations (format: 0 t1 t2 ...)--max_iter_numor-m: maximum number of iterations in the optimization (Small numbers can reduce running time and decrease accuracy.) (default: 10)--kappaor-k: a noise parameter that characterizes the total noise introduced by growth, cell transfer, DNA extraction, PCR, and sequencing (To measure kappa empirically, see the reference: [S. F. Levy, et al. Quantitative Evolutionary Dynamics Using High-resolution Lineage Tracking. Nature, 519: 181–186 (2015)].) (default: 2.5)--regression_numor-g: number of points used in the initial linear-regression-based fitness estimate (default: 2)--fitness_typeor-f: type of fitness: Wrightian fitness (w), or Malthusian fitness (m) (default: m)--output_filenameor-o: prefix of output .csv files (default: output)
output_filename_FitSeq_Result.csv: a .csv file, with- 1st column of .csv: estimated fitness of each genotype, [x1, x2, ...]
- 2nd column of .csv: log likelihood value of each genotype, [f1, f2, ...]
- 3rd column of .csv: estimated mean fitness per sequenced time-point, [x_mean(0), x_mean(t1), ...]
- 4th+ columns of .csv: estimated read number per genotype per time-point, with each time-point being a column
python pyfitseq.py --help
python pyfitseq.py -i output_EvoSimulation_Read_Number.csv -t 0 3 6 9 12 -o output
python pyfitseq.py -i output_EvoSimulation_Read_Number2.csv -t 0 2 6 8 -m 12 -k 2 -g 3 -f w -o output