Method scPlOver for inferring DNA content from overlapping fragments in scDNA-seq data.
COMING SOON: end-to-end pipeline to run scPlOver starting from a BAM file (including counting fragment overlaps and running HMMCopy)
Tested version in parentheses (scPlOver is likely flexible to different versions)
- Python (3.9)
- numpy (1.23.0)
- scipy (1.13.1)
- pandas (2.1.4)
- statsmodels (0.14.1)
- anndata (0.10.3)
- click (8.1.7)
See test_input/FUCCI.h5ad for example.
Anndata of cells x bins, with obs indices containing cell IDs, var indices containing bins in chr:start-end format, and layers:
Xcontaining read countstatecontaining integer copy number stateoverlapscontaining the number of fragment overlapsoverlap_basescontaining the number of overlap basesn_fragmentscontaining the total number of fragmentsmean_fragment_lengthcontaining the average fragment length
Required var fields:
chr: chromosomestart: bin start positionend: bin end positiongc: GC contentin_blacklist: flag indicating whether bin is present in blacklistmap: mappability (currently unused)
python run_scplover_adata.py \
--adata <path> \
--output_row <path> \
--output_table <path> \
--output_adata <path> \
--cell_df_dir <dir> \
[options]
Running on test data (should take on the order of 10 minutes with 8 cores):
python run_scplover_adata.py \
--adata test/FUCCI_test_input.h5ad \
--output_row test/output/FUCCI_test_best_fits.csv \
--output_table test/output/FUCCI_test_full_table.csv \
--output_adata test/output/FUCCI_test_adata.h5ad \
--cell_df_dir test/output/cell_dfs \
--max_k 12 \
--covariance_type full \
--min_mean_scale 0.8 \
--max_mean_scale 1.2 \
--bases_dist_quantile 0.8 \
--lowess_frac 0.2 \
--fit_transitions \
--iqr_threshold 2 \
--min_bins_per_state 50
--cores 8
Typical usage in paper for experimental datasets:
python run_scplover_adata.py --adata {input.adata} \
--output_row {output.row} \
--output_table {output.full_table} \
--output_adata {output.adata} \
--cell_df_dir {params.cell_df_dir} \
--max_k 12 \
--cells {params.cells_arg} \
--iqr_threshold 2 \
--covariance_type full \
--correct_gc \
--bases_dist_quantile 0.8 \
--lowess_frac 0.2 \
--min_mean_scale 0.8 \
--max_mean_scale 1.2 \
--means scale \
--fit_transitions \
--min_bins_per_state 50 \
| Argument | Description |
|---|---|
--adata |
Path to input .h5ad file containing read counts and overlap bases per bin per cell |
--output_row |
Output CSV with one row per cell (best-scoring model result) |
--output_table |
Output CSV with all results across all ploidy initializations |
--output_adata |
Output .h5ad with ghmm_state layer added containing inferred copy number states |
--cell_df_dir |
Directory to write per-cell regression DataFrames (one CSV per cell) |
| Argument | Default | Description |
|---|---|---|
--cells |
all cells | Comma-separated list of cell IDs to process |
--cells_file |
— | File with one cell ID per line to process |
| Argument | Default | Description |
|---|---|---|
--max_k |
12 |
Maximum copy number state (max supported: 29) |
--covariance_type |
full |
Covariance structure: full, diag, or spherical |
--means |
fixed |
Mean treatment: fixed (held at initial values), scale (learned per-feature scale factor), or free (fully learned) |
--fit_transitions |
False |
If set, learn transition matrix from data (default: fixed) |
--include_increments |
False |
If set, also explore ploidy initializations formed by integer increments from the input copy number states (otherwise, consider multiples only) |
| Argument | Default | Description |
|---|---|---|
--min_mean_scale |
0 |
Lower bound on per-feature mean scale factor |
--max_mean_scale |
inf |
Upper bound on per-feature mean scale factor |
--scale_reads |
False |
If set, apply mean scaling to the reads dimension; otherwise only the overlap bases dimension is scaled |
| Argument | Default | Description |
|---|---|---|
--iqr_threshold |
None |
IQR multiplier for outlier removal per state (e.g. 5); disabled if not set |
--min_bins_per_state |
0 |
Minimum number of bins a state must have; bins in rarer states are removed before fitting |
| Argument | Default | Description |
|---|---|---|
--correct_gc |
False |
If set, apply LOWESS-based modal quantile GC correction to reads and overlap bases |
--lowess_frac |
0.2 |
Fraction of data used in LOWESS smoothing during GC correction |
--clip_corrected_values |
False |
If set, clip GC-corrected values to a valid range |
--bases_dist_quantile |
0.8 |
Quantile of the overlap bases distribution used during GC correction |
| Argument | Default | Description |
|---|---|---|
--cores |
1 |
Number of parallel worker processes for fitting cells |