Skip to content

Commit 4de82fe

Browse files
committed
Document markII
1 parent 0348b22 commit 4de82fe

7 files changed

Lines changed: 103 additions & 32 deletions

File tree

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@ This project adheres to [Semantic Versioning](http://semver.org/).
44

55
## [Unreleased]
66

7+
### Added
8+
- Panel design notebooks (#157).
9+
710
### Fixed
811
- Debugged a test that counts observed haplotypes (#154).
912
- Replaced global pooled Ae values with 26-population average as the default Ae reported (#155).

notebooks/panel-design/markII/README

Lines changed: 0 additions & 14 deletions
This file was deleted.
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Panel design algorithm, mark II
2+
3+
## Running the procedure
4+
5+
- Make sure software prerequisites are installed (see below)
6+
- Make sure databases are in place (see below)
7+
- Edit the JSON config file `config-example.json` to point to the correct databases (or fiddle with the parameters if you're brave)
8+
- Run the Snakemake workflow
9+
10+
```
11+
snakemake --configfiles config-example.json -c 1 -p
12+
```
13+
14+
### Software prerequisites
15+
16+
- intervaltree
17+
- matplotlib
18+
- networkx
19+
- pandas
20+
- polars
21+
- scipy
22+
- snakemake
23+
- tagore
24+
- tqdm
25+
- upsetplot
26+
27+
### Required databases
28+
29+
- RepeatMasker track from UCSC genome browser
30+
- dbSNP combined VCF file (along with .tbi index)
31+
32+
Both databases must use GRC838 coordinates.
33+
34+
35+
## How it works
36+
37+
The design algorithm has three stages: a filtering stage, a panel scaffolding stage, and a panel fill-out stage.
38+
39+
1. The filtering stage applies five filters to exclude microhaps likely to perform poorly in a multiplex targeted amplicon sequencing assay: details are shown below.
40+
2. In the scaffolding stage, each chromosome is considered independently. A *linkage graph* is constructed for each chromosome where each node represents a microhap marker and each pair of nodes is connected if the corresponding microhaps are separated by at least 9.5 Mbp (as a proxy for linkage equilibrium). All *maximal cliques* in this linkage graph are enumerated, each representing a set of mutually independent microhaps. The maximal clique with the highest aggregate Ae score is retained as the panel scaffolding for this chromosome.
41+
3. The fill-out stage also proceeds on a chromosome-by-chromosome basis. A greedy algorithm is used to add additional independently inherited microhaps: the highest-ranked microhap by Ae that is separated by at least 9.5 Mbp from all microhaps already included in the panel is added to the panel. This is repeated until no more microhaps can be added.
42+
43+
### Filtering stage
44+
45+
The filters/masks applied during this first stage are as follows.
46+
47+
1. Exclude any marker within 9.5 Mbp of a forensic STR
48+
2. Exclude any marker that overlaps with a highly conserved genomic repeat element (SINE, LINE, or LTR)
49+
3. Exclude any marker with low-complexity sequence close to an allele-defining SNP (ADS)
50+
4. Exclude any marker with an indel polymorphism close to an ADS
51+
5. Exclude any marker longer than 260 bp in length
52+
53+
The results of each individual filter and of all aggregated filters are in `data/intermediate/`. A plot showing the number of microhaps excluded by each filter is shown in `data/results/masking-results-plot.png`.
54+
55+
### Whitelist
56+
57+
A whitelist was constructed of microhaps to include regardless of filtering status. This list primarily contained microhaps whose performance has already been demonstrated empirically in previous studies. This includes:
58+
59+
- All loci from ThermoFisher 74-plex
60+
- All loci from Ken Kidd 2022 24-plex
61+
- All loci from USC panel
62+
63+
It also includes two "keepers" from manual analysis of high Ae microhaps that were filtered in the preliminary stages of algorithm development (mh06SCUZJ-0528857, mh15SCUZJ-0082880).
64+
65+
```python
66+
>>> table2= pd.read_csv("markers-failed-filter.tsv", sep="\t")
67+
>>> subtable = table[(~table.FailMode.str.contains("length")) & (~table.FailMode.str.contains("str")) & (table.Ae > 9.0)].sort_values("Ae", ascending=False)
68+
>>> subtable.to_csv("input/high-ae-filtered.tsv", sep="\t", index=False)
69+
```
70+
71+
72+
## Whence the parameters?
73+
74+
Some of the parameter values (declared in `config-example.json`) were selected based on informed intuition, and some based on empirical observations. The reasoning behind parameter selection is elaborated upon here.
75+
76+
### max_length
77+
78+
A limit of 300 bp is commonly used in the literature for defining a microhap. However, the Microhap Working Group elected to restrict the max extent of any microhap in a core panel to 250 bp, allowing the entire amplicon—primers and all—to fit within roughly 300 bp. So initially the value of this parameter was set to 250 bp. But during the preliminary stages of the panel design algorithm development, a handful of promising microhaps was observed right on the fence. So this parameter was marginally relaxed to 260 bp to capture a few high-value targets.
79+
80+
### sine/line/ltr
81+
82+
RepeatMasker reports a score for each genomic repeat it annotates. This score captures the extent to which any given repetitive sequence is conserved throughout the genome. Higher scores correspond to longer sequences conserved at higher fidelity, while lower scores indicate shorter sequences with more distant similarities. The distribution of these scores for all SINEs, LINEs, and LTRs in the human genome was examined, and representatives in various score ranges were observed to assess the extent of conservation (or conversely, the amount of unique sequence) corresponding to different score ranges. These observations were then used to inform the selection of cutoffs: repeat elements with scores exceeding the cutoffs were retained to filter out microhaps; repeats with lower scores were ignored at the filtering stage.
83+
84+
### ld_dist
85+
86+
A distance of 10 Mbp was initially selected as a proxy for linkage equilibrium, i.e., the physical distance required between a pair of loci to be considered independently inherited and thus suitable for the probability product rule. This threshold is applied both to filtering based on forensic STRs and to populating the linkage graph. During the preliminary stages of the panel design algorithm development, a handful of promising microhaps was observed just below that threshold. After confirming that none of these microhaps was in linkage disequilibrium with the closest candidate markers in the panel, the threshold was relaxed to 9.5 Mbp.
87+
88+
### max_short_mh_per_chrom
89+
90+
During the preliminary stages of algorithm development, it was noted that populating the linkage graph with all candidates from the chromosome would lead to maximal cliques composed of numerous short microhaps with mediocre Ae scores, crowding out microhaps with higher individual Ae scores. Rather than developing a more sophisticated clique ranking score that gives more weight to higher Ae values, it was decided to include only a handful of short microhaps in the initial linkage graph construction, and then fill in later with a greedy algorithm. This parameter limited the linkage graph to 6 microhaps < 100 bp in length for each chromosome.

notebooks/panel-design/markII/Snakefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,9 @@ rule design_panel:
5555
cutlist="data/intermediate/cut-list.tsv",
5656
params:
5757
ld_dist=config["ld_dist"],
58-
max_per_chrom=config["max_candidates_per_chrom"],
58+
max_per_chrom=config["max_short_mh_per_chrom"],
5959
shell:
60-
"python {input} --distance {params.ld_dist} --max-per-chrom {params.max_per_chrom} --cut-list {output.cutlist} > {output.tsv}"
60+
"python {input} --distance {params.ld_dist} --max-short-mh-per-chrom {params.max_per_chrom} --cut-list {output.cutlist} > {output.tsv}"
6161

6262

6363
rule apply_masks:

notebooks/panel-design/markII/code/design_panel.py

Lines changed: 7 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -77,26 +77,18 @@ def get_parser():
7777
parser.add_argument("markers", help="path to MicroHapDB marker definitions in CSV format")
7878
parser.add_argument("aes", help="path to MicroHapDB Ae table in CSV format")
7979
parser.add_argument(
80-
"--max-per-chrom",
80+
"--max-short-mh-per-chrom",
8181
type=int,
82-
nargs=2,
83-
default=(8, 8),
82+
default=6,
8483
metavar="M",
85-
help="select the M highest ranked short and long markers (respectively) per chromosome by Ae; by default M=(8, 8)",
86-
)
87-
parser.add_argument(
88-
"--batches",
89-
type=int,
90-
default=8,
91-
metavar="B",
92-
help="split the linkage graph into B batches; by default B=8",
84+
help="when building the linkage graph, exclude all but the M highest-ranked (by Ae) short microhaps for each chromosome; by default M=6",
9385
)
9486
parser.add_argument(
9587
"--distance",
9688
type=float,
97-
default=10e6,
89+
default=9.5e6,
9890
metavar="D",
99-
help="two markers must be separated by more than D bp to be considered independently inherited (as a heuristic); by default D=10000000",
91+
help="two markers must be separated by more than D bp to be considered independently inherited (as a heuristic); by default D=95000000 (9.5 Mbp)",
10092
)
10193
parser.add_argument(
10294
"--cut-list",
@@ -110,8 +102,8 @@ def get_parser():
110102
args = get_parser().parse_args()
111103
markers = load_markers(args.markers, args.aes)
112104
thresholds = LinkageGraphThresholds(
113-
max_per_chrom_short=args.max_per_chrom[0],
114-
max_per_chrom_long=args.max_per_chrom[1],
105+
max_per_chrom_short=args.max_short_mh_per_chrom,
106+
max_per_chrom_long=100, # effectively disable this filter
115107
ld_distance=args.distance,
116108
)
117109
main(markers, thresholds, cutlist=args.cut_list)

notebooks/panel-design/markII/config-example.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,5 +8,5 @@
88
"line": 411,
99
"ltr": 909,
1010
"ld_dist": 9.5e6,
11-
"max_candidates_per_chrom": "6 100"
11+
"max_short_mh_per_chrom": 6
1212
}

notebooks/panel-design/markII/data/intermediate/high-ae-filtered.tsv renamed to notebooks/panel-design/markII/data/input/high-ae-prelim-filtered.tsv

File renamed without changes.

0 commit comments

Comments
 (0)