Document markII

standage · standage · commit 4de82fef130b · 2025-04-23T15:25:12.000-04:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,9 @@ This project adheres to [Semantic Versioning](http://semver.org/).
 
 ## [Unreleased]
 
+### Added
+- Panel design notebooks (#157).
+
 ### Fixed
 - Debugged a test that counts observed haplotypes (#154).
 - Replaced global pooled Ae values with 26-population average as the default Ae reported (#155).
diff --git a/notebooks/panel-design/markII/README b/notebooks/panel-design/markII/README
diff --git a/notebooks/panel-design/markII/README.md b/notebooks/panel-design/markII/README.md
@@ -0,0 +1,90 @@
+# Panel design algorithm, mark II
+
+## Running the procedure
+
+- Make sure software prerequisites are installed (see below)
+- Make sure databases are in place (see below)
+- Edit the JSON config file `config-example.json` to point to the correct databases (or fiddle with the parameters if you're brave)
+- Run the Snakemake workflow
+
+```
+snakemake --configfiles config-example.json -c 1 -p
+```
+
+### Software prerequisites
+
+- intervaltree
+- matplotlib
+- networkx
+- pandas
+- polars
+- scipy
+- snakemake
+- tagore
+- tqdm
+- upsetplot
+
+### Required databases
+
+- RepeatMasker track from UCSC genome browser
+- dbSNP combined VCF file (along with .tbi index)
+
+Both databases must use GRC838 coordinates.
+
+
+## How it works
+
+The design algorithm has three stages: a filtering stage, a panel scaffolding stage, and a panel fill-out stage.
+
+1. The filtering stage applies five filters to exclude microhaps likely to perform poorly in a multiplex targeted amplicon sequencing assay: details are shown below.
+2. In the scaffolding stage, each chromosome is considered independently. A *linkage graph* is constructed for each chromosome where each node represents a microhap marker and each pair of nodes is connected if the corresponding microhaps are separated by at least 9.5 Mbp (as a proxy for linkage equilibrium). All *maximal cliques* in this linkage graph are enumerated, each representing a set of mutually independent microhaps. The maximal clique with the highest aggregate Ae score is retained as the panel scaffolding for this chromosome.
+3. The fill-out stage also proceeds on a chromosome-by-chromosome basis. A greedy algorithm is used to add additional independently inherited microhaps: the highest-ranked microhap by Ae that is separated by at least 9.5 Mbp from all microhaps already included in the panel is added to the panel. This is repeated until no more microhaps can be added.
+
+### Filtering stage
+
+The filters/masks applied during this first stage are as follows.
+
+1. Exclude any marker within 9.5 Mbp of a forensic STR
+2. Exclude any marker that overlaps with a highly conserved genomic repeat element (SINE, LINE, or LTR)
+3. Exclude any marker with low-complexity sequence close to an allele-defining SNP (ADS)
+4. Exclude any marker with an indel polymorphism close to an ADS
+5. Exclude any marker longer than 260 bp in length
+
+The results of each individual filter and of all aggregated filters are in `data/intermediate/`. A plot showing the number of microhaps excluded by each filter is shown in `data/results/masking-results-plot.png`.
+
+### Whitelist
+
+A whitelist was constructed of microhaps to include regardless of filtering status. This list primarily contained microhaps whose performance has already been demonstrated empirically in previous studies. This includes:
+
+- All loci from ThermoFisher 74-plex
+- All loci from Ken Kidd 2022 24-plex
+- All loci from USC panel
+
+It also includes two "keepers" from manual analysis of high Ae microhaps that were filtered in the preliminary stages of algorithm development (mh06SCUZJ-0528857, mh15SCUZJ-0082880).
+
+```python
+>>> table2= pd.read_csv("markers-failed-filter.tsv", sep="\t")
+>>> subtable = table[(~table.FailMode.str.contains("length")) & (~table.FailMode.str.contains("str")) & (table.Ae > 9.0)].sort_values("Ae", ascending=False)
+>>> subtable.to_csv("input/high-ae-filtered.tsv", sep="\t", index=False)
+```
+
+
+## Whence the parameters?
+
+Some of the parameter values (declared in `config-example.json`) were selected based on informed intuition, and some based on empirical observations. The reasoning behind parameter selection is elaborated upon here.
+
+### max_length
+
+A limit of 300 bp is commonly used in the literature for defining a microhap. However, the Microhap Working Group elected to restrict the max extent of any microhap in a core panel to 250 bp, allowing the entire amplicon—primers and all—to fit within roughly 300 bp. So initially the value of this parameter was set to 250 bp. But during the preliminary stages of the panel design algorithm development, a handful of promising microhaps was observed right on the fence. So this parameter was marginally relaxed to 260 bp to capture a few high-value targets.
+
+### sine/line/ltr
+
+RepeatMasker reports a score for each genomic repeat it annotates. This score captures the extent to which any given repetitive sequence is conserved throughout the genome. Higher scores correspond to longer sequences conserved at higher fidelity, while lower scores indicate shorter sequences with more distant similarities. The distribution of these scores for all SINEs, LINEs, and LTRs in the human genome was examined, and representatives in various score ranges were observed to assess the extent of conservation (or conversely, the amount of unique sequence) corresponding to different score ranges. These observations were then used to inform the selection of cutoffs: repeat elements with scores exceeding the cutoffs were retained to filter out microhaps; repeats with lower scores were ignored at the filtering stage.
+
+### ld_dist
+
+A distance of 10 Mbp was initially selected as a proxy for linkage equilibrium, i.e., the physical distance required between a pair of loci to be considered independently inherited and thus suitable for the probability product rule. This threshold is applied both to filtering based on forensic STRs and to populating the linkage graph. During the preliminary stages of the panel design algorithm development, a handful of promising microhaps was observed just below that threshold. After confirming that none of these microhaps was in linkage disequilibrium with the closest candidate markers in the panel, the threshold was relaxed to 9.5 Mbp.
+
+### max_short_mh_per_chrom
+
+During the preliminary stages of algorithm development, it was noted that populating the linkage graph with all candidates from the chromosome would lead to maximal cliques composed of numerous short microhaps with mediocre Ae scores, crowding out microhaps with higher individual Ae scores. Rather than developing a more sophisticated clique ranking score that gives more weight to higher Ae values, it was decided to include only a handful of short microhaps in the initial linkage graph construction, and then fill in later with a greedy algorithm. This parameter limited the linkage graph to 6 microhaps < 100 bp in length for each chromosome.
diff --git a/notebooks/panel-design/markII/Snakefile b/notebooks/panel-design/markII/Snakefile
@@ -55,9 +55,9 @@ rule design_panel:
         cutlist="data/intermediate/cut-list.tsv",
     params:
         ld_dist=config["ld_dist"],
-        max_per_chrom=config["max_candidates_per_chrom"],
+        max_per_chrom=config["max_short_mh_per_chrom"],
     shell:
-        "python {input} --distance {params.ld_dist} --max-per-chrom {params.max_per_chrom} --cut-list {output.cutlist} > {output.tsv}"
+        "python {input} --distance {params.ld_dist} --max-short-mh-per-chrom {params.max_per_chrom} --cut-list {output.cutlist} > {output.tsv}"
 
 
 rule apply_masks:
diff --git a/notebooks/panel-design/markII/code/design_panel.py b/notebooks/panel-design/markII/code/design_panel.py
@@ -77,26 +77,18 @@ def get_parser():
     parser.add_argument("markers", help="path to MicroHapDB marker definitions in CSV format")
     parser.add_argument("aes", help="path to MicroHapDB Ae table in CSV format")
     parser.add_argument(
-        "--max-per-chrom",
+        "--max-short-mh-per-chrom",
         type=int,
-        nargs=2,
-        default=(8, 8),
+        default=6,
         metavar="M",
-        help="select the M highest ranked short and long markers (respectively) per chromosome by Ae; by default M=(8, 8)",
-    )
-    parser.add_argument(
-        "--batches",
-        type=int,
-        default=8,
-        metavar="B",
-        help="split the linkage graph into B batches; by default B=8",
+        help="when building the linkage graph, exclude all but the M highest-ranked (by Ae) short microhaps for each chromosome; by default M=6",
     )
     parser.add_argument(
         "--distance",
         type=float,
-        default=10e6,
+        default=9.5e6,
         metavar="D",
-        help="two markers must be separated by more than D bp to be considered independently inherited (as a heuristic); by default D=10000000",
+        help="two markers must be separated by more than D bp to be considered independently inherited (as a heuristic); by default D=95000000 (9.5 Mbp)",
     )
     parser.add_argument(
         "--cut-list",
@@ -110,8 +102,8 @@ def get_parser():
     args = get_parser().parse_args()
     markers = load_markers(args.markers, args.aes)
     thresholds = LinkageGraphThresholds(
-        max_per_chrom_short=args.max_per_chrom[0],
-        max_per_chrom_long=args.max_per_chrom[1],
+        max_per_chrom_short=args.max_short_mh_per_chrom,
+        max_per_chrom_long=100,  # effectively disable this filter
         ld_distance=args.distance,
     )
     main(markers, thresholds, cutlist=args.cut_list)
diff --git a/notebooks/panel-design/markII/config-example.json b/notebooks/panel-design/markII/config-example.json
@@ -8,5 +8,5 @@
     "line": 411,
     "ltr": 909,
     "ld_dist": 9.5e6,
-    "max_candidates_per_chrom": "6 100"
+    "max_short_mh_per_chrom": 6
 }
diff --git a/notebooks/panel-design/markII/data/input/high-ae-prelim-filtered.tsv b/notebooks/panel-design/markII/data/input/high-ae-prelim-filtered.tsv

Original file line number	Diff line number	Diff line change
`@@ -8,5 +8,5 @@`
`8`	`8`	`"line": 411,`
`9`	`9`	`"ltr": 909,`
`10`	`10`	`"ld_dist": 9.5e6,`
`11`		`- "max_candidates_per_chrom": "6 100"`
	`11`	`+ "max_short_mh_per_chrom": 6`
`12`	`12`	`}`