Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
bd57e00
set up project structure
jonperdomo Mar 4, 2025
b58db8f
work on train model
jonperdomo Mar 14, 2025
cb9bac1
work on annotations
jonperdomo Mar 16, 2025
05c0991
add fragile sites
jonperdomo Mar 17, 2025
f8619f6
work on annotations
jonperdomo Mar 17, 2025
3c131d5
update annotations
jonperdomo Mar 17, 2025
e4ce34f
cytoband annotations
jonperdomo Mar 28, 2025
bcc296f
update annotations
jonperdomo Mar 28, 2025
f9a0776
work on features
jonperdomo Mar 29, 2025
b1f5cec
work on features
jonperdomo Mar 29, 2025
309a140
update df
jonperdomo Apr 1, 2025
4ee6b81
fix annotations
jonperdomo Apr 1, 2025
4bb1f54
remove test code
jonperdomo Apr 1, 2025
e4e4728
model training
jonperdomo Apr 2, 2025
c3169da
test multiple models
jonperdomo Apr 3, 2025
8744fe0
caller model update
jonperdomo Apr 3, 2025
c08b339
cross validation
jonperdomo Apr 3, 2025
98a1777
update training model
jonperdomo Apr 4, 2025
5279105
add copy number state and read alignment offset features
jonperdomo Apr 15, 2025
1fbb273
create extract features module
jonperdomo Apr 15, 2025
017b920
implement predictions
jonperdomo Apr 15, 2025
80d8362
feature corr analysis
jonperdomo Apr 15, 2025
6078b96
add id column
jonperdomo Apr 15, 2025
a46794b
add hg002 hg19 to training
jonperdomo Apr 16, 2025
4ddde04
key fix
jonperdomo Apr 16, 2025
430d7f9
add hg19 filtering
jonperdomo Apr 17, 2025
d0c86ec
update plot
jonperdomo Apr 28, 2025
32eccd6
normalize features and add annovar annotations
jonperdomo May 1, 2025
7419bfc
fix segdup scores
jonperdomo May 1, 2025
990dfad
remove test code
jonperdomo May 1, 2025
23f98e6
remove unused code
jonperdomo May 1, 2025
27e6def
normalize coverage based features
jonperdomo May 4, 2025
ca4c9e9
fix scaling
jonperdomo May 5, 2025
cd4f099
improve large sv scores
jonperdomo Jul 10, 2025
ae4c11c
feature engineering
jonperdomo Jul 12, 2025
d9715e3
feature updates
jonperdomo Jul 13, 2025
257ed57
update features
jonperdomo Jul 14, 2025
1b0c25a
scale read depth and cluster size
jonperdomo Jul 14, 2025
01b327d
add threshold parameter and update cross validation plot
jonperdomo Sep 3, 2025
be6cf9b
work on leave out model training
jonperdomo Feb 15, 2026
94a00d6
parameter optimization for precision
jonperdomo Feb 15, 2026
58673c5
full model train
jonperdomo Feb 18, 2026
a057fcc
update prediction
jonperdomo Feb 18, 2026
1004ea6
comment previous code
jonperdomo Feb 23, 2026
fe3e27c
remove interaction terms
jonperdomo Feb 23, 2026
65b740d
shap analysis
jonperdomo Feb 27, 2026
618b3b2
update features
jonperdomo Feb 28, 2026
36a329f
work on feature importance
jonperdomo Feb 28, 2026
3dec63d
fix shap plot
jonperdomo Feb 28, 2026
6281306
update predict
jonperdomo Feb 28, 2026
fd2d2dd
normalize read depth with mean coverage
jonperdomo Mar 2, 2026
a84be5f
less than 10kb model
jonperdomo Mar 2, 2026
ffcc591
add filepath config file and prediction threshold sweep
jonperdomo Mar 5, 2026
b25ae22
optimal thresholds by sv type
jonperdomo Mar 8, 2026
884c89b
add sv length cutoff parameter
jonperdomo Mar 10, 2026
0378958
conda project restructure
jonperdomo Apr 12, 2026
e141707
work on unit tests
jonperdomo Apr 12, 2026
9d65823
add test data
jonperdomo Apr 12, 2026
57f1b18
handle gzip
jonperdomo Apr 12, 2026
30bed9b
add tests
jonperdomo Apr 12, 2026
618db2f
add unit tests workflow
jonperdomo Apr 12, 2026
c868f2c
Add unit tests badge to README
jonperdomo Apr 12, 2026
2996178
update test name
jonperdomo Apr 12, 2026
70ee53d
update gh action
jonperdomo Apr 12, 2026
9d6687f
update environment
jonperdomo Apr 12, 2026
3cc62cf
try mamba
jonperdomo Apr 12, 2026
e9d8134
mamba test
jonperdomo Apr 12, 2026
8cbc50b
test mamba
jonperdomo Apr 12, 2026
0779f97
wokr on conda package
jonperdomo Apr 13, 2026
64f80e0
improve installation
jonperdomo Apr 13, 2026
e599dd5
fix chr format mismatch
jonperdomo Apr 13, 2026
f66bfb4
fix warnings
jonperdomo Apr 16, 2026
b4fc5d9
restore plot
jonperdomo Apr 16, 2026
760f64c
save evaluation plots
jonperdomo Apr 20, 2026
ba4cba5
update readme
jonperdomo Apr 20, 2026
ca19b4a
update gitignore
jonperdomo May 10, 2026
fb9875c
update bed filepaths
jonperdomo May 10, 2026
fe46b3f
add pytest
jonperdomo May 10, 2026
7b10db4
Update unit tests badge in README.md
jonperdomo May 10, 2026
cb3aab3
update README
jonperdomo May 11, 2026
acdb5b6
Revise README content and improve clarity
jonperdomo May 11, 2026
dff54d0
clean up comments and logs
jonperdomo May 11, 2026
8086f93
Potential fix for pull request finding
jonperdomo May 11, 2026
8f20a2e
Potential fix for pull request finding
jonperdomo May 11, 2026
4fc28c9
Potential fix for pull request finding
jonperdomo May 11, 2026
4ad957f
Potential fix for pull request finding
jonperdomo May 11, 2026
df81ab2
Fix score intermediate BED path and cleanup behavior
Copilot May 11, 2026
ac82bf5
Lazy-load optional training dependencies in train_full_model
Copilot May 11, 2026
118f4d9
Potential fix for pull request finding
jonperdomo May 11, 2026
0f34095
Potential fix for pull request finding
jonperdomo May 11, 2026
9be53dc
Use safe subprocess invocation for ANNOVAR DB download
Copilot May 11, 2026
df9503c
Change branch for workflow trigger from 'initial-commit' to 'main'
jonperdomo May 11, 2026
874318c
update setup.py
jonperdomo May 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions .github/workflows/unit-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# This is a basic workflow to help you get started with Actions

name: unit tests

# Controls when the workflow will run
on:
# Triggers the workflow on push or pull request events but only for the "main" branch
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
# This workflow contains a single job called "build"
build:
# The type of runner that the job will run on
runs-on: ubuntu-latest

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
- uses: actions/checkout@v4

- name: Set up conda environment
uses: conda-incubator/setup-miniconda@v3
with:
miniforge-variant: Miniforge3 # uses mamba automatically
activate-environment: contextscore
environment-file: environment.yml
auto-activate-base: false
use-mamba: true
cache-environment: true # ← caches the env
cache-downloads: true # ← caches downloaded packages

- name: Run tests
shell: bash --login {0}
run: |
mkdir tests/output
python -m pytest
19 changes: 19 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -169,3 +169,22 @@ cython_debug/

# PyPI configuration file
.pypirc

# Ignore the output/ folder
output/
scripts/

# VS Code settings
.vscode/launch.json

# Testing scripts
linktoscripts
truvari_results_Simulated_*/
conda/contextscore-models/
tests/fixtures/output.vcf.avinput
tests/fixtures/output.vcf.bed
tests/fixtures/annotations/features.tsv
tests/fixtures/annotations/regions.hg38_multianno.txt

# Database files
data/
7 changes: 7 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"python.testing.pytestArgs": [
"tests"
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true
}
3 changes: 3 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include README.md
include LICENSE
recursive-include data *
39 changes: 38 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,39 @@
[![unit tests](https://github.com/WGLab/ContextScore/actions/workflows/unit-tests.yml/badge.svg)](https://github.com/WGLab/ContextScore/actions/workflows/unit-tests.yml)

# ContextScore
Assign confidence scores to SV datasets based on coverage, genomic context, and other important alignment features
<p>
<img src="https://github.com/user-attachments/assets/03603ad1-df9d-438d-911c-81af0cf612e3" alt="ContextSV" align="left" style="width:100px;"/>
Filtering step for the <a href="https://github.com/WGLab/ContextSV">ContextSV</a> long-read structural variant (SV) caller, utilizing a Random Forest model trained on SV validation features. Assign confidence scores to SV datasets based on coverage, genomic context, and other important alignment features, then filter low-confidence SVs to increase the precision of the final callset. Genomic context is determined from annotations using ANNOVAR and UCSC databases.
</p>
<br clear="left"/>

## Installation
```bash
conda install -c wglab -c bioconda -c conda-forge contextscore

# Or using mamba (faster dependency resolution):
mamba install -c wglab contextscore
```

## ANNOVAR setup
[ANNOVAR](https://annovar.openbioinformatics.org/en/latest/user-guide/download/) is required for annotations and must be installed separately.

These are the required ANNOVAR components for ContextScore:
- `--annovar`: directory containing `annotate_variation.pl` and `table_annovar.pl`
- `--annovar-db`: ANNOVAR database directory

## User Workflow
```bash
contextscore --input input.vcf --output scored.vcf --sample-coverage 30 --buildver {hg38,hg19} --threshold 0.2 \
--annovar /path/to/annovar --annovar-db /path/to/humandb
```

## Sources for additional annotations (under `data/` directory):
| File | Source | Description | Link |
| --- | --- | --- | --- |
| `cytobands_hg{19,38}.txt` | UCSC Genome Browser | Cytoband annotations for human genome builds hg19 and hg38 | [UCSC hg19](https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz) / [UCSC hg38](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz) |
| `hg{19,38}_segmental_duplications.bed` | UCSC Genome Browser | Segmental duplication annotations for human genome builds hg19 and hg38 | [UCSC hg19](https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/segmentalDuplications.txt.gz) / [UCSC hg38](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/segmentalDuplications.txt.gz) |
| `phastcons100way_hg{19,38}.bed` | UCSC Genome Browser | PhastCons conservation scores for human genome builds hg19 and hg38 | [UCSC hg19](https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/phastCons100way.txt.gz) / [UCSC hg38](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/phastCons100way.txt.gz) |
| `simple_repeats_hg{19,38}.bed` | UCSC Genome Browser | Simple repeat annotations for human genome builds hg19 and hg38 | [UCSC hg19](https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/simpleRepeat.txt.gz) / [UCSC hg38](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/simpleRepeat.txt.gz) |
| `fragile_sites_hg38.bed` / `fragile_sites_hg19_liftover.bed` | [HumCFS](https://webs.iiitd.edu.in/raghava/humcfs/download.html) | Fragile site annotations for human genome builds hg38 and hg19 (liftover) | [HumCFS](https://webs.iiitd.edu.in/raghava/humcfs/fragile_site_bed.zip) |
Comment on lines +31 to +38

42 changes: 42 additions & 0 deletions conda/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
{% set name = "contextscore" %}
{% set version = "0.1.0" %}

package:
name: {{ name|lower }}
version: {{ version }}

source:
path: ..

build:
number: 0
skip: true # [win]
script: "{{ PYTHON }} -m pip install . --no-deps -vv"

requirements:
host:
- python >=3.10,<3.11
- pip
- setuptools
run:
- python >=3.10,<3.11
- numpy
- pandas
- scikit-learn =1.6.1 # For consistency with model training environment
- joblib
- bedtools
- contextscore-models

about:
home: https://github.com/WGLab/ContextScore
summary: Assign confidence scores to structural variant datasets.
description: |
ContextScore prediction package. Model weights are distributed separately
(for example via contextscore-models) and can be provided via --model or
CONTEXTSCORE_MODEL_PATH.
license: MIT
license_file: LICENSE

extra:
recipe-maintainers:
- WGLab
2 changes: 2 additions & 0 deletions contextscore/TrainingAnnotationsSummary.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
True Positives
Total Fragile Sites Telomeres Centromeres Segmental Duplications Conserved Regions
Empty file added contextscore/__init__.py
Empty file.
5 changes: 5 additions & 0 deletions contextscore/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from .predict import main


if __name__ == '__main__':
main()
61 changes: 61 additions & 0 deletions contextscore/download_tables.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import pandas as pd
import pymysql
from pathlib import Path

def download_ucsc(table_name: str,
genome_version: str = "hg38",
output_file: str = "ucsc_table.bed") -> None:
"""
Downloads the UCSC Simple Repeats table and saves it as a BED file for use with BEDTools.
Note: This function requires access to the UCSC MySQL database.
"""
print("Downloading UCSC " + table_name + " table for " + genome_version + "...")

# Connect to UCSC MySQL database
conn = pymysql.connect(host="genome-mysql.soe.ucsc.edu",
user="genome",
password="",
database="hg38") # Change to the desired genome version (e.g., hg19, mm10)
Comment on lines +14 to +18

query = f"""
SELECT
chrom AS chr,
chromStart AS start,
chromEnd AS end,
name
FROM
{table_name}
WHERE
chrom IS NOT NULL AND
chromStart IS NOT NULL AND
chromEnd IS NOT NULL
AND
chromStart >= 0 AND
chromEnd > chromStart
AND
chromStart < chromEnd;
"""
df = pd.read_sql(query, conn)

# Close connection
conn.close()

# Save as BED file for BEDTools
df.to_csv(output_file, sep="\t", index=False, header=False)
print("Downloaded UCSC " + table_name + " table for " + genome_version + " and saved as " + output_file)

if __name__ == "__main__":
data_dir = Path(__file__).resolve().parents[1] / "data"
data_dir.mkdir(parents=True, exist_ok=True)

# Download the UCSC Simple Repeats table for hg38
simple_repeat_file = str(data_dir / "simple_repeats_hg38.bed")
download_ucsc(table_name="simpleRepeat",
genome_version="hg38",
output_file=simple_repeat_file)

# Download the UCSC phastCons100way table for hg38
phastcons_file = str(data_dir / "phastcons100way_hg38.bed")
download_ucsc(table_name="phastCons100way",
genome_version="hg38",
output_file=phastcons_file)
Loading
Loading