Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
The lab offers a diagnostic test for patients with either a congenital myopathy or congenital muscular dystrophy. The test sequences 83 genes that are associated with these conditions. The labwork for the test involves use of a capture kit (Agilent SureSelect) to pulldown the DNA corresponding to these genes for each patient sample. The captured DNA is then sequenced by NGS using an Illumina NextSeq. The sequence data is run through an analytical pipeline that detects variants and reports the level to which each gene has been sequenced. Performance metrics for this test require that every coding base of each gene is covered by at least 30 reads (each gene should be covered to 30x). The coverage data generation part of the pipeline is incomplete and requires some further work to produce a report that highlights any genes that are not covered at 30x.

## Current state
A tool called "sambamba" generates coverage data for each sample that has been tested. The output from sambamba lists each exon of each gene and the percentage coverage at 30x.
A tool called "sambamba" generates coverage data for each sample that has been tested. The output from sambamba lists each exon of each gene and the percentage coverage at 30x.
- see an example output: NGS148_34_139558_CB_CMCMD_S33_R1_001.sambamba_output.txt
- the `percentage30` column in the sambamba output indicates the percentage of each region covered at >= 30x

Expand All @@ -13,3 +13,27 @@ Generate a report that lists any genes that have less than 100% coverage at 30x.

- Ideally using python, write a script that takes the sambamba output and generates a report listing any genes that have less than 100% coverage at 30x
- This script should be able to be applied to any gene panel

## Usage
Install the required Python libraries with the `requirements.txt` file:
```
pip install -r requirements.txt
```

Run the tool with the following options:
```
python find_low_coverage_genes.py \
-i / --input [str] Path to the Sambamba output file \
-o / --output (optional) [str] Name of the output report \
-t / --threshold (optional) [int] Find genes that have less than this coverage threshold (%) at 30x (default 100%)
```

Example data can be found in the `example` directory:
- Input: `NGS148_34_139558_CB_CMCMD_S33_R1_001.sambamba_output.txt`
- Output:`NGS148_34_139558_CB_CMCMD_S33_R1_001.gene_coverage_30x_below_100pc.csv`

The output file was generated using the below command:
```
python find_low_coverage_genes.py \
-i example/NGS148_34_139558_CB_CMCMD_S33_R1_001.sambamba_output.txt
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
GeneSymbol,Accession,AvgGeneCoverage30x
MSTO1,NM_018116.3,98.65
NEB,NM_001271208.1,90.22
SELENON,NM_020451.2,94.93
ZC4H2,NM_018684.3,96.39
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
GeneSymbol,Accession,AvgGeneCoverage30x
NEB,NM_001271208.1,90.22
SELENON,NM_020451.2,94.93
ZC4H2,NM_018684.3,96.39
232 changes: 232 additions & 0 deletions find_low_coverage_genes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
#!/usr/bin/env python3

"""
Author: Rebecca Locke
Date: 03/05/24

Script to parse the output from Sambamba and return genes which are below 30x
coverage.

Usage:
python find_low_coverage_genes.py [-h] \
-i / --input [str] Path to the Sambamba output file \
-o / --output (optional) [str] Name of the output report \
-t / --threshold (optional) [int] Percentage threshold to find genes below at
30x coverage
"""

import argparse
import numpy as np
import pandas as pd

from pathlib import Path


def parse_args() -> argparse.Namespace:
"""
Parse the command line arguments given

Returns
-------
argparse.Namespace
Namespace of passed command line argument inputs
"""
parser = argparse.ArgumentParser(
description='Information required to find low coverage genes'
)

parser.add_argument(
'-i',
'--input',
type=str,
required=True,
help='.txt file input which is output of Sambamba'
)

parser.add_argument(
'-o',
'--output',
type=str,
help='Name for output report of genes below 30x threshold (optional)'
)

parser.add_argument(
'-t',
'--threshold',
type=int,
help='Get genes < this percentage threshold at 30x (default 100pc)',
default=100
)

args = parser.parse_args()

return args


def read_in_sambamba_file(input_file) -> pd.DataFrame:
"""
Read in the Sambamba output file to a pandas df and split the
GeneSymbol;Accession column into two separate columns so we can do
checks and group by gene / transcript

Parameters
----------
input_file : str
Sambamba per-exon coverage .txt file

Returns
-------
sambamba_subset : pd.DataFrame
pandas df with subset of Sambamba info, with columns split out

Example output format:
+-------------+---------------+-------------+------------+-------------+--------------+
| #chromosome | StartPosition | EndPosition | GeneSymbol | Accession | percentage30 |
+-------------+---------------+-------------+------------+-------------+--------------+
| 1 | 26126711 | 26126914 | SELENON | NM_020451.2 | 49.2611 |
| 1 | 26127523 | 26127661 | SELENON | NM_020451.2 | 100.0000 |
+-------------+---------------+-------------+------------+-------------+--------------+
"""
# Read in .txt file, accounting for whitespace
sambamba_df = pd.read_csv(input_file, delim_whitespace=True)

# Remove any full duplicate rows
sambamba_df.drop_duplicates(keep='first', inplace=True)

# Subset to useful columns; if required columns not found raise KeyError
try:
sambamba_subset = sambamba_df[[
'#chromosome', 'StartPosition', 'EndPosition', 'GeneSymbol;Accession', 'percentage30'
]]
except KeyError:
raise KeyError(
"Expecting columns #chromosome, StartPosition, EndPosition, "
"GeneSymbol;Accession, percentage30. Please check input file"
)

# Split out GeneSymbol;Accession column into two separate columns
sambamba_subset[['GeneSymbol', 'Accession']] = sambamba_subset[
'GeneSymbol;Accession'
].str.split(';', expand=True)

return sambamba_subset


def calculate_gene_30x_coverage(sambamba_df) -> pd.DataFrame:
"""
Calculate the coverage of each gene at 30x based on exon 30x % coverage

Parameters
----------
sambamba_df : pd.DataFrame
pandas df with subset of Sambamba info, with columns split out

Returns
-------
gene_cov_30x_df: pd.DataFrame
pandas df with each gene + transcript and the coverage at 30x

Example output format:
+------------+----------------+--------------------+
| GeneSymbol | Accession | AvgGeneCoverage30x |
+------------+----------------+--------------------+
| ACTA1 | NM_001100.3 | 100.000000 |
| B3GALNT2 | NM_152490.4 | 100.000000 |
| MSTO1 | NM_018116.3 | 98.6452609633718 |
| NEB | NM_001271208.1 | 90.2228444732621 |
+------------+----------------+--------------------+
"""
# Add column representing length of each exon
sambamba_df['ExonLength'] = (
sambamba_df['EndPosition'] - sambamba_df['StartPosition']
)

# This calculates gene converage at 30x by:
# calculating fraction of gene each exon covers wrt all tx exon lengths
# then multiplies each exon's gene fraction by its percentage30 value
# then averages the result of this over all of the gene's exons
gene_cov_30x_df = sambamba_df.groupby(['GeneSymbol', 'Accession']).apply(
lambda row: np.average(row['percentage30'], weights=row['ExonLength'])
).to_frame('AvgGeneCoverage30x').reset_index()

return gene_cov_30x_df


def get_genes_below_30x_threshold(gene_coverage_df, threshold) -> pd.DataFrame:
"""
Filter and create a dataframe containing any genes below 30x coverage

Parameters
----------
gene_coverage_df : pd.DataFrame
pandas df of all genes / transcripts and their coverage at 30x
threshold: float
the threshold for coverage at 30x to retrieve genes / transcripts
below this
Returns
-------
genes_below_100x : pd.DataFrame
pandas df with any genes / transcripts with 30x coverage < threshold

Example output format:
+------------+----------------+--------------------+
| GeneSymbol | Accession | AvgGeneCoverage30x |
+------------+----------------+--------------------+
| MSTO1 | NM_018116.3 | 98.65 |
| NEB | NM_001271208.1 | 90.22 |
+------------+----------------+--------------------+
"""
# Extract any genes below 100% at 30x and round values to 2dp
genes_below_100x = gene_coverage_df.loc[
gene_coverage_df['AvgGeneCoverage30x'] < float(threshold)
].round(2).reset_index(drop=True)

return genes_below_100x


def write_out_report(input_file, genes_below_30x, threshold, outfile_name):
"""
Write out the low coverage genes to a CSV

Parameters
----------
input_file : str
path to the input Sambamba file given
genes_below_30x : pd.DataFrame
pandas df with any genes / transcripts with 30x coverage < threshold
threshold: int
threshold which was used to find genes below that coverage at 30x
e.g. 100
outfile_name : str
Name of output report
"""
# If output file name not give as command line arg, name output file
# after the sample name from the input file
# e.g. input of NGS148_34_139558_CB_CMCMD_S33_R1_001.sambamba_output.txt
# and threshold of 100 percent would give output name of
# NGS148_34_139558_CB_CMCMD_S33_R1_001.gene_coverage_30x_below_100pc.csv
if not outfile_name:
outfile_name = (
f"{Path(input_file).stem.split('.')[0]}.gene_coverage_30x_below"
f"_{threshold}pc.csv"
)

# Write out to CSV, without index
genes_below_30x.to_csv(outfile_name, index=False)


def main():
"""
Main function to generate gene 30x coverage CSV report output at threshold
given
"""
args = parse_args()
sambamba_df = read_in_sambamba_file(args.input)
gene_coverage_30x = calculate_gene_30x_coverage(sambamba_df)
genes_below_30x = get_genes_below_30x_threshold(
gene_coverage_30x, args.threshold
)
write_out_report(args.input, genes_below_30x, args.threshold, args.output)

if __name__ == '__main__':
main()
3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
numpy==1.26.4
pandas==2.1.0
pytest==7.1.3
3 changes: 3 additions & 0 deletions tests/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from pathlib import Path

TEST_DATA_DIR = f"{Path(__file__).parent.resolve()}/test_data"
Loading