Skip to content

Quant to BED6 Input Format#17

Merged
hbeale merged 45 commits intoBrooksLabUCSC:masterfrom
pRottinghuis:feat/quant-bed6-input
Apr 2, 2026
Merged

Quant to BED6 Input Format#17
hbeale merged 45 commits intoBrooksLabUCSC:masterfrom
pRottinghuis:feat/quant-bed6-input

Conversation

@pRottinghuis
Copy link
Copy Markdown
Contributor

Refactor Quant to BED6 Input Format

Background

This PR removes junction-calling logic from quant and enforces a BED6 input format for junction counts.

The motivation for this PR is to refactor the SpliceDICE workflow to use a different junction caller than bam_to_junc_bed.py. Previously, SpliceDICE used bam_to_junc_bed.py to call junctions from BAM alignment files and then used quant to calculate PS values from those junctions. However, bam_to_junc_bed.py produces redundant junction calls, and the accuracy of these calls is unclear. In addition, quant must filter out the redundant junctions generated by bam_to_junc_bed.py. Since quant is intended only to quantify junctions, having it also perform junction calling creates a design conflict.

This PR removes bam_to_junc_bed.py entirely in favor of a more robust and trusted junction-calling tool, diekhans/intronProspector. intronProspector performs all junction filtering and calling before writing its output, making it safe to assume that all reported junctions can be used directly by quant. In contrast, this assumption could not be safely made for bam_to_junc_bed.py. Because intronProspector eliminates the need for downstream filtering, the junction count output format is also simplified and made more generalizable to other junction-calling tools.

Previously, quant supported multiple input formats, which led to complex, case-by-case handling and inconsistent application of filtering steps. This PR standardizes the input to the BED6 format used by intronProspector (described below) and removes all junction filtering steps. The quantification logic of quant remains unchanged. Adjusting quant to use this more general input format makes it easier to integrate different junction-calling tools in the future.

The output format for quant is changed to be 1-based instead of 0-based half-open. Because downstream tools use 1-based coordinates, all splice junctions are reported by quant using 1-based coordinates.

Changes

bam_to_junc_bed

Removed. Use diekhans/intronProspector instead.

quant

Lives in splicedice/SPLICEDICE.py

Input format

  • Tab-delimited BED6 with columns: chrom, start, end, name, score, strand
  • score is used as the junction read count
  • strand must be + or -

Refactor details

  1. Remove all case-handling for multiple input formats (no more STAR or bam_to_junc_bed.py input)
  2. Parse only BED6 input as defined above
  3. Remove CLI arguments for junction filtering options
  4. Report junctions in 1-based coordinates

Build

  • Use requirements.txt for dependencies
  • Pin all dependency versions
  • Version bump to 1.1.0

Tests

tests/e2e/quant

exon_skip.sh is a script that runs validation using toy data for quant. Run this script from the project root.

The dataset is a toy example for a single exon-skipping event. Three samples represent different PS values for the splice event. A fourth sample contains an unrelated junction as a control

Inputs and expected outputs are located in data/example_data/exon_skip/. See data/example_data/exon_skip/README.md for more details.

Other

  • Add CHANGELOG.md
  • Update README.md documentation

- bed files for exon skipping event
- manifest for quant run
- README.md describing sample data

.gitignore update for output directory in data for test runs.
These serve as benchmark for development.
- require 6 column bed format
- only include junctions that have a strand assigned. This is default from previous implementation
- Remove handling for multiple input formats.
- Don't track counts that are too low. This isn't needed assuming input is properly junction called
Irrelevant now that quant only takes one input bed format
- Remove groups, metadata, and condition attributes which are unused
Includes a bed file to the exon skipping event bed files that includes an unrelated junction. This checks the nan reporting function for ps values
Low coverage report nan is irrlelvant now that the precondition of quant is that all junctions are already filtered.
-Pin tools to newer version. Old versions were not used anyways so there is no way to really know if this is safe.
- Reference requirements.txt in the setup.py
Last tagged version on github is 1.0.0 for mesa. Assuming this should bump on that version.
Update what the expected PS value is for sj3. Should be 33% based on the way that PS is calculated with clusters
New param one_based accepts boolean to determine if coordinates are output in 1 based (True) or 0 based half open (False. Default)
@pRottinghuis pRottinghuis marked this pull request as ready for review March 19, 2026 21:05
Copilot AI review requested due to automatic review settings March 19, 2026 21:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors SpliceDICE’s quant workflow to accept only BED6 junction-count inputs (intended for upstream callers like intronProspector), removes the legacy bam_to_junc_bed command, and updates packaging/docs plus adds a small e2e fixture.

Changes:

  • Standardize quant input parsing to BED6 and remove legacy multi-format/junction-filtering logic.
  • Remove splicedice bam_to_junc_bed CLI command and delete the implementation.
  • Pin dependencies / bump version, add changelog, and add a toy e2e dataset + diff-based validation script.

Reviewed changes

Copilot reviewed 18 out of 20 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/e2e/quant/exon_skip.sh New e2e script that runs splicedice quant and diffs outputs against expected fixtures
splicedice/bam_to_junc_bed.py Removed legacy BAM→junction BED generation implementation
splicedice/main.py Removes the bam_to_junc_bed subcommand wiring
splicedice/SPLICEDICE.py Refactors quant to parse BED6 only; updates output junction identifiers to 1-based strings
setup.py Bumps version to 1.1.0 and sources install_requires from requirements.txt
requirements.txt Updates and pins runtime dependency versions
data/example_data/exon_skip/ps_0.bed Toy BED6 input sample for e2e
data/example_data/exon_skip/ps_50.bed Toy BED6 input sample for e2e
data/example_data/exon_skip/ps_100.bed Toy BED6 input sample for e2e
data/example_data/exon_skip/cntrl.bed Toy BED6 control sample for e2e
data/example_data/exon_skip/manifest.tsv Manifest for the toy e2e run
data/example_data/exon_skip/expected_out/_junctions.bed Expected quant junction BED output for e2e
data/example_data/exon_skip/expected_out/_inclusionCounts.tsv Expected inclusion count matrix for e2e
data/example_data/exon_skip/expected_out/_drimTable.tsv Expected DRIMSeq table for e2e
data/example_data/exon_skip/expected_out/_allPS.tsv Expected PS table for e2e
data/example_data/exon_skip/expected_out/_allClusters.tsv Expected cluster file for e2e
data/example_data/exon_skip/README.md Documents the toy exon-skip dataset semantics
README.md Updates usage docs to describe BED6 input and removal of bam_to_junc_bed
CHANGELOG.md New changelog entry for 1.1.0 describing the refactor
.gitignore Ignores data/output/ generated by the e2e script

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 114 to 120
with open(self.manifestFilename,"r") as manifestFile:
for line in manifestFile:
row = line.rstrip().split("\t")
if len(row) != 4:
pass # improperly formatted manifest
sample = Sample(row)
manifest.append(sample)
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parseManifest, the if len(row) != 4: branch just passes and then still constructs Sample(row). For malformed/blank lines this will raise IndexError (or silently include garbage). Consider continueing on malformed rows or raising a ValueError with filename + line number so users get a clear manifest-format error.

Copilot uses AI. Check for mistakes.
Comment on lines 194 to +201
for sampleIndex,sample in enumerate(self.manifest):

with open(sample.filename,"r") as sampleFile:

if sample.type == "bed" or sample.type == "splicedicebed" or sample.type == "leafcutter":
for line in sampleFile:
row = line.rstrip().split("\t")

junction = (row[0], int(row[1]), int(row[2]), row[5])

if junction in self.junctionIndex:
score = int(row[4])
counts[self.junctionIndex[junction],sampleIndex] = score
if self.args.lowCoverageNan and score < self.args.minUnique:
low.append((self.junctionIndex[junction],sampleIndex))

elif sample.type == "SJ":

strandSymbol = {'0':'0', '1':'+', '2':'-'}

for line in sampleFile:
row = line.rstrip().split("\t")

junction = (row[0], int(row[1])-1, int(row[2]), strandSymbol[row[3]])


if junction in self.junctionIndex:
if self.args.noMultimap:
counts[self.junctionIndex[junction],sampleIndex] = int(row[6])
else:
counts[self.junctionIndex[junction],sampleIndex] = int(row[6]) + int(row[7])


return counts, low
for line in sampleFile:
row = line.rstrip().split("\t")
junction = (row[0], int(row[1]), int(row[2]), row[5])
if junction in self.junctionIndex:
score = int(row[4])
counts[self.junctionIndex[junction],sampleIndex] = score
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same parsing robustness issue in getJunctionCounts(): it assumes BED6 and does int(row[4]) / row[5] without validating column count or skipping blank/comment lines. This can either crash or (worse) mis-read columns if the input contains extra whitespace. Consider sharing a single BED6 parsing helper that validates and returns (chrom,start,end,strand,count) with good error messages.

Copilot uses AI. Check for mistakes.
Comment on lines 230 to 235
def writeJunctionBed(self):
with open(f"{self.outputPrefix}_junctions.bed", "w") as outbed:
for junction in sorted(self.junctions):
chromosome,left,right,strand = junction
name = f"{chromosome}:{left}-{right}:{strand}"
name = self.junctionString(junction, True)
outbed.write(f"{chromosome}\t{left}\t{right}\t{name}\t0\t{strand}\n")
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeJunctionBed() writes BED columns left/right as the original (0-based half-open) coordinates, but the name field is now emitted as 1-based (junctionString(..., True)). If the PR intent is that all reported junction coordinates are 1-based, the BED columns should be converted too (or the output file should be renamed away from .bed). If the intent is to keep BED semantics (0-based columns), then the docs/CHANGELOG/PR description should clarify that only the identifier strings are 1-based while the BED columns remain 0-based.

Copilot uses AI. Check for mistakes.
Comment on lines 148 to 163

# Specify which Python versions you support. In contrast to the
# 'Programming Language' classifiers above, 'pip install' will check this
# and refuse to install the project if the version does not match. If you
# do not support Python 2, you can simplify this to '>=3.5' or similar, see
# https://packaging.python.org/guides/distributing-packages-using-setuptools/#python-requires
# python_requires='>=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*, <4',

# This field lists other packages that your project depends on to run.
# Any package you put here will be installed by pip when your project is
# installed, so they must be valid existing projects.
#
# For an analysis of "install_requires" vs pip's requirements files see:
# https://packaging.python.org/en/latest/requirements.html
install_requires=[
"numpy",
"scipy",
"statsmodels",
"scikit-learn",
"umap-learn",
], # Optional
install_requires=read_requirements("requirements.txt"),

Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that install_requires is sourced from requirements.txt (with newer pins that likely require Python >=3.8), it would be good to set python_requires in setup() so users on unsupported Python versions fail fast with a clear message instead of a cryptic dependency-resolution error.

Copilot uses AI. Check for mistakes.

### Removed
- `bam_to_junc_bed.py`. Use diekhans/intronProspector instead.
- `quant` input support and cl parameters
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog wording: "cl parameters" looks like it should be "CLI parameters" for clarity.

Suggested change
- `quant` input support and cl parameters
- `quant` input support and CLI parameters

Copilot uses AI. Check for mistakes.
Comment on lines 145 to +154
for sample in self.manifest:
with open(sample.filename,"r") as junctionFile:

if sample.type == "SJ":
for line in junctionFile:
row = line.rstrip().split("\t")
chromosome= row[0]
left = int(row[1]) - 1
right = int(row[2])
strand = strandSymbol[row[3]]
intronMotif = int(row[4])
#annotation = int(row[5])
#overhang = int(row[8])
if self.args.noMultimap:
score = int(row[6])
else:
score = int(row[6]) + int(row[7])
if (right-left < self.args.maxLength and
right-left > self.args.minLength and
strand != "0" and
score >= self.args.minUnique and
intronMotif in validMotifs):
junctions.add((chromosome,left,right,strand))

elif sample.type == "splicedicebed":
for line in junctionFile:
row = line.rstrip().split("\t")

score = int(row[4])

info = [x.split(':') for x in row[3].split(';')]
left = int(row[1])
right = int(row[2])
length = right-left

if info[3][1] == "?":

if score < self.args.minUnique:
continue
if length > self.args.maxLength or length < self.args.minLength:
continue
if int(info[1][1]) < self.args.minOverhang:
continue
if float(info[0][1]) < self.args.minEntropy or float(info[0][2]) < self.args.minEntropy:
continue

strand = row[5]
if strand in plusminus:
chromosome = row[0]
junctions.add((chromosome,left,right,strand))

elif sample.type == "bed" or sample.type == "leafcutter":
for line in junctionFile:
row = line.rstrip().split("\t")

score = int(row[4])
if score < self.args.minUnique:
continue

left = int(row[1])
right = int(row[2])
length = right-left
if length > self.args.maxLength or length < self.args.minLength:
continue
strand = row[5]
if strand in plusminus:
chromosome = row[0]
junctions.add((chromosome,left,right,strand))

for line in junctionFile:
row = line.rstrip().split("\t")
chrom = row[0]
start = int(row[1])
end = int(row[2])
strand = row[5]
if strand in plusminus:
junctions.add((chrom, start, end, strand))
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAllJunctions() assumes every line in every input file is valid BED6 and immediately indexes row[5] / casts row[1:3] to int. Any header line, blank line, comment line, or malformed row will crash with IndexError/ValueError. Add basic row-length checks (and skip empty/comment lines), or raise a clear parse error that includes the offending filename + line number.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +4
splicedice quant \
--manifest=data/example_data/exon_skip/manifest.tsv \
--drim \
--output_prefix=data/output/
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This e2e script doesn’t ensure data/output/ exists or is clean before running splicedice quant. That can cause flaky results (e.g., diffing stale output from a previous run if splicedice quant fails early). Consider rm -rf data/output && mkdir -p data/output (or use a temp dir) and also fail fast if the splicedice quant command exits non-zero.

Suggested change
splicedice quant \
--manifest=data/example_data/exon_skip/manifest.tsv \
--drim \
--output_prefix=data/output/
rm -rf data/output && mkdir -p data/output
splicedice quant \
--manifest=data/example_data/exon_skip/manifest.tsv \
--drim \
--output_prefix=data/output/ || {
echo "ERROR: splicedice quant failed; aborting quant e2e checks" >&2
exit 1
}

Copilot uses AI. Check for mistakes.
@hbeale
Copy link
Copy Markdown
Collaborator

hbeale commented Mar 19, 2026

I like the co-pilot suggestion that we document that "only the identifier strings are 1-based while the BED columns remain 0-based."

hbeale and others added 3 commits April 2, 2026 09:32
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@hbeale
Copy link
Copy Markdown
Collaborator

hbeale commented Apr 2, 2026

I tested this code and confirmed that the intron prospector implementation generates the same results as what we previously saw with bam_to_junc_bed.py in splicedice.

Specifically, building a signature based on the u2af1 genotype (35 wt, 11 s34f) creates a signature that then correctly re-identifies each of the samples used to generate it.

@hbeale hbeale merged commit 28875c0 into BrooksLabUCSC:master Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants