Quant to BED6 Input Format by pRottinghuis · Pull Request #17 · BrooksLabUCSC/splicedice

pRottinghuis · 2026-03-19T21:01:58Z

Refactor Quant to BED6 Input Format

Background

This PR removes junction-calling logic from quant and enforces a BED6 input format for junction counts.

The motivation for this PR is to refactor the SpliceDICE workflow to use a different junction caller than bam_to_junc_bed.py. Previously, SpliceDICE used bam_to_junc_bed.py to call junctions from BAM alignment files and then used quant to calculate PS values from those junctions. However, bam_to_junc_bed.py produces redundant junction calls, and the accuracy of these calls is unclear. In addition, quant must filter out the redundant junctions generated by bam_to_junc_bed.py. Since quant is intended only to quantify junctions, having it also perform junction calling creates a design conflict.

This PR removes bam_to_junc_bed.py entirely in favor of a more robust and trusted junction-calling tool, diekhans/intronProspector. intronProspector performs all junction filtering and calling before writing its output, making it safe to assume that all reported junctions can be used directly by quant. In contrast, this assumption could not be safely made for bam_to_junc_bed.py. Because intronProspector eliminates the need for downstream filtering, the junction count output format is also simplified and made more generalizable to other junction-calling tools.

Previously, quant supported multiple input formats, which led to complex, case-by-case handling and inconsistent application of filtering steps. This PR standardizes the input to the BED6 format used by intronProspector (described below) and removes all junction filtering steps. The quantification logic of quant remains unchanged. Adjusting quant to use this more general input format makes it easier to integrate different junction-calling tools in the future.

The output format for quant is changed to be 1-based instead of 0-based half-open. Because downstream tools use 1-based coordinates, all splice junctions are reported by quant using 1-based coordinates.

Changes

`bam_to_junc_bed`

Removed. Use diekhans/intronProspector instead.

`quant`

Lives in splicedice/SPLICEDICE.py

Input format

Tab-delimited BED6 with columns: chrom, start, end, name, score, strand
score is used as the junction read count
strand must be + or -

Refactor details

Remove all case-handling for multiple input formats (no more STAR or bam_to_junc_bed.py input)
Parse only BED6 input as defined above
Remove CLI arguments for junction filtering options
Report junctions in 1-based coordinates

Build

Use requirements.txt for dependencies
Pin all dependency versions
Version bump to 1.1.0

Tests

`tests/e2e/quant`

exon_skip.sh is a script that runs validation using toy data for quant. Run this script from the project root.

The dataset is a toy example for a single exon-skipping event. Three samples represent different PS values for the splice event. A fourth sample contains an unrelated junction as a control

Inputs and expected outputs are located in data/example_data/exon_skip/. See data/example_data/exon_skip/README.md for more details.

Other

Add CHANGELOG.md
Update README.md documentation

- bed files for exon skipping event - manifest for quant run - README.md describing sample data .gitignore update for output directory in data for test runs.

These serve as benchmark for development.

- require 6 column bed format - only include junctions that have a strand assigned. This is default from previous implementation

- Remove handling for multiple input formats. - Don't track counts that are too low. This isn't needed assuming input is properly junction called

…dition

Irrelevant now that quant only takes one input bed format

- Remove groups, metadata, and condition attributes which are unused

Includes a bed file to the exon skipping event bed files that includes an unrelated junction. This checks the nan reporting function for ps values

Low coverage report nan is irrlelvant now that the precondition of quant is that all junctions are already filtered.

-Pin tools to newer version. Old versions were not used anyways so there is no way to really know if this is safe. - Reference requirements.txt in the setup.py

Last tagged version on github is 1.0.0 for mesa. Assuming this should bump on that version.

Update what the expected PS value is for sj3. Should be 33% based on the way that PS is calculated with clusters

New param one_based accepts boolean to determine if coordinates are output in 1 based (True) or 0 based half open (False. Default)

Not implementing this feature

…n coords

Does not change the bed coordinates which should still be 0-based half open

…ion for quant

Copilot

Pull request overview

Refactors SpliceDICE’s quant workflow to accept only BED6 junction-count inputs (intended for upstream callers like intronProspector), removes the legacy bam_to_junc_bed command, and updates packaging/docs plus adds a small e2e fixture.

Changes:

Standardize quant input parsing to BED6 and remove legacy multi-format/junction-filtering logic.
Remove splicedice bam_to_junc_bed CLI command and delete the implementation.
Pin dependencies / bump version, add changelog, and add a toy e2e dataset + diff-based validation script.

Reviewed changes

Copilot reviewed 18 out of 20 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
tests/e2e/quant/exon_skip.sh	New e2e script that runs `splicedice quant` and diffs outputs against expected fixtures
splicedice/bam_to_junc_bed.py	Removed legacy BAM→junction BED generation implementation
splicedice/main.py	Removes the `bam_to_junc_bed` subcommand wiring
splicedice/SPLICEDICE.py	Refactors `quant` to parse BED6 only; updates output junction identifiers to 1-based strings
setup.py	Bumps version to 1.1.0 and sources `install_requires` from `requirements.txt`
requirements.txt	Updates and pins runtime dependency versions
data/example_data/exon_skip/ps_0.bed	Toy BED6 input sample for e2e
data/example_data/exon_skip/ps_50.bed	Toy BED6 input sample for e2e
data/example_data/exon_skip/ps_100.bed	Toy BED6 input sample for e2e
data/example_data/exon_skip/cntrl.bed	Toy BED6 control sample for e2e
data/example_data/exon_skip/manifest.tsv	Manifest for the toy e2e run
data/example_data/exon_skip/expected_out/_junctions.bed	Expected `quant` junction BED output for e2e
data/example_data/exon_skip/expected_out/_inclusionCounts.tsv	Expected inclusion count matrix for e2e
data/example_data/exon_skip/expected_out/_drimTable.tsv	Expected DRIMSeq table for e2e
data/example_data/exon_skip/expected_out/_allPS.tsv	Expected PS table for e2e
data/example_data/exon_skip/expected_out/_allClusters.tsv	Expected cluster file for e2e
data/example_data/exon_skip/README.md	Documents the toy exon-skip dataset semantics
README.md	Updates usage docs to describe BED6 input and removal of `bam_to_junc_bed`
CHANGELOG.md	New changelog entry for 1.1.0 describing the refactor
.gitignore	Ignores `data/output/` generated by the e2e script

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-19T21:09:57Z

splicedice/SPLICEDICE.py

        with open(self.manifestFilename,"r") as manifestFile:
            for line in manifestFile:
                row = line.rstrip().split("\t")
                if len(row) != 4:
                    pass # improperly formatted manifest
                sample = Sample(row)
                manifest.append(sample)


In parseManifest, the if len(row) != 4: branch just passes and then still constructs Sample(row). For malformed/blank lines this will raise IndexError (or silently include garbage). Consider continueing on malformed rows or raising a ValueError with filename + line number so users get a clear manifest-format error.

Copilot · 2026-03-19T21:09:58Z

splicedice/SPLICEDICE.py

        for sampleIndex,sample in enumerate(self.manifest):
-
            with open(sample.filename,"r") as sampleFile:
-
-                if sample.type == "bed" or sample.type == "splicedicebed" or sample.type == "leafcutter":
-                    for line in sampleFile:
-                        row = line.rstrip().split("\t")
-
-                        junction = (row[0], int(row[1]), int(row[2]), row[5])
-
-                        if junction in self.junctionIndex:
-                            score = int(row[4])
-                            counts[self.junctionIndex[junction],sampleIndex] = score
-                            if self.args.lowCoverageNan and score < self.args.minUnique:
-                                low.append((self.junctionIndex[junction],sampleIndex))
-
-                elif sample.type == "SJ":
-
-                    strandSymbol = {'0':'0', '1':'+', '2':'-'}
-
-                    for line in sampleFile:
-                        row = line.rstrip().split("\t")
-
-                        junction = (row[0], int(row[1])-1, int(row[2]), strandSymbol[row[3]])
-
-
-                        if junction in self.junctionIndex:
-                            if self.args.noMultimap:
-                                counts[self.junctionIndex[junction],sampleIndex] = int(row[6])
-                            else:
-                                counts[self.junctionIndex[junction],sampleIndex] = int(row[6]) + int(row[7]) 
-
-
-        return counts, low
+                for line in sampleFile:
+                    row = line.rstrip().split("\t")
+                    junction = (row[0], int(row[1]), int(row[2]), row[5])
+                    if junction in self.junctionIndex:
+                        score = int(row[4])
+                        counts[self.junctionIndex[junction],sampleIndex] = score


Same parsing robustness issue in getJunctionCounts(): it assumes BED6 and does int(row[4]) / row[5] without validating column count or skipping blank/comment lines. This can either crash or (worse) mis-read columns if the input contains extra whitespace. Consider sharing a single BED6 parsing helper that validates and returns (chrom,start,end,strand,count) with good error messages.

splicedice/SPLICEDICE.py

Copilot · 2026-03-19T21:09:58Z

splicedice/SPLICEDICE.py

    def writeJunctionBed(self):
        with open(f"{self.outputPrefix}_junctions.bed", "w") as outbed:
            for junction in sorted(self.junctions):
                chromosome,left,right,strand = junction
-                name = f"{chromosome}:{left}-{right}:{strand}"
+                name = self.junctionString(junction, True)
                outbed.write(f"{chromosome}\t{left}\t{right}\t{name}\t0\t{strand}\n")


writeJunctionBed() writes BED columns left/right as the original (0-based half-open) coordinates, but the name field is now emitted as 1-based (junctionString(..., True)). If the PR intent is that all reported junction coordinates are 1-based, the BED columns should be converted too (or the output file should be renamed away from .bed). If the intent is to keep BED semantics (0-based columns), then the docs/CHANGELOG/PR description should clarify that only the identifier strings are 1-based while the BED columns remain 0-based.

README.md

Copilot · 2026-03-19T21:09:59Z

setup.py


    # Specify which Python versions you support. In contrast to the
    # 'Programming Language' classifiers above, 'pip install' will check this
    # and refuse to install the project if the version does not match. If you
    # do not support Python 2, you can simplify this to '>=3.5' or similar, see
    # https://packaging.python.org/guides/distributing-packages-using-setuptools/#python-requires
    # python_requires='>=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*, <4',

    # This field lists other packages that your project depends on to run.
    # Any package you put here will be installed by pip when your project is
    # installed, so they must be valid existing projects.
    #
    # For an analysis of "install_requires" vs pip's requirements files see:
    # https://packaging.python.org/en/latest/requirements.html
-    install_requires=[
-            "numpy",
-            "scipy",
-            "statsmodels",
-            "scikit-learn",
-            "umap-learn",
-            ],  # Optional
+    install_requires=read_requirements("requirements.txt"),



Now that install_requires is sourced from requirements.txt (with newer pins that likely require Python >=3.8), it would be good to set python_requires in setup() so users on unsupported Python versions fail fast with a clear message instead of a cryptic dependency-resolution error.

Copilot · 2026-03-19T21:09:59Z

CHANGELOG.md

+
+### Removed
+- `bam_to_junc_bed.py`. Use diekhans/intronProspector instead.
+- `quant` input support and cl parameters


Changelog wording: "cl parameters" looks like it should be "CLI parameters" for clarity.

Suggested change

- `quant` input support and cl parameters

- `quant` input support and CLI parameters

Copilot · 2026-03-19T21:09:59Z

splicedice/SPLICEDICE.py

        for sample in self.manifest:
            with open(sample.filename,"r") as junctionFile:
-
-                if sample.type == "SJ":
-                    for line in junctionFile:
-                        row = line.rstrip().split("\t")
-                        chromosome= row[0]
-                        left = int(row[1]) - 1
-                        right = int(row[2])
-                        strand = strandSymbol[row[3]]
-                        intronMotif = int(row[4])
-                        #annotation = int(row[5])
-                        #overhang = int(row[8])
-                        if self.args.noMultimap:
-                            score = int(row[6])
-                        else:
-                            score = int(row[6]) + int(row[7]) 
-                        if (right-left < self.args.maxLength and 
-                            right-left > self.args.minLength and
-                            strand != "0" and
-                            score >= self.args.minUnique and
-                            intronMotif in validMotifs):
-                            junctions.add((chromosome,left,right,strand))
-
-                elif sample.type == "splicedicebed":
-                    for line in junctionFile:
-                        row = line.rstrip().split("\t")
-
-                        score = int(row[4])
-
-                        info = [x.split(':') for x in row[3].split(';')]
-                        left = int(row[1])
-                        right = int(row[2])
-                        length = right-left
-
-                        if info[3][1] == "?":
-
-                            if score < self.args.minUnique:
-                                continue
-                            if length > self.args.maxLength or length < self.args.minLength:    
-                                continue
-                            if int(info[1][1]) < self.args.minOverhang:
-                                continue
-                            if float(info[0][1]) < self.args.minEntropy or float(info[0][2]) < self.args.minEntropy:
-                                continue
-
-                        strand = row[5]
-                        if strand in plusminus:
-                            chromosome = row[0]
-                            junctions.add((chromosome,left,right,strand))
-
-                elif sample.type == "bed" or sample.type == "leafcutter":
-                    for line in junctionFile:
-                        row = line.rstrip().split("\t")
-
-                        score = int(row[4])
-                        if score < self.args.minUnique:
-                            continue
-
-                        left = int(row[1])
-                        right = int(row[2])
-                        length = right-left
-                        if length > self.args.maxLength or length < self.args.minLength:    
-                            continue
-                        strand = row[5]
-                        if strand in plusminus:
-                            chromosome = row[0]
-                            junctions.add((chromosome,left,right,strand))
-
+                for line in junctionFile:
+                    row = line.rstrip().split("\t")                    
+                    chrom = row[0]
+                    start = int(row[1])
+                    end = int(row[2])
+                    strand = row[5]
+                    if strand in plusminus:
+                        junctions.add((chrom, start, end, strand))


getAllJunctions() assumes every line in every input file is valid BED6 and immediately indexes row[5] / casts row[1:3] to int. Any header line, blank line, comment line, or malformed row will crash with IndexError/ValueError. Add basic row-length checks (and skip empty/comment lines), or raise a clear parse error that includes the offending filename + line number.

splicedice/SPLICEDICE.py

Copilot · 2026-03-19T21:10:00Z

tests/e2e/quant/exon_skip.sh

+splicedice quant \
+    --manifest=data/example_data/exon_skip/manifest.tsv \
+    --drim \
+    --output_prefix=data/output/


This e2e script doesn’t ensure data/output/ exists or is clean before running splicedice quant. That can cause flaky results (e.g., diffing stale output from a previous run if splicedice quant fails early). Consider rm -rf data/output && mkdir -p data/output (or use a temp dir) and also fail fast if the splicedice quant command exits non-zero.

Suggested change

splicedice quant \

--manifest=data/example_data/exon_skip/manifest.tsv \

--drim \

--output_prefix=data/output/

rm -rf data/output && mkdir -p data/output

splicedice quant \

--manifest=data/example_data/exon_skip/manifest.tsv \

--drim \

--output_prefix=data/output/ || {

echo "ERROR: splicedice quant failed; aborting quant e2e checks" >&2

exit 1

}

hbeale · 2026-03-19T21:39:09Z

I like the co-pilot suggestion that we document that "only the identifier strings are 1-based while the BED columns remain 0-based."

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

hbeale · 2026-04-02T16:41:22Z

I tested this code and confirmed that the intron prospector implementation generates the same results as what we previously saw with bam_to_junc_bed.py in splicedice.

Specifically, building a signature based on the u2af1 genotype (35 wt, 11 s34f) creates a signature that then correctly re-identifies each of the samples used to generate it.

pRottinghuis added 30 commits February 26, 2026 15:57

feat: exon skipping example data

9707150

- bed files for exon skipping event - manifest for quant run - README.md describing sample data .gitignore update for output directory in data for test runs.

feat: add expected output files for exon skip data

e5df32c

These serve as benchmark for development.

feat: getAllJunctions enforces standard bed format

6db2498

- require 6 column bed format - only include junctions that have a strand assigned. This is default from previous implementation

feat: enforce bed6 format on getJunctionCounts

6771cc6

- Remove handling for multiple input formats. - Don't track counts that are too low. This isn't needed assuming input is properly junction called

docs: quant docstring to outline input bed format and filtered precon…

24481b7

…dition

feat: remove type attribute from class Sample

96938f4

Irrelevant now that quant only takes one input bed format

docs: todo marker for potential input validation

9020496

refactor: remove unused attributes from quant class Sample

3912af7

- Remove groups, metadata, and condition attributes which are unused

refactor: remove unused min and max intron size filters

414b0ed

refactor: remove unused filters in quant

669d3db

feat: exon skip event control bed file

5bc93d2

Includes a bed file to the exon skipping event bed files that includes an unrelated junction. This checks the nan reporting function for ps values

feat: remove low coverage nan argument

d404573

Low coverage report nan is irrlelvant now that the precondition of quant is that all junctions are already filtered.

refactor: remove unused --filter argument.

dc63cfc

chore: quant file formatting

a974792

build: update requirements.txt and set in setup.py

72e98d1

-Pin tools to newer version. Old versions were not used anyways so there is no way to really know if this is safe. - Reference requirements.txt in the setup.py

chore: version bump to 1.0.0

3ec8e6e

Last tagged version on github is 1.0.0 for mesa. Assuming this should bump on that version.

docs: include changelog

c8caf50

docs: update example data readme for ps_50.bed

483504c

Update what the expected PS value is for sj3. Should be 33% based on the way that PS is calculated with clusters

test: update expected outputs on toy data for control sample

ca01649

test: include drim output for validation

b466f13

feat: junctionString conversion function can output 1-based

3a95eb0

New param one_based accepts boolean to determine if coordinates are output in 1 based (True) or 0 based half open (False. Default)

docs: remove TODO marker for quant input validation

37928fb

Not implementing this feature

refactor: use junctionString method to generate strings in writeClusters

8afd3b4

feat: _allClusters.tsv outputs with 1-based coordiantes

5bee48c

test: correct epxceted output for _allClusters.tsv for 1-based

31612b5

test: _allPS.tsv expected output updated to 1-based coordinates

e922a0b

feat: quant _allPS.tsv outputs 1-based junction coordinates

6040a5a

test: update _drimTable.tsv expected out to use 1-based junction coords

2c363de

feat: quant _drimTable.tsv output uses 1-based junction coords

8e9ad7e

test: _inclusionCounts.tsv expected output updated to 1-based junctio…

ad1933a

…n coords

pRottinghuis added 12 commits March 11, 2026 10:55

feat: quant _inclusionCounts.tsv outputs 1-based junction coords

68abc3b

test: _junctions.bed expected output uses 1-based junction coordinates

89a41f0

Does not change the bed coordinates which should still be 0-based half open

feat: _junctions.bed output cluster coordinates in 1-based

874b0da

feat: remove bam_to_junc_bed.py script and update __main__.py

1a9e7e4

docs: mark suggested alternative for bam_to_junc_bed.py in changelog.md

f01de70

docs: remove bam_to_junc_bed from main readme

1ec37e1

docs: update quant description in readme

3378758

refactor: remove old class attribute in quant

7c99c7c

docs: remove todo marker in getJunctionCounts in quant

36993d0

docs: update changelog to reflect quant refactor changes

33aed9c

test: correct 1-based coordinate in expected drim table output

0329562

test: include a quick e2e test script that runs toy exon skip validat…

0f01907

…ion for quant

pRottinghuis marked this pull request as ready for review March 19, 2026 21:05

Copilot AI review requested due to automatic review settings March 19, 2026 21:05

Copilot started reviewing on behalf of pRottinghuis March 19, 2026 21:06 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

hbeale and others added 3 commits April 2, 2026 09:32

Update splicedice/SPLICEDICE.py

4b7fc7b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update README.md

3d1f659

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix typo in splicedice/SPLICEDICE.py

fcab149

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

hbeale merged commit 28875c0 into BrooksLabUCSC:master Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quant to BED6 Input Format#17

Quant to BED6 Input Format#17
hbeale merged 45 commits intoBrooksLabUCSC:masterfrom
pRottinghuis:feat/quant-bed6-input

pRottinghuis commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

hbeale commented Mar 19, 2026

Uh oh!

hbeale commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	- `quant` input support and cl parameters
	- `quant` input support and CLI parameters

Conversation

pRottinghuis commented Mar 19, 2026

Refactor Quant to BED6 Input Format

Background

Changes

bam_to_junc_bed

quant

Input format

Refactor details

Build

Tests

tests/e2e/quant

Other

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

hbeale commented Mar 19, 2026

Uh oh!

hbeale commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`bam_to_junc_bed`

`quant`

`tests/e2e/quant`