Quant to BED6 Input Format#17
Conversation
- bed files for exon skipping event - manifest for quant run - README.md describing sample data .gitignore update for output directory in data for test runs.
These serve as benchmark for development.
- require 6 column bed format - only include junctions that have a strand assigned. This is default from previous implementation
- Remove handling for multiple input formats. - Don't track counts that are too low. This isn't needed assuming input is properly junction called
Irrelevant now that quant only takes one input bed format
- Remove groups, metadata, and condition attributes which are unused
Includes a bed file to the exon skipping event bed files that includes an unrelated junction. This checks the nan reporting function for ps values
Low coverage report nan is irrlelvant now that the precondition of quant is that all junctions are already filtered.
-Pin tools to newer version. Old versions were not used anyways so there is no way to really know if this is safe. - Reference requirements.txt in the setup.py
Last tagged version on github is 1.0.0 for mesa. Assuming this should bump on that version.
Update what the expected PS value is for sj3. Should be 33% based on the way that PS is calculated with clusters
New param one_based accepts boolean to determine if coordinates are output in 1 based (True) or 0 based half open (False. Default)
Not implementing this feature
Does not change the bed coordinates which should still be 0-based half open
There was a problem hiding this comment.
Pull request overview
Refactors SpliceDICE’s quant workflow to accept only BED6 junction-count inputs (intended for upstream callers like intronProspector), removes the legacy bam_to_junc_bed command, and updates packaging/docs plus adds a small e2e fixture.
Changes:
- Standardize
quantinput parsing to BED6 and remove legacy multi-format/junction-filtering logic. - Remove
splicedice bam_to_junc_bedCLI command and delete the implementation. - Pin dependencies / bump version, add changelog, and add a toy e2e dataset + diff-based validation script.
Reviewed changes
Copilot reviewed 18 out of 20 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/quant/exon_skip.sh | New e2e script that runs splicedice quant and diffs outputs against expected fixtures |
| splicedice/bam_to_junc_bed.py | Removed legacy BAM→junction BED generation implementation |
| splicedice/main.py | Removes the bam_to_junc_bed subcommand wiring |
| splicedice/SPLICEDICE.py | Refactors quant to parse BED6 only; updates output junction identifiers to 1-based strings |
| setup.py | Bumps version to 1.1.0 and sources install_requires from requirements.txt |
| requirements.txt | Updates and pins runtime dependency versions |
| data/example_data/exon_skip/ps_0.bed | Toy BED6 input sample for e2e |
| data/example_data/exon_skip/ps_50.bed | Toy BED6 input sample for e2e |
| data/example_data/exon_skip/ps_100.bed | Toy BED6 input sample for e2e |
| data/example_data/exon_skip/cntrl.bed | Toy BED6 control sample for e2e |
| data/example_data/exon_skip/manifest.tsv | Manifest for the toy e2e run |
| data/example_data/exon_skip/expected_out/_junctions.bed | Expected quant junction BED output for e2e |
| data/example_data/exon_skip/expected_out/_inclusionCounts.tsv | Expected inclusion count matrix for e2e |
| data/example_data/exon_skip/expected_out/_drimTable.tsv | Expected DRIMSeq table for e2e |
| data/example_data/exon_skip/expected_out/_allPS.tsv | Expected PS table for e2e |
| data/example_data/exon_skip/expected_out/_allClusters.tsv | Expected cluster file for e2e |
| data/example_data/exon_skip/README.md | Documents the toy exon-skip dataset semantics |
| README.md | Updates usage docs to describe BED6 input and removal of bam_to_junc_bed |
| CHANGELOG.md | New changelog entry for 1.1.0 describing the refactor |
| .gitignore | Ignores data/output/ generated by the e2e script |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with open(self.manifestFilename,"r") as manifestFile: | ||
| for line in manifestFile: | ||
| row = line.rstrip().split("\t") | ||
| if len(row) != 4: | ||
| pass # improperly formatted manifest | ||
| sample = Sample(row) | ||
| manifest.append(sample) |
There was a problem hiding this comment.
In parseManifest, the if len(row) != 4: branch just passes and then still constructs Sample(row). For malformed/blank lines this will raise IndexError (or silently include garbage). Consider continueing on malformed rows or raising a ValueError with filename + line number so users get a clear manifest-format error.
| for sampleIndex,sample in enumerate(self.manifest): | ||
|
|
||
| with open(sample.filename,"r") as sampleFile: | ||
|
|
||
| if sample.type == "bed" or sample.type == "splicedicebed" or sample.type == "leafcutter": | ||
| for line in sampleFile: | ||
| row = line.rstrip().split("\t") | ||
|
|
||
| junction = (row[0], int(row[1]), int(row[2]), row[5]) | ||
|
|
||
| if junction in self.junctionIndex: | ||
| score = int(row[4]) | ||
| counts[self.junctionIndex[junction],sampleIndex] = score | ||
| if self.args.lowCoverageNan and score < self.args.minUnique: | ||
| low.append((self.junctionIndex[junction],sampleIndex)) | ||
|
|
||
| elif sample.type == "SJ": | ||
|
|
||
| strandSymbol = {'0':'0', '1':'+', '2':'-'} | ||
|
|
||
| for line in sampleFile: | ||
| row = line.rstrip().split("\t") | ||
|
|
||
| junction = (row[0], int(row[1])-1, int(row[2]), strandSymbol[row[3]]) | ||
|
|
||
|
|
||
| if junction in self.junctionIndex: | ||
| if self.args.noMultimap: | ||
| counts[self.junctionIndex[junction],sampleIndex] = int(row[6]) | ||
| else: | ||
| counts[self.junctionIndex[junction],sampleIndex] = int(row[6]) + int(row[7]) | ||
|
|
||
|
|
||
| return counts, low | ||
| for line in sampleFile: | ||
| row = line.rstrip().split("\t") | ||
| junction = (row[0], int(row[1]), int(row[2]), row[5]) | ||
| if junction in self.junctionIndex: | ||
| score = int(row[4]) | ||
| counts[self.junctionIndex[junction],sampleIndex] = score |
There was a problem hiding this comment.
Same parsing robustness issue in getJunctionCounts(): it assumes BED6 and does int(row[4]) / row[5] without validating column count or skipping blank/comment lines. This can either crash or (worse) mis-read columns if the input contains extra whitespace. Consider sharing a single BED6 parsing helper that validates and returns (chrom,start,end,strand,count) with good error messages.
| def writeJunctionBed(self): | ||
| with open(f"{self.outputPrefix}_junctions.bed", "w") as outbed: | ||
| for junction in sorted(self.junctions): | ||
| chromosome,left,right,strand = junction | ||
| name = f"{chromosome}:{left}-{right}:{strand}" | ||
| name = self.junctionString(junction, True) | ||
| outbed.write(f"{chromosome}\t{left}\t{right}\t{name}\t0\t{strand}\n") |
There was a problem hiding this comment.
writeJunctionBed() writes BED columns left/right as the original (0-based half-open) coordinates, but the name field is now emitted as 1-based (junctionString(..., True)). If the PR intent is that all reported junction coordinates are 1-based, the BED columns should be converted too (or the output file should be renamed away from .bed). If the intent is to keep BED semantics (0-based columns), then the docs/CHANGELOG/PR description should clarify that only the identifier strings are 1-based while the BED columns remain 0-based.
|
|
||
| # Specify which Python versions you support. In contrast to the | ||
| # 'Programming Language' classifiers above, 'pip install' will check this | ||
| # and refuse to install the project if the version does not match. If you | ||
| # do not support Python 2, you can simplify this to '>=3.5' or similar, see | ||
| # https://packaging.python.org/guides/distributing-packages-using-setuptools/#python-requires | ||
| # python_requires='>=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*, <4', | ||
|
|
||
| # This field lists other packages that your project depends on to run. | ||
| # Any package you put here will be installed by pip when your project is | ||
| # installed, so they must be valid existing projects. | ||
| # | ||
| # For an analysis of "install_requires" vs pip's requirements files see: | ||
| # https://packaging.python.org/en/latest/requirements.html | ||
| install_requires=[ | ||
| "numpy", | ||
| "scipy", | ||
| "statsmodels", | ||
| "scikit-learn", | ||
| "umap-learn", | ||
| ], # Optional | ||
| install_requires=read_requirements("requirements.txt"), | ||
|
|
There was a problem hiding this comment.
Now that install_requires is sourced from requirements.txt (with newer pins that likely require Python >=3.8), it would be good to set python_requires in setup() so users on unsupported Python versions fail fast with a clear message instead of a cryptic dependency-resolution error.
|
|
||
| ### Removed | ||
| - `bam_to_junc_bed.py`. Use diekhans/intronProspector instead. | ||
| - `quant` input support and cl parameters |
There was a problem hiding this comment.
Changelog wording: "cl parameters" looks like it should be "CLI parameters" for clarity.
| - `quant` input support and cl parameters | |
| - `quant` input support and CLI parameters |
| for sample in self.manifest: | ||
| with open(sample.filename,"r") as junctionFile: | ||
|
|
||
| if sample.type == "SJ": | ||
| for line in junctionFile: | ||
| row = line.rstrip().split("\t") | ||
| chromosome= row[0] | ||
| left = int(row[1]) - 1 | ||
| right = int(row[2]) | ||
| strand = strandSymbol[row[3]] | ||
| intronMotif = int(row[4]) | ||
| #annotation = int(row[5]) | ||
| #overhang = int(row[8]) | ||
| if self.args.noMultimap: | ||
| score = int(row[6]) | ||
| else: | ||
| score = int(row[6]) + int(row[7]) | ||
| if (right-left < self.args.maxLength and | ||
| right-left > self.args.minLength and | ||
| strand != "0" and | ||
| score >= self.args.minUnique and | ||
| intronMotif in validMotifs): | ||
| junctions.add((chromosome,left,right,strand)) | ||
|
|
||
| elif sample.type == "splicedicebed": | ||
| for line in junctionFile: | ||
| row = line.rstrip().split("\t") | ||
|
|
||
| score = int(row[4]) | ||
|
|
||
| info = [x.split(':') for x in row[3].split(';')] | ||
| left = int(row[1]) | ||
| right = int(row[2]) | ||
| length = right-left | ||
|
|
||
| if info[3][1] == "?": | ||
|
|
||
| if score < self.args.minUnique: | ||
| continue | ||
| if length > self.args.maxLength or length < self.args.minLength: | ||
| continue | ||
| if int(info[1][1]) < self.args.minOverhang: | ||
| continue | ||
| if float(info[0][1]) < self.args.minEntropy or float(info[0][2]) < self.args.minEntropy: | ||
| continue | ||
|
|
||
| strand = row[5] | ||
| if strand in plusminus: | ||
| chromosome = row[0] | ||
| junctions.add((chromosome,left,right,strand)) | ||
|
|
||
| elif sample.type == "bed" or sample.type == "leafcutter": | ||
| for line in junctionFile: | ||
| row = line.rstrip().split("\t") | ||
|
|
||
| score = int(row[4]) | ||
| if score < self.args.minUnique: | ||
| continue | ||
|
|
||
| left = int(row[1]) | ||
| right = int(row[2]) | ||
| length = right-left | ||
| if length > self.args.maxLength or length < self.args.minLength: | ||
| continue | ||
| strand = row[5] | ||
| if strand in plusminus: | ||
| chromosome = row[0] | ||
| junctions.add((chromosome,left,right,strand)) | ||
|
|
||
| for line in junctionFile: | ||
| row = line.rstrip().split("\t") | ||
| chrom = row[0] | ||
| start = int(row[1]) | ||
| end = int(row[2]) | ||
| strand = row[5] | ||
| if strand in plusminus: | ||
| junctions.add((chrom, start, end, strand)) |
There was a problem hiding this comment.
getAllJunctions() assumes every line in every input file is valid BED6 and immediately indexes row[5] / casts row[1:3] to int. Any header line, blank line, comment line, or malformed row will crash with IndexError/ValueError. Add basic row-length checks (and skip empty/comment lines), or raise a clear parse error that includes the offending filename + line number.
| splicedice quant \ | ||
| --manifest=data/example_data/exon_skip/manifest.tsv \ | ||
| --drim \ | ||
| --output_prefix=data/output/ |
There was a problem hiding this comment.
This e2e script doesn’t ensure data/output/ exists or is clean before running splicedice quant. That can cause flaky results (e.g., diffing stale output from a previous run if splicedice quant fails early). Consider rm -rf data/output && mkdir -p data/output (or use a temp dir) and also fail fast if the splicedice quant command exits non-zero.
| splicedice quant \ | |
| --manifest=data/example_data/exon_skip/manifest.tsv \ | |
| --drim \ | |
| --output_prefix=data/output/ | |
| rm -rf data/output && mkdir -p data/output | |
| splicedice quant \ | |
| --manifest=data/example_data/exon_skip/manifest.tsv \ | |
| --drim \ | |
| --output_prefix=data/output/ || { | |
| echo "ERROR: splicedice quant failed; aborting quant e2e checks" >&2 | |
| exit 1 | |
| } |
|
I like the co-pilot suggestion that we document that "only the identifier strings are 1-based while the BED columns remain 0-based." |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
I tested this code and confirmed that the intron prospector implementation generates the same results as what we previously saw with bam_to_junc_bed.py in splicedice. Specifically, building a signature based on the u2af1 genotype (35 wt, 11 s34f) creates a signature that then correctly re-identifies each of the samples used to generate it. |
Refactor Quant to BED6 Input Format
Background
This PR removes junction-calling logic from
quantand enforces a BED6 input format for junction counts.The motivation for this PR is to refactor the SpliceDICE workflow to use a different junction caller than
bam_to_junc_bed.py. Previously, SpliceDICE usedbam_to_junc_bed.pyto call junctions from BAM alignment files and then usedquantto calculate PS values from those junctions. However,bam_to_junc_bed.pyproduces redundant junction calls, and the accuracy of these calls is unclear. In addition,quantmust filter out the redundant junctions generated bybam_to_junc_bed.py. Sincequantis intended only to quantify junctions, having it also perform junction calling creates a design conflict.This PR removes
bam_to_junc_bed.pyentirely in favor of a more robust and trusted junction-calling tool, diekhans/intronProspector. intronProspector performs all junction filtering and calling before writing its output, making it safe to assume that all reported junctions can be used directly byquant. In contrast, this assumption could not be safely made forbam_to_junc_bed.py. Because intronProspector eliminates the need for downstream filtering, the junction count output format is also simplified and made more generalizable to other junction-calling tools.Previously,
quantsupported multiple input formats, which led to complex, case-by-case handling and inconsistent application of filtering steps. This PR standardizes the input to the BED6 format used by intronProspector (described below) and removes all junction filtering steps. The quantification logic ofquantremains unchanged. Adjustingquantto use this more general input format makes it easier to integrate different junction-calling tools in the future.The output format for
quantis changed to be 1-based instead of 0-based half-open. Because downstream tools use 1-based coordinates, all splice junctions are reported byquantusing 1-based coordinates.Changes
bam_to_junc_bedRemoved. Use diekhans/intronProspector instead.
quantLives in
splicedice/SPLICEDICE.pyInput format
chrom,start,end,name,score,strandscoreis used as the junction read countstrandmust be+or-Refactor details
bam_to_junc_bed.pyinput)Build
requirements.txtfor dependencies1.1.0Tests
tests/e2e/quantexon_skip.shis a script that runs validation using toy data forquant. Run this script from the project root.The dataset is a toy example for a single exon-skipping event. Three samples represent different PS values for the splice event. A fourth sample contains an unrelated junction as a control
Inputs and expected outputs are located in
data/example_data/exon_skip/. Seedata/example_data/exon_skip/README.mdfor more details.Other
CHANGELOG.mdREADME.mddocumentation