Skip to content

Fix off-by-one error#7

Merged
imallona merged 7 commits into
mainfrom
fix-coords-stitching
May 19, 2026
Merged

Fix off-by-one error#7
imallona merged 7 commits into
mainfrom
fix-coords-stitching

Conversation

@imallona
Copy link
Copy Markdown
Owner

@imallona imallona commented May 18, 2026

This is a major PR fixing off-by-one coordinate errors; with these fixes, performance is comparable or better than derfinder's.

@imallona
Copy link
Copy Markdown
Owner Author

This is already competitive, comparable/better than derfinder

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR corrects coordinate-system mismatches between fastder’s internal 0-based half-open intervals (BedGraph/BigWig-style), STAR RR splice-junction coordinates, and the 1-based inclusive GTF format—addressing off-by-one errors in junction alignment and GTF output.

Changes:

  • Convert RR splice junction starts to 0-based half-open on read, and convert GTF start to 1-based inclusive on write.
  • Snap exon boundaries to splice junction donor/acceptor coordinates when emitting GTF (with fallback to coverage-derived bounds if snapping becomes invalid).
  • Update unit tests, test fixture GTFs, CLI/docs defaults/wording, and add a CHANGELOG entry describing the behavior changes.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/UnitTests.cpp Updates RR parsing expectations to reflect 1-based → 0-based conversion.
tests/gtfs/splicing_scenarios_test1.gtf Updates expected/sample GTF output to new coordinate and strand behavior.
tests/gtfs/splicing_scenarios_test2.gtf Updates expected/sample GTF output to new coordinate and strand behavior.
tests/gtfs/splicing_scenarios_test3.gtf Updates expected/sample GTF output to new coordinate and strand behavior.
tests/gtfs/splicing_scenarios_test4.gtf Updates expected/sample GTF output to new coordinate and strand behavior.
tests/gtfs/splicing_scenarios_test5.gtf Updates expected/sample GTF output to new coordinate and strand behavior.
tests/gtfs/splicing_scenarios_test6.gtf Updates expected/sample GTF output to new coordinate and strand behavior.
tests/gtfs/splicing_scenarios_test7.gtf Updates expected/sample GTF output to new coordinate and strand behavior.
tests/gtfs/splicing_scenarios_test8.gtf Updates expected/sample GTF output to new coordinate and strand behavior.
tests/gtfs/parser_test3.gtf Updates expected/sample GTF output for parser/integrator scenario.
README.md Updates CLI documentation (defaults/flags) to match new stitching + tolerance behavior.
cpp/StitchedER.h Extends stitched ER state to retain per-segment genomic bounds for GTF snapping.
cpp/StitchedER.cpp Populates the new per-segment bounds and extends append to carry them.
cpp/SJRow.h Shifts RR junction start by 1 to align STAR SJ.out.tab coordinates with internal 0-based half-open conventions.
cpp/main.cpp Changes default coverage_tolerance and updates CLI help text.
cpp/Integrator.cpp Records junction/ER bounds when stitching and snaps exon edges to junction coordinates during GTF emission.
cpp/GTFRow.h Converts internal 0-based start to 1-based GTF start on serialization.
CHANGELOG.md Documents the coordinate fixes and default stitching behavior change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/SJRow.h
Comment on lines +43 to 50
// The RR file gives 1-based inclusive intron coordinates (STAR
// SJ.out.tab convention). fastder works in 0-based half-open
// coordinates, like the BedGraph / BigWig coverage, so shift the
// start by one; the end already equals the 0-based half-open end.
// This keeps junction-to-ER comparisons and the exon-boundary
// snapping consistent with the coverage coordinates.
row.start -= 1;
return is;
Comment thread cpp/main.cpp
Comment on lines +55 to +59
<< " --coverage-tolerance <float> Permitted coverage deviation when stitching, as a proportion: a\n"
<< " neighbour passes if its coverage is within (1 +/- tolerance)\n"
<< " of the running average. Not bound to (0,1); 2.0 allows a 3x\n"
<< " spread. Default = 1000 (the gate is effectively off, stitching\n"
<< " is driven by splice junctions).\n"
Comment thread cpp/GTFRow.h
Comment on lines +31 to 36
// Internal coordinates are 0-based half-open (BedGraph / BigWig
// convention). GTF is 1-based inclusive, so the start is shifted by
// one. The end needs no shift: a 0-based half-open end equals the
// 1-based inclusive end of the same interval.
return os << row.seqname << "\t" << row.source << "\t" << row.feature << "\t" << (row.start + 1) << "\t" << row.end << "\t" <<
row.score << "\t" << row.strand << "\t" << row.frame << "\t" << row.attribute;
Comment thread cpp/Integrator.cpp
Comment on lines +257 to +286
// Resolve each exon's [start, end]. The default is the ER's
// coverage-derived extent; an edge that abuts a splice junction is
// snapped to the junction coordinate (the donor for an exon end, the
// acceptor for the next exon start), since the junction marks the
// exact splice site. Edges with no adjacent junction, namely the
// outer ends of a chain and both ends of a monoexonic ER, keep the
// coverage extent.
std::vector<std::pair<uint64_t, uint64_t>> exons;
std::vector<double> exon_scores;
for (unsigned int k = 0; k < ser.er_ids.size(); ++k)
{
if (ser.er_ids.at(k) == -1) continue; // spliced region, not an exon
const uint64_t cov_start = ser.er_bounds.at(k).first;
const uint64_t cov_end = ser.er_bounds.at(k).second;
uint64_t ex_start = cov_start;
uint64_t ex_end = cov_end;
if (k > 0 && ser.er_ids.at(k - 1) == -1)
ex_start = ser.er_bounds.at(k - 1).second; // splice acceptor
if (k + 1 < ser.er_ids.size() && ser.er_ids.at(k + 1) == -1)
ex_end = ser.er_bounds.at(k + 1).first; // splice donor
// Keep the snapped edges only if they still form a valid exon.
// With a large position_tolerance two junctions flanking a short
// expressed region can snap past each other (start >= end); fall
// back to the coverage extent, which is always a valid interval.
if (ex_start >= ex_end)
{
ex_start = cov_start;
ex_end = cov_end;
}
exons.emplace_back(ex_start, ex_end);
@imallona imallona merged commit 3e78131 into main May 19, 2026
2 checks passed
@imallona imallona deleted the fix-coords-stitching branch May 19, 2026 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants