Fix off-by-one error#7
Merged
Merged
Conversation
Owner
Author
|
This is already competitive, comparable/better than derfinder |
There was a problem hiding this comment.
Pull request overview
This PR corrects coordinate-system mismatches between fastder’s internal 0-based half-open intervals (BedGraph/BigWig-style), STAR RR splice-junction coordinates, and the 1-based inclusive GTF format—addressing off-by-one errors in junction alignment and GTF output.
Changes:
- Convert RR splice junction starts to 0-based half-open on read, and convert GTF start to 1-based inclusive on write.
- Snap exon boundaries to splice junction donor/acceptor coordinates when emitting GTF (with fallback to coverage-derived bounds if snapping becomes invalid).
- Update unit tests, test fixture GTFs, CLI/docs defaults/wording, and add a CHANGELOG entry describing the behavior changes.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/UnitTests.cpp | Updates RR parsing expectations to reflect 1-based → 0-based conversion. |
| tests/gtfs/splicing_scenarios_test1.gtf | Updates expected/sample GTF output to new coordinate and strand behavior. |
| tests/gtfs/splicing_scenarios_test2.gtf | Updates expected/sample GTF output to new coordinate and strand behavior. |
| tests/gtfs/splicing_scenarios_test3.gtf | Updates expected/sample GTF output to new coordinate and strand behavior. |
| tests/gtfs/splicing_scenarios_test4.gtf | Updates expected/sample GTF output to new coordinate and strand behavior. |
| tests/gtfs/splicing_scenarios_test5.gtf | Updates expected/sample GTF output to new coordinate and strand behavior. |
| tests/gtfs/splicing_scenarios_test6.gtf | Updates expected/sample GTF output to new coordinate and strand behavior. |
| tests/gtfs/splicing_scenarios_test7.gtf | Updates expected/sample GTF output to new coordinate and strand behavior. |
| tests/gtfs/splicing_scenarios_test8.gtf | Updates expected/sample GTF output to new coordinate and strand behavior. |
| tests/gtfs/parser_test3.gtf | Updates expected/sample GTF output for parser/integrator scenario. |
| README.md | Updates CLI documentation (defaults/flags) to match new stitching + tolerance behavior. |
| cpp/StitchedER.h | Extends stitched ER state to retain per-segment genomic bounds for GTF snapping. |
| cpp/StitchedER.cpp | Populates the new per-segment bounds and extends append to carry them. |
| cpp/SJRow.h | Shifts RR junction start by 1 to align STAR SJ.out.tab coordinates with internal 0-based half-open conventions. |
| cpp/main.cpp | Changes default coverage_tolerance and updates CLI help text. |
| cpp/Integrator.cpp | Records junction/ER bounds when stitching and snaps exon edges to junction coordinates during GTF emission. |
| cpp/GTFRow.h | Converts internal 0-based start to 1-based GTF start on serialization. |
| CHANGELOG.md | Documents the coordinate fixes and default stitching behavior change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+43
to
50
| // The RR file gives 1-based inclusive intron coordinates (STAR | ||
| // SJ.out.tab convention). fastder works in 0-based half-open | ||
| // coordinates, like the BedGraph / BigWig coverage, so shift the | ||
| // start by one; the end already equals the 0-based half-open end. | ||
| // This keeps junction-to-ER comparisons and the exon-boundary | ||
| // snapping consistent with the coverage coordinates. | ||
| row.start -= 1; | ||
| return is; |
Comment on lines
+55
to
+59
| << " --coverage-tolerance <float> Permitted coverage deviation when stitching, as a proportion: a\n" | ||
| << " neighbour passes if its coverage is within (1 +/- tolerance)\n" | ||
| << " of the running average. Not bound to (0,1); 2.0 allows a 3x\n" | ||
| << " spread. Default = 1000 (the gate is effectively off, stitching\n" | ||
| << " is driven by splice junctions).\n" |
Comment on lines
+31
to
36
| // Internal coordinates are 0-based half-open (BedGraph / BigWig | ||
| // convention). GTF is 1-based inclusive, so the start is shifted by | ||
| // one. The end needs no shift: a 0-based half-open end equals the | ||
| // 1-based inclusive end of the same interval. | ||
| return os << row.seqname << "\t" << row.source << "\t" << row.feature << "\t" << (row.start + 1) << "\t" << row.end << "\t" << | ||
| row.score << "\t" << row.strand << "\t" << row.frame << "\t" << row.attribute; |
Comment on lines
+257
to
+286
| // Resolve each exon's [start, end]. The default is the ER's | ||
| // coverage-derived extent; an edge that abuts a splice junction is | ||
| // snapped to the junction coordinate (the donor for an exon end, the | ||
| // acceptor for the next exon start), since the junction marks the | ||
| // exact splice site. Edges with no adjacent junction, namely the | ||
| // outer ends of a chain and both ends of a monoexonic ER, keep the | ||
| // coverage extent. | ||
| std::vector<std::pair<uint64_t, uint64_t>> exons; | ||
| std::vector<double> exon_scores; | ||
| for (unsigned int k = 0; k < ser.er_ids.size(); ++k) | ||
| { | ||
| if (ser.er_ids.at(k) == -1) continue; // spliced region, not an exon | ||
| const uint64_t cov_start = ser.er_bounds.at(k).first; | ||
| const uint64_t cov_end = ser.er_bounds.at(k).second; | ||
| uint64_t ex_start = cov_start; | ||
| uint64_t ex_end = cov_end; | ||
| if (k > 0 && ser.er_ids.at(k - 1) == -1) | ||
| ex_start = ser.er_bounds.at(k - 1).second; // splice acceptor | ||
| if (k + 1 < ser.er_ids.size() && ser.er_ids.at(k + 1) == -1) | ||
| ex_end = ser.er_bounds.at(k + 1).first; // splice donor | ||
| // Keep the snapped edges only if they still form a valid exon. | ||
| // With a large position_tolerance two junctions flanking a short | ||
| // expressed region can snap past each other (start >= end); fall | ||
| // back to the coverage extent, which is always a valid interval. | ||
| if (ex_start >= ex_end) | ||
| { | ||
| ex_start = cov_start; | ||
| ex_end = cov_end; | ||
| } | ||
| exons.emplace_back(ex_start, ex_end); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a major PR fixing off-by-one coordinate errors; with these fixes, performance is comparable or better than derfinder's.