Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Changelog

All notable changes to fastder are recorded in this file.

## [0.1.0] - 2026-05-19

### Fixed

- GTF output is now 1-based. fastder's internal coordinates are 0-based
half-open (the BedGraph and BigWig convention), but the GTF writer emitted
the start column without converting it, so every gene, transcript and exon
start was one base too low. The start is now shifted by one on output. The
end needs no shift, because a 0-based half-open end equals the 1-based
inclusive end of the same interval.
- RR splice junction coordinates are converted to 0-based on read. The RR
file carries 1-based inclusive intron coordinates (the STAR SJ.out.tab
convention), but fastder works in 0-based half-open coordinates like the
coverage. The mismatch left junction-snapped exon ends one base too long.
The RR parser now shifts the junction start by one on read, and rejects a
junction start of 0, which is malformed under the 1-based convention,
rather than letting the unsigned subtraction wrap.

### Changed

- Stitching is no longer gated on coverage similarity. The default
`--coverage-tolerance` is raised to 1000, which makes the gate effectively
off. A splice junction is the evidence that two regions belong to one
transcript, and exon coverage varies widely within a transcript, so the
previous default (0.8) wrongly rejected legitimate stitches. The flag is
kept as a knob.
- Exon boundaries are snapped to splice junctions. When a stitched chain
crosses a junction, the exon edges on either side are set to the junction's
donor and acceptor coordinates instead of the coverage-derived expressed-
region edges. Edges with no adjacent junction, namely the outer ends of a
chain and both ends of a monoexonic region, keep the coverage extent.

## [Legacy]

Baseline forked from the upstream fastder repository, which carried no tagged
release. Versioned history begins at 0.1.0 above.
4 changes: 2 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ endif()
set(CPACK_PACKAGE_NAME "fastder")
set(CPACK_PACKAGE_VENDOR "Martina Lehmann")
set(CPACK_PACKAGE_DESCRIPTION_SUMMARY "fastder: A Fast, Agnostic, Splice-Junction-Based Tool for the Detection of Expressed Regions")
set(CPACK_PACKAGE_VERSION "0.0.1")
set(CPACK_PACKAGE_CONTACT "martina.lavanya@gmail.com")
set(CPACK_PACKAGE_VERSION "0.1.0")
set(CPACK_PACKAGE_CONTACT "martina.lavanya@gmail.com, izaskun.mallona.work@gmail.com")

# Debian-specific packaging settings
set(CPACK_DEBIAN_PACKAGE_MAINTAINER "Martina Lehmann")
Expand Down
32 changes: 17 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,8 +91,7 @@ that the tool will only output expressed regions on chromosome 1, and will ignor
- `--min-length 5` describes the minimum length (in bp) that an ER must have. For instance, three consecutive base pairs with coverage > 0.25 CPM will be ignored if the min length is set to 5 bp.
- `--position-tolerance 5` describes the maximum permitted offset of the end position of an exon and the starting position of a splice junction. If this tolerance is set to 5,
an ER with end position = 1000 bp and a splice junction with start position = 1005 bp will be stitched together (if the coverage and end junction match).
- `--coverage-tolerance 0.1` describes the maximum permitted coverage deviation between two ER that are separated by a spliced region. For a coverage tolerance of 0.1,
two ERs with coverage = 10 CPM and 11 CPM will be stitched together (if there is a matching splice junction).
- `--coverage-tolerance 1000` describes the permitted coverage deviation between two ERs separated by a spliced region. It defaults to 1000, effectively off, so stitching is driven by splice junctions rather than coverage similarity.

A visualization of the different parameters is provided below.

Expand All @@ -103,13 +102,14 @@ Usage:
--dir <path> ... \
[--chr <chr1> <chr2> ...] \
[--min-coverage <float>] \
[--min-length <int>] \
[--position-tolerance <int>] \
[--coverage-tolerance <float>] \
[--help]
[--cores <int>]

Required inputs:

--dir <path> ... Relative path from the build directory to the directory containing the input files.
--dir <path> ... Relative path from the build directory, or an absolute path, to the directory containing the input files.
Example: --dir ../../data/test_exon_skipping

Optional inputs:
Expand All @@ -118,24 +118,26 @@ Optional inputs:
Default: all (chr1-chr22, chrX)
Example: --chr chr1 chr2 chr3

--min-length <float> Minimum length [#bp] required for a region to qualify as an expressed region (ER).
Default: 5 bp
Example: --min-length 5
--min-length <int> Minimum length [#bp] required for a region to qualify as an expressed region (ER).
Default: 10 bp
Example: --min-length 10

--min-coverage <float> Minimum coverage [CPM] required for a region to qualify as an ER.
Normalized in-place by library size.
Default: 0.25 CPM
Default: 0.05 CPM
Example: --min-coverage 0.25

--position-tolerance <int> Maximum allowed positional deviation between splice junction and ER coordinates [bp].
Default: 5 bp
Example: --position-tolerance 5

--coverage-tolerance <float> Allowed relative deviation in coverage between stitched ERs (e.g. 0.1 = 10%).
Default: 0.1
Example: --coverage-tolerance 0.1
--coverage-tolerance <float> Permitted coverage deviation between stitched ERs, as a proportion of the running average.
Default: 1000 (gate effectively off; stitching is junction-driven).
Example: --coverage-tolerance 2.0

--help Show this help message.
--cores <int> Number of cores fastder may use.
Default: 10
Example: --cores 23


Example:
Expand All @@ -144,9 +146,9 @@ Example:
--dir ../../data/input \
--chr chr1 chr2 \
--position-tolerance 5 \
--min-length 5 \
--min-coverage 0.25 \
--coverage-tolerance 0.1
--min-length 10 \
--min-coverage 0.05 \
--cores 10
```


Expand Down
6 changes: 5 additions & 1 deletion cpp/GTFRow.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,11 @@ class GTFRow
// overload output operator for GTFRow
friend std::ostream& operator<< (std::ostream& os, const GTFRow& row)
{
return os << row.seqname << "\t" << row.source << "\t" << row.feature << "\t" << row.start << "\t" << row.end << "\t" <<
// Internal coordinates are 0-based half-open (BedGraph / BigWig
// convention). GTF is 1-based inclusive, so the start is shifted by
// one. The end needs no shift: a 0-based half-open end equals the
// 1-based inclusive end of the same interval.
return os << row.seqname << "\t" << row.source << "\t" << row.feature << "\t" << (row.start + 1) << "\t" << row.end << "\t" <<
row.score << "\t" << row.strand << "\t" << row.frame << "\t" << row.attribute;
Comment on lines +31 to 36

}
Expand Down
81 changes: 61 additions & 20 deletions cpp/Integrator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -123,11 +123,23 @@ void Integrator::stitch_one_strand(const std::string& chrom,
}
const auto sj_to_check = (sj_it == strand_sjs.end()) ? std::prev(strand_sjs.end()) : sj_it;

// TODO reject geometrically inconsistent stitches here, rather than
// only repairing them in write_to_gtf. When position_tolerance is
// larger than a short ER, the junctions matched to its two edges can
// snap past each other and imply a negative-length exon. write_to_gtf
// currently falls back to that ER's coverage extent; a cleaner fix is
// to refuse the stitch when position_tolerance exceeds the ER length
// or when the flanking junctions would produce a non-positive exon.
if (is_similar(candidate, expressed_region, rr_all_sj[*sj_to_check - 1]))
{
const SJRow& used_sj = rr_all_sj[*sj_to_check - 1];
uint64_t sj_length = expressed_region.start - ers[candidate.er_ids.back()].end;
candidate.append(-1, sj_length, 0.0);
candidate.append(i, expressed_region.length, expressed_region.coverage);
// record the splice junction's [donor, acceptor] for the spliced
// region, and the ER's coverage extent for the appended exon, so
// write_to_gtf can snap exon edges to the splice sites.
candidate.append(-1, sj_length, 0.0, used_sj.start, used_sj.end);
candidate.append(i, expressed_region.length, expressed_region.coverage,
expressed_region.start, expressed_region.end);
int er_count = 0;
for (int id : candidate.er_ids) if (id >= 0) ++er_count;
if (er_count > max_chain_len) max_chain_len = er_count;
Expand Down Expand Up @@ -240,28 +252,57 @@ void Integrator::write_to_gtf(const std::string& output_path)

for (unsigned int i = 0; i < this->stitched_ERs.size(); ++i)
{
// each stitched_er is both a gene and a transcript
GTFRow gtf_row = GTFRow(stitched_ERs[i], "gene", i + 1);
const StitchedER& ser = stitched_ERs[i];

// Resolve each exon's [start, end]. The default is the ER's
// coverage-derived extent; an edge that abuts a splice junction is
// snapped to the junction coordinate (the donor for an exon end, the
// acceptor for the next exon start), since the junction marks the
// exact splice site. Edges with no adjacent junction, namely the
// outer ends of a chain and both ends of a monoexonic ER, keep the
// coverage extent.
std::vector<std::pair<uint64_t, uint64_t>> exons;
std::vector<double> exon_scores;
for (unsigned int k = 0; k < ser.er_ids.size(); ++k)
{
if (ser.er_ids.at(k) == -1) continue; // spliced region, not an exon
const uint64_t cov_start = ser.er_bounds.at(k).first;
const uint64_t cov_end = ser.er_bounds.at(k).second;
uint64_t ex_start = cov_start;
uint64_t ex_end = cov_end;
if (k > 0 && ser.er_ids.at(k - 1) == -1)
ex_start = ser.er_bounds.at(k - 1).second; // splice acceptor
if (k + 1 < ser.er_ids.size() && ser.er_ids.at(k + 1) == -1)
ex_end = ser.er_bounds.at(k + 1).first; // splice donor
// Keep the snapped edges only if they still form a valid exon.
// With a large position_tolerance two junctions flanking a short
// expressed region can snap past each other (start >= end); fall
// back to the coverage extent, which is always a valid interval.
if (ex_start >= ex_end)
{
ex_start = cov_start;
ex_end = cov_end;
}
exons.emplace_back(ex_start, ex_end);
Comment on lines +257 to +286
exon_scores.push_back(ser.all_coverages.at(k).second);
}
if (exons.empty()) continue;

// each stitched_er is both a gene and a transcript; their span runs
// from the first exon start to the last exon end
GTFRow gtf_row = GTFRow(ser, "gene", i + 1);
gtf_row.start = exons.front().first;
gtf_row.end = exons.back().second;
out << gtf_row << std::endl;
gtf_row.change_feature("transcript", i + 1, 0);
out << gtf_row << std::endl;
int exon_nr = 1;
// add the ERs within the stitched_er
for (unsigned int k = 0; k < stitched_ERs[i].er_ids.size(); ++k)
for (unsigned int e = 0; e < exons.size(); ++e)
{
if (stitched_ERs[i].er_ids.at(k) != -1){
gtf_row.change_feature("exon", i + 1, exon_nr);
// need to include SJ length as well
gtf_row.end = gtf_row.start + stitched_ERs.at(i).all_coverages.at(k).first; // start + length = end
gtf_row.score = stitched_ERs.at(i).all_coverages.at(k).second; // use the per-exon average coverage here instead of the overall coverage
out << gtf_row << std::endl;
gtf_row.start = gtf_row.end;
++exon_nr;
}
else
{
gtf_row.start += stitched_ERs.at(i).all_coverages.at(k).first; // add length of the SJ
}
gtf_row.change_feature("exon", i + 1, e + 1);
gtf_row.start = exons.at(e).first;
gtf_row.end = exons.at(e).second;
gtf_row.score = exon_scores.at(e);
out << gtf_row << std::endl;
}
}
out.close();
Expand Down
14 changes: 14 additions & 0 deletions cpp/SJRow.h
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,20 @@ class SJRow
return is;
}
row.strand = (strand_ == '+'); // 1 if +
// The RR file gives 1-based inclusive intron coordinates (STAR
// SJ.out.tab convention). fastder works in 0-based half-open
// coordinates, like the BedGraph / BigWig coverage, so shift the
// start by one; the end already equals the 0-based half-open end.
// This keeps junction-to-ER comparisons and the exon-boundary
// snapping consistent with the coverage coordinates.
// A valid 1-based start is at least 1; a 0 here means malformed or
// truncated input. Reject the row instead of underflowing uint64_t.
if (row.start == 0)
{
is.setstate(std::ios::failbit);
return is;
}
row.start -= 1;
return is;
Comment on lines +43 to 57
}

Expand Down
8 changes: 6 additions & 2 deletions cpp/StitchedER.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ StitchedER::StitchedER(const BedGraphRow& expressed_region, int er_id)
er_ids = {er_id};
across_er_coverage = expressed_region.coverage;
all_coverages = {std::make_pair(expressed_region.length, expressed_region.coverage)};
er_bounds = {std::make_pair(expressed_region.start, expressed_region.end)};
total_length = expressed_region.length;
}

Expand All @@ -39,11 +40,14 @@ double StitchedER::get_avg_coverage() const
return sum / full_length;
}

// note that er_id is 0-indexed
void StitchedER::append(int er_id, unsigned int length, double coverage)
// note that er_id is 0-indexed. lo/hi is the ER's coverage extent, or the
// splice junction's [donor, acceptor] when er_id is -1 (a spliced region).
void StitchedER::append(int er_id, unsigned int length, double coverage,
uint64_t lo, uint64_t hi)
{
er_ids.emplace_back(er_id); // -1 for spliced regions
all_coverages.push_back({std::make_pair(length, coverage)});
er_bounds.push_back(std::make_pair(lo, hi));
// only update avg coverage if an exon was added
if (er_id != -1)
{
Expand Down
9 changes: 8 additions & 1 deletion cpp/StitchedER.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ class StitchedER
StitchedER() = default;

StitchedER(const BedGraphRow& expressed_region, int er_id);
void append(int er_id, unsigned int length, double coverage);
void append(int er_id, unsigned int length, double coverage,
uint64_t lo, uint64_t hi);
double get_avg_coverage() const;

//bool is_similar(double val1, double val2);
Expand All @@ -28,6 +29,12 @@ class StitchedER
// example: stitched_ER consists of er_ids 45, 46, 47, 49 == vector indices of expressed_regions
double across_er_coverage; // avg (weighted) coverage of all exons that are part of the stitched ER so far
std::vector<std::pair<unsigned int, double>> all_coverages; // stores a pair of er length (= weight) + normalized average coverage of the er
// Genomic [start, end) of each entry in er_ids, aligned by index. For a
// real ER it is the coverage-derived extent; for a spliced region (-1) it
// is the splice junction's [donor, acceptor]. write_to_gtf snaps an exon
// edge to the adjacent junction coordinate and falls back to the coverage
// extent where an edge has no junction.
std::vector<std::pair<uint64_t, uint64_t>> er_bounds;
unsigned int total_length; // combined length of all ERs (excluding spliced regions)
uint64_t start;
uint64_t end;
Expand Down
20 changes: 15 additions & 5 deletions cpp/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,12 @@ int main(int argc, char* argv[]) {
// default values (if not provided by user)
int position_tolerance = 5;
int min_length = 10;
double coverage_tolerance = 0.8;
// coverage_tolerance gates stitching on how similar two regions' coverage
// is. A splice junction is the real evidence that two exons belong to one
// transcript, and the coverage of exons within a transcript varies widely,
// so this gate defaults effectively off (any fold-difference passes) and
// is left as a knob only for deliberate experiments.
double coverage_tolerance = 1000.0;
double min_coverage = 0.05;
std::string directory;
int cores = 10;
Expand All @@ -43,16 +48,21 @@ int main(int argc, char* argv[]) {
" Normalization is done in-place by library size. \n"
" Default = 0.05 CPM.\n"
<< " Example: --min-coverage 0.25\n\n"
<< " --min-length <int> Minimum length, in [nt], for a region to qualify as an expressed region (ER). Default = 10 nt.\n"
<< " Example: --min-length 10\n\n"
<< " --position-tolerance <int> Maximum permitted position deviation of splice junction and ER coordinates, in [nt]. Default = 5 nt.\n"
<< " Example: --position-tolerance 5\n\n"
<< " --coverage-tolerance <float> Permitted coverage deviation within a stitched ER, as a proportion (e.g. 0.1 = 10 %).\n"
<< " The value is not strictly bound in (0,1). Default = 0.7.\n"
<< " Example: --coverage-tolerance 0.8\n\n"
<< " --coverage-tolerance <float> Permitted coverage deviation when stitching, as a proportion: a\n"
<< " neighbour passes if its coverage is within (1 +/- tolerance)\n"
<< " of the running average. Not bound to (0,1); 2.0 allows a 3x\n"
<< " spread. Default = 1000 (the gate is effectively off, stitching\n"
<< " is driven by splice junctions).\n"
Comment on lines +55 to +59
<< " Example: --coverage-tolerance 2.0\n\n"
<< " --cores <int> Number of cores that fastder may use. Default = 10 cores.\n"
<< " Example: --cores 23\n\n"
<< "Example:\n"
<< " ./fastder --dir ../data --chr chr1 chr2 --position-tolerance 5 "
"--min-coverage 0.05 --coverage-tolerance 0.7 --cores 23\n"
"--min-coverage 0.05 --coverage-tolerance 2.0 --cores 23\n"
<< std::endl;

// parse command-line arguments
Expand Down
Loading
Loading