-
Notifications
You must be signed in to change notification settings - Fork 0
Methods and Model
Post-mortem deamination hydrolytically removes the amine group from cytosine, converting it to uracil (sequenced as thymine). The reaction preferentially affects single-stranded overhangs at fragment termini, giving a characteristic gradient of C→T substitutions at 5′ ends and (in double-stranded libraries) complementary G→A substitutions at 3′ ends [Briggs et al. 2007]. The rate follows exponential decay from the terminus:
where d is the damage amplitude at position 0 (≈ the C→T rate at the outermost base) and λ is the decay constant (typically 0.2–0.6 nats/position). AMBER estimates these parameters per contig by maximum likelihood from the observed C→T and G→A rates at each terminal position.
For each mapped read r, AMBER computes a per-read posterior p_ancient using a log-likelihood ratio over terminal positions:
At a cytosine position i with sequenced base b:
where ε is the base quality error probability. The modern likelihood assumes no damage (δ = 0).
The LLR is converted to a posterior by Bayesian update from sample-level prior π₀:
Reads with p_ancient > 0.8 are classified as confident ancient; < 0.2 as confident modern; 0.2–0.8 as ambiguous.
Beyond the 128-dimensional encoder output, each contig carries a 20-dimensional aDNA feature vector:
Position-specific terminal damage rates estimated by pooling all reads mapped to the contig:
Positions 1–5 from the 5′ end (C→T) and positions 1–5 from the 3′ end (G→A). Default: 0.5 (uninformative prior) for contigs with insufficient coverage.
Soft-clip offsets from the CIGAR string are applied before computing rates to exclude adapter bases.
λ₅ and λ₃ from fitting the per-position rates:
Mean and standard deviation of aligned read lengths for reads mapped to the contig. Ancient reads are typically 50–100 bp; modern reads 150–300 bp.
Log-normalised coverage from confident-ancient reads (p_ancient > 0.6) and confident-modern reads (p_ancient < 0.4): log1p(cov_ancient)/5 and log1p(cov_modern)/5. A truly ancient genome has high cov_ancient and low cov_modern; a modern genome the reverse.
Four terminal mismatch rates: T→C at 5′, C→T at 3′, other mismatches at 5′, other mismatches at 3′. These complement the primary C→T / G→A damage signal with broader coverage of substitution patterns at fragment termini.
AMBER extends COMEBin's self-supervised InfoNCE with two modifications. Positive pairs are the 6 augmented views of the same contig (3 coverage subsampling levels × 2 feature-noise intensities), identical to COMEBin. The two extensions concern the denominator:
Modification 1 — SCG hard negatives. Single-copy marker genes (SCGs) are typically present exactly once per genome. Two contigs both containing the same SCG are therefore likely from different genomes — high-confidence (though not guaranteed) negative pairs. AMBER amplifies these pairs in the denominator by a factor α (default 2.0), providing stronger repulsion for contigs that embeddings might otherwise conflate:
SCG membership is determined by HMM profile search with HMMER3 against checkm_markers_only.hmm (206 CheckM universal bacterial and archaeal marker profiles).
Six views per contig are constructed:
- 3 coverage subsampling levels (33%, 66%, 100% of reads)
- 2 feature noise intensities (low and medium Gaussian noise on TNF and coverage)
All 6 views pass through the shared MLP encoder (138→512→256→128, BatchNorm+ReLU, L2-normalised).
where τ = 0.1 and w_ik = max(w_ik^SCG, w_ik^damage) — SCG hard negatives always win over damage down-weighting.
Standard contrastive learning treats all negatives equally. For ancient DNA this is wrong: a contig pair where one member is ancient and the other is modern from the same environment should not be strongly repelled. AMBER downweights such negatives by w_ij ∈ [w_min, 1]:
Confidence — c_i = n_{eff,i} / (n_{eff,i} + n_0) where n_eff is the Kish effective read depth and n_0 = 60 is the half-saturation constant. Contigs with few reads get low confidence and weights close to 1 (effectively standard negatives).
Compatibility — f_compat combines two signals:
- f_raw = exp(−|δ_i − δ_j| / 0.2): similar terminal damage rates → compatible
- f_panc = exp(−0.5 · z²) where z = |p_anc,i − p_anc,j| / √(var_i + var_j + τ²)
When w_ij < 1, the gradient pushing i and j apart is reduced. The encoder cannot use damage state as a discriminative feature, so ancient and modern strains of the same genome stay together.
With --encoder-restarts N, N independent encoders are trained with different random seeds. The consensus kNN graph averages their edge weights before Leiden clustering, reducing sensitivity to any single unlucky initialisation.
Chaos Game Representation (CGR) plots k-mer frequencies in a unit square. AMBER computes CGR at three resolutions (16×16, 32×32, 64×64) and extracts three metrics at each:
- Entropy: Shannon entropy of the normalised CGR image — captures global sequence complexity
- Occupancy: Fraction of non-zero cells — correlates with genomic diversity
- Lacunarity: Gap structure in the CGR image — sensitive to repetitive sequences and GC content
The slopes of these metrics across resolutions (16→32→64) capture how composition scales with k-mer length, giving a scale-invariant fingerprint of genomic composition.
After encoding, AMBER builds a kNN graph using HNSW approximate nearest neighbours [Malkov & Yashunin 2018]. For each contig i, the k most similar contigs by cosine distance in the 157-dim embedding space are connected with edge weight:
where d_ij is the cosine distance and bw = 0.2 is the bandwidth. SCG-sharing contigs receive an additional penalty: w'_ij = w_ij · exp(−3 · n_shared_markers). This prevents single-copy marker genes from bridging contigs that belong to different genomes.
Leiden clustering [Traag et al. 2019] is run with resolution sweep over [0.5, 5.0] and 25 random seeds per resolution. The partition maximising the SCG quality score:
is selected, where strict-HQ = completeness ≥ 90% and contamination < 5%, pre-HQ ≥ 72%/5%, MQ ≥ 50%/10%.
Completeness and contamination are estimated entirely in C++ from SCG counts — no external CheckM2 call is made during clustering.
For each bin with dup_excess > 0 (more duplicate SCGs than expected):
- Extract the subgraph of all contigs in the bin
- Re-run Leiden at 3× the current resolution
- If the split reduces total dup_excess, accept it
This iteratively peels off contaminating contigs without global re-clustering.
Bins at 75–90% estimated completeness recruit missing-marker contigs from kNN neighbours:
- Identify SCGs absent from the target bin
- Find all contigs carrying those markers via
MarkerIndex - For each candidate contig: accept if none of its markers are already present in the target (strict no-duplication guarantee)
- Reject candidates from large, viable bins (n_contigs > 5, completeness > 60%) unless they are themselves contaminated
Given R independent AMBER runs (each producing a run.abin archive), amber resolve builds a co-binning affinity graph and re-clusters it.
For each pair of contigs (i, j), let:
- E_ij = number of runs in which i and j are assigned to the same bin
- R_i = number of runs in which contig i received any bin assignment
The per-pair affinity is:
The denominator min(R_i, R_j) rather than R matters: if a short contig is unbinned in some runs, dividing by R would suppress its affinity with its true bin-mates.
An affinity graph is built with edge weight p_cobin(i,j) for all pairs with E_ij > 0. Leiden is then run on this graph with the same SCG-guided quality protocol as amber bin, and the partition maximising the SCG quality score is returned.
With 3 runs the reliable bin core is recovered; with ≥ 5 runs, borderline bins accumulate enough co-binning evidence to appear in the consensus.
AMBER models the mapped BAM as a mixture of two read populations:
- Ancient: high terminal damage δ(p), short fragment length l ~ LogNormal(μ_a, σ_a)
- Modern: low damage, longer l ~ LogNormal(μ_m, σ_m)
E-step: For each read r compute soft assignment:
The damage likelihood marginalises over all terminal C/G positions using the current parameter estimates. The length likelihood uses LogNormal distributions fitted per iteration.
M-step: Update mixture fraction π_a and distribution parameters (d, λ, μ_a, σ_a, μ_m, σ_m) from weighted counts.
Convergence is declared when Δπ_a < 0.01 between iterations (typically 5–15 iterations).
Consensus calling: Reads with p_a > 0.8 contribute to the ancient consensus; reads with p_a < 0.2 contribute to the modern consensus. Positions with < min_depth weighted coverage are masked with N.
Damage correction (default: on): The ancient consensus is corrected for assembly-baked damage before writing ancient_consensus.fa. Disable with --no-polish.
When ancient reads are assembled, the assembler calls each position by majority vote. At positions where many reads begin or end, a large fraction show C→T damage at their 5′ termini. If damage exceeds 50% of reads, the assembler calls T instead of C — incorporating the damaged base into the reference. This is assembly-baked damage.
Consequences:
- Ancient reads now match the baked-in T → their C→T mismatch goes uncounted
- Modern reads show T→C mismatches (the reverse direction)
- Apparent damage rates are suppressed at heavily damaged positions
- The assembled sequence is wrong — T where the true genome has C
When it matters: Baking occurs when (ancient_fraction × damage_rate) > 0.5. At 40% ancient reads with 10% damage, baking is negligible. At 80% ancient with 30% damage, most terminal positions are baked in.
How AMBER handles it: The mismatch spectrum features include both C→T (residual damage) and T→C (baked-in proxy). The encoder learns to combine them: ct_rate + tc_rate ≈ total terminal mismatch, recovering the full damage signal regardless of baking. The amber deconvolve command corrects baked-in damage when writing the ancient consensus.
AMBER estimates bin completeness and contamination on-the-fly using the 206-marker CheckM universal single-copy gene set (checkm_markers_only.hmm). No external tools are called during binning.
For each bin B:
Quality thresholds:
| Tier | Completeness | Contamination |
|---|---|---|
| Strict-HQ | ≥ 90% | < 5% |
| Pre-HQ | ≥ 72% | < 5% |
| MQ | ≥ 50% | < 10% |
CheckM2 is only used post-hoc for external validation, not during binning.
- Briggs AW et al. (2007) Patterns of damage in genomic DNA sequences from a Neandertal. PNAS 104:14616–21.
- Malkov YA, Yashunin DA (2018) Efficient and robust approximate nearest neighbor search using HNSW. IEEE TPAMI 42:824–36.
- Oord A van den et al. (2018) Representation Learning with Contrastive Predictive Coding. arXiv 1807.03748.
- Traag VA et al. (2019) From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports 9:5233.
- Wang Z et al. (2024) COMEBin allows effective binning of metagenomic contigs using coverage multi-view encoder. Nature Communications 15:1119.