Benchmarking checklists

Some basic recommendations for optimizations with `learn-errors` and `dada` that Claude had based on the current code state that we should check.

### High-impact candidates

- [x]  **Skip redundant per-iteration setup in `learn_errors`**. Every outer iteration reconstructs `Raw` objects from `RawInput` and reruns `raw_assign_kmers`. The inputs don't change iteration-to-iteration — this is wasted work. Estimate: ~50ms/iter × 4 iters = ~200ms. Small absolute, but scales linearly with sample count for multi-sample runs.

- [x]  **Investigate why iteration 1 is 2× slower than iterations 2–5**. First iter takes ~1.0s, later ones ~0.46s at 8 threads. Partly expected (no locks → no greedy skips), but the full gap isn't explained. Could reveal something like excess cluster budding on a noisy initial error matrix.

- [x]  **Parallelize `build_trans_mat`**. Currently serial per sample at ~42ms/iter × 5 iters = ~210ms. Trivial to `par_iter` over raws. Negligible on single-sample F3D0 but scales with sample count.

- [x]  **Manual SIMD for `kmer_dist8`**. We rely on LLVM auto-vectorization, but the byte-wise min + overflow detection may defeat it. Checking generated asm would tell us; if it's scalar, `std::simd` could give 4–16× on this hot path. Matters most at small k (where screen throughput dominates) — so mostly relevant to Illumina pipelines.

### Diminishing returns

- [ ]  **Reuse `Vec<(f64, u32, bool)>` in `b_compare_parallel`**. 2000 × 16B × 1100 calls ≈ 35 MB of allocation churn per run. The allocator seems to handle it well (sys time is down to ~0.9s at 8 threads), but pooling could shave a bit.

- [ ] **Early-terminate k-mer screen**. Running dotsum with a running check against `(1 - kdist_cutoff) × scale`; if we can't hit the threshold, return. Matters most when the screen is the bottleneck, which is a specific corner case.

### Unclear-benefit (would need measurement first)

- [ ] **Pipeline the outer errors-from-sample loop**. Currently: dada → build_trans → errfun → (barrier) → next iter. Could run `build_trans_mat` concurrently with the next iter's dada if we accept a one-iter lag in the error matrix. Structural change, might not converge identically.

- [ ] **Bimera detection profiling**. Not in the learn-errors hot path, but `remove-bimera-denovo` hasn't been measured. If users run that on a big sequence table, same buffer reuse / load balancing tricks probably apply.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking checklists #3

High-impact candidates

Diminishing returns

Unclear-benefit (would need measurement first)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Benchmarking checklists #3

Description

High-impact candidates

Diminishing returns

Unclear-benefit (would need measurement first)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions