Skip to content

Benchmarking checklists #3

@cjfields

Description

@cjfields

Some basic recommendations for optimizations with learn-errors and dada that Claude had based on the current code state that we should check.

High-impact candidates

  • Skip redundant per-iteration setup in learn_errors. Every outer iteration reconstructs Raw objects from RawInput and reruns raw_assign_kmers. The inputs don't change iteration-to-iteration — this is wasted work. Estimate: ~50ms/iter × 4 iters = ~200ms. Small absolute, but scales linearly with sample count for multi-sample runs.

  • Investigate why iteration 1 is 2× slower than iterations 2–5. First iter takes ~1.0s, later ones ~0.46s at 8 threads. Partly expected (no locks → no greedy skips), but the full gap isn't explained. Could reveal something like excess cluster budding on a noisy initial error matrix.

  • Parallelize build_trans_mat. Currently serial per sample at ~42ms/iter × 5 iters = ~210ms. Trivial to par_iter over raws. Negligible on single-sample F3D0 but scales with sample count.

  • Manual SIMD for kmer_dist8. We rely on LLVM auto-vectorization, but the byte-wise min + overflow detection may defeat it. Checking generated asm would tell us; if it's scalar, std::simd could give 4–16× on this hot path. Matters most at small k (where screen throughput dominates) — so mostly relevant to Illumina pipelines.

Diminishing returns

  • Reuse Vec<(f64, u32, bool)> in b_compare_parallel. 2000 × 16B × 1100 calls ≈ 35 MB of allocation churn per run. The allocator seems to handle it well (sys time is down to ~0.9s at 8 threads), but pooling could shave a bit.

  • Early-terminate k-mer screen. Running dotsum with a running check against (1 - kdist_cutoff) × scale; if we can't hit the threshold, return. Matters most when the screen is the bottleneck, which is a specific corner case.

Unclear-benefit (would need measurement first)

  • Pipeline the outer errors-from-sample loop. Currently: dada → build_trans → errfun → (barrier) → next iter. Could run build_trans_mat concurrently with the next iter's dada if we accept a one-iter lag in the error matrix. Structural change, might not converge identically.

  • Bimera detection profiling. Not in the learn-errors hot path, but remove-bimera-denovo hasn't been measured. If users run that on a big sequence table, same buffer reuse / load balancing tricks probably apply.

Metadata

Metadata

Assignees

No one assigned

    Labels

    low priorityLower priority tickets; good to check but not necessary

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions