You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some basic recommendations for optimizations with learn-errors and dada that Claude had based on the current code state that we should check.
High-impact candidates
Skip redundant per-iteration setup in learn_errors. Every outer iteration reconstructs Raw objects from RawInput and reruns raw_assign_kmers. The inputs don't change iteration-to-iteration — this is wasted work. Estimate: ~50ms/iter × 4 iters = ~200ms. Small absolute, but scales linearly with sample count for multi-sample runs.
Investigate why iteration 1 is 2× slower than iterations 2–5. First iter takes ~1.0s, later ones ~0.46s at 8 threads. Partly expected (no locks → no greedy skips), but the full gap isn't explained. Could reveal something like excess cluster budding on a noisy initial error matrix.
Parallelize build_trans_mat. Currently serial per sample at ~42ms/iter × 5 iters = ~210ms. Trivial to par_iter over raws. Negligible on single-sample F3D0 but scales with sample count.
Manual SIMD for kmer_dist8. We rely on LLVM auto-vectorization, but the byte-wise min + overflow detection may defeat it. Checking generated asm would tell us; if it's scalar, std::simd could give 4–16× on this hot path. Matters most at small k (where screen throughput dominates) — so mostly relevant to Illumina pipelines.
Diminishing returns
Reuse Vec<(f64, u32, bool)> in b_compare_parallel. 2000 × 16B × 1100 calls ≈ 35 MB of allocation churn per run. The allocator seems to handle it well (sys time is down to ~0.9s at 8 threads), but pooling could shave a bit.
Early-terminate k-mer screen. Running dotsum with a running check against (1 - kdist_cutoff) × scale; if we can't hit the threshold, return. Matters most when the screen is the bottleneck, which is a specific corner case.
Unclear-benefit (would need measurement first)
Pipeline the outer errors-from-sample loop. Currently: dada → build_trans → errfun → (barrier) → next iter. Could run build_trans_mat concurrently with the next iter's dada if we accept a one-iter lag in the error matrix. Structural change, might not converge identically.
Bimera detection profiling. Not in the learn-errors hot path, but remove-bimera-denovo hasn't been measured. If users run that on a big sequence table, same buffer reuse / load balancing tricks probably apply.
Some basic recommendations for optimizations with
learn-errorsanddadathat Claude had based on the current code state that we should check.High-impact candidates
Skip redundant per-iteration setup in
learn_errors. Every outer iteration reconstructsRawobjects fromRawInputand rerunsraw_assign_kmers. The inputs don't change iteration-to-iteration — this is wasted work. Estimate: ~50ms/iter × 4 iters = ~200ms. Small absolute, but scales linearly with sample count for multi-sample runs.Investigate why iteration 1 is 2× slower than iterations 2–5. First iter takes ~1.0s, later ones ~0.46s at 8 threads. Partly expected (no locks → no greedy skips), but the full gap isn't explained. Could reveal something like excess cluster budding on a noisy initial error matrix.
Parallelize
build_trans_mat. Currently serial per sample at ~42ms/iter × 5 iters = ~210ms. Trivial topar_iterover raws. Negligible on single-sample F3D0 but scales with sample count.Manual SIMD for
kmer_dist8. We rely on LLVM auto-vectorization, but the byte-wise min + overflow detection may defeat it. Checking generated asm would tell us; if it's scalar,std::simdcould give 4–16× on this hot path. Matters most at small k (where screen throughput dominates) — so mostly relevant to Illumina pipelines.Diminishing returns
Reuse
Vec<(f64, u32, bool)>inb_compare_parallel. 2000 × 16B × 1100 calls ≈ 35 MB of allocation churn per run. The allocator seems to handle it well (sys time is down to ~0.9s at 8 threads), but pooling could shave a bit.Early-terminate k-mer screen. Running dotsum with a running check against
(1 - kdist_cutoff) × scale; if we can't hit the threshold, return. Matters most when the screen is the bottleneck, which is a specific corner case.Unclear-benefit (would need measurement first)
Pipeline the outer errors-from-sample loop. Currently: dada → build_trans → errfun → (barrier) → next iter. Could run
build_trans_matconcurrently with the next iter's dada if we accept a one-iter lag in the error matrix. Structural change, might not converge identically.Bimera detection profiling. Not in the learn-errors hot path, but
remove-bimera-denovohasn't been measured. If users run that on a big sequence table, same buffer reuse / load balancing tricks probably apply.