A compiled, statically-typed rewrite of Bulik-Sullivan et al.'s LDSC in Rust.
Implements six subcommands — munge-sumstats, l2, h2, rg, make-annot, cts-annot — with
numerically identical output and a ~38× speedup on LD score computation
(exact mode, 1000G N=2,490; up to 101× with --sketch). Approximate modes
(--sketch, --stochastic, --subsample) trade per-SNP precision for additional throughput.
Fastest (no Rust required):
docker run --rm ghcr.io/sharifhsn/ldsc:latest --helpLocal install options:
- Prebuilt binaries from GitHub Releases (see “Prebuilt Binaries” below).
- Cargo install (requires Rust):
cargo install ldsc
The typical LDSC workflow — preprocess summary statistics, then estimate heritability or genetic correlation — mirrors the upstream wiki tutorial.
Step 1: Download pre-computed European LD scores (skip l2 for European GWAS)
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2
tar -jxvf eur_w_ld_chr.tar.bz2 # inner .l2.ldscore.gz files are already gzip-compressed
bunzip2 w_hm3.snplist.bz2Step 2: Pre-process summary statistics
ldsc munge-sumstats \
--sumstats my_gwas.txt \
--N 50000 \
--merge-alleles w_hm3.snplist \
--out my_traitStep 3a: Estimate heritability
ldsc h2 \
--h2 my_trait.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out my_trait_h2Step 3b: Estimate genetic correlation
ldsc rg \
--rg trait1.sumstats.gz,trait2.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out trait1_vs_trait2Pre-processes GWAS summary statistics into the .sumstats.gz format consumed by h2 and rg.
Input summary statistics may be plain, .gz, or .bz2, and can be tab- or whitespace-delimited.
ldsc munge-sumstats \
--sumstats my_gwas.txt.gz \
--out output_prefix \
[--merge-alleles w_hm3.snplist] \
[--signed-sumstats BETA,0] \
[--N 50000] \
[--info-min 0.9] \
[--maf-min 0.01]Key flags: --signed-sumstats COLNAME,null_value tells the tool which column carries effect direction and what the
null value is (e.g. BETA,0, OR,1, Z,0). Without this flag the tool auto-detects from BETA/LOG_ODDS/OR/Z columns.
--a1-inc skips the signed column and treats all Z-scores as positive (A1 is always the risk allele).
--merge-alleles enforces allele concordance (mismatches are removed), matching Python behavior.
Use --daner or --daner-n for Ripke daner formats (infers N from FRQ_[A/U] headers or Nca/Nco columns).
Computes LD scores from a PLINK binary file set (.bed/.bim/.fam).
Annotation inputs (.annot) may be plain, .gz, or .bz2.
Tip for European GWAS: Pre-computed 1000G phase 3 LD scores are available from the Broad LDSCORE page. Download
eur_w_ld_chr.tar.bz2; aftertar -jxvf, the inner.l2.ldscore.gzfiles are already gzip-compressed and work directly withldsc. Non-European populations require computing your own LD scores from an appropriate reference panel.
ldsc l2 \
--bfile /path/to/1000G_EUR \
--out out/eur \
--ld-wind-cm 1.0 \
[--annot annotations/BaselineLD.] \
[--extract snplist.txt] \
[--maf 0.01] \
[--keep keep_individuals.txt] \
[--per-allele] \
[--pq-exp 1.0]ldsc l2 warns if the LD window spans an entire chromosome; use --yes-really to silence.
Window flags are mutually exclusive: --ld-wind-cm (genetic distance, default 1.0), --ld-wind-kb
(physical distance), or --ld-wind-snps (fixed flanking SNP count).
Partitioned LD scores with --annot prefix: accepts either a single {prefix}.annot[.gz|.bz2] file or
per-chromosome {prefix}{chr}.annot[.gz|.bz2] files for each chromosome present in the BIM. Outputs one L2 column
per annotation and corresponding .l2.M / .l2.M_5_50 files.
--per-allele is equivalent to --pq-exp 1 (weights each r² by p·(1−p)). Use --pq-exp S to
apply (p·(1−p))^S weighting; output columns and .M files receive a _S{S} suffix.
--no-print-annot suppresses the .annot.gz output produced by --cts-bin.
Estimates SNP heritability.
LD score inputs may be plain, .gz, or .bz2.
ldsc h2 \
--h2 trait.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out results--ref-ld-chr prefix appends the chromosome number then .l2.ldscore.gz. So
--ref-ld-chr eur_w_ld_chr/ reads eur_w_ld_chr/1.l2.ldscore.gz … eur_w_ld_chr/22.l2.ldscore.gz.
If the chromosome number falls in the middle of the filename, use @ as a placeholder:
--ref-ld-chr ld/chr@_scores → ld/chr1_scores.l2.ldscore.gz, etc.
The same convention applies to --w-ld-chr.
You may pass a comma-separated list to --ref-ld / --ref-ld-chr (Python behavior);
--w-ld / --w-ld-chr must point to a single fileset.
Common options: --no-intercept, --intercept-h2 VALUE, --two-step 30, --chisq-max 80,
--samp-prev 0.1 --pop-prev 0.01 (liability-scale conversion),
--print-coefficients (partitioned h2: per-annotation τ and enrichment).
Overlapping annotations: use --overlap-annot with --frqfile-chr prefix (or --frqfile for
single filesets) to match Python’s overlap-adjusted results. When enabled, LDSC writes
<out>.results with overlap-aware proportion/enrichment columns.
Cell-type-specific h2: use --h2-cts and --ref-ld-chr-cts (see the LDSC wiki for .ldcts
format). Output is written to <out>.cell_type_results.txt. Add --print-all-cts to report
coefficients for all CTS LD score prefixes in each line.
Estimates genetic correlations across all pairs from a list of summary statistic files.
LD score inputs may be plain, .gz, or .bz2.
ldsc rg \
--rg trait1.sumstats.gz,trait2.sumstats.gz,trait3.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out results--ref-ld-chr / --w-ld-chr follow the same prefix convention as h2 (see above).
Common options: --no-intercept, --intercept-h2 1,1,1 (one per trait), --intercept-gencov 0.0,0.0 (per-pair), --two-step 30,
--samp-prev / --pop-prev (comma-separated, one value per input file).
Generates 0/1 annotation files from a UCSC BED file or a gene set.
# From a BED file:
ldsc make-annot \
--bimfile my_data.bim \
--bed-file regions.bed \
--annot-file output.annot.gz \
--windowsize 100000
# From a gene set:
ldsc make-annot \
--bimfile my_data.bim \
--gene-set-file immune_genes.txt \
--gene-coord-file ENSG_coord.txt \
--annot-file output.annot.gz \
--windowsize 100000Bins one or more continuous annotations into categories and writes a .annot file
compatible with l2 --annot (Python --cts-bin preprocessing).
ldsc cts-annot \
--bimfile my_data.bim \
--cts-bin DAF.txt,DIST.txt \
--cts-breaks 0.1,0.25,0.4x10,100,1000 \
--cts-names DAF,DIST_TO_GENE \
--annot-file cts.annot.gzNative builds require Rust ≥ 1.85. The Rust implementation uses faer for dense linear algebra.
Pass --fast-f32 to ldsc l2 to run core matmuls in f32 while accumulating
in f64. This halves the memory bandwidth for GEMM and can be significantly faster on CPUs
with 256-bit SIMD.
Observed speedup varies with panel size (hyperfine-validated, AWS EPYC 7R13):
- N=2,490 (1000G): ~1.3× faster vs f64 exact
- N=50,000 (biobank): ~1.85× faster vs f64 exact (bandwidth-limited, f32 halves per-chunk BED footprint)
- Per-SNP LD score deltas vs f64:
max_abs_diff ≈ 0.008 - Downstream h2/rg regression: identical to f64 — h2 estimate, SE, intercept, and ratio all match to displayed precision. The per-SNP noise is far below the regression's sensitivity threshold. This is the recommended mode for biobank-scale data when exact h2/rg is needed.
Note: --sketch automatically enables f32 (see below), so --fast-f32 is only needed
for exact or --stochastic modes.
# f32 matmul (runtime flag, no rebuild needed)
ldsc l2 --bfile … --out … --fast-f32
# default f64 (parity-safe)
ldsc l2 --bfile … --out …Pass --stochastic T to ldsc l2 to use Hutchinson's stochastic trace estimator
with T random Rademacher probe vectors instead of exact GEMM. This approximates
diag(R²) without forming the full correlation matrix, trading precision for speed.
This is an approximation. Per-SNP LD scores will differ from the exact (default) path. The expected relative error per SNP is ~sqrt(2/T). Downstream h2/rg estimates are typically robust to this noise, but users should validate on their specific application before relying on stochastic scores.
| T (probes) | Mean relative error | Median |rel| error | Wall time (1.66M SNPs) | |-----------|--------------------|--------------------|----------------------| | 50 | ~2% | ~7% | 36.2s (13% faster) | | 100 | ~1% | ~5% | not recommended* | | 1000 | ~0.4% | ~1.5% | slower than exact |
*T=100 currently exhibits a memory performance regression on some hardware. Use T=50 for the speed benefit, or omit the flag for exact computation.
# Stochastic (faster, approximate)
ldsc l2 --bfile … --out … --stochastic 50
# Exact (default, numerically identical to Python)
ldsc l2 --bfile … --out …Limitations:
- Scalar (non-partitioned) LD scores only. If
--annotis provided with multiple annotation columns, the flag is ignored and exact GEMM is used. - Incompatible with
--gpu. - Results are deterministic (fixed PRNG seed) but not identical across runs with different T values.
Pass --sketch d to ldsc l2 to compress the individual dimension from N to d via
a random projection before all GEMM operations. This reduces the inner dimension of
every matrix multiply, trading precision for speed. After projection, each column is
re-normalized (ratio estimator) to reduce per-pair variance by ~2×.
--sketch automatically enables f32. Sketch projections in f64 and f32 produce
bit-identical output because the projection entries (±1/√d for Rademacher, ±1 for
CountSketch) are exactly representable in f32. There is zero accuracy cost and ~1.3×
speed gain, so f64 sketch is strictly dominated. You do not need to pass --fast-f32.
This is an approximation. All SNPs share the same random projection matrix, so errors are correlated across SNPs. The bias is corrected in expectation via adjusted r²_unbiased constants, but a single run can show systematic shifts. Requires d ≥ 3.
| d (dim) | Speedup (1.66M SNPs, N=2,490) | Pearson r vs exact | Median |rel| error | Recommended for |
|---|---|---|---|---|
| 25 | 2.88× (14.3s) | ~0.73 | ~97% | quick screening |
| 50 | ~101× (15.3s) | ~0.81 | ~52% | rough estimates, biobank speed |
| 100 | ~82× (~19s est.) | ~0.85 | ~13% | moderate accuracy |
| 200 | ~61× (25.4s) | ~0.93 | ~6% | recommended default |
| 500 | ~49× (31.5s) | ~0.97 | ~3% | high accuracy |
Accuracy measured at N=2,490 (1000G). At biobank scale (N=50K), CountSketch is recommended over Rademacher — see Projection methods below.
Important: downstream h2/rg regression accuracy. Sketch LD scores introduce measurement
error that causes attenuation bias in h2 regression — the h2 estimate is systematically low
and the intercept is inflated. This effect is much larger than the per-SNP LD score error
suggests. See Downstream regression impact for detailed
benchmarks. If exact h2/rg is needed, use --fast-f32 (exact, 1.84× faster) or CountSketch
at d ≥ 5000.
# Sketch (faster, approximate — d=200 is a good default for LD scores)
ldsc l2 --bfile … --out … --sketch 200
# High-accuracy sketch for downstream h2/rg (d=5000, ~2% h2 bias)
ldsc l2 --bfile … --out … --sketch 5000 --sketch-method countsketch
# Exact (default, numerically identical to Python)
ldsc l2 --bfile … --out …Two projection methods are available via --sketch-method:
-
rademacher(default) — Dense ±1/√d random matrix. Projection cost is O(d×N×c) via GEMM. Leverages faer's SIMD-optimized matmul for high throughput. Slightly more accurate than CountSketch at the same d. -
countsketch— Hash-based projection where each individual is assigned a random bucket b ∈ {0,...,d−1} and sign σ ∈ {±1}. Projection cost is O(N×c) — a scatter-add instead of a full GEMM. Faster than Rademacher at large N (fewer FLOPs), but slightly noisier because each individual contributes to exactly one bucket instead of all d dimensions.
| Method | d=50 median error | d=200 median error | Projection cost | Best for |
|---|---|---|---|---|
| Rademacher | 20.6% | 6.1% | O(d×N×c) | small-to-medium N, best accuracy |
| CountSketch | 21.8% | 7.5% | O(N×c) | large N (biobank), fastest projection |
To match Rademacher d=50 accuracy with CountSketch, use d=100–200.
# CountSketch projection (faster at large N)
ldsc l2 --bfile … --out … --sketch 100 --sketch-method countsketch- Rademacher: avoid d > 500 — cache thrashing from the d×N projection matrix causes severe regression at d=1000.
- CountSketch: d up to √N is free (scatter-add is O(N×c), independent of d). Above √N, the downstream d×d matmul begins to matter. Even at d=10000 (20% of N=50K), CountSketch is still ~2× faster than exact-f32 and ~4× faster than exact-f64.
- Downstream h2/rg bias: sketch LD scores cause errors-in-variables attenuation in
regression. See Downstream regression impact for
benchmarks. Use d ≥ 5000 for h2/rg, or
--fast-f32for exact regression. - Incompatible with
--gpuand--stochastic. - Results are deterministic (fixed PRNG seed 42).
- Works with partitioned annotations (unlike
--stochastic).
Sketch LD scores cause errors-in-variables attenuation bias in h2 regression: the h2 estimate is systematically low and the intercept is inflated. exact-f32 has zero downstream error. For full benchmarks, see docs/performance-deep-dive.md.
| Use case | Recommended mode | Speedup vs exact-f64 |
|---|---|---|
| Exact h2/rg needed | --fast-f32 |
1.84× |
| h2 within ~2% | --sketch 5000 --sketch-method countsketch |
~5× |
| h2 within ~4% | --sketch 1000 --sketch-method countsketch |
~9× |
| LD scores only (QC, visualization) | --sketch 200 --sketch-method countsketch |
~20× |
Pass --subsample N' to ldsc l2 to randomly select N' individuals from the reference
panel before LD computation. This reduces both BED I/O and GEMM cost proportionally —
runtime scales with N' instead of N.
This is an approximation. LD scores from a subsample are noisier than from the full
panel. For biobank-scale data (N > 10K), N' = 2,000–5,000 gives accurate LD scores for
common variants (MAF > 5%). The subsample is deterministic (seed 42) and sorted to preserve
sequential BED read order. Cannot be combined with --keep.
# Subsample 5000 individuals from a 50K panel
ldsc l2 --bfile biobank_50k --out out --subsample 5000
# Full panel (default)
ldsc l2 --bfile biobank_50k --out outOn HPC clusters with networked filesystems (GPFS, NFS, Lustre), BED file reads travel over
the network and can block for tens of milliseconds per chunk. --prefetch-bed reads the next
BED chunk on a background thread while the compute thread runs GEMM, overlapping I/O latency
with computation.
# Recommended for GPFS/NFS (HPC) — first run or cold filesystem cache
ldsc l2 --bfile … --out … --prefetch-bed
# Default (no flag) — always correct, optimal for local SSD
ldsc l2 --bfile … --out …When to use it:
| Storage | Cache state | Use --prefetch-bed? |
|---|---|---|
| Local SSD | Warm (repeated runs) | No — regresses ~10% |
| Local SSD | Cold (first run) | Neutral to slight benefit |
| GPFS / NFS / Lustre | Cold (typical HPC job) | Yes — hides network latency |
| GPFS / NFS / Lustre | Warm (same job, 2nd pass) | Probably neutral |
Why it regresses on local SSD: With a warm OS page cache the BED file is already in RAM,
so read() is just a memcpy from kernel pages — no blocking I/O. The reader thread still
uses CPU for 2-bit genotype decoding and competes with the rayon GEMM thread pool. On a
6-core / 12-thread machine, adding a 13th thread preempts GEMM workers and slows the whole run.
Why it helps on GPFS: Each read() call blocks waiting for data over InfiniBand. While the
reader thread waits, the CPU is free for GEMM — genuine parallelism with no resource contention.
Benefit scales with GPFS cache miss rate (highest on the first job run after data is staged).
Note for PMACS users: The cluster runs CentOS 7 (kernel 3.10). io_uring (kernel 5.1+) is
not available; --prefetch-bed uses standard POSIX pread on a background std::thread.
Output is always bit-identical to the default path.
For HPC deployments with GPFS, Lustre, or other networked filesystems, --mmap uses
memory-mapped I/O instead of buffered reads. This provides:
- Zero-copy access for the fused CountSketch path (eliminates ~20GB of memcpy at biobank scale)
- OS-managed readahead via
MADV_SEQUENTIAL— GPFS can prefetch across storage nodes in parallel - Async prefetch via
MADV_WILLNEEDon the next chunk — replaces--prefetch-bedwithout thread contention - No seek invalidation — unlike
BufReader, mmap'd pages stay resident once faulted
# Recommended for GPFS/Lustre HPC
ldsc l2 --bfile … --out … --mmap
# Can combine with any mode
ldsc l2 --bfile … --out … --mmap --sketch 200 --sketch-method countsketchNote: On local SSD with warm cache, --mmap regresses ~15% due to page fault overhead.
Use the default (no flag) for local storage. --mmap is designed for networked filesystems
where it replaces both --prefetch-bed and buffered reads.
Build with --features gpu to enable CUDA-accelerated matrix multiplication via
CubeCL. Requires a CUDA toolkit at build time.
cargo build --release --features gpu
ldsc l2 --bfile … --out … --gpu # f32 compute (default)
ldsc l2 --bfile … --out … --gpu --gpu-f64 # native f64 compute
ldsc l2 --bfile … --out … --gpu --gpu-flex32 # half-precision compute, f32 accumulationPrecision options:
--gpu: Default f32 compute--gpu-f64: Native f64 on GPU (slower but numerically exact)--gpu-flex32: Half-precision compute with f32 accumulation (fastest, slight accuracy loss)--gpu-tile-cols N: Split large window matrices into VRAM-fitting tiles
At 1000G scale (n=2,490), GPU is transfer-bound and slower than CPU. GPU acceleration targets biobank-scale cohorts (n >= 50k) where each chunk's GEMM is large enough for compute to dominate PCIe transfer.
Default build (CPU only):
cargo build --releaseReleases include Linux, macOS, and Windows archives that contain ldsc, LICENSE, and README.md.
# Linux (x86_64)
curl -L -o ldsc_linux-x86_64.tar.gz \
https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_linux-x86_64.tar.gz
tar -xzf ldsc_linux-x86_64.tar.gz
./ldsc --help
# macOS (Apple Silicon)
curl -L -o ldsc_macos-aarch64.tar.gz \
https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_macos-aarch64.tar.gz
tar -xzf ldsc_macos-aarch64.tar.gz
./ldsc --help
# Windows (x86_64)
# Download the zip from the release page and extract:
# https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_windows-x86_64.zip
Images are published to the GitHub Container Registry on every push to main and for each version tag.
docker pull ghcr.io/sharifhsn/ldsc:latest
# Run with local data mounted
docker run --rm \
-v /path/to/data:/data \
ghcr.io/sharifhsn/ldsc:latest \
h2 --h2 /data/trait.sumstats.gz \
--ref-ld-chr /data/eur_w_ld_chr/ \
--w-ld-chr /data/eur_w_ld_chr/ \
--out /data/resultsVersion tags (v1.2.3) produce :1.2.3, :1.2, and :latest. Pushes to main produce a :main
tag and a short-SHA tag (:sha-XXXXXXX).
Requires a Rust toolchain (≥ 1.85; edition 2024 features used).
cargo build --release
# binary: target/release/ldscThe release profile sets opt-level = 3, lto = "thin", codegen-units = 1.
For a fully static binary that runs on any Linux (including CentOS 7 / RHEL 7 HPC clusters),
build with musl and the mimalloc feature:
# Requires the musl target (one-time setup)
rustup target add x86_64-unknown-linux-musl
cargo build --release --features mimalloc --target x86_64-unknown-linux-musl
strip target/x86_64-unknown-linux-musl/release/ldsc
# Copy the single binary to your cluster — no dependencies neededThis produces a static-pie binary with zero shared library dependencies. The mimalloc
feature replaces musl's default allocator with Microsoft's mimalloc,
which eliminates the ~12% performance penalty musl's single-threaded allocator incurs under
rayon parallelism. On AWS EPYC benchmarks, musl + mimalloc is actually 4.5% faster than
the default glibc build (69.9s vs 73.2s on full 1000G).
Without --features mimalloc, a musl build works but is ~12% slower due to allocator contention.
The following flags are available for performance tuning:
--rayon-threads N: Rayon thread count for jackknife inh2/rg.--polars-threads N: Polars thread count for CSV streaming inmunge-sumstats.--prefetch-bed: Background BED reader thread forl2. Beneficial on networked filesystems (GPFS/NFS); hurts on local SSD with warm cache. See BED prefetch for full guidance.--mmap: Memory-mapped BED I/O forl2. Recommended for GPFS/Lustre HPC; replaces--prefetch-bed. See mmap for details.
Benchmarks on AWS c6a.4xlarge (AMD EPYC 7R13, 16 vCPU) using 1000 Genomes Phase 3
(1,664,852 SNPs, n = 2,490 individuals). Measured with hyperfine (1 warmup + 3 timed runs).
Static musl binary with mimalloc, AVX2+FMA target features.
| Mode | Full genome wall time | vs Python | Accuracy |
|---|---|---|---|
| Python | 25 min 49 s | 1.0× | reference |
| Rust f64 (default) | 41.1 s | ~38× | exact (max_abs_diff = 0) |
Rust f32 (--fast-f32) |
~32 s | ~48× | max_abs_diff ≈ 0.008 |
Rust stochastic (--stochastic 50) |
36.2 s | ~43× | ~7% median per-SNP error |
Rust sketch (--sketch 200) |
25.4 s | ~61× | Pearson r ≈ 0.93 vs exact |
Rust sketch (--sketch 50) |
15.3 s | ~101× | Pearson r ≈ 0.81 vs exact |
--ld-wind-kb 1000, --chunk-size 200. --sketch automatically enables f32 (bit-identical
to f64 for sketch, ~1.3× faster). The --fast-f32, --stochastic, and --sketch paths
trade exact parity for speed; the default f64 path is numerically identical to Python across
all 1,664,852 SNPs.
At biobank-scale N, GEMM dominates and larger panels expose more parallelism. Python runtime is extrapolated assuming linear O(N) scaling; Rust runtimes are hyperfine-measured (AWS EPYC 7R13, same hardware as above).
| Mode | Wall time | vs exact-f64 | Notes |
|---|---|---|---|
| exact-f64 | 665.9 s | 1.0× | numerically exact |
exact-f32 (--fast-f32) |
361.9 s | 1.84× | halved BED bandwidth |
--sketch 50 (Rademacher) |
118.4 s | 5.6× | r ≈ 0.81 |
--sketch 200 (Rademacher) |
151.8 s | 4.4× | r ≈ 0.93 |
--sketch 50 --sketch-method countsketch |
33.1 s | 20.1× | r ≈ 0.81 |
--sketch 200 --sketch-method countsketch |
33.8 s | 19.7× | r ≈ 0.93 |
--sketch 500 --sketch-method countsketch |
36.2 s | 18.4× | r ≈ 0.97 |
--sketch 1000 --sketch-method countsketch |
39.7 s | 16.8× | r ≈ 0.99 |
--sketch 5000 --sketch-method countsketch |
~55 s* | ~12×* | r ≈ 0.998, h2 ~2% low |
--sketch 10000 --sketch-method countsketch |
~75 s* | ~9×* | r ≈ 0.999, h2 exact |
--subsample 5000 --sketch 50 |
24.9 s | 26.7× | fastest (compounds two approx.) |
*d=5000 and d=10000 timings estimated from local scaling (1.7× and 2.3× vs d=200); AWS times will differ but the relative ordering is stable. Even at d=10000 (20% of N), CountSketch is still ~5× faster than exact-f32 and ~9× faster than exact-f64.
Fused CountSketch reads packed BED bytes and scatter-adds directly into the d×c sketch buffer, eliminating the N×c intermediate entirely. Cost is O(N×c) independent of d, so d=200 has the same speed as d=50 but much better accuracy. A 28 GB container is recommended for N=50K.
Speedup varies with window size (200k-SNP extract, same machine):
| Window | Python | Rust f64 | Speedup |
|---|---|---|---|
--ld-wind-kb 100 |
44.2 s | 5.1 s | 8.7× |
--ld-wind-kb 500 |
48.4 s | 6.2 s | 7.8× |
--ld-wind-kb 1000 |
53.7 s | 8.6 s | 6.2× |
--ld-wind-kb 2000 |
61.8 s | 12.9 s | 4.8× |
Runtime scales with both M (SNP count) and N (individual count). The ring-buffer algorithm keeps memory bounded by the LD window size, not the total SNP count.
Scaling with M (fixed N=2,490, exact-f64):
| M (SNPs) | Est. wall time (AWS EPYC 16v) | BED size | Peak memory |
|---|---|---|---|
| 1.66M | 41 s (measured) | 1 GB | ~100 MB |
| 5M | ~124 s | 3 GB | ~100 MB |
| 10M | ~247 s | 6 GB | ~100 MB |
| 50M | ~1,235 s (~21 min) | 30 GB | ~100 MB |
Scaling with N (fixed M=1.66M, full-genome):
| N (individuals) | Mode | Wall time | vs Python |
|---|---|---|---|
| 2,490 (1000G) | exact-f64 | 41.1 s (measured) | ~38× |
| 2,490 (1000G) | --sketch 50 |
15.3 s (measured) | ~101× |
| 50,000 (biobank) | exact-f64 | 665.9 s (measured) | — |
| 50,000 (biobank) | countsketch-200 | 33.8 s (measured) | — |
| 50,000 (biobank) | countsketch-5000 | ~55 s (estimated) | — |
| 50,000 (biobank) | countsketch-10000 | ~75 s (estimated) | — |
At biobank N, GEMM cost is O(N × w × c) per chunk; CountSketch reduces this to O(N×c) scatter-add,
independent of d. Rademacher sketch reduces inner dimension from N to d but is still GEMM-bound.
BED I/O is sequential and throughput-bound; --fast-f32 halves BED read bytes per chunk.
Additional UKBB I/O benchmarks on this machine (Apple M4, 10 CPU cores, 24 GB RAM, macOS 26.3 build 25D125). These highlight I/O-heavy workflows and the impact of the Rust pipeline’s faster parsing and joins.
Dataset: /Users/sharif/Code/ldsc/data/biomarkers-30600-both_sexes-irnt.sample8m.tsv
(~497 MB) for munge-sumstats; and /Users/sharif/Code/ldsc/data/UKBB.ALL.ldscore/UKBB.EUR.l2.ldscore.gz
for h2/rg (381,831 SNPs after merge).
Quick local checks for l2 default to a 50k SNP extract via scripts/bench_l2_py3_vs_rust.sh.
| Workflow | Rust | Python | Speedup |
|---|---|---|---|
| munge-sumstats | 3.74 s | 62.65 s | 16.75× |
| h2 | 0.90 s | 7.81 s | 8.68× |
| rg (two traits) | 2.93 s | 28.09 s | 9.59× |
Reference HPC hardware (Penn PMACS) for scaling tests:
- 19 Dell C6420 quad node systems (76 compute nodes, 6,080 cores total, CentOS 7.8, 80 CPU cores per node with hyper-threading, 256 GB or 512 GB RAM per node, 56 GB/s EDR or 100 GB/s FDR InfiniBand to the filesystem, 10 Gb/s Ethernet).
- 1 Dell R940 big memory system (1.5 TB RAM, 96 CPU cores, 10 Gb/s Ethernet, 100 GB/s FDR InfiniBand).
- 2 GPU nodes (1× Nvidia Tesla P100, 512 GB RAM, 88 CPU cores, 10 Gb/s Ethernet, 100 GB/s FDR InfiniBand).
- 4.2 PB IBM Spectrum Scale (GPFS) disk storage (2 tiers, no backup).
- 1.3 PB mirrored archive tape storage.
- LSF job scheduling system.
Python LDSC consists of three separate scripts; this crate consolidates them into subcommands of a
single ldsc binary:
| Python | Rust |
|---|---|
python munge_sumstats.py --sumstats … --out … |
ldsc munge-sumstats --sumstats … --out … |
python ldsc.py --l2 --bfile … --out … |
ldsc l2 --bfile … --out … |
python ldsc.py --h2 … --ref-ld-chr … |
ldsc h2 --h2 … --ref-ld-chr … |
python ldsc.py --rg … --ref-ld-chr … |
ldsc rg --rg … --ref-ld-chr … |
python make_annot.py --bimfile … --bed-file … |
ldsc make-annot --bimfile … --bed-file … |
python ldsc.py --cts-bin … |
ldsc cts-annot … |
Python's --l2 flag (LD score estimation mode) becomes the l2 subcommand. The --h2 and
--rg flags (regression modes) become h2 and rg subcommands.
Python flag names are supported directly.
--mafin l2: default now matches Python (MAF prefilter before LD computation).--n-mindefault: when--n-minis 0, Rust now matches Python (90th percentile / 1.5).--yes-really: Rust warns when the LD window spans a whole chromosome and--yes-reallyis not set (Python errors).--chunksize: Python requires explicit chunking for large files; Rust uses Polars LazyFrame streaming and ignores chunk size for munge.--return-silly-things/--invert-anyway: accepted flags for CLI parity; Rust never clips results and always uses a least-squares solver (warnings emitted).--no-print-annot: only affects--cts-binoutput; suppresses the.annot.gzfile.--cts-binworkflow: supported directly byldsc l2(also available as a separate preprocessor vialdsc cts-annot).
The following Python flags are accepted for CLI parity but do not change behavior in Rust:
h2/rg --return-silly-thingsh2/rg --invert-anyway
- Performance deep-dive — algorithmic complexity for each mode, scaling analysis for dense SNP panels (O(M² × N) with distance-based windows), downstream h2/rg regression accuracy by sketch dimension, and why Python is slow.
- Architecture & source map — module-level code map, key data-flow invariants, and dependency rationale.
Releases are cut with cargo-release and tagged as vX.Y.Z. Tag pushes trigger the release
workflow, which builds and uploads platform tarballs to GitHub Releases.
cargo release patch
cargo release patch --executeRequires Docker with BuildKit (default since Docker 23):
docker build -t ldsc .The multi-stage Dockerfile uses cargo-chef to
cache dependency compilation in a separate layer, so incremental rebuilds only recompile changed
source files.