LDSC — Rust Rewrite

A compiled, statically-typed rewrite of Bulik-Sullivan et al.'s LDSC in Rust. Implements six subcommands — munge-sumstats, l2, h2, rg, make-annot, cts-annot — with numerically identical output and a ~38× speedup on LD score computation (exact mode, 1000G N=2,490; up to 101× with --sketch). Approximate modes (--sketch, --stochastic, --subsample) trade per-SNP precision for additional throughput.

Install

Fastest (no Rust required):

docker run --rm ghcr.io/sharifhsn/ldsc:latest --help

Local install options:

Prebuilt binaries from GitHub Releases (see “Prebuilt Binaries” below).
Cargo install (requires Rust): cargo install ldsc

Quick Start

The typical LDSC workflow — preprocess summary statistics, then estimate heritability or genetic correlation — mirrors the upstream wiki tutorial.

Step 1: Download pre-computed European LD scores (skip l2 for European GWAS)

wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2
tar -jxvf eur_w_ld_chr.tar.bz2   # inner .l2.ldscore.gz files are already gzip-compressed
bunzip2 w_hm3.snplist.bz2

Step 2: Pre-process summary statistics

ldsc munge-sumstats \
  --sumstats my_gwas.txt \
  --N 50000 \
  --merge-alleles w_hm3.snplist \
  --out my_trait

Step 3a: Estimate heritability

ldsc h2 \
  --h2 my_trait.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out my_trait_h2

Step 3b: Estimate genetic correlation

ldsc rg \
  --rg trait1.sumstats.gz,trait2.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out trait1_vs_trait2

Usage

munge-sumstats

Pre-processes GWAS summary statistics into the .sumstats.gz format consumed by h2 and rg. Input summary statistics may be plain, .gz, or .bz2, and can be tab- or whitespace-delimited.

ldsc munge-sumstats \
  --sumstats my_gwas.txt.gz \
  --out output_prefix \
  [--merge-alleles w_hm3.snplist] \
  [--signed-sumstats BETA,0] \
  [--N 50000] \
  [--info-min 0.9] \
  [--maf-min 0.01]

Key flags: --signed-sumstats COLNAME,null_value tells the tool which column carries effect direction and what the null value is (e.g. BETA,0, OR,1, Z,0). Without this flag the tool auto-detects from BETA/LOG_ODDS/OR/Z columns. --a1-inc skips the signed column and treats all Z-scores as positive (A1 is always the risk allele). --merge-alleles enforces allele concordance (mismatches are removed), matching Python behavior. Use --daner or --daner-n for Ripke daner formats (infers N from FRQ_[A/U] headers or Nca/Nco columns).

l2

Computes LD scores from a PLINK binary file set (.bed/.bim/.fam). Annotation inputs (.annot) may be plain, .gz, or .bz2.

Tip for European GWAS: Pre-computed 1000G phase 3 LD scores are available from the Broad LDSCORE page. Download eur_w_ld_chr.tar.bz2; after tar -jxvf, the inner .l2.ldscore.gz files are already gzip-compressed and work directly with ldsc. Non-European populations require computing your own LD scores from an appropriate reference panel.

ldsc l2 \
  --bfile /path/to/1000G_EUR \
  --out out/eur \
  --ld-wind-cm 1.0 \
  [--annot annotations/BaselineLD.] \
  [--extract snplist.txt] \
  [--maf 0.01] \
  [--keep keep_individuals.txt] \
  [--per-allele] \
  [--pq-exp 1.0]

ldsc l2 warns if the LD window spans an entire chromosome; use --yes-really to silence.

Window flags are mutually exclusive: --ld-wind-cm (genetic distance, default 1.0), --ld-wind-kb (physical distance), or --ld-wind-snps (fixed flanking SNP count).

Partitioned LD scores with --annot prefix: accepts either a single {prefix}.annot[.gz|.bz2] file or per-chromosome {prefix}{chr}.annot[.gz|.bz2] files for each chromosome present in the BIM. Outputs one L2 column per annotation and corresponding .l2.M / .l2.M_5_50 files.

--per-allele is equivalent to --pq-exp 1 (weights each r² by p·(1−p)). Use --pq-exp S to apply (p·(1−p))^S weighting; output columns and .M files receive a _S{S} suffix. --no-print-annot suppresses the .annot.gz output produced by --cts-bin.

h2

Estimates SNP heritability. LD score inputs may be plain, .gz, or .bz2.

ldsc h2 \
  --h2 trait.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out results

--ref-ld-chr prefix appends the chromosome number then .l2.ldscore.gz. So --ref-ld-chr eur_w_ld_chr/ reads eur_w_ld_chr/1.l2.ldscore.gz … eur_w_ld_chr/22.l2.ldscore.gz. If the chromosome number falls in the middle of the filename, use @ as a placeholder: --ref-ld-chr ld/chr@_scores → ld/chr1_scores.l2.ldscore.gz, etc. The same convention applies to --w-ld-chr. You may pass a comma-separated list to --ref-ld / --ref-ld-chr (Python behavior); --w-ld / --w-ld-chr must point to a single fileset.

Common options: --no-intercept, --intercept-h2 VALUE, --two-step 30, --chisq-max 80, --samp-prev 0.1 --pop-prev 0.01 (liability-scale conversion), --print-coefficients (partitioned h2: per-annotation τ and enrichment).

Overlapping annotations: use --overlap-annot with --frqfile-chr prefix (or --frqfile for single filesets) to match Python’s overlap-adjusted results. When enabled, LDSC writes <out>.results with overlap-aware proportion/enrichment columns.

Cell-type-specific h2: use --h2-cts and --ref-ld-chr-cts (see the LDSC wiki for .ldcts format). Output is written to <out>.cell_type_results.txt. Add --print-all-cts to report coefficients for all CTS LD score prefixes in each line.

rg

Estimates genetic correlations across all pairs from a list of summary statistic files. LD score inputs may be plain, .gz, or .bz2.

ldsc rg \
  --rg trait1.sumstats.gz,trait2.sumstats.gz,trait3.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out results

--ref-ld-chr / --w-ld-chr follow the same prefix convention as h2 (see above).

Common options: --no-intercept, --intercept-h2 1,1,1 (one per trait), --intercept-gencov 0.0,0.0 (per-pair), --two-step 30, --samp-prev / --pop-prev (comma-separated, one value per input file).

make-annot

Generates 0/1 annotation files from a UCSC BED file or a gene set.

# From a BED file:
ldsc make-annot \
  --bimfile my_data.bim \
  --bed-file regions.bed \
  --annot-file output.annot.gz \
  --windowsize 100000

# From a gene set:
ldsc make-annot \
  --bimfile my_data.bim \
  --gene-set-file immune_genes.txt \
  --gene-coord-file ENSG_coord.txt \
  --annot-file output.annot.gz \
  --windowsize 100000

cts-annot

Bins one or more continuous annotations into categories and writes a .annot file compatible with l2 --annot (Python --cts-bin preprocessing).

ldsc cts-annot \
  --bimfile my_data.bim \
  --cts-bin DAF.txt,DIST.txt \
  --cts-breaks 0.1,0.25,0.4x10,100,1000 \
  --cts-names DAF,DIST_TO_GENE \
  --annot-file cts.annot.gz

Installation Details

Native builds require Rust ≥ 1.85. The Rust implementation uses faer for dense linear algebra.

Fast f32 mode

Pass --fast-f32 to ldsc l2 to run core matmuls in f32 while accumulating in f64. This halves the memory bandwidth for GEMM and can be significantly faster on CPUs with 256-bit SIMD.

Observed speedup varies with panel size (hyperfine-validated, AWS EPYC 7R13):

N=2,490 (1000G): ~1.3× faster vs f64 exact
N=50,000 (biobank): ~1.85× faster vs f64 exact (bandwidth-limited, f32 halves per-chunk BED footprint)
Per-SNP LD score deltas vs f64: max_abs_diff ≈ 0.008
Downstream h2/rg regression: identical to f64 — h2 estimate, SE, intercept, and ratio all match to displayed precision. The per-SNP noise is far below the regression's sensitivity threshold. This is the recommended mode for biobank-scale data when exact h2/rg is needed.

Note: --sketch automatically enables f32 (see below), so --fast-f32 is only needed for exact or --stochastic modes.

# f32 matmul (runtime flag, no rebuild needed)
ldsc l2 --bfile … --out … --fast-f32

# default f64 (parity-safe)
ldsc l2 --bfile … --out …

Stochastic trace estimation (`--stochastic`, experimental)

Pass --stochastic T to ldsc l2 to use Hutchinson's stochastic trace estimator with T random Rademacher probe vectors instead of exact GEMM. This approximates diag(R²) without forming the full correlation matrix, trading precision for speed.

This is an approximation. Per-SNP LD scores will differ from the exact (default) path. The expected relative error per SNP is ~sqrt(2/T). Downstream h2/rg estimates are typically robust to this noise, but users should validate on their specific application before relying on stochastic scores.

| T (probes) | Mean relative error | Median |rel| error | Wall time (1.66M SNPs) | |-----------|--------------------|--------------------|----------------------| | 50 | ~2% | ~7% | 36.2s (13% faster) | | 100 | ~1% | ~5% | not recommended* | | 1000 | ~0.4% | ~1.5% | slower than exact |

*T=100 currently exhibits a memory performance regression on some hardware. Use T=50 for the speed benefit, or omit the flag for exact computation.

# Stochastic (faster, approximate)
ldsc l2 --bfile … --out … --stochastic 50

# Exact (default, numerically identical to Python)
ldsc l2 --bfile … --out …

Limitations:

Scalar (non-partitioned) LD scores only. If --annot is provided with multiple annotation columns, the flag is ignored and exact GEMM is used.
Incompatible with --gpu.
Results are deterministic (fixed PRNG seed) but not identical across runs with different T values.

Random projection sketch (`--sketch`, experimental)

Pass --sketch d to ldsc l2 to compress the individual dimension from N to d via a random projection before all GEMM operations. This reduces the inner dimension of every matrix multiply, trading precision for speed. After projection, each column is re-normalized (ratio estimator) to reduce per-pair variance by ~2×.

--sketch automatically enables f32. Sketch projections in f64 and f32 produce bit-identical output because the projection entries (±1/√d for Rademacher, ±1 for CountSketch) are exactly representable in f32. There is zero accuracy cost and ~1.3× speed gain, so f64 sketch is strictly dominated. You do not need to pass --fast-f32.

This is an approximation. All SNPs share the same random projection matrix, so errors are correlated across SNPs. The bias is corrected in expectation via adjusted r²_unbiased constants, but a single run can show systematic shifts. Requires d ≥ 3.

d (dim)	Speedup (1.66M SNPs, N=2,490)	Pearson r vs exact	Median \|rel\| error	Recommended for
25	2.88× (14.3s)	~0.73	~97%	quick screening
50	~101× (15.3s)	~0.81	~52%	rough estimates, biobank speed
100	~82× (~19s est.)	~0.85	~13%	moderate accuracy
200	~61× (25.4s)	~0.93	~6%	recommended default
500	~49× (31.5s)	~0.97	~3%	high accuracy

Accuracy measured at N=2,490 (1000G). At biobank scale (N=50K), CountSketch is recommended over Rademacher — see Projection methods below.

Important: downstream h2/rg regression accuracy. Sketch LD scores introduce measurement error that causes attenuation bias in h2 regression — the h2 estimate is systematically low and the intercept is inflated. This effect is much larger than the per-SNP LD score error suggests. See Downstream regression impact for detailed benchmarks. If exact h2/rg is needed, use --fast-f32 (exact, 1.84× faster) or CountSketch at d ≥ 5000.

# Sketch (faster, approximate — d=200 is a good default for LD scores)
ldsc l2 --bfile … --out … --sketch 200

# High-accuracy sketch for downstream h2/rg (d=5000, ~2% h2 bias)
ldsc l2 --bfile … --out … --sketch 5000 --sketch-method countsketch

# Exact (default, numerically identical to Python)
ldsc l2 --bfile … --out …

Projection methods (`--sketch-method`)

Two projection methods are available via --sketch-method:

rademacher (default) — Dense ±1/√d random matrix. Projection cost is O(d×N×c) via GEMM. Leverages faer's SIMD-optimized matmul for high throughput. Slightly more accurate than CountSketch at the same d.
countsketch — Hash-based projection where each individual is assigned a random bucket b ∈ {0,...,d−1} and sign σ ∈ {±1}. Projection cost is O(N×c) — a scatter-add instead of a full GEMM. Faster than Rademacher at large N (fewer FLOPs), but slightly noisier because each individual contributes to exactly one bucket instead of all d dimensions.

Method	d=50 median error	d=200 median error	Projection cost	Best for
Rademacher	20.6%	6.1%	O(d×N×c)	small-to-medium N, best accuracy
CountSketch	21.8%	7.5%	O(N×c)	large N (biobank), fastest projection

To match Rademacher d=50 accuracy with CountSketch, use d=100–200.

# CountSketch projection (faster at large N)
ldsc l2 --bfile … --out … --sketch 100 --sketch-method countsketch

Sketch limitations

Rademacher: avoid d > 500 — cache thrashing from the d×N projection matrix causes severe regression at d=1000.
CountSketch: d up to √N is free (scatter-add is O(N×c), independent of d). Above √N, the downstream d×d matmul begins to matter. Even at d=10000 (20% of N=50K), CountSketch is still ~2× faster than exact-f32 and ~4× faster than exact-f64.
Downstream h2/rg bias: sketch LD scores cause errors-in-variables attenuation in regression. See Downstream regression impact for benchmarks. Use d ≥ 5000 for h2/rg, or --fast-f32 for exact regression.
Incompatible with --gpu and --stochastic.
Results are deterministic (fixed PRNG seed 42).
Works with partitioned annotations (unlike --stochastic).

Downstream regression impact

Sketch LD scores cause errors-in-variables attenuation bias in h2 regression: the h2 estimate is systematically low and the intercept is inflated. exact-f32 has zero downstream error. For full benchmarks, see docs/performance-deep-dive.md.

Use case	Recommended mode	Speedup vs exact-f64
Exact h2/rg needed	`--fast-f32`	1.84×
h2 within ~2%	`--sketch 5000 --sketch-method countsketch`	~5×
h2 within ~4%	`--sketch 1000 --sketch-method countsketch`	~9×
LD scores only (QC, visualization)	`--sketch 200 --sketch-method countsketch`	~20×

Individual subsampling (`--subsample`, experimental)

Pass --subsample N' to ldsc l2 to randomly select N' individuals from the reference panel before LD computation. This reduces both BED I/O and GEMM cost proportionally — runtime scales with N' instead of N.

This is an approximation. LD scores from a subsample are noisier than from the full panel. For biobank-scale data (N > 10K), N' = 2,000–5,000 gives accurate LD scores for common variants (MAF > 5%). The subsample is deterministic (seed 42) and sorted to preserve sequential BED read order. Cannot be combined with --keep.

# Subsample 5000 individuals from a 50K panel
ldsc l2 --bfile biobank_50k --out out --subsample 5000

# Full panel (default)
ldsc l2 --bfile biobank_50k --out out

BED prefetch for networked storage (`--prefetch-bed`)

On HPC clusters with networked filesystems (GPFS, NFS, Lustre), BED file reads travel over the network and can block for tens of milliseconds per chunk. --prefetch-bed reads the next BED chunk on a background thread while the compute thread runs GEMM, overlapping I/O latency with computation.

# Recommended for GPFS/NFS (HPC) — first run or cold filesystem cache
ldsc l2 --bfile … --out … --prefetch-bed

# Default (no flag) — always correct, optimal for local SSD
ldsc l2 --bfile … --out …

When to use it:

Storage	Cache state	Use `--prefetch-bed`?
Local SSD	Warm (repeated runs)	No — regresses ~10%
Local SSD	Cold (first run)	Neutral to slight benefit
GPFS / NFS / Lustre	Cold (typical HPC job)	Yes — hides network latency
GPFS / NFS / Lustre	Warm (same job, 2nd pass)	Probably neutral

Why it regresses on local SSD: With a warm OS page cache the BED file is already in RAM, so read() is just a memcpy from kernel pages — no blocking I/O. The reader thread still uses CPU for 2-bit genotype decoding and competes with the rayon GEMM thread pool. On a 6-core / 12-thread machine, adding a 13th thread preempts GEMM workers and slows the whole run.

Why it helps on GPFS: Each read() call blocks waiting for data over InfiniBand. While the reader thread waits, the CPU is free for GEMM — genuine parallelism with no resource contention. Benefit scales with GPFS cache miss rate (highest on the first job run after data is staged).

Note for PMACS users: The cluster runs CentOS 7 (kernel 3.10). io_uring (kernel 5.1+) is not available; --prefetch-bed uses standard POSIX pread on a background std::thread. Output is always bit-identical to the default path.

Memory-mapped BED I/O (`--mmap`)

For HPC deployments with GPFS, Lustre, or other networked filesystems, --mmap uses memory-mapped I/O instead of buffered reads. This provides:

Zero-copy access for the fused CountSketch path (eliminates ~20GB of memcpy at biobank scale)
OS-managed readahead via MADV_SEQUENTIAL — GPFS can prefetch across storage nodes in parallel
Async prefetch via MADV_WILLNEED on the next chunk — replaces --prefetch-bed without thread contention
No seek invalidation — unlike BufReader, mmap'd pages stay resident once faulted

# Recommended for GPFS/Lustre HPC
ldsc l2 --bfile … --out … --mmap

# Can combine with any mode
ldsc l2 --bfile … --out … --mmap --sketch 200 --sketch-method countsketch

Note: On local SSD with warm cache, --mmap regresses ~15% due to page fault overhead. Use the default (no flag) for local storage. --mmap is designed for networked filesystems where it replaces both --prefetch-bed and buffered reads.

Optional GPU acceleration (experimental)

Build with --features gpu to enable CUDA-accelerated matrix multiplication via CubeCL. Requires a CUDA toolkit at build time.

cargo build --release --features gpu
ldsc l2 --bfile … --out … --gpu              # f32 compute (default)
ldsc l2 --bfile … --out … --gpu --gpu-f64    # native f64 compute
ldsc l2 --bfile … --out … --gpu --gpu-flex32 # half-precision compute, f32 accumulation

Precision options:

--gpu: Default f32 compute
--gpu-f64: Native f64 on GPU (slower but numerically exact)
--gpu-flex32: Half-precision compute with f32 accumulation (fastest, slight accuracy loss)
--gpu-tile-cols N: Split large window matrices into VRAM-fitting tiles

At 1000G scale (n=2,490), GPU is transfer-bound and slower than CPU. GPU acceleration targets biobank-scale cohorts (n >= 50k) where each chunk's GEMM is large enough for compute to dominate PCIe transfer.

Default build (CPU only):

cargo build --release

Prebuilt Binaries

Releases include Linux, macOS, and Windows archives that contain ldsc, LICENSE, and README.md.

# Linux (x86_64)
curl -L -o ldsc_linux-x86_64.tar.gz \
  https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_linux-x86_64.tar.gz
tar -xzf ldsc_linux-x86_64.tar.gz
./ldsc --help

# macOS (Apple Silicon)
curl -L -o ldsc_macos-aarch64.tar.gz \
  https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_macos-aarch64.tar.gz
tar -xzf ldsc_macos-aarch64.tar.gz
./ldsc --help

# Windows (x86_64)
# Download the zip from the release page and extract:
# https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_windows-x86_64.zip

Docker

Images are published to the GitHub Container Registry on every push to main and for each version tag.

docker pull ghcr.io/sharifhsn/ldsc:latest

# Run with local data mounted
docker run --rm \
  -v /path/to/data:/data \
  ghcr.io/sharifhsn/ldsc:latest \
  h2 --h2         /data/trait.sumstats.gz \
     --ref-ld-chr /data/eur_w_ld_chr/ \
     --w-ld-chr   /data/eur_w_ld_chr/ \
     --out        /data/results

Version tags (v1.2.3) produce :1.2.3, :1.2, and :latest. Pushes to main produce a :main tag and a short-SHA tag (:sha-XXXXXXX).

Building from source

Requires a Rust toolchain (≥ 1.85; edition 2024 features used).

cargo build --release
# binary: target/release/ldsc

The release profile sets opt-level = 3, lto = "thin", codegen-units = 1.

Static binary with mimalloc (recommended for HPC)

For a fully static binary that runs on any Linux (including CentOS 7 / RHEL 7 HPC clusters), build with musl and the mimalloc feature:

# Requires the musl target (one-time setup)
rustup target add x86_64-unknown-linux-musl

cargo build --release --features mimalloc --target x86_64-unknown-linux-musl
strip target/x86_64-unknown-linux-musl/release/ldsc
# Copy the single binary to your cluster — no dependencies needed

This produces a static-pie binary with zero shared library dependencies. The mimalloc feature replaces musl's default allocator with Microsoft's mimalloc, which eliminates the ~12% performance penalty musl's single-threaded allocator incurs under rayon parallelism. On AWS EPYC benchmarks, musl + mimalloc is actually 4.5% faster than the default glibc build (69.9s vs 73.2s on full 1000G).

Without --features mimalloc, a musl build works but is ~12% slower due to allocator contention.

Runtime tuning (optional)

The following flags are available for performance tuning:

--rayon-threads N: Rayon thread count for jackknife in h2/rg.
--polars-threads N: Polars thread count for CSV streaming in munge-sumstats.
--prefetch-bed: Background BED reader thread for l2. Beneficial on networked filesystems (GPFS/NFS); hurts on local SSD with warm cache. See BED prefetch for full guidance.
--mmap: Memory-mapped BED I/O for l2. Recommended for GPFS/Lustre HPC; replaces --prefetch-bed. See mmap for details.

Performance

LD score computation (`l2`)

Benchmarks on AWS c6a.4xlarge (AMD EPYC 7R13, 16 vCPU) using 1000 Genomes Phase 3 (1,664,852 SNPs, n = 2,490 individuals). Measured with hyperfine (1 warmup + 3 timed runs). Static musl binary with mimalloc, AVX2+FMA target features.

1000G reference panel (N = 2,490)

Mode	Full genome wall time	vs Python	Accuracy
Python	25 min 49 s	1.0×	reference
Rust f64 (default)	41.1 s	~38×	exact (`max_abs_diff = 0`)
Rust f32 (`--fast-f32`)	~32 s	~48×	`max_abs_diff ≈ 0.008`
Rust stochastic (`--stochastic 50`)	36.2 s	~43×	~7% median per-SNP error
Rust sketch (`--sketch 200`)	25.4 s	~61×	Pearson r ≈ 0.93 vs exact
Rust sketch (`--sketch 50`)	15.3 s	~101×	Pearson r ≈ 0.81 vs exact

--ld-wind-kb 1000, --chunk-size 200. --sketch automatically enables f32 (bit-identical to f64 for sketch, ~1.3× faster). The --fast-f32, --stochastic, and --sketch paths trade exact parity for speed; the default f64 path is numerically identical to Python across all 1,664,852 SNPs.

Biobank scale (N = 50,000)

At biobank-scale N, GEMM dominates and larger panels expose more parallelism. Python runtime is extrapolated assuming linear O(N) scaling; Rust runtimes are hyperfine-measured (AWS EPYC 7R13, same hardware as above).

Mode	Wall time	vs exact-f64	Notes
exact-f64	665.9 s	1.0×	numerically exact
exact-f32 (`--fast-f32`)	361.9 s	1.84×	halved BED bandwidth
`--sketch 50` (Rademacher)	118.4 s	5.6×	r ≈ 0.81
`--sketch 200` (Rademacher)	151.8 s	4.4×	r ≈ 0.93
`--sketch 50 --sketch-method countsketch`	33.1 s	20.1×	r ≈ 0.81
`--sketch 200 --sketch-method countsketch`	33.8 s	19.7×	r ≈ 0.93
`--sketch 500 --sketch-method countsketch`	36.2 s	18.4×	r ≈ 0.97
`--sketch 1000 --sketch-method countsketch`	39.7 s	16.8×	r ≈ 0.99
`--sketch 5000 --sketch-method countsketch`	~55 s*	~12×*	r ≈ 0.998, h2 ~2% low
`--sketch 10000 --sketch-method countsketch`	~75 s*	~9×*	r ≈ 0.999, h2 exact
`--subsample 5000 --sketch 50`	24.9 s	26.7×	fastest (compounds two approx.)

*d=5000 and d=10000 timings estimated from local scaling (1.7× and 2.3× vs d=200); AWS times will differ but the relative ordering is stable. Even at d=10000 (20% of N), CountSketch is still ~5× faster than exact-f32 and ~9× faster than exact-f64.

Fused CountSketch reads packed BED bytes and scatter-adds directly into the d×c sketch buffer, eliminating the N×c intermediate entirely. Cost is O(N×c) independent of d, so d=200 has the same speed as d=50 but much better accuracy. A 28 GB container is recommended for N=50K.

Speedup varies with window size (200k-SNP extract, same machine):

Window	Python	Rust f64	Speedup
`--ld-wind-kb 100`	44.2 s	5.1 s	8.7×
`--ld-wind-kb 500`	48.4 s	6.2 s	7.8×
`--ld-wind-kb 1000`	53.7 s	8.6 s	6.2×
`--ld-wind-kb 2000`	61.8 s	12.9 s	4.8×

Scaling to larger reference panels

Runtime scales with both M (SNP count) and N (individual count). The ring-buffer algorithm keeps memory bounded by the LD window size, not the total SNP count.

Scaling with M (fixed N=2,490, exact-f64):

M (SNPs)	Est. wall time (AWS EPYC 16v)	BED size	Peak memory
1.66M	41 s (measured)	1 GB	~100 MB
5M	~124 s	3 GB	~100 MB
10M	~247 s	6 GB	~100 MB
50M	~1,235 s (~21 min)	30 GB	~100 MB

Scaling with N (fixed M=1.66M, full-genome):

N (individuals)	Mode	Wall time	vs Python
2,490 (1000G)	exact-f64	41.1 s (measured)	~38×
2,490 (1000G)	`--sketch 50`	15.3 s (measured)	~101×
50,000 (biobank)	exact-f64	665.9 s (measured)	—
50,000 (biobank)	countsketch-200	33.8 s (measured)	—
50,000 (biobank)	countsketch-5000	~55 s (estimated)	—
50,000 (biobank)	countsketch-10000	~75 s (estimated)	—

At biobank N, GEMM cost is O(N × w × c) per chunk; CountSketch reduces this to O(N×c) scatter-add, independent of d. Rademacher sketch reduces inner dimension from N to d but is still GEMM-bound. BED I/O is sequential and throughput-bound; --fast-f32 halves BED read bytes per chunk.

Additional UKBB I/O benchmarks on this machine (Apple M4, 10 CPU cores, 24 GB RAM, macOS 26.3 build 25D125). These highlight I/O-heavy workflows and the impact of the Rust pipeline’s faster parsing and joins.

Dataset: /Users/sharif/Code/ldsc/data/biomarkers-30600-both_sexes-irnt.sample8m.tsv (~497 MB) for munge-sumstats; and /Users/sharif/Code/ldsc/data/UKBB.ALL.ldscore/UKBB.EUR.l2.ldscore.gz for h2/rg (381,831 SNPs after merge). Quick local checks for l2 default to a 50k SNP extract via scripts/bench_l2_py3_vs_rust.sh.

Workflow	Rust	Python	Speedup
munge-sumstats	3.74 s	62.65 s	16.75×
h2	0.90 s	7.81 s	8.68×
rg (two traits)	2.93 s	28.09 s	9.59×

Reference HPC hardware (Penn PMACS) for scaling tests:

19 Dell C6420 quad node systems (76 compute nodes, 6,080 cores total, CentOS 7.8, 80 CPU cores per node with hyper-threading, 256 GB or 512 GB RAM per node, 56 GB/s EDR or 100 GB/s FDR InfiniBand to the filesystem, 10 Gb/s Ethernet).
1 Dell R940 big memory system (1.5 TB RAM, 96 CPU cores, 10 Gb/s Ethernet, 100 GB/s FDR InfiniBand).
2 GPU nodes (1× Nvidia Tesla P100, 512 GB RAM, 88 CPU cores, 10 Gb/s Ethernet, 100 GB/s FDR InfiniBand).
4.2 PB IBM Spectrum Scale (GPFS) disk storage (2 tiers, no backup).
1.3 PB mirrored archive tape storage.
LSF job scheduling system.

Differences from Python

Command structure

Python LDSC consists of three separate scripts; this crate consolidates them into subcommands of a single ldsc binary:

Python	Rust
`python munge_sumstats.py --sumstats … --out …`	`ldsc munge-sumstats --sumstats … --out …`
`python ldsc.py --l2 --bfile … --out …`	`ldsc l2 --bfile … --out …`
`python ldsc.py --h2 … --ref-ld-chr …`	`ldsc h2 --h2 … --ref-ld-chr …`
`python ldsc.py --rg … --ref-ld-chr …`	`ldsc rg --rg … --ref-ld-chr …`
`python make_annot.py --bimfile … --bed-file …`	`ldsc make-annot --bimfile … --bed-file …`
`python ldsc.py --cts-bin …`	`ldsc cts-annot …`

Python's --l2 flag (LD score estimation mode) becomes the l2 subcommand. The --h2 and --rg flags (regression modes) become h2 and rg subcommands.

Flag compatibility

Python flag names are supported directly.

Behavioural differences

--maf in l2: default now matches Python (MAF prefilter before LD computation).
--n-min default: when --n-min is 0, Rust now matches Python (90th percentile / 1.5).
--yes-really: Rust warns when the LD window spans a whole chromosome and --yes-really is not set (Python errors).
--chunksize: Python requires explicit chunking for large files; Rust uses Polars LazyFrame streaming and ignores chunk size for munge.
--return-silly-things / --invert-anyway: accepted flags for CLI parity; Rust never clips results and always uses a least-squares solver (warnings emitted).
--no-print-annot: only affects --cts-bin output; suppresses the .annot.gz file.
--cts-bin workflow: supported directly by ldsc l2 (also available as a separate preprocessor via ldsc cts-annot).

No-op Flags (Warned)

The following Python flags are accepted for CLI parity but do not change behavior in Rust:

h2/rg --return-silly-things
h2/rg --invert-anyway

Maintainers

Release process

Releases are cut with cargo-release and tagged as vX.Y.Z. Tag pushes trigger the release workflow, which builds and uploads platform tarballs to GitHub Releases.

cargo release patch
cargo release patch --execute

Building the image locally

Requires Docker with BuildKit (default since Docker 23):

docker build -t ldsc .

The multi-stage Dockerfile uses cargo-chef to cache dependency compilation in a separate layer, so incremental rebuilds only recompile changed source files.

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
.cargo		.cargo
.github/workflows		.github/workflows
docs		docs
ldsc_py2		ldsc_py2
ldsc_py3		ldsc_py3
paper		paper
scripts		scripts
src		src
tests/fixtures		tests/fixtures
typst_packages_docs		typst_packages_docs
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.bench		Dockerfile.bench
Dockerfile.bench-pgo		Dockerfile.bench-pgo
LICENSE		LICENSE
README.md		README.md
release.toml		release.toml
testing.md		testing.md
typst-docs-reference.md		typst-docs-reference.md

Folders and files

Latest commit

History

Repository files navigation

LDSC — Rust Rewrite

Install

Quick Start

Usage

munge-sumstats

l2

h2

rg

make-annot

cts-annot

Installation Details

Fast f32 mode

Stochastic trace estimation (--stochastic, experimental)

Random projection sketch (--sketch, experimental)

Projection methods (--sketch-method)

Sketch limitations

Downstream regression impact

Individual subsampling (--subsample, experimental)

BED prefetch for networked storage (--prefetch-bed)

Memory-mapped BED I/O (--mmap)

Optional GPU acceleration (experimental)

Prebuilt Binaries

Docker

Building from source

Static binary with mimalloc (recommended for HPC)

Runtime tuning (optional)

Performance

LD score computation (l2)

1000G reference panel (N = 2,490)

Biobank scale (N = 50,000)

Scaling to larger reference panels

Differences from Python

Command structure

Flag compatibility

Behavioural differences

No-op Flags (Warned)

Further Reading

Maintainers

Release process

Building the image locally

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Stochastic trace estimation (`--stochastic`, experimental)

Random projection sketch (`--sketch`, experimental)

Projection methods (`--sketch-method`)

Individual subsampling (`--subsample`, experimental)

BED prefetch for networked storage (`--prefetch-bed`)

Memory-mapped BED I/O (`--mmap`)

LD score computation (`l2`)

Packages