LDZipMatrix is a suite of tools for compressing and randomly accessing large Linkage Disequilibrium (LD) matrices.
It is designed for workflows where LD matrices are too large to store uncompressed, while still enabling fast, targeted access. Data are stored as flat files, requiring no database server and allowing simple deployment and portability, and support multiple LD metrics (e.g., phased/unphased r/r-square delta, Dprime etc). Common use cases include:
- retrieving individual LD values between variant pairs (e.g., A vs. B)
- identifying variants in high LD with a given variant (above a specified threshold)
- extracting LD submatrices for downstream analyses (e.g., SuSiE, fine-mapping)
- generating inputs for LocusZoom plots, variant annotation, and related workflows
This repository includes three main components:
-
C++ binary (
ldzip) - Compresses plink2 LD matrices into the.ldzipformat and supports related operations such as decompression, filtering, and concatenation across chromosomes. -
R package (
LDZipMatrix) - Opens and queries.ldzipfiles efficiently from R with random access. -
Nextflow pipeline - Automates whole-genome
.ldzipgeneration (including LD calculation using plink2) by running jobs on small chunks and combining the outputs.
- The snippet below compiles the
ldzipC++ binary and places it incpp/bin/ldzip. - Use the
ldzipbinary only for compressing PLINK LD matrices. - To read existing compressed data, install the R package
LDZipMatrixinstead. - For more details on usage of the C++ binary, please see the C++ documentation.
git clone git@github.com:23andMe/LDZip.git
cd LDZip/
make cpp- The snippet below builds and installs the R package
LDZipMatrix. - This package is required for random access to compressed matrices in R.
- You do not need to build the C++ binary to use the R package.
- Ensure that
roxygen2is installed for documentation andNAMESPACEgeneration. - For more details on the R package, please see the R documentation.
git clone git@github.com:23andMe/LDZip.git
cd LDZip/
make r-packageThe Nextflow pipeline automates creation of a whole-genome compressed LD archive by scattering work across chunks and concatenating the resulting outputs
- For details on configuration and execution, see Nextflow documentation.
-
I already have a
.ldzipfile and want to query it. What should I do?
Install the R package and use the R API to fetch LD values and neighboring linked variants. Go to: R Package -
I have a PLINK LD matrix and want to create a
.ldzipfile. What should I do?
Build the C++ldzipbinary and run thecompresscommand. Go to: C++ Binary -
I have PLINK pgen files and want to build whole-genome
.ldzipoutputs in a pipeline. What should I do?
Use the Nextflow workflow. Go to: Nextflow -
I already have a
.ldzipfile and want to convert it back to my own format. What should I do?
Build the C++ldzipbinary and run thedecompresscommand. Go to: C++ Binary
If you find a bug or have a feature request, please open a GitHub Issue in this repository.
When reporting an issue, it is helpful to include:
- what you were trying to do
- the command or R code you ran
- your OS and compiler / R versions
- a minimal reproducible example, if possible
This tool is intended for trusted workflows and assumes that input .ldzip files are well-formed and generated by trusted sources. Do not run this tool on untrusted or user-supplied .ldzip files. The parser is optimized for performance and does not perform full defensive validation against maliciously crafted inputs.
For questions or issues related to LDZipMatrix, please use the GitHub issue tracker or email:
sayantand@23andme.com