For multiple genomic data, most of the information can be stored as matrices. The most striking example is with SNP data, which can be stored as matrices with thousands to hundreds of thousands of rows (samples) with hundreds of thousands to dozens of millions of columns (SNPs) (Bycroft et al. 2017). This results in datasets of GygaBytes to TeraBytes of data.
Other fields in genomics, such as proteomics or expression data, use data stored as matrices potentially of size larger than available memory.
To address large data size in R, we can use memory-mapping for accessing large matrices stored on disk instead of in RAM. This has existed in R for several years thanks to package bigmemory (Kane, Emerson, and Weston 2013).
More recently, two packages which use the same principle as bigmemory have been developed: bigstatsr and bigsnpr (Privé, Aschard, and Blum 2017). Package bigstatsr implements many statistical tools for several types of Filebacked Big Matrices (FBMs), making it usable for any type of genomic data that can be encoded as a matrix. The statistical tools in bigstatsr include implementation of multivariate sparse linear models, Principal Component Analysis (PCA), matrix operations, and numerical summaries. Package bigsnpr implements algorithms which are specific to the analysis of SNP arrays, making use of already implemented features in package bigstatsr.
In this small tutorial, we’ll see the potential benefits of using memory-mapping instead of standard R matrices in memory, by using bigstatsr and bigsnpr.
You can find the first version of the tuto there.