We introduced Biomarker expression Level Imputation using Summary-level Statistics (BLISS), a novel method designed for developing protein imputation models using summary-level pQTL data. This software supports our GCB Hub and facilitates researchers to conduct their own proteome-wide association studies (PWAS).
Simply clone from our GitHub repo to get the up-to-date version:
git clone https://github.com/gcb-hub/BLISS.gitDownload our pre-constructed model and LD matrix files from our 🤗 Hugging Face repo and place them in the appropriate directory:
cd BLISS
# Download model
mkdir model && cd model
wget https://huggingface.co/datasets/chongwulab/BLISS-models/resolve/main/UKBPPP_EUR.zip
unzip UKBPPP_EUR.zip
cd ..
# Download example GWAS data
wget https://gcbhub.s3.us-east-2.amazonaws.com/example.zip && unzip example.zipFor more models, please refer to our 🤗 Hugging Face repo page and get the needed download links.
We strongly advise using quality control tools before running the main analysis:
- Use our QC tool: Run
APSS.Rfor interactive quality control - Check allele specifications: Ensure A1 and A2 are correctly specified
- Verify Z-score calculations: If Z-scores are missing, they can be calculated as Z = BETA/SE
- Handle missing data: Remove or impute missing values appropriately
If your GWAS summary statistics include BETA and SE columns but lack
# Z-score calculation
Z = BETA / SE
# For odds ratios, first convert to log scale
BETA = log(OR)
Z = BETA / SETips: let APSS.R handle this, if you are not 100% sure what columns should be used to compute
BLISS performs analyses by combining user-specified protein expression prediction models with GWAS summary statistics to identify significant protein-trait associations. We offer multiple pre-built sets of protein imputation models tailored for various proteomic platforms and ancestries. Users only need to provide GWAS summary data and specify the imputation models to be used.
GWAS summary statistics must be formatted as a flat file with the following mandatory columns:
- CHR - chromosome ID
- SNP - SNP identifier (rsID)
- A1 - effect allele (effect/risk/coding allele)
- A2 - other allele (non-effect allele)
- Z - Z-scores, sign with respect to A1
We also strongly recommend including one optional column:
- N - sample size (discovery stage sample size, not maximum sample size)
Note: Column names are case-sensitive. Make sure your input file uses the exact column names listed above.
We strongly advise using quality control tools before running the main analysis:
- Use our QC tool: Run
APSS.Rfor interactive quality control - Check allele specifications: Ensure A1 and A2 are correctly specified
- Verify Z-score calculations: If Z-scores are missing, they can be calculated as Z = BETA/SE
- Handle missing data: Remove or impute missing values appropriately
APSS.R offers an option to keep only HapMap3 SNPs, we DO NOT recommend you using that option. Since BLISS models were constructed using more than just HapMap3 SNPs, not restricting the analysis to only HapMap3 SNPs is generally the better approach.
Generally, rename the mandatory column exactly as stated above and use Y/N to respond to the prompted questions as needed. For non-mandatory columns, you can name them as you like, as long as the names are reasonable.
# First source the function
source("APSS.R")
# An example of using APSS interactively
APSS(
directory.working = YOUR_WORKING_DIR,
filename = "RandomGWAS_chr12.sumstats",
do.return = FALSE,
BIG = 2
)| Parameter | Type | Description | Default |
|---|---|---|---|
directory.working |
string | Path to the GWAS summary statistics file. | Required |
filename |
string | Name of the GWAS summary statistics file. | Required |
do.return |
boolean | Return the processed GWAS summary statistics file or not? | FALSE |
BIG |
numeric | If the GWAS summary statistics is bigger than BIGGb, APSS will do an exploratory read first to save memory. |
2 |
To identify protein-trait associations, use the main analysis script. Here's an example using an example GWAS data (included in our S3 bucket, all numbers randomly generated):
Rscript BLISS_Association.R \
--path_sumstats example/RandomGWAS_chr12.sumstats \
--n 300000 \
--model UKBPPP_EUR \
--chr 12 \
--output_dir results/ \
--output_name result_RandomGWAS_chr12.txt \
--output_augmented TRUE \
--output_twas_fusion FALSE \
--clean_slate FALSE \| Parameter | Type | Description | Default |
|---|---|---|---|
path_sumstats |
string | Path to the processed GWAS summary statistics file | Required |
n |
numeric | Sample size of GWAS (if not provided in N column) | Optional |
model |
string | Protein prediction model to use | UKBPPP_EUR |
chr |
numeric | Specific chromosome to analyze (1-22) | All chromosomes |
output_dir |
string | Directory to store output files | Current directory |
output_name |
string | Name of the output file | PWAS_results |
output_augmented |
logical | Output augmented results with additional annotations | FALSE |
output_twas_fusion |
logical | TWAS-FUSION-style output | FALSE |
clean_slate |
logical | Discard unfinished result and start fresh? | FALSE |
The output file contains the following columns:
| Column | Description |
|---|---|
protein |
Protein name/identifier |
beta |
Estimated effect size (recommended for interpretation) |
se |
Standard error of the effect size |
p |
P-value for the protein-trait association |
model_r2 |
Predictive R² of the protein expression model |
n_snps |
Number of SNPs used in the protein prediction model |
MHC |
Indicator for MHC region (major histocompatibility complex) |
When --output_augmented TRUE is specified, additional columns are included:
filename: Model filename usedchromosome: Chromosome locationstart: Gene start positionend: Gene end positionq: FDR-adjusted p-values
We provide protein expression imputation models across various platforms and ancestries:
| Model Name | Platform | Method | Ancestry | Training Sample Size | Number of proteins | Number of proteins with cis-h2 > 0.01 |
|---|---|---|---|---|---|---|
| UKBPPP_EUR | Olink | BLISS | European | 49,341 | 2,808 | 1,407 |
| UKBPPP_AFR | Olink | BLISS | African | 1,181 | 2,790 | 916 |
| UKBPPP_ASN | Olink | BLISS | Asian | 923 | 2,791 | 996 |
| deCODE | SomaScan | BLISS | European (Icelandic) | 35,892 | 4,711 | 1,183 |
| ARIC_EA | SomaScan | BLISS | European American | 7,213 | 4,439 | 1,214 |
| ARIC_AA | SomaScan | BLISS | African American | 1,181 | 4,435 | 1,352 |
Recommendation: Although we are providing you with results for all the available proteins, we recommend using models with estimated heritability exceeding 0.01 as analyzed in our manuscript.
The script automatically handles interrupted runs:
- Creates
.finishedfiles upon completion - Can resume from interruption points
- Use
--clean_slate TRUEto start fresh
Q: How do I calculate Z-scores from my GWAS summary statistics?
A: Use our APSS.R script for preprocessing, or calculate manually: Z = BETA/SE. For odds ratios, first take the natural logarithm: BETA = log(OR), then Z = BETA/SE.
Q: What if I get "model not found" errors?
A: Ensure model files are properly downloaded and placed in the model/ directory. Each model must have a corresponding .manifest file.
Q: Why do I get warnings about replacement elements?
A: This typically indicates improperly formatted GWAS summary data. Use APSS.R to preprocess your data for robust results.
Q: Can I analyze a specific chromosome only?
A: Yes, use the --chr parameter to specify a single chromosome (1-22). Omit this parameter to analyze all chromosomes.
Q: How do I interpret the MHC column?
A: The MHC column indicates whether the protein/gene is located in the major histocompatibility complex region, which may require special consideration due to its unique linkage disequilibrium patterns.
Q: Can I extract the LD matrices?
A: Load the model file of interest and use the recover_corr_matrix function from BLISS_Association.R.
load("PROTEIN_OF_INTEREST.RData")
matrix.LD <- recover_corr_matrix(matrix.LD)Common Issues:
- File path errors: Ensure all file paths are correct and files exist
- Memory issues: For large datasets, consider processing chromosome by chromosome
- Column naming: Verify that input files use exact column names (case-sensitive)
- Sample size: Provide sample size via
--nif not available in the N column
If you find our resources helpful, please cite:
Large-scale imputation models for multi-ancestry proteome-wide association analysis, bioRxiv
This work is licensed under the CC BY-NC-ND 4.0 DEED.
For questions or issues:
- GitHub Issues: [Recommended] Submit issues on our GitHub repository
- Email: Contact us directly
We typically respond within a few hours.
-
10-03-2023:
First version release -
07-27-2025:
Added multi-ancestry support with UKBPPP_AFR and UKBPPP_ASN models.
Enhanced parameter structure with augmented output options and TWAS-FUSION compatibility.
Added deCODE (SomaScan) and ARIC models for expanded platform coverage.
Documentation update - Added LD matrix extraction guide, improved FAQ section, and clarified QC recommendations -
10-01-2025:
Fixed a minor issue where an undefined object occurred in some cases.
