This pipeline is structured to generate polygenic risk scores with the PRS-CSx tool in the Pan-UK Biobank dataset and validate and test them in the All of Us Research Program dataset. The phenotype I worked with is Standing Height.
-
R and Python Programming languages
-
Python packages scipy and h5py
-
Download the GWAS summary statistics from the Pan-UK Biobank Phenotypes page
-
Run the script to add rsids to the sumstats,
00_add_rsids.sh -
The GWAS sumstats must be in the following format to run PRScsx:
SNP A1 A2 BETA SE
rs4970383 C A -0.0064 0.0090
rs4475691 C T -0.0145 0.0094
rs13302982 A G -0.0232 0.0199
...
-
If not already in the above format, run the script to reformat the PUKBB sumstats,
01_reformat_sumstats.R -
Clone the PRS-CSx GitHub page
git clone https://github.com/getian107/PRScsx.git
- Download and extract LD Reference Panel files using the following commands:
tar -zxvf ldblk_ukbb_afr.tar.gz
tar -zxvf ldblk_ukbb_amr.tar.gz
tar -zxvf ldblk_ukbb_eas.tar.gz
tar -zxvf ldblk_ukbb_eur.tar.gz
tar -zxvf ldblk_ukbb_sas.tar.gz
- Download the SNP information file and put it in the same folder containing the reference panels
-
Run the script
02_run_prscsx.shwith the correct population sample sizes, seed, and chromosome numbers -
You will need to edit the first part of the script that looks like this:
pop1=afr
n1=6556 # pop1 sample size
pop2=amr
n2=972 # pop2 sample size
pop3=sas
n3=8657 # pop3 sample size
pop4=eas
n4=2397 # pop4 sample size
pop5=eur
n5=419596 # pop5 sample size
phi=1e-04 # phi value
seed= 60556 # seed value
-
I ran the following populations: AFR, AMR, EAS, EUR, SAS
-
The following seeds: 60556, 2001, 4928
-
And the following phi values: 1e-02, 1e-04, 1e-06, 1e-08
-
I split the runs into separate chromosomes in the following way to be more efficient:
1, 8, 9, 16, 17
2, 7, 10, 15, 18
3, 6, 11, 14, 19, 22
4, 5, 12, 13, 20, 21
-
Files with the following format:
Standing_height_POP_pst_eff_a1_b0.5_phiX_chr#.txtare generated-
POPis replaced with each population that you ran -
Xis replaced with the phi value that you used -
The chromosome numbers correspond to the chromosome each file was generated from
-
-
These files include the following unlabelled columns: rsid, base position, A1, A2, posterior effect size estimate
-
This data is what is used to generate individual polygenic risk scores
-
They will be used as input for the All of Us Validation step
-
Upload the PRScsx output files to your Google bucket by running the following commands on the command line:
-
gcloud auth login -
Follow the steps, copy the url and then paste the verification code
-
Then run this command with the prefix of your PRS-CSx files in place of
Standingand the file path that you want to use in your Google Bucket:
gsutil cp Standing* gs://fc-secure-0b5d7336-c242-426a-8854-548d4ed254d8/data/PRScsx_Standing_Height_output -
-
This section of the pipeline is designed to be run on the All of Us Research Program server
-
03_retrieve_height_weight.htmlretrieves the height and weight data for the All of Us research subjects- This file is specific to the height phenotype, and will need to be edited to work with other phenotypes
-
04_calc_PRS_in_AoU.ipynbcalculates polygenic risk scores with the All of Us dataset using the weights generated with PRS-CSx-
You will need to change a few lines of this script to fit the phenotype and populations you are working with
-
Change the
Standing*prefix to the prefix of your PRS-CSx output in line 11 -
Change the populations in this script to the populations that you are using in the bash portion of the script:
for pop in AFR AMR EAS EUR SAS
-
-
-
05_part1_compare_PRS_to_observed_join_data.htmlretrieves the phenotypes and ancestry PCs of the All of Us dataset and joins this data with the calculated PRS's to create joint phenotype and PRS file:Pan-UKB_Standing_height_PRSCSx_phiX_in_AoU_w_pheno.txtwhere X is the phi value -
06_Validation.txtto output Validation weights- Use this command with the pheno file created in the previous step, the name of your desired validation weights file, and the populuations you wish to run the script with
Rscript 06_Validation.txt -f Pan-UKB_Standing_height_PRSCSx_phi1e-02_in_AoU_w_pheno.txt -v Validation_output_weights.txt -p afr,amr,eas,eur,sas -
07_Testing.txtto output Testing Adjusted R Squared Values for each pop- Use this command with the validation weights file created int eh previous step and the list of populations that you wish to test in
Rscript 07_Testing.txt -v Validation_output_weights.txt -p afr,amr,eas,eur,sas -o Adjusted_R_squared_phiX_SEED.txt- Replace
Xwith the correct phi value andSEEDwith the correct seed value