This repository contains a complete computational workflow for comparing genomic selection (GS) performance between whole-genome sequencing (WGS) and AgriPlex genotyping platforms in the Sorghum Association Panel (SAP).
- Overview
- Repository Structure
- Prerequisites
- Installation
- Data Acquisition
- Workflow
- Script Reference
- Output Files
- Citation
This analysis pipeline performs:
- VCF preprocessing - Convert VCF files to PLINK binary format with quality control
- LD pruning - Generate LD-pruned marker sets for PCA and kinship calculation
- Principal Component Analysis (PCA) - Population structure visualization for WGS and AgriPlex panels
- Procrustes Analysis - Alignment comparison between WGS and AgriPlex PCA spaces
- Kinship Matrix Construction - VanRaden genomic relationship matrices (G-matrices)
- Genomic Selection Cross-Validation - k-fold CV using rrBLUP and ASReml-R
- GWAS - Genome-wide association studies with PCA covariates
- Visualization - Manhattan plots, QQ plots, confidence ellipses, violin plots, and more
Key Finding: The analysis demonstrates that the AgriPlex platform (1,942 markers) achieves comparable genomic prediction accuracy to WGS (~250K markers) for the traits analyzed in the SAP panel.
All scripts resolve paths relative to the project root (via SBI_AGRIPLEX_ROOT or auto-detection), and assume the following standardized structure:
.
├── data/
│ ├── input/ # Input data (VCF files, phenotypes, derived genotype formats)
│ │ ├── agp/ # AgriPlex genotyping platform data
│ │ │ ├── vcf/ # Required: Validated.SAP.VCF
│ │ │ ├── raw/
│ │ │ │ ├── full/ # Generated: SAP_genotypes_agp_full.raw
│ │ │ │ └── filtered/ # Generated: SAP_genotypes_agp_filtered_by_wgs.*
│ │ │ ├── bfiles/
│ │ │ │ ├── full/ # Generated: SAP_genotypes_agp_full.{bed,bim,fam}
│ │ │ │ └── filtered/ # Generated: SAP_genotypes_agp_filtered_by_wgs.{bed,bim,fam}
│ │ │ └── kinship/ # Generated: VanRaden G-matrices for AGP full
│ │ │ └── SAP_genotypes_agp_full_G_matrix.rds
│ │ ├── wgs/ # Whole-genome sequencing data (Boatwright)
│ │ │ ├── vcf/ # Required: SAP.raw.vcf.gz
│ │ │ ├── raw/
│ │ │ │ ├── full/ # Generated: SAP_genotypes_wgs_full.raw
│ │ │ │ ├── pruned/ # Generated: SAP_genotypes_wgs_pruned02.raw + pruning lists
│ │ │ │ └── filtered/ # Generated: SAP_genotypes_wgs_filtered_by_agp.*
│ │ │ ├── bfiles/
│ │ │ │ ├── full/ # Generated: SAP_genotypes_wgs_full.{bed,bim,fam}
│ │ │ │ ├── pruned/ # Generated: SAP_genotypes_wgs_pruned02.{bed,bim,fam}
│ │ │ │ └── filtered/ # Generated: SAP_genotypes_wgs_filtered_by_agp.{bed,bim,fam}
│ │ │ └── kinship/ # Generated: VanRaden G-matrices for WGS pruned02
│ │ │ └── SAP_genotypes_wgs_pruned02_G_matrix.rds
│ │ └── pheno/ # Phenotype data
│ │ ├── SAP_Field_Data_Set.csv # Required: phenotype source
│ │ └── pheno_plink_format.tsv # Generated: PLINK phenotype table for GWAS
│ └── output/ # Analysis outputs
│ ├── pca/ # PCA eigenvec/eigenval + PCA-derived figures
│ │ ├── *.eigenvec
│ │ ├── *.eigenval
│ │ ├── plots/
│ │ └── procrustes/
│ ├── gwas/ # PLINK2 GWAS outputs + figures
│ │ ├── *.glm.linear
│ │ └── plots/
│ ├── gs/ # Genomic selection (k-fold CV) outputs + figures
│ │ ├── sap_gs_cv_results.csv
│ │ └── plots/
│ └── kinship_eval/ # Kinship evaluation (correlation/Bland-Altman/etc.)
│ ├── kinship_bland_altman.png
│ └── square/
├── scripts/ # Analysis pipeline scripts
│ ├── 00_env_setup/ # Environment setup (conda, optional)
│ ├── 01_plink_pheno/ # Phenotype formatting
│ ├── 02_VCF2RAW/ # VCF to PLINK .raw conversion
│ ├── 03_VCF2Bfiles/ # VCF to PLINK binary files
│ ├── 04_filter_panels/ # Cross-platform marker filtering (QC)
│ ├── 05_calc_eigenvalues_PCA/ # Principal component analysis
│ ├── 06_GWAS/ # Genome-wide association studies
│ ├── 07_kinship/ # Kinship matrix construction
│ ├── 08_iterate_kfold_GP/ # Genomic selection cross-validation
│ └── 09_figures/ # Visualization scripts
├── workflow/ # Snakemake workflow orchestration (RECOMMENDED)
│ ├── Snakefile # Main workflow definition
│ ├── config/
│ │ ├── config.yaml # Production configuration
│ │ └── config_test.yaml # Test/development configuration
│ ├── rules/ # Modular Snakemake rules
│ │ ├── 01_phenotype_prep.smk
│ │ ├── 02_genotype_conversion.smk
│ │ ├── 03_filtering.smk
│ │ ├── 04_pca.smk
│ │ ├── 05_gwas.smk
│ │ ├── 06_kinship.smk
│ │ ├── 07_genomic_selection.smk
│ │ └── 08_visualization.smk
│ ├── README.md # Snakemake workflow documentation
│ └── QUICKSTART.md # Quick reference guide
└── README.md # This file
Important Notes:
- VanRaden matrices live under
data/input/*/kinship/by design: downstream GS and evaluation scripts treat these as canonical inputs (AGP full and WGS pruned02)
- PLINK2 (v2.0+) - Genetic data management and analysis
- Option 1 (Recommended): Use the provided conda environment (see Installation)
- Option 2: Manual installation from https://www.cog-genomics.org/plink/2.0/
- Used for: VCF conversion, LD pruning, PCA, GWAS
Core Packages (available via CRAN):
install.packages(c(
"tidyverse", # Data manipulation and visualization
"data.table", # Fast data reading/writing
"ggplot2", # Plotting
"ggforce", # Extended ggplot2 geoms
"magrittr", # Pipe operators
"furrr", # Parallel processing
"future", # Parallel backend
"progressr" # Progress bars
))Genetic Analysis Packages:
install.packages(c(
"rrBLUP", # Ridge regression BLUP (genomic selection)
"vcfR", # VCF file handling
"genio", # PLINK file I/O
"caret", # Cross-validation framework
"matrixcalc", # Matrix operations
"Matrix", # Sparse matrices
"vegan", # Procrustes analysis
"qqman" # Manhattan and QQ plots
))ASReml-R (OPTIONAL - Commercial License Required):
# ASReml-R provides more advanced mixed model capabilities
# If you don't have an ASReml license, the pipeline will use rrBLUP exclusively
# rrBLUP is a free, open-source alternative that performs comparably for most traitsImportant Note on ASReml:
- ASReml-R is a commercial package requiring a paid license (~$600-1000/year for academic use)
- The pipeline works perfectly fine using only rrBLUP (free, open-source)
- Both methods are implemented in
scripts/08_iterate_kfold_GP/pred_SAP.r- Results show rrBLUP and ASReml achieve nearly identical prediction accuracies
- If ASReml is unavailable, the script will simply skip ASReml results and continue with rrBLUP
If you have an ASReml license, place your asreml.lic file in your home directory:
cp /path/to/your/asreml.lic ~/asreml.licThe genomic selection script (pred_SAP.r) will automatically detect and use it. If the license is not found, only rrBLUP results will be generated.
git clone <repository-url>
cd sorghum_validate_panelFor explicit control over the project root directory:
export SBI_AGRIPLEX_ROOT="/path/to/sorghum_validate_panel"If not set, scripts will automatically detect the project root by searching for the data/ directory.
If you don't already have PLINK2 installed, use the provided conda environment:
The repository includes a conda environment specification that installs PLINK2 and all required command-line tools.
# Create the conda environment
bash scripts/00_env_setup/setup_conda_env.sh
# Activate the environment
conda activate sorghum_gs_validation-env
# Verify PLINK2 is installed
plink2 --versionNotes:
- The script will create or update the
sorghum_gs_validation-envenvironment - Use
--forceflag to recreate from scratch:bash scripts/00_env_setup/setup_conda_env.sh --force - You'll need to activate this environment (
conda activate sorghum_gs_validation-env) whenever running the shell scripts - This environment includes all required tools: PLINK2, Snakemake (≥7.0), wget, and system utilities
Included tools:
plink2- Main genotyping analysis toolwget- For downloading data filesopensslandca-certificates- For secure connections
If you prefer not to use conda, manually install PLINK2:
# Download and install from:
# https://www.cog-genomics.org/plink/2.0/
# Verify installation
plink2 --version# Launch R and install required packages
install.packages(c("tidyverse", "data.table", "ggforce", "rrBLUP",
"caret", "vcfR", "genio", "matrixcalc", "Matrix",
"furrr", "future", "progressr", "vegan", "qqman"))Important: The following data files are NOT included in this repository but are required to run the analysis pipeline. All collaborators should have access to these files. Before running any scripts, you must place these files in the correct locations:
-
WGS VCF File:
SAP.raw.vcf.gz- Place in:
data/input/wgs/vcf/SAP.raw.vcf.gz - Description: Sorghum Association Panel whole-genome sequencing data
- Important Note: The
prep_genotypes.shscript (Step 1 of workflow) will attempt to auto-download this file from ENA if not found. However, if the download fails, you must manually place the file in this location.
- Place in:
-
AgriPlex VCF File:
Validated.SAP.VCF- Place in:
data/input/agp/vcf/Validated.SAP.VCF - Description: AgriPlex genotyping panel data
- Place in:
-
Phenotype Data:
SAP_Field_Data_Set.csv- Place in:
data/input/pheno/SAP_Field_Data_Set.csv - Description: Field trial phenotypic measurements
- Note Any SAP file with phenomic info can be used here, but trait columns may need to be updated to match the script (or vice-versa)
- Place in:
# 1. Create the required directory structure (if not already present)
mkdir -p data/input/wgs/vcf
mkdir -p data/input/wgs/raw/full
mkdir -p data/input/wgs/raw/pruned
mkdir -p data/input/wgs/raw/filtered
mkdir -p data/input/wgs/bfiles/full
mkdir -p data/input/wgs/bfiles/pruned
mkdir -p data/input/wgs/bfiles/filtered
mkdir -p data/input/agp/vcf
mkdir -p data/input/agp/raw/full
mkdir -p data/input/agp/raw/filtered
mkdir -p data/input/agp/bfiles/full
mkdir -p data/input/agp/bfiles/filtered
mkdir -p data/input/pheno
# 2. Copy the files from your shared location to the correct directories
# Example (adjust paths based on where you received the files):
cp /path/to/your/SAP.raw.vcf.gz data/input/wgs/raw/
cp /path/to/your/Validated.SAP.VCF data/input/agp/vcf/
cp /path/to/your/SAP_Field_Data_Set.csv data/input/pheno/
# 3. Verify the files are in place
ls -lh data/input/wgs/raw/SAP.raw.vcf.gz
ls -lh data/input/agp/vcf/Validated.SAP.VCF
ls -lh data/input/pheno/SAP_Field_Data_Set.csvOnce these files are in place, you can proceed with the Workflow steps below.
The Sorghum Association Panel (SAP) WGS data (Boatwright) can be downloaded from the European Nucleotide Archive:
Project: PRJEB51985
Direct Link: https://www.ebi.ac.uk/ena/browser/view/PRJEB51985
# Download SAP WGS VCF (example using wget)
wget -O data/input/wgs/raw/SAP.raw.vcf.gz \
ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ125/ERZ12588732/SAP.raw.vcf.gzNote: The prep_genotypes.sh script (used in Workflow Step 1) includes an interactive prompt to automatically download this file if not found. If the auto-download fails, you can manually download it using the command above.
This pipeline can be executed in two ways:
- Snakemake Workflow (RECOMMENDED) - Automated, reproducible, with automatic dependency management
- Manual Step-by-Step - Individual script execution with manual dependency tracking
The workflow/ directory contains a complete Snakemake pipeline that orchestrates all analysis steps automatically. This is the recommended approach for most users.
- Automatic Checkpointing: Resume from failures without re-running completed steps
- Parallel Execution: Independent tasks (WGS/AGP processing) run simultaneously
- Dependency Management: Ensures correct execution order automatically
- Configuration-Driven: Centralized parameter management
- Reproducibility: Single command executes entire pipeline consistently
# 1. Activate conda environment (includes Snakemake ≥7.0)
conda activate sorghum_gs_validation-env
# 2. Verify Snakemake installation
snakemake --version
# 3. Preview what will run (dry run)
snakemake --dry-run --cores 1
# 4. Execute full pipeline
snakemake --cores 16
# 5. Test run (skip intensive genomic selection)
snakemake test_run --cores 8Edit workflow/config/config.yaml to customize analysis parameters:
# Enable/disable analysis components
analysis:
run_matched_panels: true # Cross-platform QC filtering
run_gwas: true # Association studies
run_genomic_selection: true # CV cross-validation (2-6 hours)
use_asreml: true # Use ASReml if license available
# Computational resources
resources:
plink_threads: 16 # Threads for PLINK operations
r_cores_fraction: 0.8 # Use 80% of cores for R parallel tasks
# Genomic selection parameters
genomic_selection:
num_iters: 50 # CV iterations (use 5 for testing)
cv_folds: 5 # Folds per iteration# Run only GWAS (skip genomic selection)
snakemake gwas_only --cores 16
# Run only genomic selection
snakemake genomic_selection_only --cores 16
# Generate all figures only
snakemake all_figures --cores 8
# Resume after failure (automatic checkpointing)
snakemake --cores 16 --rerun-incomplete
# Visualize workflow as DAG
snakemake --dag | dot -Tpng > workflow_dag.png
# Check execution plan with details
snakemake --dry-run --printshellcmdsInput Data (VCF + Phenotypes)
├─> Phenotype Preparation (PLINK format)
├─> VCF to RAW Conversion (WGS + AGP)
│ └─> VCF to Binary Files (PLINK .bed/.bim/.fam)
│ ├─> LD Pruning (WGS only, r² ≤ 0.2)
│ └─> Optional: Cross-Platform Filtering (QC)
├─> Principal Component Analysis (all panels)
│ ├─> GWAS (optional, with PCA covariates)
│ └─> PCA Visualizations (plots + Procrustes)
├─> Kinship Matrix Construction (VanRaden G-matrices)
│ ├─> Genomic Selection CV (optional, rrBLUP + ASReml)
│ ├─> Kinship Evaluation (Bland-Altman, correlation)
│ └─> GS Visualizations (ellipses, violins)
└─> All Publication Figures
For comprehensive Snakemake workflow documentation, see:
workflow/README.md- Complete guide with troubleshooting and advanced usageworkflow/QUICKSTART.md- Quick reference for common tasksworkflow/config/config.yaml- All configurable parameters with descriptions
If you prefer manual execution or need to run individual steps independently, follow this sequential workflow. Ensure you've completed Installation (PLINK setup) and Data Acquisition first.
Before running analyses, prepare the phenotype file for PLINK/GWAS compatibility.
Purpose: Convert the SAP field data CSV to PLINK-compatible TSV format with averaged replicate values.
Rscript scripts/01_plink_pheno/prepare_pheno_plink.RInput:
data/input/pheno/SAP_Field_Data_Set.csv
Output:
data/input/pheno/pheno_plink_format.tsv- Averaged phenotypes in PLINK format (IID + trait columns)
Traits Processed:
- Days_to_flowering_DTF
- Height_inches
- Exertion_inches
- Seed_weight_grams_100_seeds
Note: This step is required before running GWAS (Step 6). The script averages replicate measurements for each accession.
This section prepares raw genotype data (VCF files) into quality-controlled formats for downstream analyses.
Purpose: Convert VCF files to additive genotype matrices (0/1/2 format). For WGS data, generates both unpruned and LD-pruned versions (MAF ≥ 0.05, missingness ≤ 10% applied before pruning). For AgriPlex, generates unpruned only (panel is pre-designed). The pruned SNP list (.prune.in) produced here is reused by downstream scripts (e.g. prune_wgs_bfiles.sh) to build binary files without re-running LD pruning.
# Process both WGS and AgriPlex VCF files
bash scripts/02_VCF2RAW/prep_genotypes.shHelpful Feature: If the WGS VCF file (
SAP.raw.vcf.gz) is not found, this script will prompt you to automatically download it from ENA. If the download fails, you'll need to manually obtain the file (see Data Acquisition).
Outputs:
WGS Unpruned (QC filtered only):
data/input/wgs/raw/full/SAP_genotypes_wgs_full.raw
WGS LD Pruned (r² ≤ 0.2):
data/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02.rawdata/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.indata/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.out
AgriPlex (if VCF provided):
data/input/agp/raw/full/SAP_genotypes_agp_full.raw
Purpose: Convert raw VCF files to PLINK binary format. This step builds the core genomic resources needed for all downstream analyses (PCA, GWAS, genomic selection). The prune_wgs_bfiles.sh script reuses the .prune.in list from step 1 (no redundant LD pruning).
# Build full PLINK binary files from VCFs
bash scripts/03_VCF2Bfiles/build_full_wgs_agp_bfiles.sh
# Build pruned WGS bfiles (uses .prune.in from step 1)
bash scripts/03_VCF2Bfiles/prune_wgs_bfiles.shOutputs:
Full WGS and AGP Bfiles:
data/input/wgs/bfiles/full/SAP_genotypes_wgs_full.{bed,bim,fam}(PLINK binary files)data/input/agp/bfiles/full/SAP_genotypes_agp_full.{bed,bim,fam}
LD-Pruned WGS Bfiles (r² ≤ 0.2, MAF ≥ 0.05, missingness ≤ 10%):
data/input/wgs/bfiles/pruned/SAP_genotypes_wgs_pruned02.{bed,bim,fam}data/input/wgs/bfiles/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.in(kept SNPs)data/input/wgs/bfiles/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.out(removed SNPs)
Purpose: Perform bidirectional filtering to ensure both WGS and AgriPlex panels contain only the intersection of markers present in both platforms, enabling direct comparison and quality control.
# Bidirectionally filter WGS and AgriPlex panels
bash scripts/04_filter_panels/filter_wgs_by_agp.shInputs:
- Requires the AgriPlex full
.rawatdata/input/agp/raw/full/SAP_genotypes_agp_full.raw(generated in Step 1). - If your AGP
.rawlives elsewhere, pass it as the 3rd argument:
# bash filter_wgs_by_agp.sh [wgs_out_prefix] [agp_out_prefix] [agp_raw] [wgs_vcf]
bash scripts/04_filter_panels/filter_wgs_by_agp.sh \
data/input/wgs/raw/filtered/SAP_genotypes_wgs_filtered_by_agp \
data/input/agp/raw/filtered/SAP_genotypes_agp_filtered_by_wgs \
path/to/your_agp.rawProcess:
- Extract markers from AgriPlex panel
- Filter WGS VCF to markers present in AgriPlex
- Extract markers successfully found in WGS
- Filter AgriPlex to markers that exist in filtered WGS
Outputs:
data/input/wgs/raw/filtered/SAP_genotypes_wgs_filtered_by_agp.raw- WGS filtered to AgriPlex markersdata/input/agp/raw/filtered/SAP_genotypes_agp_filtered_by_wgs.raw- AgriPlex filtered to WGS markersdata/input/wgs/raw/filtered/agp_markers.list- Original AgriPlex marker listdata/input/wgs/raw/filtered/wgs_markers.list- Markers found in WGS (intersection)
Purpose: Build PLINK bfiles for both WGS and AgriPlex using the matched marker list produced in Step 3.
# Build matched/filtered panels (same markers, overlapping samples)
bash scripts/03_VCF2Bfiles/build_matched_wgs_agp_bfiles.shOutputs:
data/input/wgs/bfiles/filtered/SAP_genotypes_wgs_filtered_by_agp.{bed,bim,fam}data/input/agp/bfiles/filtered/SAP_genotypes_agp_filtered_by_wgs.{bed,bim,fam}
Purpose: Compute population structure for both marker panels.
# Run PCA on all datasets (WGS unpruned, WGS pruned, AGP full, filtered panels)
bash scripts/05_calc_eigenvalues_PCA/pca.shOutputs:
data/output/PCA/wgs_full_pca.eigenvecand.eigenvaldata/output/PCA/wgs_pruned_pca.eigenvecand.eigenvaldata/output/PCA/agp_pca.eigenvecand.eigenvaldata/output/PCA/wgs_filtered_overlap_pca.*data/output/PCA/agp_filtered_overlap_pca.*
Note: PCA outputs use uppercase PCA/ directory for compatibility with legacy scripts.
Visualize PCA Results:
Rscript scripts/09_figures/pca_SAP.r
Rscript scripts/09_figures/pca_plink2.rPurpose: Identify marker-trait associations while controlling for population structure.
# Run GWAS for all traits using PCA covariates
bash scripts/06_GWAS/run_gwas.shTraits Analyzed:
- Days to flowering (DTF)
- Plant height (inches)
- Panicle exertion (inches)
- Seed weight (grams per 100 seeds)
Outputs:
data/output/gwas/wgs_pruned_<trait>.glm.linear- WGS GWAS resultsdata/output/gwas/agp_full_<trait>.glm.linear- AgriPlex GWAS results
Visualize GWAS Results:
# Manhattan + QQ plots for plant height
Rscript scripts/09_figures/manhattan_height.r
# Manhattan + QQ plots for all traits
Rscript scripts/09_figures/manhattan_all.rOutputs:
data/output/gwas/plots/<trait>_manhattan.pngdata/output/gwas/plots/<trait>_qq.png
Purpose:
- Build VanRaden genomic relationship matrices (G-matrices) for AGP full and WGS pruned02
- Evaluate agreement (correlation/Bland-Altman/visualizations)
# Build VanRaden matrices (writes to data/input/*/kinship/)
Rscript scripts/07_kinship/create_vanraden_k2.r --preset agp
Rscript scripts/07_kinship/create_vanraden_k2.r --preset wgs
# Evaluate kinship agreement (writes to data/output/kinship_eval/)
Rscript scripts/07_kinship/bland-altman.r
Rscript scripts/09_figures/visualize_kin.rOutputs:
- VanRaden matrices (canonical inputs for GS):
data/input/agp/kinship/SAP_genotypes_agp_full_G_matrix.rdsdata/input/wgs/kinship/SAP_genotypes_wgs_pruned02_G_matrix.rds
- Kinship evaluation:
data/output/kinship_eval/*
Purpose: Evaluate genomic prediction accuracy using k-fold cross-validation with both rrBLUP and ASReml.
Note: This is the most computationally intensive step. It runs 50 iterations of 5-fold cross-validation across multiple traits in parallel.
Rscript scripts/08_iterate_kfold_GP/pred_SAP.rConfiguration (edit within the script if needed):
NUM_ITERS- Number of CV iterations (default: 50)CV_FOLDS- Number of folds per iteration (default: 5)FUTURE_CORES_FRACTION- CPU usage fraction (default: 0.8)
Computational Requirements:
- Runtime: 2-6 hours on a modern workstation (depends on core count)
- RAM: 16-32 GB recommended
- CPU: Benefits from 8+ cores
Outputs:
data/output/gs/sap_gs_cv_results.csv- Main results file- Columns:
trait,k2(marker panel),method(rrBLUP/ASReml),fold,test_cor_pearson,test_bias,iter
- Columns:
Important: If ASReml is not available, the script will automatically skip ASReml analysis and only report rrBLUP results. No code changes are needed.
Purpose: Generate publication-quality figures for PCA, Procrustes, genomic selection, and GWAS results.
Purpose: Assess alignment between WGS and AgriPlex PCA spaces.
Rscript scripts/09_figures/PCA_procrustes.r \
--panel1=data/input/agp/raw/full/SAP_genotypes_agp_full.raw \
--panel2=data/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02.raw \
--pcs=5 \
--permutations=999Outputs:
data/output/pca/procrustes/*data/output/pca/plots/*
Alternative Procrustes Scripts:
PCA_procrustes_plink2_filtered.r- Filtered marker setsPCA_procrustes_plink2_prune02_vs_full.r- Compare pruned vs full WGS
Purpose: Visualize genomic prediction accuracy and bias.
# Confidence ellipses (accuracy vs bias)
Rscript scripts/09_figures/confidence_ellipses.r
# Violin plots (accuracy distributions by trait)
Rscript scripts/09_figures/violin_sap_pred.r
# Combined patchwork figure
Rscript scripts/09_figures/patchwork.rOutputs:
data/output/gs/plots/*.png
Purpose: Visualize genomic relationship matrices.
Rscript scripts/09_figures/visualize_kin.r| Script | Purpose | Key Outputs |
|---|---|---|
01_plink_pheno/prepare_pheno_plink.R |
Convert SAP field data CSV to PLINK-compatible TSV with averaged replicates | data/input/pheno/pheno_plink_format.tsv |
| Script | Purpose | Key Outputs |
|---|---|---|
02_VCF2RAW/prep_genotypes.sh |
Convert VCF → PLINK .raw (0/1/2 additive matrix). WGS: full + LD-pruned (MAF ≥ 0.05, geno ≤ 0.10); AGP: full only | WGS: data/input/wgs/raw/full/*.raw & data/input/wgs/raw/pruned/*.raw + .prune.in; AGP: data/input/agp/raw/full/*.raw |
03_VCF2Bfiles/build_full_wgs_agp_bfiles.sh |
Convert WGS and AgriPlex VCFs → PLINK binaries (no filtering) | WGS: data/input/wgs/bfiles/full/*.{bed,bim,fam}; AGP: data/input/agp/bfiles/full/*.{bed,bim,fam} |
03_VCF2Bfiles/prune_wgs_bfiles.sh |
Build pruned WGS bfiles using .prune.in from step 02 (no redundant LD pruning) |
data/input/wgs/bfiles/pruned/*.{bed,bim,fam} |
03_VCF2Bfiles/build_matched_wgs_agp_bfiles.sh |
Build matched/filtered bfiles (same markers across platforms) | data/input/*/bfiles/filtered/*.{bed,bim,fam} |
| Script | Purpose | Key Outputs |
|---|---|---|
05_calc_eigenvalues_PCA/pca.sh |
PCA on WGS full, WGS pruned, AGP full, and filtered panels | data/output/PCA/{wgs_full,wgs_pruned,agp,wgs_filtered_overlap,agp_filtered_overlap}_pca.{eigenvec,eigenval} |
| Script | Purpose | Key Outputs |
|---|---|---|
06_GWAS/run_gwas.sh |
GWAS with PCA covariates (10 PCs) | data/output/gwas/*.glm.linear |
| Script | Purpose | Key Outputs |
|---|---|---|
04_filter_panels/filter_wgs_by_agp.sh |
Subset WGS to markers present in AgriPlex panel (defaults to data/input/agp/raw/full/SAP_genotypes_agp_full.raw, or pass agp_raw) |
Marker lists + filtered .raw under data/input/*/raw/filtered/ |
| Script | Purpose | Method | Key Outputs |
|---|---|---|---|
07_kinship/create_vanraden_k2.r |
Compute VanRaden G-matrix from PLINK .raw (AGP full, WGS pruned02 presets) | rrBLUP::A.mat() |
data/input/*/kinship/*_G_matrix.rds |
07_kinship/bland-altman.r |
Bland–Altman agreement plots for kinship matrices | Bland–Altman | data/output/kinship_eval/* |
07_kinship/kinship_corr.r |
Correlation between WGS and AGP kinship matrices | Pearson correlation | Console output + optional files |
| Script | Purpose | Methods | Runtime | Key Outputs |
|---|---|---|---|---|
pred_SAP.r |
k-fold CV genomic prediction | rrBLUP, ASReml (optional) | 2-6 hours | sap_gs_cv_results.csv |
Key Features:
- Parallelized across CV iterations using
furrr - Automatic handling of missing ASReml license
- Progress bar via
progressr - Handles traits with/without replication
- Computes Pearson accuracy and regression bias
| Script | Purpose | Figure Type | Key Outputs |
|---|---|---|---|
pca_SAP.r |
Simple PCA scatter plots | 2D scatter (PC1 vs PC2) | PNG plots |
pca_plink2.r |
PCA plots for PLINK2-generated PCs | 2D scatter with variance labels | PNG plots |
PCA_procrustes.r |
Procrustes alignment between marker panels | Rotated PC spaces with residuals | PNG/PDF, summary CSV |
PCA_procrustes_plink2_filtered.r |
Procrustes for filtered marker sets | Rotated alignment plot | PNG/PDF |
PCA_procrustes_plink2_prune02_vs_full.r |
Compare pruned vs full WGS PCA | Rotated alignment plot | PNG/PDF |
| Script | Purpose | Figure Type | Key Outputs |
|---|---|---|---|
confidence_ellipses.r |
Accuracy vs bias distributions | 95% confidence ellipses by method/platform | PNG (separate for rrBLUP and ASReml) |
violin_sap_pred.r |
Accuracy distributions by trait | Violin + boxplot panels | PNG multipanel |
patchwork.r |
Combined GS summary figure | Multi-panel composite | PNG composite |
| Script | Purpose | Figure Type | Key Outputs |
|---|---|---|---|
manhattan_height.r |
Manhattan + QQ plots for plant height | Standard GWAS plots | PNG manhattan + QQ |
manhattan_all.r |
Manhattan + QQ plots for all traits | Standard GWAS plots | PNG manhattan + QQ (per trait) |
| Script | Purpose | Figure Type | Key Outputs |
|---|---|---|---|
visualize_kin.r |
Heatmap of kinship matrices | Heatmap with dendrograms | PNG heatmaps |
| Script | Purpose | Key Outputs |
|---|---|---|
07_kinship/kinship_corr.r |
Compute correlation between WGS and AgriPlex kinship | Console output, optional plot |
| File Path | Description | Format |
|---|---|---|
data/output/gs/sap_gs_cv_results.csv |
Cross-validation results (all traits, folds, iterations) | CSV |
data/output/gs/plots/*.png |
Genomic selection figures | PNG |
data/output/pca/*.eigenvec |
Principal component scores | TSV |
data/output/pca/*.eigenval |
Variance explained per PC | TSV |
data/output/gwas/*.glm.linear |
GWAS association results | PLINK2 GLM format |
data/output/gwas/plots/*.png |
GWAS figures | PNG |
data/input/*/kinship/*_G_matrix.rds |
VanRaden kinship matrices (canonical inputs) | RDS (R serialized) |
data/output/kinship_eval/* |
Kinship evaluation outputs | PNG/TSV/etc. |
data/output/pca/procrustes/* |
Procrustes correlation statistics + reports | CSV/TXT |
| Column | Description |
|---|---|
trait |
Phenotype name |
k2 |
Marker panel (wgs or agp) |
method |
GS method (rrblup or asreml) |
fold |
Fold identifier (e.g., Fold1, Fold2, ...) |
test_cor_pearson |
Pearson correlation (observed vs predicted) |
test_bias |
Regression slope (observed ~ predicted) |
iter |
CV iteration number (1-50) |
Interpretation:
test_cor_pearson: Genomic prediction accuracy (higher = better, range 0-1)test_bias: Prediction bias (closer to 1.0 = unbiased)
All scripts use dynamic path resolution to ensure portability across systems:
- Environment Variable: Set
SBI_AGRIPLEX_ROOTto explicitly define the project root - Auto-detection: Scripts search upward from their location for a
data/directory - No hardcoded paths: All file paths are constructed relative to
PROJECT_ROOT
Shell scripts:
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd -P)"
PROJECT_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd -P)"R scripts:
get_project_root <- function() {
env_root <- Sys.getenv("SBI_AGRIPLEX_ROOT", unset = "")
if (nzchar(env_root)) {
return(normalizePath(env_root, mustWork = TRUE))
}
# ... auto-detection logic ...
}
PROJECT_ROOT <- get_project_root()| Step | Runtime | RAM | CPU Cores |
|---|---|---|---|
| VCF Conversion | 10-30 min | 8 GB | 16 threads (configurable) |
| PCA | 5-15 min | 8 GB | 1-4 cores |
| Kinship Calculation | 10-30 min | 16 GB | 1 core |
| Genomic Selection CV | 2-6 hours | 16-32 GB | 8+ cores recommended |
| GWAS | 30-60 min | 8 GB | 16 threads (configurable) |
| Visualization | 5-10 min per script | 4 GB | 1 core |
The genomic selection script (pred_SAP.r) uses parallel processing:
- Controlled by
FUTURE_CORES_FRACTION(default: 0.8 = 80% of available cores) - Example: On a 16-core machine, uses ~13 cores
- Adjust in the script if needed:
FUTURE_CORES_FRACTION <- 0.5 # Use only 50% of coresIssue: ASReml license not found
Solution: This is expected if you don't have an ASReml license. The script will automatically use rrBLUP only. No action needed.
Issue: VCF file not found
Solution: Download the SAP WGS VCF from ENA (see Data Acquisition). Some scripts will offer to download it automatically.
Issue: PLINK2 not found in PATH
Solution: Install PLINK2 from https://www.cog-genomics.org/plink/2.0/ and ensure it's in your $PATH.
Issue: R package 'asreml' not found
Solution: Expected if you don't have an ASReml license. The genomic selection script will skip ASReml and proceed with rrBLUP.
Issue: Out of memory during genomic selection
Solution:
- Reduce
NUM_ITERSorCV_FOLDSinpred_SAP.r - Decrease
FUTURE_CORES_FRACTIONto reduce parallel load - Close other applications
Issue: Scripts can't find data/ directory
Solution:
- Run scripts from the project root directory, OR
- Set
SBI_AGRIPLEX_ROOTenvironment variable, OR - Ensure
data/directory exists relative to script locations
If you use this pipeline or data in your research, please cite:
[Your citation here - include DOI, journal, authors, etc.]
Data Source:
Boatwright, J.L., et al. (2022). Sorghum Association Panel whole-genome sequencing data.
European Nucleotide Archive, Project PRJEB51985.
https://www.ebi.ac.uk/ena/browser/view/PRJEB51985
For questions, issues, or contributions, please contact:
- Author: [Your Name]
- Email: [Your Email]
- Institution: [Your Institution]
[Specify your license here - e.g., MIT, GPL-3, etc.]
- Sorghum Association Panel data provided by [appropriate acknowledgment]
- ASReml-R developed by VSN International
- rrBLUP package developed by Jeffrey Endelman
- PLINK2 developed by Christopher Chang
- v1.0 (YYYY-MM-DD) - Initial release
Happy analyzing! 🌾