Skip to content

Breeding-Insight/sorghum_validate_panel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sorghum Marker Panel Analysis

This repository contains a complete computational workflow for comparing genomic selection (GS) performance between whole-genome sequencing (WGS) and AgriPlex genotyping platforms in the Sorghum Association Panel (SAP).

Table of Contents


Overview

This analysis pipeline performs:

  1. VCF preprocessing - Convert VCF files to PLINK binary format with quality control
  2. LD pruning - Generate LD-pruned marker sets for PCA and kinship calculation
  3. Principal Component Analysis (PCA) - Population structure visualization for WGS and AgriPlex panels
  4. Procrustes Analysis - Alignment comparison between WGS and AgriPlex PCA spaces
  5. Kinship Matrix Construction - VanRaden genomic relationship matrices (G-matrices)
  6. Genomic Selection Cross-Validation - k-fold CV using rrBLUP and ASReml-R
  7. GWAS - Genome-wide association studies with PCA covariates
  8. Visualization - Manhattan plots, QQ plots, confidence ellipses, violin plots, and more

Key Finding: The analysis demonstrates that the AgriPlex platform (1,942 markers) achieves comparable genomic prediction accuracy to WGS (~250K markers) for the traits analyzed in the SAP panel.


Repository Structure

All scripts resolve paths relative to the project root (via SBI_AGRIPLEX_ROOT or auto-detection), and assume the following standardized structure:

.
├── data/
│   ├── input/                        # Input data (VCF files, phenotypes, derived genotype formats)
│   │   ├── agp/                      # AgriPlex genotyping platform data
│   │   │   ├── vcf/                  # Required: Validated.SAP.VCF
│   │   │   ├── raw/
│   │   │   │   ├── full/             # Generated: SAP_genotypes_agp_full.raw
│   │   │   │   └── filtered/         # Generated: SAP_genotypes_agp_filtered_by_wgs.*
│   │   │   ├── bfiles/
│   │   │   │   ├── full/             # Generated: SAP_genotypes_agp_full.{bed,bim,fam}
│   │   │   │   └── filtered/         # Generated: SAP_genotypes_agp_filtered_by_wgs.{bed,bim,fam}
│   │   │   └── kinship/              # Generated: VanRaden G-matrices for AGP full
│   │   │       └── SAP_genotypes_agp_full_G_matrix.rds
│   │   ├── wgs/                      # Whole-genome sequencing data (Boatwright)
│   │   │   ├── vcf/                  # Required: SAP.raw.vcf.gz
│   │   │   ├── raw/
│   │   │   │   ├── full/             # Generated: SAP_genotypes_wgs_full.raw
│   │   │   │   ├── pruned/           # Generated: SAP_genotypes_wgs_pruned02.raw + pruning lists
│   │   │   │   └── filtered/         # Generated: SAP_genotypes_wgs_filtered_by_agp.*
│   │   │   ├── bfiles/
│   │   │   │   ├── full/             # Generated: SAP_genotypes_wgs_full.{bed,bim,fam}
│   │   │   │   ├── pruned/           # Generated: SAP_genotypes_wgs_pruned02.{bed,bim,fam}
│   │   │   │   └── filtered/         # Generated: SAP_genotypes_wgs_filtered_by_agp.{bed,bim,fam}
│   │   │   └── kinship/              # Generated: VanRaden G-matrices for WGS pruned02
│   │   │       └── SAP_genotypes_wgs_pruned02_G_matrix.rds
│   │   └── pheno/                    # Phenotype data
│   │       ├── SAP_Field_Data_Set.csv    # Required: phenotype source
│   │       └── pheno_plink_format.tsv    # Generated: PLINK phenotype table for GWAS
│   └── output/                       # Analysis outputs
│       ├── pca/                      # PCA eigenvec/eigenval + PCA-derived figures
│       │   ├── *.eigenvec
│       │   ├── *.eigenval
│       │   ├── plots/
│       │   └── procrustes/
│       ├── gwas/                     # PLINK2 GWAS outputs + figures
│       │   ├── *.glm.linear
│       │   └── plots/
│       ├── gs/                       # Genomic selection (k-fold CV) outputs + figures
│       │   ├── sap_gs_cv_results.csv
│       │   └── plots/
│       └── kinship_eval/             # Kinship evaluation (correlation/Bland-Altman/etc.)
│           ├── kinship_bland_altman.png
│           └── square/
├── scripts/                          # Analysis pipeline scripts
│   ├── 00_env_setup/                 # Environment setup (conda, optional)
│   ├── 01_plink_pheno/               # Phenotype formatting
│   ├── 02_VCF2RAW/                   # VCF to PLINK .raw conversion
│   ├── 03_VCF2Bfiles/                # VCF to PLINK binary files
│   ├── 04_filter_panels/             # Cross-platform marker filtering (QC)
│   ├── 05_calc_eigenvalues_PCA/      # Principal component analysis
│   ├── 06_GWAS/                      # Genome-wide association studies
│   ├── 07_kinship/                   # Kinship matrix construction
│   ├── 08_iterate_kfold_GP/          # Genomic selection cross-validation
│   └── 09_figures/                   # Visualization scripts
├── workflow/                         # Snakemake workflow orchestration (RECOMMENDED)
│   ├── Snakefile                     # Main workflow definition
│   ├── config/
│   │   ├── config.yaml               # Production configuration
│   │   └── config_test.yaml          # Test/development configuration
│   ├── rules/                        # Modular Snakemake rules
│   │   ├── 01_phenotype_prep.smk
│   │   ├── 02_genotype_conversion.smk
│   │   ├── 03_filtering.smk
│   │   ├── 04_pca.smk
│   │   ├── 05_gwas.smk
│   │   ├── 06_kinship.smk
│   │   ├── 07_genomic_selection.smk
│   │   └── 08_visualization.smk
│   ├── README.md                     # Snakemake workflow documentation
│   └── QUICKSTART.md                 # Quick reference guide
└── README.md                         # This file

Important Notes:

  • VanRaden matrices live under data/input/*/kinship/ by design: downstream GS and evaluation scripts treat these as canonical inputs (AGP full and WGS pruned02)

Prerequisites

Required Software

Command-line Tools

  • PLINK2 (v2.0+) - Genetic data management and analysis

R (v4.0+) with Packages

Core Packages (available via CRAN):

install.packages(c(
  "tidyverse",      # Data manipulation and visualization
  "data.table",     # Fast data reading/writing
  "ggplot2",        # Plotting
  "ggforce",        # Extended ggplot2 geoms
  "magrittr",       # Pipe operators
  "furrr",          # Parallel processing
  "future",         # Parallel backend
  "progressr"       # Progress bars
))

Genetic Analysis Packages:

install.packages(c(
  "rrBLUP",         # Ridge regression BLUP (genomic selection)
  "vcfR",           # VCF file handling
  "genio",          # PLINK file I/O
  "caret",          # Cross-validation framework
  "matrixcalc",     # Matrix operations
  "Matrix",         # Sparse matrices
  "vegan",          # Procrustes analysis
  "qqman"           # Manhattan and QQ plots
))

ASReml-R (OPTIONAL - Commercial License Required):

# ASReml-R provides more advanced mixed model capabilities
# If you don't have an ASReml license, the pipeline will use rrBLUP exclusively
# rrBLUP is a free, open-source alternative that performs comparably for most traits

Important Note on ASReml:

  • ASReml-R is a commercial package requiring a paid license (~$600-1000/year for academic use)
  • The pipeline works perfectly fine using only rrBLUP (free, open-source)
  • Both methods are implemented in scripts/08_iterate_kfold_GP/pred_SAP.r
  • Results show rrBLUP and ASReml achieve nearly identical prediction accuracies
  • If ASReml is unavailable, the script will simply skip ASReml results and continue with rrBLUP

Optional: ASReml License Configuration

If you have an ASReml license, place your asreml.lic file in your home directory:

cp /path/to/your/asreml.lic ~/asreml.lic

The genomic selection script (pred_SAP.r) will automatically detect and use it. If the license is not found, only rrBLUP results will be generated.


Installation

1. Clone the Repository

git clone <repository-url>
cd sorghum_validate_panel

2. Set Environment Variable (Optional)

For explicit control over the project root directory:

export SBI_AGRIPLEX_ROOT="/path/to/sorghum_validate_panel"

If not set, scripts will automatically detect the project root by searching for the data/ directory.

3. Install PLINK2

If you don't already have PLINK2 installed, use the provided conda environment:

Option A: Conda Environment Setup (Recommended)

The repository includes a conda environment specification that installs PLINK2 and all required command-line tools.

# Create the conda environment
bash scripts/00_env_setup/setup_conda_env.sh

# Activate the environment
conda activate sorghum_gs_validation-env

# Verify PLINK2 is installed
plink2 --version

Notes:

  • The script will create or update the sorghum_gs_validation-env environment
  • Use --force flag to recreate from scratch: bash scripts/00_env_setup/setup_conda_env.sh --force
  • You'll need to activate this environment (conda activate sorghum_gs_validation-env) whenever running the shell scripts
  • This environment includes all required tools: PLINK2, Snakemake (≥7.0), wget, and system utilities

Included tools:

  • plink2 - Main genotyping analysis tool
  • wget - For downloading data files
  • openssl and ca-certificates - For secure connections

Option B: Manual PLINK2 Installation

If you prefer not to use conda, manually install PLINK2:

# Download and install from:
# https://www.cog-genomics.org/plink/2.0/

# Verify installation
plink2 --version

4. Install R Packages

# Launch R and install required packages
install.packages(c("tidyverse", "data.table", "ggforce", "rrBLUP", 
                   "caret", "vcfR", "genio", "matrixcalc", "Matrix",
                   "furrr", "future", "progressr", "vegan", "qqman"))

Data Acquisition

REQUIRED FIRST STEP

Important: The following data files are NOT included in this repository but are required to run the analysis pipeline. All collaborators should have access to these files. Before running any scripts, you must place these files in the correct locations:

Required Files and Locations

  1. WGS VCF File: SAP.raw.vcf.gz

    • Place in: data/input/wgs/vcf/SAP.raw.vcf.gz
    • Description: Sorghum Association Panel whole-genome sequencing data
    • Important Note: The prep_genotypes.sh script (Step 1 of workflow) will attempt to auto-download this file from ENA if not found. However, if the download fails, you must manually place the file in this location.
  2. AgriPlex VCF File: Validated.SAP.VCF

    • Place in: data/input/agp/vcf/Validated.SAP.VCF
    • Description: AgriPlex genotyping panel data
  3. Phenotype Data: SAP_Field_Data_Set.csv

    • Place in: data/input/pheno/SAP_Field_Data_Set.csv
    • Description: Field trial phenotypic measurements
    • Note Any SAP file with phenomic info can be used here, but trait columns may need to be updated to match the script (or vice-versa)

Setup Instructions

# 1. Create the required directory structure (if not already present)
mkdir -p data/input/wgs/vcf
mkdir -p data/input/wgs/raw/full
mkdir -p data/input/wgs/raw/pruned
mkdir -p data/input/wgs/raw/filtered
mkdir -p data/input/wgs/bfiles/full
mkdir -p data/input/wgs/bfiles/pruned
mkdir -p data/input/wgs/bfiles/filtered
mkdir -p data/input/agp/vcf
mkdir -p data/input/agp/raw/full
mkdir -p data/input/agp/raw/filtered
mkdir -p data/input/agp/bfiles/full
mkdir -p data/input/agp/bfiles/filtered
mkdir -p data/input/pheno

# 2. Copy the files from your shared location to the correct directories
# Example (adjust paths based on where you received the files):
cp /path/to/your/SAP.raw.vcf.gz data/input/wgs/raw/
cp /path/to/your/Validated.SAP.VCF data/input/agp/vcf/
cp /path/to/your/SAP_Field_Data_Set.csv data/input/pheno/

# 3. Verify the files are in place
ls -lh data/input/wgs/raw/SAP.raw.vcf.gz
ls -lh data/input/agp/vcf/Validated.SAP.VCF
ls -lh data/input/pheno/SAP_Field_Data_Set.csv

Once these files are in place, you can proceed with the Workflow steps below.


Public Data Source (WGS VCF only)

The Sorghum Association Panel (SAP) WGS data (Boatwright) can be downloaded from the European Nucleotide Archive:

Project: PRJEB51985
Direct Link: https://www.ebi.ac.uk/ena/browser/view/PRJEB51985

# Download SAP WGS VCF (example using wget)
wget -O data/input/wgs/raw/SAP.raw.vcf.gz \
  ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ125/ERZ12588732/SAP.raw.vcf.gz

Note: The prep_genotypes.sh script (used in Workflow Step 1) includes an interactive prompt to automatically download this file if not found. If the auto-download fails, you can manually download it using the command above.

Workflow

This pipeline can be executed in two ways:

  1. Snakemake Workflow (RECOMMENDED) - Automated, reproducible, with automatic dependency management
  2. Manual Step-by-Step - Individual script execution with manual dependency tracking

Snakemake Workflow (Recommended)

The workflow/ directory contains a complete Snakemake pipeline that orchestrates all analysis steps automatically. This is the recommended approach for most users.

Why Use Snakemake?

  • Automatic Checkpointing: Resume from failures without re-running completed steps
  • Parallel Execution: Independent tasks (WGS/AGP processing) run simultaneously
  • Dependency Management: Ensures correct execution order automatically
  • Configuration-Driven: Centralized parameter management
  • Reproducibility: Single command executes entire pipeline consistently

Quick Start with Snakemake

# 1. Activate conda environment (includes Snakemake ≥7.0)
conda activate sorghum_gs_validation-env

# 2. Verify Snakemake installation
snakemake --version

# 3. Preview what will run (dry run)
snakemake --dry-run --cores 1

# 4. Execute full pipeline
snakemake --cores 16

# 5. Test run (skip intensive genomic selection)
snakemake test_run --cores 8

Configuration

Edit workflow/config/config.yaml to customize analysis parameters:

# Enable/disable analysis components
analysis:
  run_matched_panels: true      # Cross-platform QC filtering
  run_gwas: true                # Association studies
  run_genomic_selection: true   # CV cross-validation (2-6 hours)
  use_asreml: true              # Use ASReml if license available

# Computational resources
resources:
  plink_threads: 16             # Threads for PLINK operations
  r_cores_fraction: 0.8         # Use 80% of cores for R parallel tasks

# Genomic selection parameters
genomic_selection:
  num_iters: 50                 # CV iterations (use 5 for testing)
  cv_folds: 5                   # Folds per iteration

Common Snakemake Usage Patterns

# Run only GWAS (skip genomic selection)
snakemake gwas_only --cores 16

# Run only genomic selection
snakemake genomic_selection_only --cores 16

# Generate all figures only
snakemake all_figures --cores 8

# Resume after failure (automatic checkpointing)
snakemake --cores 16 --rerun-incomplete

# Visualize workflow as DAG
snakemake --dag | dot -Tpng > workflow_dag.png

# Check execution plan with details
snakemake --dry-run --printshellcmds

Snakemake Workflow Structure

Input Data (VCF + Phenotypes)
  ├─> Phenotype Preparation (PLINK format)
  ├─> VCF to RAW Conversion (WGS + AGP)
  │   └─> VCF to Binary Files (PLINK .bed/.bim/.fam)
  │       ├─> LD Pruning (WGS only, r² ≤ 0.2)
  │       └─> Optional: Cross-Platform Filtering (QC)
  ├─> Principal Component Analysis (all panels)
  │   ├─> GWAS (optional, with PCA covariates)
  │   └─> PCA Visualizations (plots + Procrustes)
  ├─> Kinship Matrix Construction (VanRaden G-matrices)
  │   ├─> Genomic Selection CV (optional, rrBLUP + ASReml)
  │   ├─> Kinship Evaluation (Bland-Altman, correlation)
  │   └─> GS Visualizations (ellipses, violins)
  └─> All Publication Figures

Detailed Documentation

For comprehensive Snakemake workflow documentation, see:

  • workflow/README.md - Complete guide with troubleshooting and advanced usage
  • workflow/QUICKSTART.md - Quick reference for common tasks
  • workflow/config/config.yaml - All configurable parameters with descriptions

Manual Step-by-Step Workflow

If you prefer manual execution or need to run individual steps independently, follow this sequential workflow. Ensure you've completed Installation (PLINK setup) and Data Acquisition first.

Phenotype Data Preparation

Before running analyses, prepare the phenotype file for PLINK/GWAS compatibility.

Step 0: Format Phenotype Data for PLINK

Purpose: Convert the SAP field data CSV to PLINK-compatible TSV format with averaged replicate values.

Rscript scripts/01_plink_pheno/prepare_pheno_plink.R

Input:

  • data/input/pheno/SAP_Field_Data_Set.csv

Output:

  • data/input/pheno/pheno_plink_format.tsv - Averaged phenotypes in PLINK format (IID + trait columns)

Traits Processed:

  • Days_to_flowering_DTF
  • Height_inches
  • Exertion_inches
  • Seed_weight_grams_100_seeds

Note: This step is required before running GWAS (Step 6). The script averages replicate measurements for each accession.


Genomic Data Preparation

This section prepares raw genotype data (VCF files) into quality-controlled formats for downstream analyses.

Step 1: Generate PLINK .raw Files (VCF to RAW)

Purpose: Convert VCF files to additive genotype matrices (0/1/2 format). For WGS data, generates both unpruned and LD-pruned versions (MAF ≥ 0.05, missingness ≤ 10% applied before pruning). For AgriPlex, generates unpruned only (panel is pre-designed). The pruned SNP list (.prune.in) produced here is reused by downstream scripts (e.g. prune_wgs_bfiles.sh) to build binary files without re-running LD pruning.

# Process both WGS and AgriPlex VCF files
bash scripts/02_VCF2RAW/prep_genotypes.sh

Helpful Feature: If the WGS VCF file (SAP.raw.vcf.gz) is not found, this script will prompt you to automatically download it from ENA. If the download fails, you'll need to manually obtain the file (see Data Acquisition).

Outputs:

WGS Unpruned (QC filtered only):

  • data/input/wgs/raw/full/SAP_genotypes_wgs_full.raw

WGS LD Pruned (r² ≤ 0.2):

  • data/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02.raw
  • data/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.in
  • data/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.out

AgriPlex (if VCF provided):

  • data/input/agp/raw/full/SAP_genotypes_agp_full.raw

Step 2: Build PLINK Binary Files (VCF to Bfiles)

Purpose: Convert raw VCF files to PLINK binary format. This step builds the core genomic resources needed for all downstream analyses (PCA, GWAS, genomic selection). The prune_wgs_bfiles.sh script reuses the .prune.in list from step 1 (no redundant LD pruning).

# Build full PLINK binary files from VCFs
bash scripts/03_VCF2Bfiles/build_full_wgs_agp_bfiles.sh

# Build pruned WGS bfiles (uses .prune.in from step 1)
bash scripts/03_VCF2Bfiles/prune_wgs_bfiles.sh

Outputs:

Full WGS and AGP Bfiles:

  • data/input/wgs/bfiles/full/SAP_genotypes_wgs_full.{bed,bim,fam} (PLINK binary files)
  • data/input/agp/bfiles/full/SAP_genotypes_agp_full.{bed,bim,fam}

LD-Pruned WGS Bfiles (r² ≤ 0.2, MAF ≥ 0.05, missingness ≤ 10%):

  • data/input/wgs/bfiles/pruned/SAP_genotypes_wgs_pruned02.{bed,bim,fam}
  • data/input/wgs/bfiles/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.in (kept SNPs)
  • data/input/wgs/bfiles/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.out (removed SNPs)

Step 3: Filter Panels for QC Matching

Purpose: Perform bidirectional filtering to ensure both WGS and AgriPlex panels contain only the intersection of markers present in both platforms, enabling direct comparison and quality control.

# Bidirectionally filter WGS and AgriPlex panels
bash scripts/04_filter_panels/filter_wgs_by_agp.sh

Inputs:

  • Requires the AgriPlex full .raw at data/input/agp/raw/full/SAP_genotypes_agp_full.raw (generated in Step 1).
  • If your AGP .raw lives elsewhere, pass it as the 3rd argument:
# bash filter_wgs_by_agp.sh [wgs_out_prefix] [agp_out_prefix] [agp_raw] [wgs_vcf]
bash scripts/04_filter_panels/filter_wgs_by_agp.sh \
  data/input/wgs/raw/filtered/SAP_genotypes_wgs_filtered_by_agp \
  data/input/agp/raw/filtered/SAP_genotypes_agp_filtered_by_wgs \
  path/to/your_agp.raw

Process:

  1. Extract markers from AgriPlex panel
  2. Filter WGS VCF to markers present in AgriPlex
  3. Extract markers successfully found in WGS
  4. Filter AgriPlex to markers that exist in filtered WGS

Outputs:

  • data/input/wgs/raw/filtered/SAP_genotypes_wgs_filtered_by_agp.raw - WGS filtered to AgriPlex markers
  • data/input/agp/raw/filtered/SAP_genotypes_agp_filtered_by_wgs.raw - AgriPlex filtered to WGS markers
  • data/input/wgs/raw/filtered/agp_markers.list - Original AgriPlex marker list
  • data/input/wgs/raw/filtered/wgs_markers.list - Markers found in WGS (intersection)

Step 4: Build Matched PLINK Binary Files (Matched Marker Set)

Purpose: Build PLINK bfiles for both WGS and AgriPlex using the matched marker list produced in Step 3.

# Build matched/filtered panels (same markers, overlapping samples)
bash scripts/03_VCF2Bfiles/build_matched_wgs_agp_bfiles.sh

Outputs:

  • data/input/wgs/bfiles/filtered/SAP_genotypes_wgs_filtered_by_agp.{bed,bim,fam}
  • data/input/agp/bfiles/filtered/SAP_genotypes_agp_filtered_by_wgs.{bed,bim,fam}

PLINK Analyses

Principal Component Analysis (PCA)

Purpose: Compute population structure for both marker panels.

# Run PCA on all datasets (WGS unpruned, WGS pruned, AGP full, filtered panels)
bash scripts/05_calc_eigenvalues_PCA/pca.sh

Outputs:

  • data/output/PCA/wgs_full_pca.eigenvec and .eigenval
  • data/output/PCA/wgs_pruned_pca.eigenvec and .eigenval
  • data/output/PCA/agp_pca.eigenvec and .eigenval
  • data/output/PCA/wgs_filtered_overlap_pca.*
  • data/output/PCA/agp_filtered_overlap_pca.*

Note: PCA outputs use uppercase PCA/ directory for compatibility with legacy scripts.

Visualize PCA Results:

Rscript scripts/09_figures/pca_SAP.r
Rscript scripts/09_figures/pca_plink2.r

Genome-Wide Association Studies (GWAS)

Purpose: Identify marker-trait associations while controlling for population structure.

# Run GWAS for all traits using PCA covariates
bash scripts/06_GWAS/run_gwas.sh

Traits Analyzed:

  • Days to flowering (DTF)
  • Plant height (inches)
  • Panicle exertion (inches)
  • Seed weight (grams per 100 seeds)

Outputs:

  • data/output/gwas/wgs_pruned_<trait>.glm.linear - WGS GWAS results
  • data/output/gwas/agp_full_<trait>.glm.linear - AgriPlex GWAS results

Visualize GWAS Results:

# Manhattan + QQ plots for plant height
Rscript scripts/09_figures/manhattan_height.r

# Manhattan + QQ plots for all traits
Rscript scripts/09_figures/manhattan_all.r

Outputs:

  • data/output/gwas/plots/<trait>_manhattan.png
  • data/output/gwas/plots/<trait>_qq.png

R Analyses

Kinship Matrix Construction + Evaluation

Purpose:

  • Build VanRaden genomic relationship matrices (G-matrices) for AGP full and WGS pruned02
  • Evaluate agreement (correlation/Bland-Altman/visualizations)
# Build VanRaden matrices (writes to data/input/*/kinship/)
Rscript scripts/07_kinship/create_vanraden_k2.r --preset agp
Rscript scripts/07_kinship/create_vanraden_k2.r --preset wgs

# Evaluate kinship agreement (writes to data/output/kinship_eval/)
Rscript scripts/07_kinship/bland-altman.r
Rscript scripts/09_figures/visualize_kin.r

Outputs:

  • VanRaden matrices (canonical inputs for GS):
    • data/input/agp/kinship/SAP_genotypes_agp_full_G_matrix.rds
    • data/input/wgs/kinship/SAP_genotypes_wgs_pruned02_G_matrix.rds
  • Kinship evaluation:
    • data/output/kinship_eval/*

Genomic Selection Cross-Validation

Purpose: Evaluate genomic prediction accuracy using k-fold cross-validation with both rrBLUP and ASReml.

Note: This is the most computationally intensive step. It runs 50 iterations of 5-fold cross-validation across multiple traits in parallel.

Rscript scripts/08_iterate_kfold_GP/pred_SAP.r

Configuration (edit within the script if needed):

  • NUM_ITERS - Number of CV iterations (default: 50)
  • CV_FOLDS - Number of folds per iteration (default: 5)
  • FUTURE_CORES_FRACTION - CPU usage fraction (default: 0.8)

Computational Requirements:

  • Runtime: 2-6 hours on a modern workstation (depends on core count)
  • RAM: 16-32 GB recommended
  • CPU: Benefits from 8+ cores

Outputs:

  • data/output/gs/sap_gs_cv_results.csv - Main results file
    • Columns: trait, k2 (marker panel), method (rrBLUP/ASReml), fold, test_cor_pearson, test_bias, iter

Important: If ASReml is not available, the script will automatically skip ASReml analysis and only report rrBLUP results. No code changes are needed.


Visualization

Purpose: Generate publication-quality figures for PCA, Procrustes, genomic selection, and GWAS results.

Procrustes Analysis

Purpose: Assess alignment between WGS and AgriPlex PCA spaces.

Rscript scripts/09_figures/PCA_procrustes.r \
  --panel1=data/input/agp/raw/full/SAP_genotypes_agp_full.raw \
  --panel2=data/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02.raw \
  --pcs=5 \
  --permutations=999

Outputs:

  • data/output/pca/procrustes/*
  • data/output/pca/plots/*

Alternative Procrustes Scripts:

  • PCA_procrustes_plink2_filtered.r - Filtered marker sets
  • PCA_procrustes_plink2_prune02_vs_full.r - Compare pruned vs full WGS
Genomic Selection Plots

Purpose: Visualize genomic prediction accuracy and bias.

# Confidence ellipses (accuracy vs bias)
Rscript scripts/09_figures/confidence_ellipses.r

# Violin plots (accuracy distributions by trait)
Rscript scripts/09_figures/violin_sap_pred.r

# Combined patchwork figure
Rscript scripts/09_figures/patchwork.r

Outputs:

  • data/output/gs/plots/*.png
Kinship Matrix Visualization

Purpose: Visualize genomic relationship matrices.

Rscript scripts/09_figures/visualize_kin.r

Script Reference

Shell Scripts (scripts/)

Phenotype Preparation (01_plink_pheno/)

Script Purpose Key Outputs
01_plink_pheno/prepare_pheno_plink.R Convert SAP field data CSV to PLINK-compatible TSV with averaged replicates data/input/pheno/pheno_plink_format.tsv

VCF Conversion (02_VCF2RAW/, 03_VCF2Bfiles/)

Script Purpose Key Outputs
02_VCF2RAW/prep_genotypes.sh Convert VCF → PLINK .raw (0/1/2 additive matrix). WGS: full + LD-pruned (MAF ≥ 0.05, geno ≤ 0.10); AGP: full only WGS: data/input/wgs/raw/full/*.raw & data/input/wgs/raw/pruned/*.raw + .prune.in; AGP: data/input/agp/raw/full/*.raw
03_VCF2Bfiles/build_full_wgs_agp_bfiles.sh Convert WGS and AgriPlex VCFs → PLINK binaries (no filtering) WGS: data/input/wgs/bfiles/full/*.{bed,bim,fam}; AGP: data/input/agp/bfiles/full/*.{bed,bim,fam}
03_VCF2Bfiles/prune_wgs_bfiles.sh Build pruned WGS bfiles using .prune.in from step 02 (no redundant LD pruning) data/input/wgs/bfiles/pruned/*.{bed,bim,fam}
03_VCF2Bfiles/build_matched_wgs_agp_bfiles.sh Build matched/filtered bfiles (same markers across platforms) data/input/*/bfiles/filtered/*.{bed,bim,fam}

Population Structure (05_calc_eigenvalues_PCA/)

Script Purpose Key Outputs
05_calc_eigenvalues_PCA/pca.sh PCA on WGS full, WGS pruned, AGP full, and filtered panels data/output/PCA/{wgs_full,wgs_pruned,agp,wgs_filtered_overlap,agp_filtered_overlap}_pca.{eigenvec,eigenval}

Association Mapping (06_GWAS/)

Script Purpose Key Outputs
06_GWAS/run_gwas.sh GWAS with PCA covariates (10 PCs) data/output/gwas/*.glm.linear

Cross-Platform Filtering (04_filter_panels/)

Script Purpose Key Outputs
04_filter_panels/filter_wgs_by_agp.sh Subset WGS to markers present in AgriPlex panel (defaults to data/input/agp/raw/full/SAP_genotypes_agp_full.raw, or pass agp_raw) Marker lists + filtered .raw under data/input/*/raw/filtered/

R Scripts (scripts/)

Kinship Computation + Evaluation (07_kinship/)

Script Purpose Method Key Outputs
07_kinship/create_vanraden_k2.r Compute VanRaden G-matrix from PLINK .raw (AGP full, WGS pruned02 presets) rrBLUP::A.mat() data/input/*/kinship/*_G_matrix.rds
07_kinship/bland-altman.r Bland–Altman agreement plots for kinship matrices Bland–Altman data/output/kinship_eval/*
07_kinship/kinship_corr.r Correlation between WGS and AGP kinship matrices Pearson correlation Console output + optional files

Genomic Selection (08_iterate_kfold_GP/)

Script Purpose Methods Runtime Key Outputs
pred_SAP.r k-fold CV genomic prediction rrBLUP, ASReml (optional) 2-6 hours sap_gs_cv_results.csv

Key Features:

  • Parallelized across CV iterations using furrr
  • Automatic handling of missing ASReml license
  • Progress bar via progressr
  • Handles traits with/without replication
  • Computes Pearson accuracy and regression bias

Visualization (09_figures/)

PCA & Procrustes
Script Purpose Figure Type Key Outputs
pca_SAP.r Simple PCA scatter plots 2D scatter (PC1 vs PC2) PNG plots
pca_plink2.r PCA plots for PLINK2-generated PCs 2D scatter with variance labels PNG plots
PCA_procrustes.r Procrustes alignment between marker panels Rotated PC spaces with residuals PNG/PDF, summary CSV
PCA_procrustes_plink2_filtered.r Procrustes for filtered marker sets Rotated alignment plot PNG/PDF
PCA_procrustes_plink2_prune02_vs_full.r Compare pruned vs full WGS PCA Rotated alignment plot PNG/PDF
Genomic Selection Results
Script Purpose Figure Type Key Outputs
confidence_ellipses.r Accuracy vs bias distributions 95% confidence ellipses by method/platform PNG (separate for rrBLUP and ASReml)
violin_sap_pred.r Accuracy distributions by trait Violin + boxplot panels PNG multipanel
patchwork.r Combined GS summary figure Multi-panel composite PNG composite
GWAS Results
Script Purpose Figure Type Key Outputs
manhattan_height.r Manhattan + QQ plots for plant height Standard GWAS plots PNG manhattan + QQ
manhattan_all.r Manhattan + QQ plots for all traits Standard GWAS plots PNG manhattan + QQ (per trait)
Kinship Visualization
Script Purpose Figure Type Key Outputs
visualize_kin.r Heatmap of kinship matrices Heatmap with dendrograms PNG heatmaps

Utility Scripts

Script Purpose Key Outputs
07_kinship/kinship_corr.r Compute correlation between WGS and AgriPlex kinship Console output, optional plot

Output Files

Key Result Files

File Path Description Format
data/output/gs/sap_gs_cv_results.csv Cross-validation results (all traits, folds, iterations) CSV
data/output/gs/plots/*.png Genomic selection figures PNG
data/output/pca/*.eigenvec Principal component scores TSV
data/output/pca/*.eigenval Variance explained per PC TSV
data/output/gwas/*.glm.linear GWAS association results PLINK2 GLM format
data/output/gwas/plots/*.png GWAS figures PNG
data/input/*/kinship/*_G_matrix.rds VanRaden kinship matrices (canonical inputs) RDS (R serialized)
data/output/kinship_eval/* Kinship evaluation outputs PNG/TSV/etc.
data/output/pca/procrustes/* Procrustes correlation statistics + reports CSV/TXT

GS Results Format (sap_gs_cv_results.csv)

Column Description
trait Phenotype name
k2 Marker panel (wgs or agp)
method GS method (rrblup or asreml)
fold Fold identifier (e.g., Fold1, Fold2, ...)
test_cor_pearson Pearson correlation (observed vs predicted)
test_bias Regression slope (observed ~ predicted)
iter CV iteration number (1-50)

Interpretation:

  • test_cor_pearson: Genomic prediction accuracy (higher = better, range 0-1)
  • test_bias: Prediction bias (closer to 1.0 = unbiased)

Notes on Path Management

All scripts use dynamic path resolution to ensure portability across systems:

  1. Environment Variable: Set SBI_AGRIPLEX_ROOT to explicitly define the project root
  2. Auto-detection: Scripts search upward from their location for a data/ directory
  3. No hardcoded paths: All file paths are constructed relative to PROJECT_ROOT

Shell scripts:

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd -P)"
PROJECT_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd -P)"

R scripts:

get_project_root <- function() {
  env_root <- Sys.getenv("SBI_AGRIPLEX_ROOT", unset = "")
  if (nzchar(env_root)) {
    return(normalizePath(env_root, mustWork = TRUE))
  }
  # ... auto-detection logic ...
}
PROJECT_ROOT <- get_project_root()

Computational Considerations

Resource Requirements

Step Runtime RAM CPU Cores
VCF Conversion 10-30 min 8 GB 16 threads (configurable)
PCA 5-15 min 8 GB 1-4 cores
Kinship Calculation 10-30 min 16 GB 1 core
Genomic Selection CV 2-6 hours 16-32 GB 8+ cores recommended
GWAS 30-60 min 8 GB 16 threads (configurable)
Visualization 5-10 min per script 4 GB 1 core

Parallelization

The genomic selection script (pred_SAP.r) uses parallel processing:

  • Controlled by FUTURE_CORES_FRACTION (default: 0.8 = 80% of available cores)
  • Example: On a 16-core machine, uses ~13 cores
  • Adjust in the script if needed:
FUTURE_CORES_FRACTION <- 0.5  # Use only 50% of cores

Troubleshooting

Common Issues

Issue: ASReml license not found
Solution: This is expected if you don't have an ASReml license. The script will automatically use rrBLUP only. No action needed.

Issue: VCF file not found
Solution: Download the SAP WGS VCF from ENA (see Data Acquisition). Some scripts will offer to download it automatically.

Issue: PLINK2 not found in PATH
Solution: Install PLINK2 from https://www.cog-genomics.org/plink/2.0/ and ensure it's in your $PATH.

Issue: R package 'asreml' not found
Solution: Expected if you don't have an ASReml license. The genomic selection script will skip ASReml and proceed with rrBLUP.

Issue: Out of memory during genomic selection
Solution:

  • Reduce NUM_ITERS or CV_FOLDS in pred_SAP.r
  • Decrease FUTURE_CORES_FRACTION to reduce parallel load
  • Close other applications

Issue: Scripts can't find data/ directory
Solution:

  • Run scripts from the project root directory, OR
  • Set SBI_AGRIPLEX_ROOT environment variable, OR
  • Ensure data/ directory exists relative to script locations

Citation

If you use this pipeline or data in your research, please cite:

[Your citation here - include DOI, journal, authors, etc.]

Data Source:

Boatwright, J.L., et al. (2022). Sorghum Association Panel whole-genome sequencing data. 
European Nucleotide Archive, Project PRJEB51985.
https://www.ebi.ac.uk/ena/browser/view/PRJEB51985

Contact

For questions, issues, or contributions, please contact:

  • Author: [Your Name]
  • Email: [Your Email]
  • Institution: [Your Institution]

License

[Specify your license here - e.g., MIT, GPL-3, etc.]


Acknowledgments

  • Sorghum Association Panel data provided by [appropriate acknowledgment]
  • ASReml-R developed by VSN International
  • rrBLUP package developed by Jeffrey Endelman
  • PLINK2 developed by Christopher Chang

Version History

  • v1.0 (YYYY-MM-DD) - Initial release

Happy analyzing! 🌾

About

This repository contains a complete computational workflow for comparing genomic selection (GS) performance between whole-genome sequencing (WGS) and AgriPlex genotyping platforms in the Sorghum Association Panel (SAP).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors