Sorghum Marker Panel Analysis

This repository contains a complete computational workflow for comparing genomic selection (GS) performance between whole-genome sequencing (WGS) and AgriPlex genotyping platforms in the Sorghum Association Panel (SAP).

Overview

This analysis pipeline performs:

VCF preprocessing - Convert VCF files to PLINK binary format with quality control
LD pruning - Generate LD-pruned marker sets for PCA and kinship calculation
Principal Component Analysis (PCA) - Population structure visualization for WGS and AgriPlex panels
Procrustes Analysis - Alignment comparison between WGS and AgriPlex PCA spaces
Kinship Matrix Construction - VanRaden genomic relationship matrices (G-matrices)
Genomic Selection Cross-Validation - k-fold CV using rrBLUP and ASReml-R
GWAS - Genome-wide association studies with PCA covariates
Visualization - Manhattan plots, QQ plots, confidence ellipses, violin plots, and more

Key Finding: The analysis demonstrates that the AgriPlex platform (1,942 markers) achieves comparable genomic prediction accuracy to WGS (~250K markers) for the traits analyzed in the SAP panel.

Repository Structure

All scripts resolve paths relative to the project root (via SBI_AGRIPLEX_ROOT or auto-detection), and assume the following standardized structure:

.
├── data/
│   ├── input/                        # Input data (VCF files, phenotypes, derived genotype formats)
│   │   ├── agp/                      # AgriPlex genotyping platform data
│   │   │   ├── vcf/                  # Required: Validated.SAP.VCF
│   │   │   ├── raw/
│   │   │   │   ├── full/             # Generated: SAP_genotypes_agp_full.raw
│   │   │   │   └── filtered/         # Generated: SAP_genotypes_agp_filtered_by_wgs.*
│   │   │   ├── bfiles/
│   │   │   │   ├── full/             # Generated: SAP_genotypes_agp_full.{bed,bim,fam}
│   │   │   │   └── filtered/         # Generated: SAP_genotypes_agp_filtered_by_wgs.{bed,bim,fam}
│   │   │   └── kinship/              # Generated: VanRaden G-matrices for AGP full
│   │   │       └── SAP_genotypes_agp_full_G_matrix.rds
│   │   ├── wgs/                      # Whole-genome sequencing data (Boatwright)
│   │   │   ├── vcf/                  # Required: SAP.raw.vcf.gz
│   │   │   ├── raw/
│   │   │   │   ├── full/             # Generated: SAP_genotypes_wgs_full.raw
│   │   │   │   ├── pruned/           # Generated: SAP_genotypes_wgs_pruned02.raw + pruning lists
│   │   │   │   └── filtered/         # Generated: SAP_genotypes_wgs_filtered_by_agp.*
│   │   │   ├── bfiles/
│   │   │   │   ├── full/             # Generated: SAP_genotypes_wgs_full.{bed,bim,fam}
│   │   │   │   ├── pruned/           # Generated: SAP_genotypes_wgs_pruned02.{bed,bim,fam}
│   │   │   │   └── filtered/         # Generated: SAP_genotypes_wgs_filtered_by_agp.{bed,bim,fam}
│   │   │   └── kinship/              # Generated: VanRaden G-matrices for WGS pruned02
│   │   │       └── SAP_genotypes_wgs_pruned02_G_matrix.rds
│   │   └── pheno/                    # Phenotype data
│   │       ├── SAP_Field_Data_Set.csv    # Required: phenotype source
│   │       └── pheno_plink_format.tsv    # Generated: PLINK phenotype table for GWAS
│   └── output/                       # Analysis outputs
│       ├── pca/                      # PCA eigenvec/eigenval + PCA-derived figures
│       │   ├── *.eigenvec
│       │   ├── *.eigenval
│       │   ├── plots/
│       │   └── procrustes/
│       ├── gwas/                     # PLINK2 GWAS outputs + figures
│       │   ├── *.glm.linear
│       │   └── plots/
│       ├── gs/                       # Genomic selection (k-fold CV) outputs + figures
│       │   ├── sap_gs_cv_results.csv
│       │   └── plots/
│       └── kinship_eval/             # Kinship evaluation (correlation/Bland-Altman/etc.)
│           ├── kinship_bland_altman.png
│           └── square/
├── scripts/                          # Analysis pipeline scripts
│   ├── 00_env_setup/                 # Environment setup (conda, optional)
│   ├── 01_plink_pheno/               # Phenotype formatting
│   ├── 02_VCF2RAW/                   # VCF to PLINK .raw conversion
│   ├── 03_VCF2Bfiles/                # VCF to PLINK binary files
│   ├── 04_filter_panels/             # Cross-platform marker filtering (QC)
│   ├── 05_calc_eigenvalues_PCA/      # Principal component analysis
│   ├── 06_GWAS/                      # Genome-wide association studies
│   ├── 07_kinship/                   # Kinship matrix construction
│   ├── 08_iterate_kfold_GP/          # Genomic selection cross-validation
│   └── 09_figures/                   # Visualization scripts
├── workflow/                         # Snakemake workflow orchestration (RECOMMENDED)
│   ├── Snakefile                     # Main workflow definition
│   ├── config/
│   │   ├── config.yaml               # Production configuration
│   │   └── config_test.yaml          # Test/development configuration
│   ├── rules/                        # Modular Snakemake rules
│   │   ├── 01_phenotype_prep.smk
│   │   ├── 02_genotype_conversion.smk
│   │   ├── 03_filtering.smk
│   │   ├── 04_pca.smk
│   │   ├── 05_gwas.smk
│   │   ├── 06_kinship.smk
│   │   ├── 07_genomic_selection.smk
│   │   └── 08_visualization.smk
│   ├── README.md                     # Snakemake workflow documentation
│   └── QUICKSTART.md                 # Quick reference guide
└── README.md                         # This file

Important Notes:

VanRaden matrices live under data/input/*/kinship/ by design: downstream GS and evaluation scripts treat these as canonical inputs (AGP full and WGS pruned02)

Prerequisites

Required Software

Command-line Tools

PLINK2 (v2.0+) - Genetic data management and analysis
- Option 1 (Recommended): Use the provided conda environment (see Installation)
- Option 2: Manual installation from https://www.cog-genomics.org/plink/2.0/
- Used for: VCF conversion, LD pruning, PCA, GWAS

R (v4.0+) with Packages

Core Packages (available via CRAN):

install.packages(c(
  "tidyverse",      # Data manipulation and visualization
  "data.table",     # Fast data reading/writing
  "ggplot2",        # Plotting
  "ggforce",        # Extended ggplot2 geoms
  "magrittr",       # Pipe operators
  "furrr",          # Parallel processing
  "future",         # Parallel backend
  "progressr"       # Progress bars
))

Genetic Analysis Packages:

install.packages(c(
  "rrBLUP",         # Ridge regression BLUP (genomic selection)
  "vcfR",           # VCF file handling
  "genio",          # PLINK file I/O
  "caret",          # Cross-validation framework
  "matrixcalc",     # Matrix operations
  "Matrix",         # Sparse matrices
  "vegan",          # Procrustes analysis
  "qqman"           # Manhattan and QQ plots
))

ASReml-R (OPTIONAL - Commercial License Required):

# ASReml-R provides more advanced mixed model capabilities
# If you don't have an ASReml license, the pipeline will use rrBLUP exclusively
# rrBLUP is a free, open-source alternative that performs comparably for most traits

Important Note on ASReml:

ASReml-R is a commercial package requiring a paid license (~$600-1000/year for academic use)

The pipeline works perfectly fine using only rrBLUP (free, open-source)

Both methods are implemented in scripts/08_iterate_kfold_GP/pred_SAP.r

Results show rrBLUP and ASReml achieve nearly identical prediction accuracies

If ASReml is unavailable, the script will simply skip ASReml results and continue with rrBLUP

Optional: ASReml License Configuration

If you have an ASReml license, place your asreml.lic file in your home directory:

cp /path/to/your/asreml.lic ~/asreml.lic

The genomic selection script (pred_SAP.r) will automatically detect and use it. If the license is not found, only rrBLUP results will be generated.

Installation

1. Clone the Repository

git clone <repository-url>
cd sorghum_validate_panel

2. Set Environment Variable (Optional)

For explicit control over the project root directory:

export SBI_AGRIPLEX_ROOT="/path/to/sorghum_validate_panel"

If not set, scripts will automatically detect the project root by searching for the data/ directory.

3. Install PLINK2

If you don't already have PLINK2 installed, use the provided conda environment:

Option A: Conda Environment Setup (Recommended)

The repository includes a conda environment specification that installs PLINK2 and all required command-line tools.

# Create the conda environment
bash scripts/00_env_setup/setup_conda_env.sh

# Activate the environment
conda activate sorghum_gs_validation-env

# Verify PLINK2 is installed
plink2 --version

Notes:

The script will create or update the sorghum_gs_validation-env environment
Use --force flag to recreate from scratch: bash scripts/00_env_setup/setup_conda_env.sh --force
You'll need to activate this environment (conda activate sorghum_gs_validation-env) whenever running the shell scripts
This environment includes all required tools: PLINK2, Snakemake (≥7.0), wget, and system utilities

Included tools:

plink2 - Main genotyping analysis tool
wget - For downloading data files
openssl and ca-certificates - For secure connections

Option B: Manual PLINK2 Installation

If you prefer not to use conda, manually install PLINK2:

# Download and install from:
# https://www.cog-genomics.org/plink/2.0/

# Verify installation
plink2 --version

4. Install R Packages

# Launch R and install required packages
install.packages(c("tidyverse", "data.table", "ggforce", "rrBLUP", 
                   "caret", "vcfR", "genio", "matrixcalc", "Matrix",
                   "furrr", "future", "progressr", "vegan", "qqman"))

Data Acquisition

REQUIRED FIRST STEP

Important: The following data files are NOT included in this repository but are required to run the analysis pipeline. All collaborators should have access to these files. Before running any scripts, you must place these files in the correct locations:

Required Files and Locations

WGS VCF File: SAP.raw.vcf.gz
- Place in: data/input/wgs/vcf/SAP.raw.vcf.gz
- Description: Sorghum Association Panel whole-genome sequencing data
- Important Note: The prep_genotypes.sh script (Step 1 of workflow) will attempt to auto-download this file from ENA if not found. However, if the download fails, you must manually place the file in this location.
AgriPlex VCF File: Validated.SAP.VCF
- Place in: data/input/agp/vcf/Validated.SAP.VCF
- Description: AgriPlex genotyping panel data
Phenotype Data: SAP_Field_Data_Set.csv
- Place in: data/input/pheno/SAP_Field_Data_Set.csv
- Description: Field trial phenotypic measurements
- Note Any SAP file with phenomic info can be used here, but trait columns may need to be updated to match the script (or vice-versa)

Setup Instructions

# 1. Create the required directory structure (if not already present)
mkdir -p data/input/wgs/vcf
mkdir -p data/input/wgs/raw/full
mkdir -p data/input/wgs/raw/pruned
mkdir -p data/input/wgs/raw/filtered
mkdir -p data/input/wgs/bfiles/full
mkdir -p data/input/wgs/bfiles/pruned
mkdir -p data/input/wgs/bfiles/filtered
mkdir -p data/input/agp/vcf
mkdir -p data/input/agp/raw/full
mkdir -p data/input/agp/raw/filtered
mkdir -p data/input/agp/bfiles/full
mkdir -p data/input/agp/bfiles/filtered
mkdir -p data/input/pheno

# 2. Copy the files from your shared location to the correct directories
# Example (adjust paths based on where you received the files):
cp /path/to/your/SAP.raw.vcf.gz data/input/wgs/raw/
cp /path/to/your/Validated.SAP.VCF data/input/agp/vcf/
cp /path/to/your/SAP_Field_Data_Set.csv data/input/pheno/

# 3. Verify the files are in place
ls -lh data/input/wgs/raw/SAP.raw.vcf.gz
ls -lh data/input/agp/vcf/Validated.SAP.VCF
ls -lh data/input/pheno/SAP_Field_Data_Set.csv

Once these files are in place, you can proceed with the Workflow steps below.

Public Data Source (WGS VCF only)

The Sorghum Association Panel (SAP) WGS data (Boatwright) can be downloaded from the European Nucleotide Archive:

Project: PRJEB51985
Direct Link: https://www.ebi.ac.uk/ena/browser/view/PRJEB51985

# Download SAP WGS VCF (example using wget)
wget -O data/input/wgs/raw/SAP.raw.vcf.gz \
  ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ125/ERZ12588732/SAP.raw.vcf.gz

Note: The prep_genotypes.sh script (used in Workflow Step 1) includes an interactive prompt to automatically download this file if not found. If the auto-download fails, you can manually download it using the command above.

Workflow

This pipeline can be executed in two ways:

Snakemake Workflow (RECOMMENDED) - Automated, reproducible, with automatic dependency management
Manual Step-by-Step - Individual script execution with manual dependency tracking

Snakemake Workflow (Recommended)

The workflow/ directory contains a complete Snakemake pipeline that orchestrates all analysis steps automatically. This is the recommended approach for most users.

Why Use Snakemake?

Automatic Checkpointing: Resume from failures without re-running completed steps
Parallel Execution: Independent tasks (WGS/AGP processing) run simultaneously
Dependency Management: Ensures correct execution order automatically
Configuration-Driven: Centralized parameter management
Reproducibility: Single command executes entire pipeline consistently

Quick Start with Snakemake

# 1. Activate conda environment (includes Snakemake ≥7.0)
conda activate sorghum_gs_validation-env

# 2. Verify Snakemake installation
snakemake --version

# 3. Preview what will run (dry run)
snakemake --dry-run --cores 1

# 4. Execute full pipeline
snakemake --cores 16

# 5. Test run (skip intensive genomic selection)
snakemake test_run --cores 8

Configuration

Edit workflow/config/config.yaml to customize analysis parameters:

# Enable/disable analysis components
analysis:
  run_matched_panels: true      # Cross-platform QC filtering
  run_gwas: true                # Association studies
  run_genomic_selection: true   # CV cross-validation (2-6 hours)
  use_asreml: true              # Use ASReml if license available

# Computational resources
resources:
  plink_threads: 16             # Threads for PLINK operations
  r_cores_fraction: 0.8         # Use 80% of cores for R parallel tasks

# Genomic selection parameters
genomic_selection:
  num_iters: 50                 # CV iterations (use 5 for testing)
  cv_folds: 5                   # Folds per iteration

Common Snakemake Usage Patterns

# Run only GWAS (skip genomic selection)
snakemake gwas_only --cores 16

# Run only genomic selection
snakemake genomic_selection_only --cores 16

# Generate all figures only
snakemake all_figures --cores 8

# Resume after failure (automatic checkpointing)
snakemake --cores 16 --rerun-incomplete

# Visualize workflow as DAG
snakemake --dag | dot -Tpng > workflow_dag.png

# Check execution plan with details
snakemake --dry-run --printshellcmds

Snakemake Workflow Structure

Input Data (VCF + Phenotypes)
  ├─> Phenotype Preparation (PLINK format)
  ├─> VCF to RAW Conversion (WGS + AGP)
  │   └─> VCF to Binary Files (PLINK .bed/.bim/.fam)
  │       ├─> LD Pruning (WGS only, r² ≤ 0.2)
  │       └─> Optional: Cross-Platform Filtering (QC)
  ├─> Principal Component Analysis (all panels)
  │   ├─> GWAS (optional, with PCA covariates)
  │   └─> PCA Visualizations (plots + Procrustes)
  ├─> Kinship Matrix Construction (VanRaden G-matrices)
  │   ├─> Genomic Selection CV (optional, rrBLUP + ASReml)
  │   ├─> Kinship Evaluation (Bland-Altman, correlation)
  │   └─> GS Visualizations (ellipses, violins)
  └─> All Publication Figures

Detailed Documentation

For comprehensive Snakemake workflow documentation, see:

workflow/README.md - Complete guide with troubleshooting and advanced usage
workflow/QUICKSTART.md - Quick reference for common tasks
workflow/config/config.yaml - All configurable parameters with descriptions

Manual Step-by-Step Workflow

If you prefer manual execution or need to run individual steps independently, follow this sequential workflow. Ensure you've completed Installation (PLINK setup) and Data Acquisition first.

Phenotype Data Preparation

Before running analyses, prepare the phenotype file for PLINK/GWAS compatibility.

Step 0: Format Phenotype Data for PLINK

Purpose: Convert the SAP field data CSV to PLINK-compatible TSV format with averaged replicate values.

Rscript scripts/01_plink_pheno/prepare_pheno_plink.R

Input:

data/input/pheno/SAP_Field_Data_Set.csv

Output:

data/input/pheno/pheno_plink_format.tsv - Averaged phenotypes in PLINK format (IID + trait columns)

Traits Processed:

Days_to_flowering_DTF
Height_inches
Exertion_inches
Seed_weight_grams_100_seeds

Note: This step is required before running GWAS (Step 6). The script averages replicate measurements for each accession.

Genomic Data Preparation

This section prepares raw genotype data (VCF files) into quality-controlled formats for downstream analyses.

Step 1: Generate PLINK .raw Files (VCF to RAW)

Purpose: Convert VCF files to additive genotype matrices (0/1/2 format). For WGS data, generates both unpruned and LD-pruned versions (MAF ≥ 0.05, missingness ≤ 10% applied before pruning). For AgriPlex, generates unpruned only (panel is pre-designed). The pruned SNP list (.prune.in) produced here is reused by downstream scripts (e.g. prune_wgs_bfiles.sh) to build binary files without re-running LD pruning.

# Process both WGS and AgriPlex VCF files
bash scripts/02_VCF2RAW/prep_genotypes.sh

Helpful Feature: If the WGS VCF file (SAP.raw.vcf.gz) is not found, this script will prompt you to automatically download it from ENA. If the download fails, you'll need to manually obtain the file (see Data Acquisition).

Outputs:

WGS Unpruned (QC filtered only):

data/input/wgs/raw/full/SAP_genotypes_wgs_full.raw

WGS LD Pruned (r² ≤ 0.2):

data/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02.raw
data/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.in
data/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.out

AgriPlex (if VCF provided):

data/input/agp/raw/full/SAP_genotypes_agp_full.raw

Step 2: Build PLINK Binary Files (VCF to Bfiles)

Purpose: Convert raw VCF files to PLINK binary format. This step builds the core genomic resources needed for all downstream analyses (PCA, GWAS, genomic selection). The prune_wgs_bfiles.sh script reuses the .prune.in list from step 1 (no redundant LD pruning).

# Build full PLINK binary files from VCFs
bash scripts/03_VCF2Bfiles/build_full_wgs_agp_bfiles.sh

# Build pruned WGS bfiles (uses .prune.in from step 1)
bash scripts/03_VCF2Bfiles/prune_wgs_bfiles.sh

Outputs:

Full WGS and AGP Bfiles:

data/input/wgs/bfiles/full/SAP_genotypes_wgs_full.{bed,bim,fam} (PLINK binary files)
data/input/agp/bfiles/full/SAP_genotypes_agp_full.{bed,bim,fam}

LD-Pruned WGS Bfiles (r² ≤ 0.2, MAF ≥ 0.05, missingness ≤ 10%):

data/input/wgs/bfiles/pruned/SAP_genotypes_wgs_pruned02.{bed,bim,fam}
data/input/wgs/bfiles/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.in (kept SNPs)
data/input/wgs/bfiles/pruned/SAP_genotypes_wgs_pruned02_pruning.prune.out (removed SNPs)

Step 3: Filter Panels for QC Matching

Purpose: Perform bidirectional filtering to ensure both WGS and AgriPlex panels contain only the intersection of markers present in both platforms, enabling direct comparison and quality control.

# Bidirectionally filter WGS and AgriPlex panels
bash scripts/04_filter_panels/filter_wgs_by_agp.sh

Inputs:

Requires the AgriPlex full .raw at data/input/agp/raw/full/SAP_genotypes_agp_full.raw (generated in Step 1).
If your AGP .raw lives elsewhere, pass it as the 3rd argument:

# bash filter_wgs_by_agp.sh [wgs_out_prefix] [agp_out_prefix] [agp_raw] [wgs_vcf]
bash scripts/04_filter_panels/filter_wgs_by_agp.sh \
  data/input/wgs/raw/filtered/SAP_genotypes_wgs_filtered_by_agp \
  data/input/agp/raw/filtered/SAP_genotypes_agp_filtered_by_wgs \
  path/to/your_agp.raw

Process:

Extract markers from AgriPlex panel
Filter WGS VCF to markers present in AgriPlex
Extract markers successfully found in WGS
Filter AgriPlex to markers that exist in filtered WGS

Outputs:

data/input/wgs/raw/filtered/SAP_genotypes_wgs_filtered_by_agp.raw - WGS filtered to AgriPlex markers
data/input/agp/raw/filtered/SAP_genotypes_agp_filtered_by_wgs.raw - AgriPlex filtered to WGS markers
data/input/wgs/raw/filtered/agp_markers.list - Original AgriPlex marker list
data/input/wgs/raw/filtered/wgs_markers.list - Markers found in WGS (intersection)

Step 4: Build Matched PLINK Binary Files (Matched Marker Set)

Purpose: Build PLINK bfiles for both WGS and AgriPlex using the matched marker list produced in Step 3.

# Build matched/filtered panels (same markers, overlapping samples)
bash scripts/03_VCF2Bfiles/build_matched_wgs_agp_bfiles.sh

Outputs:

data/input/wgs/bfiles/filtered/SAP_genotypes_wgs_filtered_by_agp.{bed,bim,fam}
data/input/agp/bfiles/filtered/SAP_genotypes_agp_filtered_by_wgs.{bed,bim,fam}

PLINK Analyses

Principal Component Analysis (PCA)

Purpose: Compute population structure for both marker panels.

# Run PCA on all datasets (WGS unpruned, WGS pruned, AGP full, filtered panels)
bash scripts/05_calc_eigenvalues_PCA/pca.sh

Outputs:

data/output/PCA/wgs_full_pca.eigenvec and .eigenval
data/output/PCA/wgs_pruned_pca.eigenvec and .eigenval
data/output/PCA/agp_pca.eigenvec and .eigenval
data/output/PCA/wgs_filtered_overlap_pca.*
data/output/PCA/agp_filtered_overlap_pca.*

Note: PCA outputs use uppercase PCA/ directory for compatibility with legacy scripts.

Visualize PCA Results:

Rscript scripts/09_figures/pca_SAP.r
Rscript scripts/09_figures/pca_plink2.r

Genome-Wide Association Studies (GWAS)

Purpose: Identify marker-trait associations while controlling for population structure.

# Run GWAS for all traits using PCA covariates
bash scripts/06_GWAS/run_gwas.sh

Traits Analyzed:

Days to flowering (DTF)
Plant height (inches)
Panicle exertion (inches)
Seed weight (grams per 100 seeds)

Outputs:

data/output/gwas/wgs_pruned_<trait>.glm.linear - WGS GWAS results
data/output/gwas/agp_full_<trait>.glm.linear - AgriPlex GWAS results

Visualize GWAS Results:

# Manhattan + QQ plots for plant height
Rscript scripts/09_figures/manhattan_height.r

# Manhattan + QQ plots for all traits
Rscript scripts/09_figures/manhattan_all.r

Outputs:

data/output/gwas/plots/<trait>_manhattan.png
data/output/gwas/plots/<trait>_qq.png

R Analyses

Kinship Matrix Construction + Evaluation

Purpose:

Build VanRaden genomic relationship matrices (G-matrices) for AGP full and WGS pruned02
Evaluate agreement (correlation/Bland-Altman/visualizations)

# Build VanRaden matrices (writes to data/input/*/kinship/)
Rscript scripts/07_kinship/create_vanraden_k2.r --preset agp
Rscript scripts/07_kinship/create_vanraden_k2.r --preset wgs

# Evaluate kinship agreement (writes to data/output/kinship_eval/)
Rscript scripts/07_kinship/bland-altman.r
Rscript scripts/09_figures/visualize_kin.r

Outputs:

VanRaden matrices (canonical inputs for GS):
- data/input/agp/kinship/SAP_genotypes_agp_full_G_matrix.rds
- data/input/wgs/kinship/SAP_genotypes_wgs_pruned02_G_matrix.rds
Kinship evaluation:
- data/output/kinship_eval/*

Genomic Selection Cross-Validation

Purpose: Evaluate genomic prediction accuracy using k-fold cross-validation with both rrBLUP and ASReml.

Note: This is the most computationally intensive step. It runs 50 iterations of 5-fold cross-validation across multiple traits in parallel.

Rscript scripts/08_iterate_kfold_GP/pred_SAP.r

Configuration (edit within the script if needed):

NUM_ITERS - Number of CV iterations (default: 50)
CV_FOLDS - Number of folds per iteration (default: 5)
FUTURE_CORES_FRACTION - CPU usage fraction (default: 0.8)

Computational Requirements:

Runtime: 2-6 hours on a modern workstation (depends on core count)
RAM: 16-32 GB recommended
CPU: Benefits from 8+ cores

Outputs:

data/output/gs/sap_gs_cv_results.csv - Main results file
- Columns: trait, k2 (marker panel), method (rrBLUP/ASReml), fold, test_cor_pearson, test_bias, iter

Important: If ASReml is not available, the script will automatically skip ASReml analysis and only report rrBLUP results. No code changes are needed.

Visualization

Purpose: Generate publication-quality figures for PCA, Procrustes, genomic selection, and GWAS results.

Procrustes Analysis

Purpose: Assess alignment between WGS and AgriPlex PCA spaces.

Rscript scripts/09_figures/PCA_procrustes.r \
  --panel1=data/input/agp/raw/full/SAP_genotypes_agp_full.raw \
  --panel2=data/input/wgs/raw/pruned/SAP_genotypes_wgs_pruned02.raw \
  --pcs=5 \
  --permutations=999

Outputs:

data/output/pca/procrustes/*
data/output/pca/plots/*

Alternative Procrustes Scripts:

PCA_procrustes_plink2_filtered.r - Filtered marker sets
PCA_procrustes_plink2_prune02_vs_full.r - Compare pruned vs full WGS

Genomic Selection Plots

Purpose: Visualize genomic prediction accuracy and bias.

# Confidence ellipses (accuracy vs bias)
Rscript scripts/09_figures/confidence_ellipses.r

# Violin plots (accuracy distributions by trait)
Rscript scripts/09_figures/violin_sap_pred.r

# Combined patchwork figure
Rscript scripts/09_figures/patchwork.r

Outputs:

data/output/gs/plots/*.png

Kinship Matrix Visualization

Purpose: Visualize genomic relationship matrices.

Rscript scripts/09_figures/visualize_kin.r

Script Reference

Shell Scripts (`scripts/`)

Phenotype Preparation (`01_plink_pheno/`)

Script	Purpose	Key Outputs
`01_plink_pheno/prepare_pheno_plink.R`	Convert SAP field data CSV to PLINK-compatible TSV with averaged replicates	`data/input/pheno/pheno_plink_format.tsv`

VCF Conversion (`02_VCF2RAW/`, `03_VCF2Bfiles/`)

Script	Purpose	Key Outputs
`02_VCF2RAW/prep_genotypes.sh`	Convert VCF → PLINK .raw (0/1/2 additive matrix). WGS: full + LD-pruned (MAF ≥ 0.05, geno ≤ 0.10); AGP: full only	WGS: `data/input/wgs/raw/full/.raw` & `data/input/wgs/raw/pruned/.raw` + `.prune.in`; AGP: `data/input/agp/raw/full/*.raw`
`03_VCF2Bfiles/build_full_wgs_agp_bfiles.sh`	Convert WGS and AgriPlex VCFs → PLINK binaries (no filtering)	WGS: `data/input/wgs/bfiles/full/.{bed,bim,fam}`; AGP: `data/input/agp/bfiles/full/.{bed,bim,fam}`
`03_VCF2Bfiles/prune_wgs_bfiles.sh`	Build pruned WGS bfiles using `.prune.in` from step 02 (no redundant LD pruning)	`data/input/wgs/bfiles/pruned/*.{bed,bim,fam}`
`03_VCF2Bfiles/build_matched_wgs_agp_bfiles.sh`	Build matched/filtered bfiles (same markers across platforms)	`data/input//bfiles/filtered/.{bed,bim,fam}`

Population Structure (`05_calc_eigenvalues_PCA/`)

Script	Purpose	Key Outputs
`05_calc_eigenvalues_PCA/pca.sh`	PCA on WGS full, WGS pruned, AGP full, and filtered panels	`data/output/PCA/{wgs_full,wgs_pruned,agp,wgs_filtered_overlap,agp_filtered_overlap}_pca.{eigenvec,eigenval}`

Association Mapping (`06_GWAS/`)

Script	Purpose	Key Outputs
`06_GWAS/run_gwas.sh`	GWAS with PCA covariates (10 PCs)	`data/output/gwas/*.glm.linear`

Cross-Platform Filtering (`04_filter_panels/`)

Script	Purpose	Key Outputs
`04_filter_panels/filter_wgs_by_agp.sh`	Subset WGS to markers present in AgriPlex panel (defaults to `data/input/agp/raw/full/SAP_genotypes_agp_full.raw`, or pass `agp_raw`)	Marker lists + filtered `.raw` under `data/input/*/raw/filtered/`

R Scripts (`scripts/`)

Kinship Computation + Evaluation (`07_kinship/`)

Script	Purpose	Method	Key Outputs
`07_kinship/create_vanraden_k2.r`	Compute VanRaden G-matrix from PLINK .raw (AGP full, WGS pruned02 presets)	`rrBLUP::A.mat()`	`data/input//kinship/_G_matrix.rds`
`07_kinship/bland-altman.r`	Bland–Altman agreement plots for kinship matrices	Bland–Altman	`data/output/kinship_eval/*`
`07_kinship/kinship_corr.r`	Correlation between WGS and AGP kinship matrices	Pearson correlation	Console output + optional files

Genomic Selection (`08_iterate_kfold_GP/`)

Script	Purpose	Methods	Runtime	Key Outputs
`pred_SAP.r`	k-fold CV genomic prediction	rrBLUP, ASReml (optional)	2-6 hours	`sap_gs_cv_results.csv`

Key Features:

Parallelized across CV iterations using furrr
Automatic handling of missing ASReml license
Progress bar via progressr
Handles traits with/without replication
Computes Pearson accuracy and regression bias

Visualization (`09_figures/`)

PCA & Procrustes

Script	Purpose	Figure Type	Key Outputs
`pca_SAP.r`	Simple PCA scatter plots	2D scatter (PC1 vs PC2)	PNG plots
`pca_plink2.r`	PCA plots for PLINK2-generated PCs	2D scatter with variance labels	PNG plots
`PCA_procrustes.r`	Procrustes alignment between marker panels	Rotated PC spaces with residuals	PNG/PDF, summary CSV
`PCA_procrustes_plink2_filtered.r`	Procrustes for filtered marker sets	Rotated alignment plot	PNG/PDF
`PCA_procrustes_plink2_prune02_vs_full.r`	Compare pruned vs full WGS PCA	Rotated alignment plot	PNG/PDF

Genomic Selection Results

Script	Purpose	Figure Type	Key Outputs
`confidence_ellipses.r`	Accuracy vs bias distributions	95% confidence ellipses by method/platform	PNG (separate for rrBLUP and ASReml)
`violin_sap_pred.r`	Accuracy distributions by trait	Violin + boxplot panels	PNG multipanel
`patchwork.r`	Combined GS summary figure	Multi-panel composite	PNG composite

GWAS Results

Script	Purpose	Figure Type	Key Outputs
`manhattan_height.r`	Manhattan + QQ plots for plant height	Standard GWAS plots	PNG manhattan + QQ
`manhattan_all.r`	Manhattan + QQ plots for all traits	Standard GWAS plots	PNG manhattan + QQ (per trait)

Kinship Visualization

Script	Purpose	Figure Type	Key Outputs
`visualize_kin.r`	Heatmap of kinship matrices	Heatmap with dendrograms	PNG heatmaps

Utility Scripts

Script	Purpose	Key Outputs
`07_kinship/kinship_corr.r`	Compute correlation between WGS and AgriPlex kinship	Console output, optional plot

Output Files

Key Result Files

File Path	Description	Format
`data/output/gs/sap_gs_cv_results.csv`	Cross-validation results (all traits, folds, iterations)	CSV
`data/output/gs/plots/*.png`	Genomic selection figures	PNG
`data/output/pca/*.eigenvec`	Principal component scores	TSV
`data/output/pca/*.eigenval`	Variance explained per PC	TSV
`data/output/gwas/*.glm.linear`	GWAS association results	PLINK2 GLM format
`data/output/gwas/plots/*.png`	GWAS figures	PNG
`data/input//kinship/_G_matrix.rds`	VanRaden kinship matrices (canonical inputs)	RDS (R serialized)
`data/output/kinship_eval/*`	Kinship evaluation outputs	PNG/TSV/etc.
`data/output/pca/procrustes/*`	Procrustes correlation statistics + reports	CSV/TXT

GS Results Format (`sap_gs_cv_results.csv`)

Column	Description
`trait`	Phenotype name
`k2`	Marker panel (`wgs` or `agp`)
`method`	GS method (`rrblup` or `asreml`)
`fold`	Fold identifier (e.g., `Fold1`, `Fold2`, ...)
`test_cor_pearson`	Pearson correlation (observed vs predicted)
`test_bias`	Regression slope (observed ~ predicted)
`iter`	CV iteration number (1-50)

Interpretation:

test_cor_pearson: Genomic prediction accuracy (higher = better, range 0-1)
test_bias: Prediction bias (closer to 1.0 = unbiased)

Notes on Path Management

All scripts use dynamic path resolution to ensure portability across systems:

Environment Variable: Set SBI_AGRIPLEX_ROOT to explicitly define the project root
Auto-detection: Scripts search upward from their location for a data/ directory
No hardcoded paths: All file paths are constructed relative to PROJECT_ROOT

Shell scripts:

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd -P)"
PROJECT_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd -P)"

R scripts:

get_project_root <- function() {
  env_root <- Sys.getenv("SBI_AGRIPLEX_ROOT", unset = "")
  if (nzchar(env_root)) {
    return(normalizePath(env_root, mustWork = TRUE))
  }
  # ... auto-detection logic ...
}
PROJECT_ROOT <- get_project_root()

Computational Considerations

Resource Requirements

Step	Runtime	RAM	CPU Cores
VCF Conversion	10-30 min	8 GB	16 threads (configurable)
PCA	5-15 min	8 GB	1-4 cores
Kinship Calculation	10-30 min	16 GB	1 core
Genomic Selection CV	2-6 hours	16-32 GB	8+ cores recommended
GWAS	30-60 min	8 GB	16 threads (configurable)
Visualization	5-10 min per script	4 GB	1 core

Parallelization

The genomic selection script (pred_SAP.r) uses parallel processing:

Controlled by FUTURE_CORES_FRACTION (default: 0.8 = 80% of available cores)
Example: On a 16-core machine, uses ~13 cores
Adjust in the script if needed:

FUTURE_CORES_FRACTION <- 0.5  # Use only 50% of cores

Troubleshooting

Common Issues

Issue: ASReml license not found
Solution: This is expected if you don't have an ASReml license. The script will automatically use rrBLUP only. No action needed.

Issue: VCF file not found
Solution: Download the SAP WGS VCF from ENA (see Data Acquisition). Some scripts will offer to download it automatically.

Issue: PLINK2 not found in PATH
Solution: Install PLINK2 from https://www.cog-genomics.org/plink/2.0/ and ensure it's in your $PATH.

Issue: R package 'asreml' not found
Solution: Expected if you don't have an ASReml license. The genomic selection script will skip ASReml and proceed with rrBLUP.

Issue: Out of memory during genomic selection
Solution:

Reduce NUM_ITERS or CV_FOLDS in pred_SAP.r
Decrease FUTURE_CORES_FRACTION to reduce parallel load
Close other applications

Issue: Scripts can't find data/ directory
Solution:

Run scripts from the project root directory, OR
Set SBI_AGRIPLEX_ROOT environment variable, OR
Ensure data/ directory exists relative to script locations

Citation

If you use this pipeline or data in your research, please cite:

[Your citation here - include DOI, journal, authors, etc.]

Data Source:

Boatwright, J.L., et al. (2022). Sorghum Association Panel whole-genome sequencing data. 
European Nucleotide Archive, Project PRJEB51985.
https://www.ebi.ac.uk/ena/browser/view/PRJEB51985

Contact

For questions, issues, or contributions, please contact:

Author: [Your Name]
Email: [Your Email]
Institution: [Your Institution]

License

[Specify your license here - e.g., MIT, GPL-3, etc.]

Acknowledgments

Sorghum Association Panel data provided by [appropriate acknowledgment]
ASReml-R developed by VSN International
rrBLUP package developed by Jeffrey Endelman
PLINK2 developed by Christopher Chang

Version History

v1.0 (YYYY-MM-DD) - Initial release

Happy analyzing! 🌾

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
scripts		scripts
workflow		workflow
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Sorghum Marker Panel Analysis

Table of Contents

Overview

Repository Structure

Prerequisites

Required Software

Command-line Tools

R (v4.0+) with Packages

Optional: ASReml License Configuration

Installation

1. Clone the Repository

2. Set Environment Variable (Optional)

3. Install PLINK2

Option A: Conda Environment Setup (Recommended)

Option B: Manual PLINK2 Installation

4. Install R Packages

Data Acquisition

REQUIRED FIRST STEP

Required Files and Locations

Setup Instructions

Public Data Source (WGS VCF only)

Workflow

Snakemake Workflow (Recommended)

Why Use Snakemake?

Quick Start with Snakemake

Configuration

Common Snakemake Usage Patterns

Snakemake Workflow Structure

Detailed Documentation

Manual Step-by-Step Workflow

Phenotype Data Preparation

Step 0: Format Phenotype Data for PLINK

Genomic Data Preparation

Step 1: Generate PLINK .raw Files (VCF to RAW)

Step 2: Build PLINK Binary Files (VCF to Bfiles)

Step 3: Filter Panels for QC Matching

Step 4: Build Matched PLINK Binary Files (Matched Marker Set)

PLINK Analyses

Principal Component Analysis (PCA)

Genome-Wide Association Studies (GWAS)

R Analyses

Kinship Matrix Construction + Evaluation

Genomic Selection Cross-Validation

Visualization

Procrustes Analysis

Genomic Selection Plots

Kinship Matrix Visualization

Script Reference

Shell Scripts (scripts/)

Phenotype Preparation (01_plink_pheno/)

VCF Conversion (02_VCF2RAW/, 03_VCF2Bfiles/)

Population Structure (05_calc_eigenvalues_PCA/)

Association Mapping (06_GWAS/)

Cross-Platform Filtering (04_filter_panels/)

R Scripts (scripts/)

Kinship Computation + Evaluation (07_kinship/)

Genomic Selection (08_iterate_kfold_GP/)

Visualization (09_figures/)

PCA & Procrustes

Genomic Selection Results

GWAS Results

Kinship Visualization

Utility Scripts

Output Files

Key Result Files

GS Results Format (sap_gs_cv_results.csv)

Notes on Path Management

Computational Considerations

Resource Requirements

Parallelization

Troubleshooting

Common Issues

Citation

Contact

License

Acknowledgments

Shell Scripts (`scripts/`)

Phenotype Preparation (`01_plink_pheno/`)

VCF Conversion (`02_VCF2RAW/`, `03_VCF2Bfiles/`)

Population Structure (`05_calc_eigenvalues_PCA/`)

Association Mapping (`06_GWAS/`)

Cross-Platform Filtering (`04_filter_panels/`)

R Scripts (`scripts/`)

Kinship Computation + Evaluation (`07_kinship/`)

Genomic Selection (`08_iterate_kfold_GP/`)

Visualization (`09_figures/`)

GS Results Format (`sap_gs_cv_results.csv`)

Packages