This repository contains simulation code and analysis scripts for evaluating how experimental design impacts prediction accuracy in plant breeding programs. The study demonstrates that randomizing breeding materials across cohorts improves selection accuracy compared to traditional cohort-separated trials, particularly when data is incomplete or sparse. An extensive user guide for user implementation can be found in docs/simulations_parameters_and_specifications_guide.md and .pdf.
Key Finding: Complete randomization (CR) across breeding cohorts outperforms restricted randomization (RR) by up to 15.7% when phenotypic or genomic data is limited, though both designs perform equivalently with complete genomic and phenotypic datasets.
Ackerman, A.J. & Rutkoski, J. (2025). Randomization across breeding cohorts improves accuracy of conventional and genomic selection. Manuscript in preparation.
Breeding programs typically evaluate materials in separate yield trials based on their advancement stage (cohorts). This spatial separation can confound genetic effects with non-genetic trial effects, potentially reducing selection accuracy—especially when:
- Genomic relationship data is unavailable (conventional BLUP)
- Testing designs are sparse (not all lines in all environments)
- Genotype-by-environment (G×E) interactions are strong
We compared two randomization strategies:
- Restricted Randomization (RR): Traditional approach where cohorts occupy separate trials within environments
- Complete Randomization (CR): Alternative approach where all cohorts are randomized together (e.g., p-rep designs)
.
├── R/
│ ├── breedSimV7_BLUP_1Phase_1-16.R # Conventional BLUP simulation
│ ├── breedSimV9_gBLUP_1-26.R # Genomic BLUP (balanced MET)
│ └── breedSimV9_GEBV_1-26.R # Genomic-enabled sparse testing (unbalanced MET)
├── data/
│ ├── data_accessibility.md
├── docs/
│ ├── simulations_parameters_and_specifications_guide.md # Detailed parameter specifications
│ └── results_summary.md # Key findings summary
├── LICENSE
└── README.md
Real wheat breeding lines from the University of Illinois wheat breeding program:
- 4 breeding cohorts (S1-S4) representing different advancement stages
- 3,102 experimental lines + check varieties
- 9,262 SNP markers (GBS, MAF > 0.05, <10% missing data)
Randomization schemes:
- Restricted Randomization (RR): RCBD or IBD designs with cohorts in separate trials
- Complete Randomization (CR): p-rep designs with all cohorts randomized together
Replication levels:
- High: S4 replicated 3×, S1-S3 replicated 2×
- Intermediate: All cohorts replicated 2×
- Low: S1-S2 unreplicated, S3-S4 replicated 2×
Environmental parameters:
- 5 environments (location × year combinations)
- Inter-environment genetic correlation: r = 0.5 (fixed)
- Intra-environment genetic correlation (rGE): 0.2, 0.4, 0.6, 0.8, 1.0
- Heritability (h²): 0.2, 0.4, 0.6, 0.8
Conventional Selection (BLUP):
y_ijk = μ + e_i + b_k(i) + g_j(i) + ε_ijk
Genomic Selection (GBLUP):
y_ijk = μ + e_i + b_k(i) + g_j(i) + ε_ijk
where g ~ N(0, G_m ⊗ G_e)
Sparse Testing (GEBV):
- S1 and S2 cohorts evaluated in only 1 of 5 environments
- S3 and S4 cohorts evaluated in all 5 environments
- Prediction accuracy assessed for untested cohorts
Difference-in-Differences (DiD) Framework:
E(Y | Z, T) = β₀ + β₁I(T=1) + β₂Z + β₃I(T=1)Z
Where:
- Y = prediction accuracy (rbv)
- Z = randomization scheme (0=RR, 1=CR)
- T = parameter conditions (0=optimal, 1=suboptimal)
- β₃ = DiD estimator (differential response between designs)
- R (≥ 4.0.0)
- ASReml-R (v4.2) - Requires license
# Core packages
library(data.table)
library(tidyverse)
library(magrittr)
# Parallel processing
library(furrr)
library(future)
# Statistical modeling
library(asreml4) # Commercial license required
library(MASS)
library(matrixcalc)
library(MBESS)
# Design generation
library(caret)
library(purrr)git clone https://github.com/yourusername/breeding-cohort-randomization.git
cd breeding-cohort-randomizationDue to data sharing restrictions, genotypic data is not included. To replicate:
# Load your own marker data (matrix format: lines × markers)
# Rows = genotype names, Columns = SNP markers coded as -1, 0, 1
geno <- your_marker_matrix
# Calculate genomic relationship matrix
library(rrBLUP)
K2 <- A.mat(geno - 1) # ASReml requires mean-centered markers
# Save for simulation scripts
saveRDS(geno, "data/geno.RData")
saveRDS(K2, "data/K2.RData")Conventional Selection (BLUP):
Rscript R/breedSimV7_BLUP_1Phase_1-16.R- 450 iterations
- Tests all 120 parameter combinations
- Output:
output/BLUPoutput/BLUPresults_final/
Genomic Selection (balanced MET):
Rscript R/breedSimV9_gBLUP_1-26.R- 50 iterations
- Low replication scenarios only
- Output:
output/gBLUPoutput/gBLUPresults_final/
Genomic Prediction (sparse testing):
Rscript R/breedSimV9_GEBV_1-26.R- 100 iterations per heritability level
- Sparse testing scenarios
- Output:
output/GEBVoutput/GEBVresults_final/
Each simulation outputs CSV files containing:
germplasmName: Genotype identifiercor: Prediction accuracy (correlation between true and predicted breeding values)design: RCBD or PREPheritability: Simulated h²nLoc: Number of environmentsmacroGxE: Inter-environment genetic correlationmicroGxE: Intra-environment genetic correlation (rGE)repCat: Replication categorygroup: Overall, by-cohort, or by-test resultsmodel: BLUP, gBLUP, or GEBViteration: Simulation replicate number
| Scenario | CR Advantage | Significance |
|---|---|---|
| Overall (incomplete data) | +8.3 pp (11.7%) | *** |
| Low replication | +10.1 pp (15.7%) | *** |
| rGE = 0.2 | +18.6 pp | *** |
With complete phenotypic and genomic data:
- No significant difference between designs (rbv = 0.888 vs 0.885)
- Genomic relationships eliminate confounding effects
| Condition | CR Advantage | DiD coefficient |
|---|---|---|
| Overall | +1.5% | δ̂ = 0.018 ** |
| h² = 0.2, rGE = 0.2 | +5.5% | Highly significant |
Most Critical: rGE (intra-environment genetic correlation)
- DiD: δ̂ = 0.082, p < 0.001 (BLUP)
- Each 0.2 decrease in rGE = 8.2 pp advantage for CR
Moderate Impact: Replication level
- DiD: δ̂ = 0.005, p < 0.001
- Largest effect when moving to unreplicated entries
Minimal Impact: Heritability
- Both designs respond similarly to decreasing h²
- DiD: non-significant across all models
- Limited genomic data or relying on phenotypic BLUP
- Implementing sparse testing designs
- Strong G×E interactions expected (low rGE)
- Resource constraints limit replication
- Comprehensive genomic data available for all lines
- Balanced multi-environment testing implemented
- Within-cohort selection is primary objective
- Logistical constraints favor separate trials
| Simulation Type | Iterations | Cores | RAM | Time |
|---|---|---|---|---|
| BLUP | 450 | 6 | 16 GB | ~48 hrs |
| GBLUP | 50 | 4 | 32 GB | ~24 hrs |
| GEBV | 400 | 4 | 32 GB | ~72 hrs |
Times are approximate and depend on hardware specifications
ASReml convergence failures:
# Increase workspace
asreml.options(pworkspace = "8gb")
# Adjust AI settings
asreml.options(ai.sing = TRUE, fail = "soft")Memory issues with large datasets:
# Use data.table for efficiency
library(data.table)
setDTthreads(threads = 0) # Use all available threadsMatrix singularity warnings:
- Ensure sufficient genetic variation in cohorts
- Check for duplicate genotypes
- Verify genomic relationship matrix is positive definite
Due to data sharing agreements:
- Genotypic data: Available upon reasonable request to the authors
- Simulated phenotypes: Generated de novo by scripts in this repository
- Summary statistics: Included in
docs/results_summary.md
We welcome contributions! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit changes (
git commit -am 'Add improvement') - Push to branch (
git push origin feature/improvement) - Open a Pull Request
This project is licensed under the MIT License - see LICENSE file for details.
Arlyn Ackerman
Breeding Insight, Cornell University
Email: aja258@cornell.edu
Jessica Rutkoski
Department of Crop Sciences, University of Illinois at Urbana-Champaign
Email: rutkoski@illinois.edu
- Eastern Regional Small Grains Genotyping Lab for GBS services
- Breeding Insight for computational resources
Key methodological references:
-
Experimental Design:
- Cullis et al. (2006) - p-rep designs
- Clarke & Stefanova (2011) - Optimal designs for early-generation trials
- Piepho & Williams (2006) - Restricted vs. complete randomization
-
Genomic Selection:
- Combs & Bernardo (2013) - GS accuracy factors
- Atanda et al. (2022) - Sparse testing with GS
-
Statistical Methods:
- Smith et al. (2007) - Environment-specific variance models
- Rothbard et al. (2024) - Difference-in-differences methodology
Last Updated: January 2025
Status: Manuscript in preparation