A comprehensive Nextflow pipeline for multi-ancestry genome-wide association studies (GWAS) with fine-mapping, meta-analysis, polygenic risk score calculation, and functional annotation. Optimized for admixed populations with Tractor, GENESIS, and SAIGE support.
This pipeline addresses the critical need for ancestry-aware genetic analysis by:
- Ancestry-stratified GWAS: Runs separate GWAS for each major ancestry group (EUR, AFR, EAS, SAS, AMR, MID, AAC, AHI, LAT1, LAT2, HET)
- Admixed-optimized analysis: GENESIS, SAIGE, and Tractor for local ancestry-aware GWAS
- Multi-ancestry meta-analysis: Uses MR-MEGA to model ancestry-correlated heterogeneity
- Within-ancestry fine-mapping: PolyFun+SuSIE for annotation-prioritized fine-mapping
- Multi-ancestry fine-mapping: MG-FLASH-FM for related traits, SuSIE-ME for single traits
- Ancestry-aware PRS: Multiple methods including PRS-CSx, GAUDI, and local ancestry-informed approaches
- Heritability analysis: Ancestry-specific h² and cross-ancestry genetic correlations
- Functional annotation: MAGMA, FUMA, LAVA, FLAMES with cell-type specificity
- Diverse QTL colocalization: Curated datasets beyond GTEx for blood/immune, mQTL, caQTL, and more
- Survival analysis: SPA-Cox for time-to-event GWAS
Tractor decomposes genetic effects by ancestral origin in admixed individuals:
- African American (AAC): 2-way admixture (EUR, AFR)
- Latino (LAT1+LAT2 combined): 3-way admixture (EUR, AFR, NAT)
- LAT1 = Mexican/Central American
- LAT2 = Caribbean/South American
- Combined for better statistical power
R/Bioconductor package optimized for:
- Complex population structure
- Related individuals (with kinship matrix)
- Heterogeneous ancestry
Optimized for:
- Case-control imbalance
- Rare variants
- Large biobank data
┌─────────────────┐
│ Input Data │
│ (Genotypes + │
│ Phenotypes) │
└────────┬────────┘
│
▼
┌─────────────────┐ ┌──────────────────────┐
│ QC + Ancestry │────▶│ Optional Ancestry │
│ (GRAF-ANC) │ │ Inference (or use │
│ │ │ pre-computed calls) │
└────────┬────────┘ └──────────────────────┘
│
┌────┴────┬────────┬────────┬────────┬────────┬────────┬────────┐
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ EUR │ │ AFR │ │ EAS │ │ SAS │ │ AMR │ │ MID │ │ AAC │ │ LATINO│
│ GWAS │ │ GWAS │ │ GWAS │ │ GWAS │ │ GWAS │ │ GWAS │ │Tractor│ │Tractor│
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
│ │ │ │ │ │ │ │
└────┬────┴────────┴────────┴────────┴────────┴────────┴────────┘
│
▼
┌─────────────────┐
│ MR-MEGA │
│ Meta-Analysis │
└────────┬────────┘
│
┌────┴─────────────┐
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Within-Anc │ │ Multi-Anc │
│ Fine-Map │ │ Fine-Map │
│ PolyFun+ │ │ MG-FLASH-FM │
│ SuSIE │ │ or SuSIE-ME │
└──────┬──────┘ └──────┬──────┘
│ │
└────────┬───────┘
▼
┌───────────────────────────────────────────────────────────┐
│ Downstream Analysis │
├─────────────┬─────────────┬──────────────┬────────────────┤
│ PRS │ Heritability│ Colocalization│ Functional │
│ PRS-CSx │ LDSC │ Diverse QTL │ MAGMA │
│ PRS-CS │ GCTA │ Datasets │ FUMA │
│ GAUDI │ Cross-Anc │ │ LAVA │
│ LDpred2 │ rg │ │ FLAMES │
└─────────────┴─────────────┴──────────────┴────────────────┘
# Basic run with REGENIE
nextflow run main.nf \
--input samplesheet.csv \
--phenotype_file phenotypes.txt \
--phenotype_cols "trait1,trait2" \
--outdir results \
-profile docker
# With pre-computed ancestry calls (recommended)
nextflow run main.nf \
--input samplesheet.csv \
--phenotype_file phenotypes.txt \
--phenotype_cols "WBC,Lymphocytes,Monocytes" \
--covariate_cols "age,sex,PC1,PC2,PC3,PC4,PC5" \
--run_ancestry_inference false \
--ancestry_calls_file ancestry_calls.tsv \
--gwas_tool genesis \
--kinship_matrix grm.rds \
--outdir results \
-profile singularity
# For admixed populations with Tractor
nextflow run main.nf \
--input samplesheet.csv \
--phenotype_file phenotypes.txt \
--phenotype_cols "trait1" \
--run_tractor true \
--local_ancestry_files local_ancestry.msp \
--tractor_aac_pops "EUR,AFR" \
--tractor_lat_pops "EUR,AFR,NAT" \
--outdir results \
-profile singularity
# For survival/time-to-event analysis
nextflow run main.nf \
--input samplesheet.csv \
--phenotype_file phenotypes.txt \
--phenotype_cols "leukemia" \
--survival_analysis true \
--time_col "time_to_event" \
--event_col "event_status" \
--outdir results \
-profile singularityCreate a CSV file with genotype file paths:
sample_id,cohort,bed,bim,fam
study1,cohort_A,/path/to/study1.bed,/path/to/study1.bim,/path/to/study1.fam
study2,cohort_B,/path/to/study2.bed,/path/to/study2.bim,/path/to/study2.famSupported formats: PLINK1 (bed/bim/fam), PLINK2 (pgen/pvar/psam), BGEN, VCF
Tab-separated file with sample IDs and phenotype values:
IID trait1 trait2 age sex PC1 PC2 PC3 PC4 PC5
sample1 25.3 170.2 45 1 0.01 -0.02 0.005 0.001 -0.003
sample2 28.1 165.8 38 2 -0.01 0.03 -0.002 0.004 0.001
If you have pre-computed ancestry calls, provide a TSV file:
IID ancestry
sample1 EUR
sample2 AFR
sample3 AAC
sample4 LAT1
| Parameter | Default | Description |
|---|---|---|
--run_ancestry_inference |
false |
Run ancestry inference in pipeline |
--ancestry_calls_file |
null |
Pre-computed ancestry calls file |
--ancestry_method |
graf-anc |
Method: graf-anc, admixture, pca |
--ancestry_groups |
EUR,AFR,EAS,SAS,AMR,MID,AAC,AHI,LAT1,LAT2,HET |
Ancestry groups |
--min_ancestry_n |
100 |
Minimum samples per ancestry |
| Parameter | Default | Description |
|---|---|---|
--gwas_tool |
regenie |
Tool: regenie, saige, bolt-lmm, plink2, genesis |
--gwas_model |
additive |
Model: additive, dominant, recessive |
--kinship_matrix |
null |
Pre-computed kinship matrix (for GENESIS) |
--p_threshold |
5e-8 |
Genome-wide significance threshold |
| Parameter | Default | Description |
|---|---|---|
--run_tractor |
false |
Run Tractor for admixed populations |
--tractor_groups |
AAC,LATINO |
Groups for Tractor (LATINO = LAT1+LAT2) |
--tractor_aac_pops |
EUR,AFR |
Ancestral populations for African American |
--tractor_lat_pops |
EUR,AFR,NAT |
Ancestral populations for Latino |
--local_ancestry_files |
null |
Local ancestry MSP files (RFMix format) |
| Parameter | Default | Description |
|---|---|---|
--survival_analysis |
false |
Run time-to-event GWAS |
--time_col |
null |
Time-to-event column name |
--event_col |
null |
Event indicator column (1=event, 0=censored) |
| Parameter | Default | Description |
|---|---|---|
--meta_method |
mr-mega |
Method: mr-mega, metal, metasoft |
--mr_mega_env |
false |
Include environmental variable |
| Parameter | Default | Description |
|---|---|---|
--within_ancestry_fm |
polyfun-susie |
Within-ancestry method |
--multi_ancestry_fm |
mg-flash-fm |
Multi-ancestry method for related traits |
--single_trait_fm |
susie-me |
Multi-ancestry method for single traits |
--related_traits |
true |
Whether traits are related |
| Parameter | Default | Description |
|---|---|---|
--use_curated_qtl |
true |
Use curated diverse QTL datasets |
--qtl_ancestry_matching |
true |
Match QTL datasets to GWAS ancestry |
--custom_qtl_datasets |
null |
Path to custom QTL samplesheet |
Curated QTL datasets include:
- Blood/Immune: OneK1K (14 immune cell types), DICE, BLUEPRINT, eQTLGen
- African ancestry: GENOA (mQTL/eQTL), AFGR (caQTL)
- Multi-ethnic: MESA (EUR, AFR, HISP, EAS), TOPMed
- Methylation: GENOA mQTL, GoDMC, BLUEPRINT mQTL
- Chromatin accessibility: DICE caQTL, BLUEPRINT hQTL
- Single-cell: OneK1K, HCA Immune Atlas
| Parameter | Default | Description |
|---|---|---|
--prs_methods |
prs-csx,prs-cs,gaudi,ldpred2 |
PRS methods to run |
--prs_validation |
true |
Validate PRS in target cohort |
--prs_best_method |
true |
Determine best method per ancestry |
| Code | Description | Tractor Support |
|---|---|---|
| EUR | European | - |
| AFR | African | - |
| EAS | East Asian | - |
| SAS | South Asian | - |
| AMR | Native American | - |
| MID | Middle Eastern | - |
| AAC | African American (admixed) | Yes (2-way: EUR,AFR) |
| AHI | American Hispanic (admixed) | Yes (3-way: EUR,AFR,NAT) |
| LAT1 | Latino Type 1 (Mexican/Central American) | Combined as LATINO |
| LAT2 | Latino Type 2 (Caribbean/South American) | Combined as LATINO |
| HET | Heterogeneous/Other | - |
results/
├── ancestry/
│ ├── graf/ # GRAF ancestry calls
│ └── stratified/ # Ancestry-stratified genotypes
├── gwas/
│ ├── {ancestry}/{trait}/ # Per-ancestry GWAS results
│ └── tractor/ # Tractor results (AAC, LATINO)
│ ├── AAC/ # 2-way admixture results
│ └── LATINO/ # 3-way admixture results (LAT1+LAT2)
├── survival/
│ └── {ancestry}/{trait}/ # SPA-Cox survival GWAS
├── meta_analysis/
│ └── {trait}/ # MR-MEGA meta-analysis
├── fine_mapping/
│ ├── {ancestry}/{trait}/ # Within-ancestry fine-mapping
│ └── multi_ancestry/{trait}/ # Multi-ancestry fine-mapping
├── prs/
│ └── {trait}/ # PRS weights and scores
├── heritability/
│ └── {ancestry}/{trait}/ # h² estimates
├── functional/
│ └── {trait}/ # Functional annotations
├── colocalization/
│ └── {trait}/ # Colocalization results
├── plots/
│ └── {trait}/ # Manhattan, QQ, regional plots
└── reports/
├── multiqc_report.html # MultiQC report
└── ancestry_gwas_report.html # Custom pipeline report
Local ancestry-aware GWAS that decomposes genetic effects by ancestral origin:
- Extracts ancestry-specific haplotype dosages from local ancestry calls
- Tests for ancestry-specific effects and heterogeneity
- Particularly powerful for identifying ancestry-specific associations
R/Bioconductor package for:
- Mixed model association testing accounting for population structure and relatedness
- Score, Wald, and BinomiRare tests
- PC-AiR for ancestry inference in related samples
- PC-Relate for kinship estimation in structured populations
Saddlepoint approximation for Cox proportional hazards GWAS:
- Efficient for large-scale survival analysis
- Handles time-to-event outcomes (e.g., disease onset, death)
- Accounts for censoring
Meta-regression of genetic association data:
- Models allelic effects as a function of axes of genetic variation
- Accounts for ancestry-correlated heterogeneity
- Detects residual heterogeneity beyond ancestry
Annotation-informed fine-mapping:
- PolyFun: Functional annotation-informed prior probabilities
- SuSIE: Sum of Single Effects regression for multiple causal variants
- 95% credible set construction
Multi-ancestry fine-mapping:
- MG-FLASH-FM: For related traits (shares information across traits)
- SuSIE-ME: For single/unrelated traits
Cross-population PRS:
- PRS-CSx: Coupled continuous shrinkage priors across ancestries
- GAUDI: Local ancestry-informed PRS for admixed individuals
If you use this pipeline, please cite:
[Pipeline citation - to be added]
Key methods:
- REGENIE: Mbatchou et al. (2021) Nature Genetics
- SAIGE: Zhou et al. (2018) Nature Genetics
- GENESIS: Gogarten et al. (2019) Bioinformatics
- Tractor: Atkinson et al. (2021) Nature Genetics
- MR-MEGA: Magi et al. (2017) Nature Genetics
- PolyFun: Weissbrod et al. (2020) Nature Genetics
- SuSIE: Wang et al. (2020) J R Stat Soc
- PRS-CSx: Ruan et al. (2022) Nature Genetics
- GAUDI: Marnetto et al. (2020) Nature Communications
- Nextflow >= 23.04.0
- Container runtime: Docker, Singularity, or Apptainer
- OR Conda/Mamba for local execution
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please read our contributing guidelines first.