Please cite our paper:
Bellavance, J., Xiao, H., Chang, L., Kazemi, M., Wickramasinghe, S., Mayhew, A.J., Raina, P., VandeHaar, P., Taliun, D., & Gagliano Taliun, S.A. (2026). Exploring and visualizing stratified GWAS results with PheWeb2. Nature Genetics https://doi.org/10.21203/rs.3.rs-7463215/v1
This is an implementation of the data model and API for PheWeb2 — an enhanced version of the original PheWeb web-based tool for interactive querying, visualizing, and sharing summary-level results from genome-wide and phenome-wide association studies (GWAS/PheWAS), which offers intuitive and efficient support for stratified analysis results. PheWeb2 decouples the data model and API from the user interface (UI) to improve code maintenance and reusability and allow results querying by other external resources and applications.
Tip
If you've already set up the PheWeb2 API, you can proceed to install and launch the PheWeb2 user interface.
Note
The code was developed and tested with Python 3.12+ on Linux-based OS.
You can install PheWeb2 and all required dependencies within a virtual environment using the following steps:
- Clone this repository:
git clone https://github.com/GaglianoTaliun-Lab/PheWeb2-API.git cd PheWeb2-API - Create and activate Python virtual environment:
python -m venv .venv source .venv/bin/activate - Install PheWeb2 Python package and its dependencies:
pip install -e .
To familiarize yourself with PheWeb2, we recommend first trying to configure and run it with the provided example dataset by following the steps below.
🚨 Important: Please avoid modifying the default values in the
config.pyfile when working with the example dataset. This file specifies the default versions of dbSNP (v157, genome build 38) and GENCODE (v48, genome build 38) databases. We host these versions on our servers to make your work easier and to ensure a faster workflow.
-
Download and unarchive the example data (~13 GB):
wget https://objets.juno.calculquebec.ca/swift/v1/AUTH_290e6dcc5e264b34b401f54358bd4c54/pheweb_example_data/example_regenie.tar.gz tar -xzvf example_regenie.tar.gz -
Import the example manifest file describing phenotypes:
pheweb2 phenolist import-phenolist manifest-example.csv -
Ingest the example data into PheWeb2 (this can take some time):
pheweb2 process -
Run automated tests of the API routes:
pytest tests/test_routes.py -s -vClick to see an example of passed tests.
tests/test_routes.py::test_get_phenotypes PASSED tests/test_routes.py::test_get_gene_names PASSED tests/test_routes.py::test_get_gene_PCSK9 PASSED tests/test_routes.py::test_get_tophits PASSED tests/test_routes.py::test_get_stratifications PASSED tests/test_routes.py::test_get_variant_10_112999020_G_T PASSED -
Launch PheWeb2 API endpoint which will be available at
http://127.0.0.1:9543:pheweb2 serve --host 127.0.0.1 --port 9543 -
To access the interactive API documentation, open your internet browser and navigate to
http://localhost:9543/docs, assuming you are running it on the same machine at port 9543.
The self-documenting config.py configuration file includes all the variables that determine where PheWeb 2 API stores ingested GWAS data, how it processes this data, and how it serves it through HTTP.
When performing GWAS data ingestion, you should focus on adjusting the variables in SECTION A, SECTION B, and SECTION C of this file:
- SECTION A of the config.py file lists the configuration variables that control the location of the processed GWAS summary statistics.
- SECTION B of the config.py file lists the configuration variables that control the versions of the external public databases such as dbSNP and GENCODE.
- SECTION C of the config.py file lists the configuration variables that control GWAS summary statistics ingestion.
When running an API endpoint, you should focus on adjusting the variables in SECTION D of the config.py file that control the API URL address, the number of API workers to handle HTTP requests, and so on.
Note
The current version of PheWeb 2 API has been tested using GWAS summary statistics output files generated by Regenie.
Important
PheWeb 2 relies on the TEST/test column in GWAS summary statistics files to distinguish variant effects from interactions. While Regenie and PLINK2 GWAS ummary statistics files by default include the TEST column, which provides details on the statistical tests reported (e.g., ADD, ADD-CONDTL, ADD-INT_SNPxVAR, ADDxCOVAR1), other tools may not include this column. In such cases, we recommend adding a TEST column with the appropriate values to your GWAS ummary statistics files. Then, you can define the values representing each test in your GWAS ummary statistics files using ASSOC_TEST_NAME and INTERACTION_TEST_NAME variables in the config.py file.
The table below presents a minimal list of expected columns in GWAS summary statistics files. You can modify their names using the FIELD_ALIASES variable in the config.py file.
| column description | column name | allowed values |
|---|---|---|
| Chromosome | CHROM | 1-22, 23/X |
| Position | GENPOS | Integer |
| Reference Allele | ALLELE0 | Must match reference genome |
| Effect Allele | ALLELE1 | Anything |
| Beta | BETA | Float |
| Standard Error | SE | Float |
| Log-10 P-value | LOG10P | Float |
| Statistical Test | TEST | Anything. Typical values: ADD, ADD-CONDTL, ADD-INT_SNPxVAR, ... |
During the ingestion of GWAS summary statistics, PheWeb 2 will apply the following exclusions:
- If a required field is null (i.e., any of the following: [’’, ‘.’, ‘NA’, ‘N/A’, ‘n/a’, ‘nan’, ‘-nan’, ‘NaN’, ‘-NaN’, ‘null’, ‘NULL’]), the variant will be discarded.
- If the minor allele frequency (MAF) of the variant falls below a specified threshold (defined by the
ASSOC_MIN_MAFvariable in the config.py file), the variant will be excluded. - If the MAF of the variant is below a specified threshold (defined by the
INTERACTION_MIN_MAFvariable in the config.py file), the interaction effect for this variant will not be imported. - When genotype imputation quality scores are available, if the imputation quality for a variant falls below a specified threshold (defined by the
MIN_IMP_QUALITYvariable in the config.py file), the variant will be excluded.
Note
When filtering by imputation quality scores, the scores can be provided as a column inside the GWAS summary statistics file (e.g., the Regenie outputs imputation quality scores in the INFO column) or in the external indexed VCF files outputted by the genotype imputation software (e.g., minimac4). See an example inside the config.py file.
Important
The imputation quality scores recomputed by Regenie may differ from the scores outputted by the genotype imputation software, as Regenie uses dosages only for a subset of individuals included in GWAS who have phenotype and covariate data. When conducting stratified analyses, we recommend using the original genotype imputation quality scores computed by the imputation software.
To ingest GWAS summary statistics files into PheWeb2, you need to create the Manifest file. The Manifest file is a comma-separated (CSV) file that describes each GWAS summary statistics file you have. The manifest-example.csv file serves as an example of the Manifest file, and the table below lists the required and optional columns.
| column description | value | allowed values | required? |
|---|---|---|---|
| Phenotype Code | phenocode | string | true |
| Phenotype Description | phenostring | string | true |
| Location of summary statistics | assoc_files | string | true |
| Number of tested samples / participants | num_samples | int | true |
| Number of tested cases | num_cases | int | false |
| Number of tested controls | num_controls | int | false |
| Category of trait | category | float | false |
| Variable of Interaction Testing | interaction | string | false |
| Category of stratification (Can be more than one) | interaction | "stratification.*" (where *=string) | false |
Note
Although parallelized, data ingestion may still take some time, depending on the number and size of your GWAS summary statistics files.
The ingested data will be by default stored inside the generated_by_pheweb folder, but this can be controlled using the PHEWEB_DATA_DIR variable inside the config.py file.
After the data ingestion, you may archive your original GWAS summary statistics files, as PheWeb2 will no longer need them.
Once you have edited the config.py file and created the Manifest file, execute the following commands inside the root directory of the PheWeb2 API.
-
Import the manifest file into PheWeb2:
pheweb2 phenolist import-phenolist /path/to/manifest.csvThis command creates a
pheno-list.jsonfile in the root directory. -
Set the number of parallel ingestion jobs using the
NUM_PROCSparameter in the config.py file. -
Ingest the GWAS summary statistics into PheWeb2:
pheweb2 process
Note
Although parallelized, data ingestion may still take some time, depending on the number and size of your GWAS summary statistics files.
The ingested data will be by default stored inside the generated_by_pheweb folder, but this can be controlled using the PHEWEB_DATA_DIR variable inside the config.py file.
After the data ingestion, you may archive your original GWAS summary statistics files, as PheWeb2 will no longer need them.
Once you have edited the config.py file and created the Manifest file, execute the following commands inside the root directory of the PheWeb2 API.
-
Import the manifest file into PheWeb2:
pheweb2 phenolist import-phenolist /path/to/manifest.csvThis command creates a
pheno-list.jsonfile in the root directory. -
Create a SLURM/SGE job submission bash script for parsing the GWAS summary statistics files:
pheweb2 cluster --engine=slurm --step=parse --N_per_job=3or
pheweb2 cluster --engine=sge --step=parse --N_per_job=3A bash script named
slurm-parse-[DATETIME].shorsge-parse-[DATETIME].shwill be generated in thegenerated-by-pheweb/tmp/directory. If you have modified thePHEWEB_DATA_DIRvariable in the config.py configuration file, the script will be placed in the[PHEWEB_DATA_DIR]/tmp/directory instead.To specify the maximum number of GWAS files to parse sequentially per parallel job, use the
--N_per_joboption (the default is 3). You can also include an optional--accountargument in thepheweb2 clustercommand to specify your SLURM account name if needed. -
Submit the SLURM/SGE jobs for parsing GWAS summary statistics files to the queue:
sbatch generated-by-pheweb/tmp/slurm-parse-[DATETIME].shor
sge generated-by-pheweb/tmp/sge-parse-[DATETIME].sh
Important
Depending on your GWAS files' size and your cluster's configuration, you may need to adjust the job submission parameters in the bash script. This could involve increasing the job time limit or memory limit, or specifying a queue/account name.
-
After all SLURM/SGE jobs submitted to the queue have successfully finished running, you can proceed to the next step. This involves generating a union of all genetic variants from the GWAS summary statistics files. Note that this step does not utilize SLURM/SGE; instead, it employs local parallelization, which is controlled by the
NUM_PROCSvariable in the config.py configuration file.pheweb2 sitesThis creates the
generated-by-pheweb/sites/sites-unannotated.tsvfile. -
Next, download the gene aliases database (internet access is required):
pheweb2 make-gene-aliases-sqlite3 -
Download the rsID database and annotate it with gene names in the sites file (this may take a few hours; internet access is required):
pheweb2 add-rsids pheweb2 add-genes -
Create a mapping between GWAS variants and rsIDs:
pheweb2 make-cpras-rsids-sqlite3 -
The upcoming processing steps will again utilize SLURM/SGE for parallelization, similar to steps 2-3. Please create bash scripts for SLURM/SGE to augment variant-phenotype data and generate Manhattan and QQ plots:
pheweb2 cluster --engine=slurm --step=augment-phenos --N_per_job=3 pheweb2 cluster --engine=slurm --step=manhattan --N_per_job=3 pheweb2 cluster --engine=slurm --step=qq --N_per_job=3or
pheweb2 cluster --engine=sge --step=augment-phenos --N_per_job=3 pheweb2 cluster --engine=sge --step=manhattan --N_per_job=3 pheweb2 cluster --engine=sge --step=qq --N_per_job=3After generating the job submission bash scripts, you can submit your jobs to the SLURM or SGE queue, just as you did in step 3.
sbatch generated-by-pheweb/tmp/slurm-augment-phenos-[DATETIME].sh sbatch generated-by-pheweb/tmp/slurm-manhattan-[DATETIME].sh sbatch generated-by-pheweb/tmp/slurm-qq-[DATETIME].shor
sge generated-by-pheweb/tmp/sge-augment-phenos-[DATETIME].sh sge generated-by-pheweb/tmp/sge-manhattan-[DATETIME].sh sge generated-by-pheweb/tmp/sge-qq-[DATETIME].sh -
Once all SLURM/SGE jobs from the previous step have successfully completed, proceed to create the phenotype matrix:
pheweb2 matrix -
Then, create the database to map each gene to phenotypes:
pheweb2 gather-pvalues-for-each-gene -
Then, prepare a list of all phenotypes and all statistically significant results:
pheweb2 phenotypes pheweb2 top-hits -
Generate a list of phenotypes with the statistically significant associations:
pheweb2 best-of-pheno -
Generate a database for quick lookups of variants, genes, and phenotypes using incomplete names. This will support the functionality of the search boxes effectively.
pheweb2 generate-autocomplete-db -
Run this command to validate the process and ensure that all expected files were generated.
pheweb2 process
-
Adjust the configuration variables in SECTION D of the configuration file located at config.py. Among other things, you may want to enable debug mode by setting the
ENABLE_DEBUGvariable. -
To run the API in development mode, run:
pheweb2 serveClick to see an example of terminal output in the development mode.
* Serving Flask app 'pheweb_api.api_app' * Debug mode: on WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Debugger is active! -
To test the API once it is running, execute the following commands in a different terminal on the same machine. If necessary, change the port number from 9543.
curl -X GET "http://localhost:9543/phenotypes/" curl -X GET "http://localhost:9543/gene/" curl -X GET "http://localhost:9543/phenotypes/tophits" curl -X GET "http://localhost:9543/variant/stratification_list"You should see the messages in the terminal that run the API, prefixed with
[DEBUG]. -
To access the interactive API documentation and test API queries, open your internet browser and go to
http://localhost:9543/, provided you are running it on the same machine at port 9543. If you are using a different host or port, please update the HTTP address in your browser to match the settings in the config.py configuration file.
- Adjust the configuration variables in SECTION D of the configuration file located at config.py. Make sure to disable debug mode by setting the
ENABLE_DEBUGvariable toFalse.
Note
If you plan to run the API behind a reverse proxy, such as Apache or Nginx, and want all API routes to start with a specific prefix (like /api or /api/v1), you should set the API_URL_PREFIX variable in the config.py configuration file.
Warning
If you need to allow access to API only from a specific machine, then set the CORS_ORIGINS variable in the config.py configuration file.
-
To run the API server in production mode, run:
pheweb2 serve --gunicorn --enable-cache -
Once the API is running, you can test it by executing the following commands on the same machine. Change the 9543 port number if required.
curl -X GET "http://localhost:9543/phenotypes/" curl -X GET "http://localhost:9543/gene/" curl -X GET "http://localhost:9543/phenotypes/tophits" curl -X GET "http://localhost:9543/variant/stratification_list" -
To access the interactive API documentation, open your internet browser and navigate to
http://localhost:9543/, assuming you are running it on the same machine at port 9543. If not, you will need to modify the HTTP address in your browser to match the host and port specified in the config.py configuration file.