This guide shows practical, copy-pasteable examples to build Hail Tables (HT) and MatrixTables (MT) from explicit raw files and from recipes (JSON/YAML).
If you haven't installed hvantk yet, see the main README for install steps.
For downloading raw data files (built-in downloaders and manual steps), see Data Sources.
Build one table at a time with explicit inputs and options.
- ClinVar (VCF → HT keyed by [locus, alleles])
hvantk mktable clinvar \
--raw-input /data/clinvar_2024.vcf.bgz \
--output-ht /out/clinvar.ht \
--ref-genome GRCh38 \
--overwrite- Interactome (BED intervals → HT keyed by interval)
hvantk mktable interactome \
--raw-input /data/insider.bed.bgz \
--output-ht /out/interactome.ht- GeVIR (TSV keyed by gene_id)
hvantk mktable gevir \
--raw-input /data/gevir.tsv.bgz \
--output-ht /out/gevir.ht \
--fields oe_syn_upper,oe_mis_upper- gnomAD constraint metrics (TSV keyed by gene_id)
hvantk mktable gnomad-metrics \
--raw-input /data/gnomad.tsv.bgz \
--output-ht /out/gnomad.ht- dbNSFP variant annotations (TSV keyed by locus, alleles)
hvantk mktable dbnsfp \
--raw-input /data/dbNSFP4_variant.bgz \
--output-ht /out/dbnsfp.ht \
--ref-genome GRCh38 \
--auto-convert-bgzTip: If your dbNSFP file is standard gzip (
.gz) rather than BGZF, add--auto-convert-bgzto automatically convert it before import.
- Ensembl gene annotations (Biomart TSV keyed by gene_id)
hvantk mktable ensembl-gene \
--raw-input /data/biomart.tsv.bgz \
--output-ht /out/ensembl.ht \
--no-canonical- ClinGen Gene-Disease validity (CSV keyed by gene_id + disease_id)
hvantk mktable clingen-gene-disease \
--raw-input /data/clingen/Clingen-Gene-Disease-Summary-2026-03-22.csv \
--output-ht /out/clingen_gene_disease.ht- GenCC submissions (CSV keyed by gene_id + disease_id)
hvantk mktable gencc-submissions \
--raw-input /data/gencc/gencc-submissions.csv \
--output-ht /out/gencc_submissions.ht- COSMIC Cancer Gene Census (TSV keyed by gene_id)
hvantk mktable cosmic-cgc \
--raw-input /data/cosmic/cancer_gene_census.tsv \
--output-ht /out/cosmic_cgc.ht \
--mutation-context somatic- HGNC gene nomenclature (TSV keyed by hgnc_id)
hvantk mktable hgnc \
--raw-input /data/hgnc/hgnc_complete_set.tsv \
--output-ht /out/hgnc.ht- PTM sites (UniProt PTM → genomic coordinates, keyed by locus)
hvantk ptm build \
--output-dir /data/ptm/ \
--output-ht /out/ptm_sites.htNote: The PTM build command downloads Ensembl GTF and UniProt PTM data automatically. Use
--gtf-pathand--ptm-tsvto provide pre-downloaded files.
- PTM constraint (stratified AF depletion at PTM codons, by tissue / cell type)
# 1. Annotate variants with PTM flags first
hvantk ptm annotate --variants-ht clinvar.ht --ptm-ht /out/ptm_sites.ht -o clinvar_ptm.ht
# 2. Run stratified constraint analysis
hvantk ptm constraint \
--variants-ht clinvar_ptm.ht \
--expression-source anndata \
--expression-path /data/farah_2024.h5ad \
--grouping major_cell_class \
--output-dir /out/ptm-farah/Note: See
tools/ptm-constraint.mdfor the full flag reference and backend-specific notes (Hail MT / AnnData / tabular).
- eQTL (keyed by locus, alleles, gene_id)
# GTEx v11 significant pairs (Parquet)
hvantk mktable eqtl \
--raw-input /data/gtex_v11/Liver.v11.signif_pairs.parquet \
--output-ht /out/eqtl_liver.ht \
--source gtex_v11 \
--tissue Liver
# GTEx v8 (TSV)
hvantk mktable eqtl \
--raw-input /data/gtex_v8/Liver.v8.signif_variant_gene_pairs.txt.gz \
--output-ht /out/eqtl_liver.ht \
--source gtex_v8
# eQTLGen (blood cis-eQTLs)
hvantk mktable eqtl \
--raw-input /data/eqtlgen/cis-eQTLs_full.txt.gz \
--output-ht /out/eqtl_blood.ht \
--source eqtlgen
# Allpairs for coloc (set p-threshold to 0)
hvantk mktable eqtl \
--raw-input /data/gtex_v11/allpairs/Liver/ \
--output-ht /out/eqtl_allpairs_liver.ht \
--source gtex_v11 --tissue Liver --p-threshold 0- pQTL (keyed by locus, alleles, gene_id)
# Fang et al. 2025 (space-delimited allpairs)
hvantk mktable pqtl \
--raw-input /data/fang_pqtl/Liver_allpairs.txt.gz \
--output-ht /out/pqtl_liver.ht \
--source gtex_fang \
--tissue Liver \
--gene-map-ht /data/ensembl_gene.ht \
--p-threshold 5e-8Note: Fang pQTL data uses gene symbols. Provide
--gene-map-ht(Ensembl gene table withgene_namefield) for symbol → Ensembl ID mapping.
Use a recipe to build many tables at once. JSON and YAML are both supported (YAML requires PyYAML installed).
Example JSON recipe (save as examples/recipes/tables.example.json):
{
"tables": [
{
"name": "clinvar",
"input": "/data/clinvar_2024.vcf.bgz",
"output": "/out/clinvar.ht",
"params": {"reference_genome": "GRCh38", "export_tsv": true}
},
{
"name": "interactome",
"input": "/data/insider.bed.bgz",
"output": "/out/interactome.ht",
"params": {"reference_genome": "GRCh38"}
}
]
}Run:
hvantk mktable-batch --recipe examples/recipes/tables.example.jsonYAML variant (examples/recipes/tables.example.yaml):
---
tables:
- name: clinvar
input: /data/clinvar_2024.vcf.bgz
output: /out/clinvar.ht
params:
reference_genome: GRCh38
export_tsv: true
- name: interactome
input: /data/insider.bed.bgz
output: /out/interactome.ht
params:
reference_genome: GRCh38- UCSC Cell Browser (TSV matrix + TSV metadata)
First, find and download the dataset you need:
# Search for datasets by tissue or keyword
hvantk download ucsc --list_datasets --search heart
# Download a dataset (use child path for collections)
hvantk download ucsc --dataset hoc/all-heart --output-dir data/ucscThen build the MatrixTable:
hvantk mkmatrix ucsc \
--expression-matrix data/ucsc/hoc/all-heart/exprMatrix.tsv.gz \
--metadata data/ucsc/hoc/all-heart/meta.tsv \
--output-mt /out/ucsc.mt \
--gene-column gene \
--auto-convert-bgz \
--overwrite- Expression Atlas (TSV matrix + SDRF TSV)
hvantk mkmatrix expression-atlas \
--expression-matrix /data/atlas/matrix.tsv \
--sdrf /data/atlas/atlas.sdrf.tsv \
--output-mt /out/atlas.mt \
--gene-column "Gene ID" \
--sample-id-column sample_id \
--overwriteIf your expression matrix is a plain .gz file, add --auto-convert-bgz:
hvantk mkmatrix expression-atlas \
--expression-matrix /data/atlas/matrix.tsv.gz \
--sdrf /data/atlas/atlas.sdrf.tsv \
--output-mt /out/atlas.mt \
--auto-convert-bgz- CPTAC (TSV/CSV expression + TSV/CSV metadata)
hvantk mkmatrix cptac \
--expression /data/cptac/expression.tsv \
--metadata /data/cptac/metadata.tsv \
--output-mt /out/cptac.mt \
--gene-id-col GeneID \
--sample-id-col SampleID \
--categorical-cols TumorType,Stage \
--overwriteExample JSON recipe (save as examples/recipes/matrices.example.json):
{
"matrices": [
{
"name": "ucsc",
"inputs": {
"expression_matrix": "/data/ucsc/expr.tsv.bgz",
"metadata": "/data/ucsc/meta.tsv"
},
"output": "/out/ucsc.mt",
"params": {"gene_column": "gene", "overwrite": true}
},
{
"name": "expression-atlas",
"inputs": {
"expression_matrix": "/data/atlas/matrix.tsv",
"sdrf": "/data/atlas/atlas.sdrf.tsv"
},
"output": "/out/atlas.mt",
"params": {"gene_column": "Gene ID", "sample_id_column": "sample_id"}
}
]
}Run:
hvantk mkmatrix-batch --recipe examples/recipes/matrices.example.jsonYAML variant (examples/recipes/matrices.example.yaml):
---
matrices:
- name: ucsc
inputs:
expression_matrix: /data/ucsc/expr.tsv.bgz
metadata: /data/ucsc/meta.tsv
output: /out/ucsc.mt
params:
gene_column: gene
overwrite: true
- name: expression-atlas
inputs:
expression_matrix: /data/atlas/matrix.tsv
sdrf: /data/atlas/atlas.sdrf.tsv
output: /out/atlas.mt
params:
gene_column: "Gene ID"
sample_id_column: sample_idCPTAC JSON recipe (save as examples/recipes/cptac.example.json):
{
"matrices": [
{
"name": "cptac",
"inputs": {
"expression": "/data/cptac/expression.tsv",
"metadata": "/data/cptac/metadata.tsv"
},
"output": "/out/cptac.mt",
"params": {
"gene_id_col": "GeneID",
"gene_name_col": "Gene Name",
"sample_id_col": "SampleID",
"expression_col": "Expression",
"categorical_cols": "TumorType,Stage",
"overwrite": true
}
}
]
}Run:
hvantk mkmatrix-batch --recipe examples/recipes/cptac.example.jsonPredict genetic ancestry for samples using PCA and Random Forest classification against a labeled reference panel.
# Predict ancestry using 1000 Genomes as reference
hvantk ancestry-inference \
-q /data/my_cohort.mt \
-r /data/1kg_phase3.mt \
--ancestry-col super_pop \
-o /out/ancestry_predictions.ht \
--generate-report \
--export-tsv# Conservative assignment with custom filtering
hvantk ancestry-inference \
-q /data/my_cohort.mt \
-r /data/1kg_phase3.mt \
--ancestry-col super_pop \
-o /out/ancestry_predictions.ht \
--min-af 0.05 \
--min-call-rate 0.99 \
--n-pcs 30 \
--n-pcs-classify 15 \
--min-prob 0.90 \
--generate-reportimport hail as hl
from hvantk.ancestry import run_ancestry_inference
# Initialize Hail
hl.init()
# Load data
query_mt = hl.read_matrix_table("my_cohort.mt")
reference_mt = hl.read_matrix_table("1kg_phase3.mt")
# Run inference
result = run_ancestry_inference(
query_mt=query_mt,
reference_mt=reference_mt,
ancestry_col="super_pop",
min_prob=0.75,
)
# Get results
predictions = result.get_predictions_df()
print(predictions['predicted_ancestry'].value_counts())
# Generate visualizations
result.generate_report("ancestry_report.html")
fig = result.plot_pca()
fig.savefig("pca_plot.png", dpi=300)
# Annotate original MT with predictions
annotated_mt = result.annotate_matrixtable(query_mt)| File | Description |
|---|---|
predictions.ht |
Hail Table with ancestry predictions |
predictions.tsv |
TSV export (with --export-tsv) |
ancestry_report.html |
HTML report with visualizations |
rf_model.pkl |
Trained model (with --save-model) |
pca_loadings.ht |
PCA loadings (with --save-loadings) |
Full Ancestry Documentation | Examples
Inspect, summarize, and extract markers from expression MatrixTables.
# Inspect column metadata fields
hvantk expression describe -m /data/heart_sc.mt
# Collapse into gene-level summary grouped by cell type
hvantk expression summarize \
-m /data/heart_sc.mt \
--group-by cell_type \
-o /out/heart_celltype_summary.ht
# Multi-field grouping with pre-filtering
hvantk expression summarize \
-m /data/heart_sc.mt \
--group-by cell_type --group-by region \
--filter-by time_point=9wpc \
--min-cells 50 \
-o /out/heart_summary.ht
# Extract marker genes (fold-change method)
hvantk expression markers \
-s /out/heart_celltype_summary.ht \
--method fold_change \
--top-n 200 \
-o /out/heart_markers.json
# Extract markers using Wilcoxon rank-sum test
hvantk expression markers \
-m /data/heart_sc.mt \
--method wilcoxon \
--group-by cell_type \
--top-n 200 \
-o /out/heart_wilcoxon_markers.jsonConvert plain-text gene panels into GeneSetCollection JSON files for use
with hvantk psroc, hvantk enrichex burden, and hvantk enrichex overlap.
# Format: headerless two-column TSV (gene_set_name<TAB>gene_symbol)
hvantk genesets prepare -i panels.tsv -o panels.json
# With HGNC validation and alias resolution (recommended)
hvantk genesets prepare -i panels.tsv -o panels.json --hgnc /data/hgnc.ht
# Filter small sets and provide explicit background
hvantk genesets prepare -i panels.tsv -o panels.json \
--hgnc /data/hgnc.ht --min-genes 5 --background bg_genes.txt
# Also export as GMT for GSEA compatibility
hvantk genesets prepare -i panels.tsv -o panels.json --export-gmt panels.gmt
# Extract gene sets from COSMIC Cancer Gene Census
hvantk genesets cosmic --ht /data/cosmic_cgc.ht -o cosmic_gene_sets.jsonfrom hvantk.utils.geneset_io import parse_geneset_tsv, validate_with_hgnc
from hvantk.utils.gene_sets import load_gene_sets_from_dict
# Parse TSV
result = parse_geneset_tsv("panels.tsv")
# Optional: validate against HGNC
vr = validate_with_hgnc(result.gene_sets, "/data/hgnc.ht")
# Build and save collection
collection = load_gene_sets_from_dict(vr.gene_sets, source="prepare-geneset")
collection.save("panels.json")from hvantk.data.clingen_streamer import ClinGenStreamer
streamer = ClinGenStreamer("/data/clingen/clingen_gene_disease.ht")
# High-confidence genes
definitive = streamer.get_genes_by_classification("Definitive")
# Disease keyword search
cancer_genes = streamer.get_genes_by_disease(
["cancer", "carcinoma", "tumor"],
match_mode="contains",
min_classification="Moderate",
)
# Dataset stats + summary
stats = streamer.compute_stats()
summary = streamer.classification_summary()
# Export gene sets for EnrichEx
streamer.export_for_enrichex(
"/out/clingen_gene_sets.json",
min_classification="Moderate",
)
# Translate output to Ensembl IDs using GeneMapper
import hail as hl
from hvantk.data.gene_mapper import GeneMapper
hgnc_ht = hl.read_table("/data/hgnc/hgnc.ht")
mapper = GeneMapper(hgnc_ht)
ensembl_ids = streamer.get_genes_by_classification(
"Definitive",
gene_mapper=mapper,
output_id_type="ensembl_gene_id",
)
# Or translate to HGNC IDs
hgnc_ids = streamer.to_gene_set(
min_classification="Moderate",
gene_mapper=mapper,
output_id_type="hgnc_id",
)Hail supports standard gzip (.gz) and uncompressed files but processes them single-threaded. Block gzip (BGZF) .bgz files enable parallel import and are strongly recommended for large datasets. hvantk provides two ways to convert:
Several CLI commands support the --auto-convert-bgz flag, which detects plain .gz files and converts them to BGZF before import:
hvantk mktable dbnsfp --raw-input data.gz --output-ht out.ht --auto-convert-bgz
hvantk mkmatrix ucsc -e expr.tsv.gz -m meta.tsv -o out.mt --auto-convert-bgz
hvantk mkmatrix expression-atlas -e matrix.tsv.gz -s atlas.sdrf.tsv -o out.mt --auto-convert-bgzThe converted .bgz file is written alongside the original (e.g., data.gz → data.bgz) and reused on subsequent runs.
For batch or one-off conversion:
# Default: replaces .gz extension with .bgz
hvantk utils convert-bgz input.tsv.gz
# Custom output path and thread count
hvantk utils convert-bgz input.tsv.gz -o output.tsv.bgz --threads 4The command auto-detects whether the file is already BGZF and skips conversion if so.
- Use
--overwriteto replace an existing output. Without it, builders abort if the output exists. - For JSON vs YAML: JSON works out of the box; YAML recipes require
PyYAMLinstalled. - For UCSC, gene labels may be pipe-delimited (e.g., A|B);
--split-gene-fielddefaults to true. - MatrixTables typically store sample/cell metadata under
mt.col_keyand cols metadata; inspect withmt.describe()in Python or logs from CLI. - gzip vs BGZF: Hail reads standard gzip files single-threaded, which is significantly slower for large files. Convert to BGZF with
--auto-convert-bgzorhvantk utils convert-bgzfor parallel import.
- Architecture – system design and extension points
- Contributing – development workflow and contribution guidelines
- Recipe Examples – ready-to-edit recipe templates