This document describes the unified schema for genomic annotation modules used in the just-dna-seq project.
The schema consists of three Parquet tables that work together with VCF variant data:
- annotations.parquet — Variant-level facts (what a variant is associated with)
- studies.parquet — Per-study evidence (literature references and study details)
- weights.parquet — Curator-defined scoring (genotype-specific weights)
All tables join with VCF data on rsid to provide genomic coordinates and allele information.
The annotation tables are designed to join with VCF/Parquet variant data:
| Column | Type | Description |
|---|---|---|
chrom |
String | Chromosome |
start |
Int64 | Start position (0-based) |
end |
Int64 | End position |
id |
String | Variant identifier (rsid) — JOIN KEY |
ref |
String | Reference allele |
alt |
String | Alternative allele |
qual |
Float64 | Quality score |
filter |
String | Filter status |
END |
Int64 | End position (VCF INFO field) |
Note: The
idfield in VCF corresponds torsidin annotation tables.
Purpose: Store variant-level facts — what each variant is associated with.
Granularity: One row per (rsid, module) combination.
| Column | Type | Required | Description |
|---|---|---|---|
rsid |
String | ✓ | Variant identifier (e.g., "rs7412") |
module |
String | ✓ | Source module name |
gene |
String | Curated gene symbol (may differ from VCF annotations) | |
phenotype |
String | Trait or phenotype affected | |
category |
String | Category within the module |
(rsid, module)
- gene: This is the curator-assigned gene, which may differ from automatic VCF annotations (e.g., VEP). Useful when curators have specific gene associations.
- phenotype: Describes the trait or condition the variant is associated with (e.g., "longevity", "lipid_metabolism", "coronary_disease").
- category: Module-specific categorization (e.g., "lipids", "insulin" for longevitymap; drug names for drugs module).
Purpose: Store per-study evidence — literature references and study details.
Granularity: One row per (rsid, module, pmid) combination.
| Column | Type | Required | Description |
|---|---|---|---|
rsid |
String | ✓ | Variant identifier |
module |
String | ✓ | Source module name |
pmid |
String | PubMed ID | |
population |
String | Study population (e.g., "European", "Asian") | |
p_value |
String | Statistical significance (kept as string due to mixed formats) | |
conclusion |
String | Study-specific conclusion text | |
study_design |
String | Study design description |
(rsid, module, pmid)
- pmid: May need parsing from legacy data. Some sources store as
"[PMID 12345]; [PMID 67890]", others as plain"12345". - p_value: Stored as string because formats vary (e.g.,
"2.00E-28","< 0.05","= 0.017","1 x 10-7"). - Multiple studies per variant: This is expected and normal. A single rsid can have many studies across different populations.
Purpose: Store curator-defined scoring — genotype-specific weights for variant interpretation.
Granularity: One row per (rsid, genotype, module) combination.
| Column | Type | Required | Description |
|---|---|---|---|
rsid |
String | ✓ | Variant identifier |
genotype |
List[String] | ✓ | Genotype this weight applies to (NORMALIZED alphabetically) |
module |
String | ✓ | Source module name |
weight |
Float64 | Numeric weight/score (nullable) | |
state |
String | Effect direction: risk/protective/neutral/significant | |
priority |
String | Priority level (module-specific) | |
conclusion |
String | Genotype-specific text description | |
curator |
String | ✓ | Curator organization |
method |
String | ✓ | Curation method |
(rsid, genotype, module)
- genotype: Normalized to alphabetical order (e.g.,
"AG"not"GA","CT"not"TC"). This ensures consistent matching regardless of source format. - weight: Nullable because some modules (e.g., superhuman) have qualitative annotations without numeric scores.
- state: Semantic effect direction only. Values:
risk,protective,neutral,significant,not_significant. Does NOT includeref/alt(computed at query time). - curator/method: Required for provenance tracking.
The following fields are not stored but can be computed at query time:
| Field | Formula | Description |
|---|---|---|
zygosity |
"hom" if genotype[0] == genotype[1] else "het" |
Homozygous or heterozygous |
allele_type |
"ref" if effect_allele == vcf.ref else "alt" |
Whether the allele is reference or alternative |
is_homozygous |
genotype[0] == genotype[1] |
Boolean for homozygosity |
import polars as pl
weights = pl.scan_parquet("weights.parquet")
# Add computed zygosity
weights_with_zygosity = weights.with_columns(
pl.when(pl.col("genotype").list.get(0) == pl.col("genotype").list.get(1))
.then(pl.lit("hom"))
.otherwise(pl.lit("het"))
.alias("zygosity")
)Genotypes are stored in alphabetical order for consistent matching:
| Original | Normalized | Zygosity |
|---|---|---|
"GA" |
["A", "G"] |
het |
"TC" |
["C", "T"] |
het |
"AA" |
["A", "A"] |
hom |
"GG" |
["G", "G"] |
hom |
def normalize_genotype(genotype: str) -> list[str]:
"""Normalize genotype to alphabetical list of alleles."""
if genotype is None:
return None
return sorted(list(genotype))df = df.with_columns(
pl.when(pl.col("genotype_raw").str.len_chars() == 2)
.then(
pl.when(pl.col("genotype_raw").str.slice(0, 1) > pl.col("genotype_raw").str.slice(1, 1))
.then(
pl.concat_list([
pl.col("genotype_raw").str.slice(1, 1),
pl.col("genotype_raw").str.slice(0, 1)
])
)
.otherwise(
pl.concat_list([
pl.col("genotype_raw").str.slice(0, 1),
pl.col("genotype_raw").str.slice(1, 1)
])
)
)
.otherwise(
pl.col("genotype_raw").str.split("").list.slice(1, -1)
)
.alias("genotype")
)The state field in weights uses these semantic values:
| Value | Description |
|---|---|
risk |
Increases disease/negative outcome risk |
protective |
Decreases disease/negative outcome risk |
neutral |
No significant effect (typically weight ≈ 0) |
significant |
Statistically significant (used by drugs module) |
not_significant |
Not statistically significant |
Note:
refandaltare not stored as state values. Whether an allele is reference or alternative can be computed by comparing with VCF data.
Each weight entry includes provenance information:
| Module | Curator | Method |
|---|---|---|
longevitymap |
Olga Borysova | literature_review |
lipidmetabolism |
Olga Borysova | literature_review |
vo2max |
Olga Borysova | literature_review |
superhuman |
Olga Borysova | literature_review |
coronary |
Olga Borysova | gwas_literature |
drugs |
PharmGKB | pharmacogenomics_db |
import polars as pl
# Load data
vcf = pl.scan_parquet("vcf/*.parquet")
annotations = pl.scan_parquet("annotations.parquet")
# Join annotations with VCF
annotated = (
vcf
.rename({"id": "rsid"})
.join(annotations, on="rsid")
)# User's genotype data (from VCF)
user_genotypes = pl.DataFrame({
"rsid": ["rs7412", "rs429358"],
"user_genotype": ["CT", "TC"] # Will be normalized
})
# Normalize user genotypes
user_genotypes = user_genotypes.with_columns(
pl.col("user_genotype").map_elements(
lambda g: "".join(sorted(g)) if g and len(g) == 2 else g,
return_dtype=pl.Utf8
).alias("user_genotype")
)
# Load weights
weights = pl.scan_parquet("weights.parquet")
# Join to get applicable weights
scored = user_genotypes.join(
weights.collect(),
left_on=["rsid", "user_genotype"],
right_on=["rsid", "genotype"]
)studies = pl.scan_parquet("studies.parquet")
# All studies for rs7412
rs7412_studies = (
studies
.filter(pl.col("rsid") == "rs7412")
.collect()
)import polars as pl
# Load all data
vcf = pl.scan_parquet("vcf/*.parquet").rename({"id": "rsid"})
annotations = pl.scan_parquet("annotations.parquet")
studies = pl.scan_parquet("studies.parquet")
weights = pl.scan_parquet("weights.parquet")
# User's normalized genotypes
user_genotypes = pl.DataFrame({
"rsid": ["rs7412"],
"genotype": ["CT"]
})
# Get full annotation with weights
result = (
user_genotypes
.join(vcf.collect(), on="rsid")
.join(annotations.collect(), on="rsid")
.join(
weights.collect(),
on=["rsid", "genotype", "module"],
how="left"
)
.with_columns(
# Add computed zygosity
pl.when(pl.col("genotype").str.slice(0, 1) == pl.col("genotype").str.slice(1, 1))
.then(pl.lit("hom"))
.otherwise(pl.lit("het"))
.alias("zygosity")
)
)modules/
├── annotations.parquet # 5 columns - variant facts
├── studies.parquet # 7 columns - study evidence
├── weights.parquet # 8 columns - curator scores
└── by_module/ # Optional: split by module
├── longevitymap/
│ ├── annotations.parquet
│ ├── studies.parquet
│ └── weights.parquet
├── lipidmetabolism/
│ └── ...
└── ...
┌─────────────────────────────────────────────────────────────────┐
│ VCF DATA (join source) │
├─────────────────────────────────────────────────────────────────┤
│ chrom, start, end, id (rsid), ref, alt, qual, filter, END │
└─────────────────────────────────────────────────────────────────┘
│
│ JOIN ON rsid
▼
┌─────────────────────────────────────────────────────────────────┐
│ MODULE ANNOTATION TABLES │
└─────────────────────────────────────────────────────────────────┘
┌───────────────────────────┐
│ annotations.parquet │
│ (variant-level facts) │
├───────────────────────────┤
│ PK: rsid, module │◄──────────────────────┐
├───────────────────────────┤ │
│ gene │ │
│ phenotype │ │
│ category │ │
└───────────────────────────┘ │
│ │
│ 1:N │
▼ │
┌───────────────────────────┐ │
│ studies.parquet │ │
│ (per-study evidence) │ │
├───────────────────────────┤ │
│ FK: rsid, module │───────────────────────┤
│ PK: + pmid │ │
├───────────────────────────┤ │
│ population │ │
│ p_value │ │
│ conclusion │ │
│ study_design │ │
└───────────────────────────┘ │
│
┌───────────────────────────┐ │
│ weights.parquet │ │
│ (curator-defined scores) │ │
├───────────────────────────┤ │
│ FK: rsid │───────────────────────┘
│ PK: + genotype, module │
├───────────────────────────┤
│ weight │
│ state │
│ priority │
│ conclusion │
├───────────────────────────┤
│ curator │
│ method │
└───────────────────────────┘
COMPUTABLE (not stored):
zygosity ← from genotype
allele_type ← from comparing with VCF ref/alt
The following legacy modules can be converted to this schema:
| Legacy Module | Annotations | Studies | Weights | Notes |
|---|---|---|---|---|
| just_longevitymap | ✓ | ✓ | ✓ | Full conversion |
| just_lipidmetabolism | ✓ | ✓ | ✓ | Full conversion |
| just_vo2max | ✓ | ✓ | ✓ | Full conversion |
| just_superhuman | ✓ | ✓ | ✓ | No numeric weights (NULL) |
| just_coronary | ✓ | ✓ | ✓ | Full conversion |
| just_drugs | ✓ | ✓ | ✓ | Missing some allele info |
- Separation of concerns: Facts, evidence, and scoring are distinct concepts
- Different update frequencies: Weights may update independently of study evidence
- Query efficiency: Most queries need only one or two tables
Different sources use different conventions ("AG" vs "GA"). Alphabetical normalization ensures consistent matching.
P-values come in many formats ("2.00E-28", "< 0.05", "1 x 10-7"). Storing as string preserves original precision and format.
Zygosity is trivially computable from genotype (genotype[0] == genotype[1]). Storing it would be redundant.
This requires VCF data to compute. Since we always join with VCF, it can be computed at query time.
- Hugging Face Module Consumption Guide — How to use these modules from the Hugging Face repository.
- Dagster Modules Pipeline — Implementation details of the conversion pipeline.