22
33This repository is dedicated to the preparation of genomic annotation data (Ensembl, ClinVar, dbSNP, gnomAD, etc.) and conversion of OakVar modules from the [ dna-seq GitHub organization] ( https://github.com/orgs/dna-seq/repositories ) .
44
5- ## Repository Layout (uv package)
6-
7- The package follows Dagster best practices with utilities organized in subpackages:
8-
9- - ` src/prepare_annotations/ ` : Main package
10- - ` definitions.py ` : ** Main Dagster definitions** (assets, jobs, resources)
11- - ` pipelines.py ` : ** Standalone API** for ClinVar, dbSNP, gnomAD (non-Dagster sources)
12- - ` cli.py ` : Main Typer CLI entrypoint
13-
14- - ` core/ ` : Core utilities
15- - ` io.py ` : VCF/Parquet I/O utilities
16- - ` models.py ` : Pydantic models for results
17- - ` paths.py ` : Path helpers and resource locations
18- - ` runtime.py ` : Execution environment and profiling
19- - ` config.py ` : Configuration helpers
20- - ` splitter.py ` : Variant splitting by type
21-
22- - ` assets/ ` : Dagster assets
23- - ` ensembl.py ` : Ensembl VCF pipeline assets
24- - ` modules.py ` : OakVar module conversion assets
25-
26- - ` downloaders/ ` : Download utilities
27- - ` vcf.py ` : VCF download with retry/resume
28- - ` genome.py ` : Ensembl genome FASTA download
29-
30- - ` huggingface/ ` : HuggingFace Hub integration
31- - ` uploader.py ` : Upload utilities
32- - ` dataset_cards.py ` : Dataset card templates
33-
34- - ` converters/ ` : OakVar module converters
35- - ` longevitymap.py ` , ` coronary.py ` , ` drugs.py ` , etc.
36- - ` common.py ` : Shared conversion utilities
37-
38- - ` dataset_cards/ ` : Markdown templates for Hugging Face dataset cards
39- - ` tests/ ` : Unit and integration tests
40-
41- ## Coding Standards
5+ ---
6+
7+ ## General Standards (Reusable)
8+
9+ ### Coding Standards
4210
4311- ** Type hints** : Mandatory for all Python code.
4412- ** Pathlib** : Always use for all file paths.
@@ -49,149 +17,44 @@ The package follows Dagster best practices with utilities organized in subpackag
4917- ** Pydantic 2** : Mandatory for data classes.
5018- ** Avoid __ all__ ** : Avoid __ init__ .py with __ all__ as it confuses where things are located.
5119
52- ## Import Guidelines
53-
54- For new code, use the organized subpackages:
55-
56- ``` python
57- # Dagster definitions
58- from prepare_annotations.definitions import defs
59-
60- # Standalone API
61- from prepare_annotations.pipelines import PreparationPipelines
62-
63- # Assets
64- from prepare_annotations.assets import ensembl_vcf_urls, longevitymap_weights
65-
66- # Core utilities
67- from prepare_annotations.core.io import read_vcf_file, vcf_to_parquet
68- from prepare_annotations.core.models import PreparationResult
69- from prepare_annotations.core.paths import get_cache_dir, LOGS_DIR
70-
71- # Downloaders
72- from prepare_annotations.downloaders.vcf import download_path, list_paths
73- from prepare_annotations.downloaders.genome import download_ensembl_genome
74-
75- # HuggingFace
76- from prepare_annotations.huggingface.uploader import upload_parquet_to_hf
77- from prepare_annotations.huggingface.dataset_cards import generate_ensembl_card
78-
79- # Converters
80- from prepare_annotations.converters import convert_longevitymap
81- ```
82-
83- ## Dagster Guide (Agents)
84-
85- These pipelines are Dagster-first. Follow these rules to avoid the issues we already hit:
86-
87- ### 1) Use modern API (no legacy CLI)
88-
89- - Do not use ` dagster job execute ` or other deprecated CLI for orchestration.
90- - Prefer Python API: ` materialize() ` for assets, ` execute_job() ` for non-partitioned jobs.
91- - If CLI is needed, use Dagster dev server only (` uv run dagster dev -m prepare_annotations.definitions ` ).
92-
93- ### 2) Dynamic partitions must be explicit
94-
95- - Use dynamic partitions whenever the upstream file list is external or changing (FTP, HTTP, HF repo, etc.).
96- - The discovery asset must register partitions via ` DynamicPartitionsDefinition ` .
97- - Partitioned assets must use ` context.partition_key ` and must not be run without a partition key.
98-
99- ### 3) Materialize with full asset list + selection
100-
101- When using ` materialize() ` :
102-
103- - Always pass the full asset graph (all upstream assets).
104- - Use ` selection=["asset_name"] ` to run a single asset.
105- - Do not pass config for assets not present in the selection.
106-
107- Example pattern:
108-
109- - ` materialize(assets=all_assets, selection=["download_asset"], partition_key="...") `
110-
111- ### 4) IO manager must resolve partition paths
112-
113- Partitioned assets need file-level input paths:
114-
115- - Inputs must resolve to concrete files for each partition key
116- - If an IO manager returns a directory for a partitioned asset, downstream processing will crash
117-
118- ### 5) Collector must depend on partitioned outputs
119-
120- Collector assets must declare deps on the partitioned asset to ensure correct lineage and ordering.
20+ ### Dagster Best Practices (Universal)
12121
122- ### 6) Filter temporary outputs before upload
22+ #### Asset Return Types
12323
124- Do not upload temp files. Filter out:
24+ | Asset Returns | IO Manager | Use Case |
25+ | ---------------| ------------| ----------|
26+ | ` pl.LazyFrame ` | ` polars_parquet_io_manager ` | Small parquet, schema visibility |
27+ | ` Path ` | Custom IO manager | Large data, DuckDB joins, file uploads |
28+ | ` dict ` | Default | API responses, upload results |
12529
126- - files starting with ` tmp `
127- - files ending with ` .tmp.parquet ` or ` .parquet.tmp `
128- - dotfiles
30+ #### Key Rules
12931
130- ### 7) Concurrency and memory safety
32+ - ** dagster-polars** : Use ` PolarsParquetIOManager ` for ` LazyFrame ` assets → automatic schema/row count in UI
33+ - ** Path assets** : Add ` "dagster/column_schema": polars_schema_to_table_schema(path) ` for schema visibility
34+ - ** Asset checks** : Use ` @asset_check ` for validation; include via ` AssetSelection.checks_for_assets(...) `
35+ - ** Streaming** : Use ` lazy_frame.sink_parquet() ` , never ` .collect().write_parquet() ` on large data
36+ - ** DuckDB** : Use for large joins (out-of-core); set ` memory_limit ` and ` temp_directory `
37+ - ** Concurrency** : Use ` op_tags={"dagster/concurrency_key": "name"} ` to limit parallel execution
13138
132- - Use download parallelism (I/O bound), but keep conversion limited.
133- - Concurrency should be enforced via Dagster tag limits in ` dagster.yaml ` .
39+ #### Dynamic Partitions Pattern
13440
135- ### 8) Dagster home
41+ 1 . Create partition def: ` PARTS = DynamicPartitionsDefinition(name="files") `
42+ 2 . Discovery asset registers partitions: ` context.instance.add_dynamic_partitions(PARTS.name, keys) `
43+ 3 . Partitioned assets use: ` partitions_def=PARTS ` , access ` context.partition_key `
44+ 4 . Collector depends on partitioned output via ` deps=[partitioned_asset] ` , scans filesystem for results
13645
137- - ` DAGSTER_HOME ` is ` data/interim/dagster `
138- - Always set it for runs so UI and API share the same instance.
46+ #### Execution
13947
140- ### Primary Dagster Pipelines (Recommended)
48+ - ** Python API only** : ` defs.resolve_job_def(name) ` + ` job.execute_in_process(instance=instance) `
49+ - ** Same DAGSTER_HOME** for UI and execution: ` dagster dev -m module.definitions `
50+ - ** All assets in ` Definitions(assets=[...]) ` ** for lineage visibility in UI
14151
142- - ` uv run dagster-ensembl ` : Run the full Ensembl pipeline (download, convert, upload).
143- - ` uv run prepare longevitymap ` : Run the LongevityMap pipeline (convert, join with Ensembl, upload).
144- - ` uv run dagster-ui ` : Launch Dagster UI for monitoring and lineage visualization.
52+ #### Anti-Patterns
14553
146- ### OakVar Module Management
147-
148- - ` uv run modules data --repo dna-seq/just_longevitymap ` : Download module data files.
149- - ` uv run modules clone --repo dna-seq/just_longevitymap ` : Clone full module repository.
150- - ` uv run prepare longevitymap ` : Run full Dagster pipeline (convert + Ensembl join + upload).
151- - ` uv run prepare longevitymap --convert-only ` : Convert only (no Ensembl join, no upload).
152-
153- ### Unified Annotation Schema
154-
155- The module conversion produces three standardized parquet files:
156-
157- 1 . ** annotations.parquet** : Variant-level facts
158- - Schema: ` rsid, module, gene, phenotype, category `
159- - Links variants to genes and phenotype categories
160-
161- 2 . ** studies.parquet** : Per-study evidence
162- - Schema: ` rsid, module, pmid, population, p_value, conclusion, study_design `
163- - Scientific evidence from publications
164-
165- 3 . ** weights.parquet** : Curator-defined scoring
166- - Schema: ` rsid, genotype, module, weight, state, priority, conclusion, curator, method `
167- - Curated weight assignments for variant impact
168- - State: ` protective ` , ` risk ` , or ` neutral `
169- - Genotype: Normalized (e.g., ` CT ` , ` TT ` , ` AA ` )
170-
171- ### Available Modules
172-
173- Modules from https://github.com/orgs/dna-seq/repositories :
174- - ` just_longevitymap ` : Longevity-associated variants
175- - ` just_pathogenic ` : Pathogenic variant annotations
176- - ` just_cancer ` : Cancer-associated genes
177- - ` just_coronary ` : Coronary disease variants
178- - ` just_vo2max ` : VO2max-related variants
179- - ` just_lipidmetabolism ` : Lipid metabolism variants
180- - ` just_prs ` : Polygenic risk score data
181- - ` just_drugs ` : Pharmacogenomic data
182- - ` just_superhuman ` : Elite performance genetics
183-
184- ## Deployment
185-
186- Datasets are typically uploaded to the ` just-dna-seq ` organization on Hugging Face Hub.
187-
188- ## Testing
189-
190- ### Test Philosophy
191-
192- - ** Integration tests** : Use real data, no mocking unless necessary
193- - ** Auto-download** : Tests automatically download required data from GitHub
194- - ** Validation** : Comprehensive checks ensuring data integrity during conversion
54+ - ` dagster job execute ` CLI (deprecated)
55+ - Hardcoded asset names; use ` defs.get_all_asset_specs() `
56+ - Config for unselected assets (validation errors)
57+ - Suspended jobs holding DuckDB file locks
19558
19659### Test Generation Guidelines (Universal)
19760
@@ -265,53 +128,50 @@ assert source_ids == output_ids
265128assert valid_states == {" active" , " inactive" , " pending" } # from API spec
266129```
267130
268- ### Running Tests
131+ ---
269132
270- ``` bash
271- # Run all tests (excluding large downloads)
272- uv run pytest
133+ ## Project-Specific: prepare-annotations
273134
274- # Run specific module tests
275- uv run pytest tests/test_longevitymap_module.py -v
135+ ### Key Entry Points
276136
277- # Run with verbose output
278- uv run pytest -vvv
279- ```
137+ - ` src/prepare_annotations/definitions.py ` : Main Dagster definitions (assets, jobs, resources)
138+ - ` src/prepare_annotations/cli.py ` : Typer CLI - run ` uv run prepare --help `
139+ - ` src/prepare_annotations/assets/ ` : Dagster assets (ensembl.py, modules.py)
140+ - ` src/prepare_annotations/converters/ ` : OakVar module converters
280141
281- ### Test Fixtures
142+ ### CLI Commands
282143
283- The ` conftest.py ` provides shared fixtures for OakVar module testing:
144+ ``` bash
145+ uv run prepare longevitymap # Full pipeline: convert + Ensembl join + upload
146+ uv run prepare longevitymap --convert-only # Convert only
147+ uv run dagster-ui # Launch Dagster UI
148+ uv run modules data --repo dna-seq/just_longevitymap # Download module data
149+ ```
284150
285- - ` ensure_oakvar_module_data() ` : Downloads module data if not present
286- - ` download_oakvar_module_data() ` : Directly downloads from GitHub repositories
151+ ### Unified Annotation Schema
287152
288- These fixtures are automatically used by test modules to ensure data availability.
153+ Module conversion produces three standardized parquet files:
289154
290- ### Example: LongevityMap Validation
155+ 1 . ** annotations.parquet** : ` rsid, module, gene, phenotype, category `
156+ 2 . ** studies.parquet** : ` rsid, module, pmid, population, p_value, conclusion, study_design `
157+ 3 . ** weights.parquet** : ` rsid, genotype, module, weight, state, priority, conclusion, curator, method `
158+ - State: ` protective ` , ` risk ` , or ` neutral `
159+ - Genotype: List of 2 alleles, alphabetically sorted
291160
292- The ` test_longevitymap_module.py ` validates conversion integrity:
161+ ### Available Modules
293162
294- 1 . ** Weights Table**
295- - Row counts: parquet ≥ sqlite (due to genotype expansion)
296- - Unique rsid counts match between formats
297- - Weight values preserved (min/max match, all unique values present)
298- - Per-rsid weight sets match (not sums, due to expansion)
163+ Modules from https://github.com/orgs/dna-seq/repositories :
164+ - ` just_longevitymap ` , ` just_coronary ` , ` just_vo2max ` , ` just_lipidmetabolism `
165+ - ` just_superhuman ` , ` just_drugs ` , ` just_pathogenic ` , ` just_cancer ` , ` just_prs `
299166
300- 2 . ** APOE Variants** (Critical longevity markers)
301- - rs7412 (APOE e2): expected protective weight set ` {0.5, 1.0} `
302- - rs429358 (APOE e4): expected risk weight set ` {-0.5, -1.0} `
167+ ### Testing
303168
304- 3 . ** Schema Transformations**
305- - Genotype format: list of 2 alleles, alphabetically sorted
306- - State values: valid enum (` protective ` , ` risk ` , ` alt ` , ` ref ` )
307- - Module column: all rows have ` "longevitymap" `
169+ ``` bash
170+ uv run pytest # Run all tests
171+ uv run pytest tests/test_longevitymap_module.py # Specific module
172+ uv run pytest -v --tb=short # Verbose with short traceback
173+ ```
308174
309- 4 . ** Studies & Annotations**
310- - All PMIDs: set equality between sqlite and parquet
311- - Categories: parquet subset of sqlite categories
312- - Populations: parquet subset of sqlite populations
175+ ### Deployment
313176
314- Tests automatically:
315- 1 . Download SQLite from ` dna-seq/just_longevitymap ` if missing
316- 2 . Convert to parquet if needed
317- 3 . Validate data integrity via source comparison (not hardcoded counts)
177+ Datasets uploaded to ` just-dna-seq ` organization on HuggingFace Hub.
0 commit comments