Skip to content

Commit 3264612

Browse files
committed
updated readme and validation checks
1 parent 3ddd8b6 commit 3264612

9 files changed

Lines changed: 965 additions & 342 deletions

File tree

AGENTS.md

Lines changed: 66 additions & 206 deletions
Original file line numberDiff line numberDiff line change
@@ -2,43 +2,11 @@
22

33
This repository is dedicated to the preparation of genomic annotation data (Ensembl, ClinVar, dbSNP, gnomAD, etc.) and conversion of OakVar modules from the [dna-seq GitHub organization](https://github.com/orgs/dna-seq/repositories).
44

5-
## Repository Layout (uv package)
6-
7-
The package follows Dagster best practices with utilities organized in subpackages:
8-
9-
- `src/prepare_annotations/`: Main package
10-
- `definitions.py`: **Main Dagster definitions** (assets, jobs, resources)
11-
- `pipelines.py`: **Standalone API** for ClinVar, dbSNP, gnomAD (non-Dagster sources)
12-
- `cli.py`: Main Typer CLI entrypoint
13-
14-
- `core/`: Core utilities
15-
- `io.py`: VCF/Parquet I/O utilities
16-
- `models.py`: Pydantic models for results
17-
- `paths.py`: Path helpers and resource locations
18-
- `runtime.py`: Execution environment and profiling
19-
- `config.py`: Configuration helpers
20-
- `splitter.py`: Variant splitting by type
21-
22-
- `assets/`: Dagster assets
23-
- `ensembl.py`: Ensembl VCF pipeline assets
24-
- `modules.py`: OakVar module conversion assets
25-
26-
- `downloaders/`: Download utilities
27-
- `vcf.py`: VCF download with retry/resume
28-
- `genome.py`: Ensembl genome FASTA download
29-
30-
- `huggingface/`: HuggingFace Hub integration
31-
- `uploader.py`: Upload utilities
32-
- `dataset_cards.py`: Dataset card templates
33-
34-
- `converters/`: OakVar module converters
35-
- `longevitymap.py`, `coronary.py`, `drugs.py`, etc.
36-
- `common.py`: Shared conversion utilities
37-
38-
- `dataset_cards/`: Markdown templates for Hugging Face dataset cards
39-
- `tests/`: Unit and integration tests
40-
41-
## Coding Standards
5+
---
6+
7+
## General Standards (Reusable)
8+
9+
### Coding Standards
4210

4311
- **Type hints**: Mandatory for all Python code.
4412
- **Pathlib**: Always use for all file paths.
@@ -49,149 +17,44 @@ The package follows Dagster best practices with utilities organized in subpackag
4917
- **Pydantic 2**: Mandatory for data classes.
5018
- **Avoid __all__**: Avoid __init__.py with __all__ as it confuses where things are located.
5119

52-
## Import Guidelines
53-
54-
For new code, use the organized subpackages:
55-
56-
```python
57-
# Dagster definitions
58-
from prepare_annotations.definitions import defs
59-
60-
# Standalone API
61-
from prepare_annotations.pipelines import PreparationPipelines
62-
63-
# Assets
64-
from prepare_annotations.assets import ensembl_vcf_urls, longevitymap_weights
65-
66-
# Core utilities
67-
from prepare_annotations.core.io import read_vcf_file, vcf_to_parquet
68-
from prepare_annotations.core.models import PreparationResult
69-
from prepare_annotations.core.paths import get_cache_dir, LOGS_DIR
70-
71-
# Downloaders
72-
from prepare_annotations.downloaders.vcf import download_path, list_paths
73-
from prepare_annotations.downloaders.genome import download_ensembl_genome
74-
75-
# HuggingFace
76-
from prepare_annotations.huggingface.uploader import upload_parquet_to_hf
77-
from prepare_annotations.huggingface.dataset_cards import generate_ensembl_card
78-
79-
# Converters
80-
from prepare_annotations.converters import convert_longevitymap
81-
```
82-
83-
## Dagster Guide (Agents)
84-
85-
These pipelines are Dagster-first. Follow these rules to avoid the issues we already hit:
86-
87-
### 1) Use modern API (no legacy CLI)
88-
89-
- Do not use `dagster job execute` or other deprecated CLI for orchestration.
90-
- Prefer Python API: `materialize()` for assets, `execute_job()` for non-partitioned jobs.
91-
- If CLI is needed, use Dagster dev server only (`uv run dagster dev -m prepare_annotations.definitions`).
92-
93-
### 2) Dynamic partitions must be explicit
94-
95-
- Use dynamic partitions whenever the upstream file list is external or changing (FTP, HTTP, HF repo, etc.).
96-
- The discovery asset must register partitions via `DynamicPartitionsDefinition`.
97-
- Partitioned assets must use `context.partition_key` and must not be run without a partition key.
98-
99-
### 3) Materialize with full asset list + selection
100-
101-
When using `materialize()`:
102-
103-
- Always pass the full asset graph (all upstream assets).
104-
- Use `selection=["asset_name"]` to run a single asset.
105-
- Do not pass config for assets not present in the selection.
106-
107-
Example pattern:
108-
109-
- `materialize(assets=all_assets, selection=["download_asset"], partition_key="...")`
110-
111-
### 4) IO manager must resolve partition paths
112-
113-
Partitioned assets need file-level input paths:
114-
115-
- Inputs must resolve to concrete files for each partition key
116-
- If an IO manager returns a directory for a partitioned asset, downstream processing will crash
117-
118-
### 5) Collector must depend on partitioned outputs
119-
120-
Collector assets must declare deps on the partitioned asset to ensure correct lineage and ordering.
20+
### Dagster Best Practices (Universal)
12121

122-
### 6) Filter temporary outputs before upload
22+
#### Asset Return Types
12323

124-
Do not upload temp files. Filter out:
24+
| Asset Returns | IO Manager | Use Case |
25+
|---------------|------------|----------|
26+
| `pl.LazyFrame` | `polars_parquet_io_manager` | Small parquet, schema visibility |
27+
| `Path` | Custom IO manager | Large data, DuckDB joins, file uploads |
28+
| `dict` | Default | API responses, upload results |
12529

126-
- files starting with `tmp`
127-
- files ending with `.tmp.parquet` or `.parquet.tmp`
128-
- dotfiles
30+
#### Key Rules
12931

130-
### 7) Concurrency and memory safety
32+
- **dagster-polars**: Use `PolarsParquetIOManager` for `LazyFrame` assets → automatic schema/row count in UI
33+
- **Path assets**: Add `"dagster/column_schema": polars_schema_to_table_schema(path)` for schema visibility
34+
- **Asset checks**: Use `@asset_check` for validation; include via `AssetSelection.checks_for_assets(...)`
35+
- **Streaming**: Use `lazy_frame.sink_parquet()`, never `.collect().write_parquet()` on large data
36+
- **DuckDB**: Use for large joins (out-of-core); set `memory_limit` and `temp_directory`
37+
- **Concurrency**: Use `op_tags={"dagster/concurrency_key": "name"}` to limit parallel execution
13138

132-
- Use download parallelism (I/O bound), but keep conversion limited.
133-
- Concurrency should be enforced via Dagster tag limits in `dagster.yaml`.
39+
#### Dynamic Partitions Pattern
13440

135-
### 8) Dagster home
41+
1. Create partition def: `PARTS = DynamicPartitionsDefinition(name="files")`
42+
2. Discovery asset registers partitions: `context.instance.add_dynamic_partitions(PARTS.name, keys)`
43+
3. Partitioned assets use: `partitions_def=PARTS`, access `context.partition_key`
44+
4. Collector depends on partitioned output via `deps=[partitioned_asset]`, scans filesystem for results
13645

137-
- `DAGSTER_HOME` is `data/interim/dagster`
138-
- Always set it for runs so UI and API share the same instance.
46+
#### Execution
13947

140-
### Primary Dagster Pipelines (Recommended)
48+
- **Python API only**: `defs.resolve_job_def(name)` + `job.execute_in_process(instance=instance)`
49+
- **Same DAGSTER_HOME** for UI and execution: `dagster dev -m module.definitions`
50+
- **All assets in `Definitions(assets=[...])`** for lineage visibility in UI
14151

142-
- `uv run dagster-ensembl`: Run the full Ensembl pipeline (download, convert, upload).
143-
- `uv run prepare longevitymap`: Run the LongevityMap pipeline (convert, join with Ensembl, upload).
144-
- `uv run dagster-ui`: Launch Dagster UI for monitoring and lineage visualization.
52+
#### Anti-Patterns
14553

146-
### OakVar Module Management
147-
148-
- `uv run modules data --repo dna-seq/just_longevitymap`: Download module data files.
149-
- `uv run modules clone --repo dna-seq/just_longevitymap`: Clone full module repository.
150-
- `uv run prepare longevitymap`: Run full Dagster pipeline (convert + Ensembl join + upload).
151-
- `uv run prepare longevitymap --convert-only`: Convert only (no Ensembl join, no upload).
152-
153-
### Unified Annotation Schema
154-
155-
The module conversion produces three standardized parquet files:
156-
157-
1. **annotations.parquet**: Variant-level facts
158-
- Schema: `rsid, module, gene, phenotype, category`
159-
- Links variants to genes and phenotype categories
160-
161-
2. **studies.parquet**: Per-study evidence
162-
- Schema: `rsid, module, pmid, population, p_value, conclusion, study_design`
163-
- Scientific evidence from publications
164-
165-
3. **weights.parquet**: Curator-defined scoring
166-
- Schema: `rsid, genotype, module, weight, state, priority, conclusion, curator, method`
167-
- Curated weight assignments for variant impact
168-
- State: `protective`, `risk`, or `neutral`
169-
- Genotype: Normalized (e.g., `CT`, `TT`, `AA`)
170-
171-
### Available Modules
172-
173-
Modules from https://github.com/orgs/dna-seq/repositories:
174-
- `just_longevitymap`: Longevity-associated variants
175-
- `just_pathogenic`: Pathogenic variant annotations
176-
- `just_cancer`: Cancer-associated genes
177-
- `just_coronary`: Coronary disease variants
178-
- `just_vo2max`: VO2max-related variants
179-
- `just_lipidmetabolism`: Lipid metabolism variants
180-
- `just_prs`: Polygenic risk score data
181-
- `just_drugs`: Pharmacogenomic data
182-
- `just_superhuman`: Elite performance genetics
183-
184-
## Deployment
185-
186-
Datasets are typically uploaded to the `just-dna-seq` organization on Hugging Face Hub.
187-
188-
## Testing
189-
190-
### Test Philosophy
191-
192-
- **Integration tests**: Use real data, no mocking unless necessary
193-
- **Auto-download**: Tests automatically download required data from GitHub
194-
- **Validation**: Comprehensive checks ensuring data integrity during conversion
54+
- `dagster job execute` CLI (deprecated)
55+
- Hardcoded asset names; use `defs.get_all_asset_specs()`
56+
- Config for unselected assets (validation errors)
57+
- Suspended jobs holding DuckDB file locks
19558

19659
### Test Generation Guidelines (Universal)
19760

@@ -265,53 +128,50 @@ assert source_ids == output_ids
265128
assert valid_states == {"active", "inactive", "pending"} # from API spec
266129
```
267130

268-
### Running Tests
131+
---
269132

270-
```bash
271-
# Run all tests (excluding large downloads)
272-
uv run pytest
133+
## Project-Specific: prepare-annotations
273134

274-
# Run specific module tests
275-
uv run pytest tests/test_longevitymap_module.py -v
135+
### Key Entry Points
276136

277-
# Run with verbose output
278-
uv run pytest -vvv
279-
```
137+
- `src/prepare_annotations/definitions.py`: Main Dagster definitions (assets, jobs, resources)
138+
- `src/prepare_annotations/cli.py`: Typer CLI - run `uv run prepare --help`
139+
- `src/prepare_annotations/assets/`: Dagster assets (ensembl.py, modules.py)
140+
- `src/prepare_annotations/converters/`: OakVar module converters
280141

281-
### Test Fixtures
142+
### CLI Commands
282143

283-
The `conftest.py` provides shared fixtures for OakVar module testing:
144+
```bash
145+
uv run prepare longevitymap # Full pipeline: convert + Ensembl join + upload
146+
uv run prepare longevitymap --convert-only # Convert only
147+
uv run dagster-ui # Launch Dagster UI
148+
uv run modules data --repo dna-seq/just_longevitymap # Download module data
149+
```
284150

285-
- `ensure_oakvar_module_data()`: Downloads module data if not present
286-
- `download_oakvar_module_data()`: Directly downloads from GitHub repositories
151+
### Unified Annotation Schema
287152

288-
These fixtures are automatically used by test modules to ensure data availability.
153+
Module conversion produces three standardized parquet files:
289154

290-
### Example: LongevityMap Validation
155+
1. **annotations.parquet**: `rsid, module, gene, phenotype, category`
156+
2. **studies.parquet**: `rsid, module, pmid, population, p_value, conclusion, study_design`
157+
3. **weights.parquet**: `rsid, genotype, module, weight, state, priority, conclusion, curator, method`
158+
- State: `protective`, `risk`, or `neutral`
159+
- Genotype: List of 2 alleles, alphabetically sorted
291160

292-
The `test_longevitymap_module.py` validates conversion integrity:
161+
### Available Modules
293162

294-
1. **Weights Table**
295-
- Row counts: parquet ≥ sqlite (due to genotype expansion)
296-
- Unique rsid counts match between formats
297-
- Weight values preserved (min/max match, all unique values present)
298-
- Per-rsid weight sets match (not sums, due to expansion)
163+
Modules from https://github.com/orgs/dna-seq/repositories:
164+
- `just_longevitymap`, `just_coronary`, `just_vo2max`, `just_lipidmetabolism`
165+
- `just_superhuman`, `just_drugs`, `just_pathogenic`, `just_cancer`, `just_prs`
299166

300-
2. **APOE Variants** (Critical longevity markers)
301-
- rs7412 (APOE e2): expected protective weight set `{0.5, 1.0}`
302-
- rs429358 (APOE e4): expected risk weight set `{-0.5, -1.0}`
167+
### Testing
303168

304-
3. **Schema Transformations**
305-
- Genotype format: list of 2 alleles, alphabetically sorted
306-
- State values: valid enum (`protective`, `risk`, `alt`, `ref`)
307-
- Module column: all rows have `"longevitymap"`
169+
```bash
170+
uv run pytest # Run all tests
171+
uv run pytest tests/test_longevitymap_module.py # Specific module
172+
uv run pytest -v --tb=short # Verbose with short traceback
173+
```
308174

309-
4. **Studies & Annotations**
310-
- All PMIDs: set equality between sqlite and parquet
311-
- Categories: parquet subset of sqlite categories
312-
- Populations: parquet subset of sqlite populations
175+
### Deployment
313176

314-
Tests automatically:
315-
1. Download SQLite from `dna-seq/just_longevitymap` if missing
316-
2. Convert to parquet if needed
317-
3. Validate data integrity via source comparison (not hardcoded counts)
177+
Datasets uploaded to `just-dna-seq` organization on HuggingFace Hub.

README.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,9 +67,26 @@ The `modules` command manages OakVar modules from the [dna-seq GitHub organizati
6767
uv run modules data --repo dna-seq/just_longevitymap
6868

6969
# Convert module data to unified schema
70-
uv run modules convert-longevitymap
70+
uv run prepare longevitymap --convert-only
7171
```
7272

73+
**Available modules**: `just_longevitymap`, `just_coronary`, `just_vo2max`, `just_lipidmetabolism`, `just_superhuman`, `just_drugs`, `just_pathogenic`, `just_cancer`, `just_prs`
74+
75+
### Unified Annotation Schema
76+
77+
Module conversion produces three standardized parquet files:
78+
79+
| File | Schema |
80+
|------|--------|
81+
| **annotations.parquet** | `rsid, module, gene, phenotype, category` |
82+
| **studies.parquet** | `rsid, module, pmid, population, p_value, conclusion, study_design` |
83+
| **weights.parquet** | `rsid, genotype, module, weight, state, priority, conclusion, curator, method` |
84+
85+
- **State**: `protective`, `risk`, or `neutral`
86+
- **Genotype**: List of 2 alleles, alphabetically sorted
87+
88+
Converted datasets are uploaded to the [`just-dna-seq`](https://huggingface.co/just-dna-seq) organization on HuggingFace Hub.
89+
7390
## Package Structure
7491

7592
The package follows Dagster best practices with utilities organized in subpackages:

0 commit comments

Comments
 (0)