dna-seq
diff --git a/‎AGENTS.md‎
Lines changed: 66 additions & 206 deletions b/‎AGENTS.md‎
Lines changed: 66 additions & 206 deletions
diff --git a/‎README.md‎
Lines changed: 18 additions & 1 deletion b/‎README.md‎
Lines changed: 18 additions & 1 deletion
@@ -2,43 +2,11 @@
 
 This repository is dedicated to the preparation of genomic annotation data (Ensembl, ClinVar, dbSNP, gnomAD, etc.) and conversion of OakVar modules from the [dna-seq GitHub organization](https://github.com/orgs/dna-seq/repositories).
 
-## Repository Layout (uv package)
-
-The package follows Dagster best practices with utilities organized in subpackages:
-
-- `src/prepare_annotations/`: Main package
-  - `definitions.py`: **Main Dagster definitions** (assets, jobs, resources)
-  - `pipelines.py`: **Standalone API** for ClinVar, dbSNP, gnomAD (non-Dagster sources)
-  - `cli.py`: Main Typer CLI entrypoint
-  
-  - `core/`: Core utilities
-    - `io.py`: VCF/Parquet I/O utilities
-    - `models.py`: Pydantic models for results
-    - `paths.py`: Path helpers and resource locations
-    - `runtime.py`: Execution environment and profiling
-    - `config.py`: Configuration helpers
-    - `splitter.py`: Variant splitting by type
-  
-  - `assets/`: Dagster assets
-    - `ensembl.py`: Ensembl VCF pipeline assets
-    - `modules.py`: OakVar module conversion assets
-  
-  - `downloaders/`: Download utilities
-    - `vcf.py`: VCF download with retry/resume
-    - `genome.py`: Ensembl genome FASTA download
-  
-  - `huggingface/`: HuggingFace Hub integration
-    - `uploader.py`: Upload utilities
-    - `dataset_cards.py`: Dataset card templates
-  
-  - `converters/`: OakVar module converters
-    - `longevitymap.py`, `coronary.py`, `drugs.py`, etc.
-    - `common.py`: Shared conversion utilities
-
-- `dataset_cards/`: Markdown templates for Hugging Face dataset cards
-- `tests/`: Unit and integration tests
-
-## Coding Standards
+---
+
+## General Standards (Reusable)
+
+### Coding Standards
 
 - **Type hints**: Mandatory for all Python code.
 - **Pathlib**: Always use for all file paths.
@@ -49,149 +17,44 @@ The package follows Dagster best practices with utilities organized in subpackag
 - **Pydantic 2**: Mandatory for data classes.
 - **Avoid __all__**: Avoid __init__.py with __all__ as it confuses where things are located.
 
-## Import Guidelines
-
-For new code, use the organized subpackages:
-
-```python
-# Dagster definitions
-from prepare_annotations.definitions import defs
-
-# Standalone API
-from prepare_annotations.pipelines import PreparationPipelines
-
-# Assets
-from prepare_annotations.assets import ensembl_vcf_urls, longevitymap_weights
-
-# Core utilities
-from prepare_annotations.core.io import read_vcf_file, vcf_to_parquet
-from prepare_annotations.core.models import PreparationResult
-from prepare_annotations.core.paths import get_cache_dir, LOGS_DIR
-
-# Downloaders
-from prepare_annotations.downloaders.vcf import download_path, list_paths
-from prepare_annotations.downloaders.genome import download_ensembl_genome
-
-# HuggingFace
-from prepare_annotations.huggingface.uploader import upload_parquet_to_hf
-from prepare_annotations.huggingface.dataset_cards import generate_ensembl_card
-
-# Converters
-from prepare_annotations.converters import convert_longevitymap
-```
-
-## Dagster Guide (Agents)
-
-These pipelines are Dagster-first. Follow these rules to avoid the issues we already hit:
-
-### 1) Use modern API (no legacy CLI)
-
-- Do not use `dagster job execute` or other deprecated CLI for orchestration.
-- Prefer Python API: `materialize()` for assets, `execute_job()` for non-partitioned jobs.
-- If CLI is needed, use Dagster dev server only (`uv run dagster dev -m prepare_annotations.definitions`).
-
-### 2) Dynamic partitions must be explicit
-
-- Use dynamic partitions whenever the upstream file list is external or changing (FTP, HTTP, HF repo, etc.).
-- The discovery asset must register partitions via `DynamicPartitionsDefinition`.
-- Partitioned assets must use `context.partition_key` and must not be run without a partition key.
-
-### 3) Materialize with full asset list + selection
-
-When using `materialize()`:
-
-- Always pass the full asset graph (all upstream assets).
-- Use `selection=["asset_name"]` to run a single asset.
-- Do not pass config for assets not present in the selection.
-
-Example pattern:
-
-- `materialize(assets=all_assets, selection=["download_asset"], partition_key="...")`
-
-### 4) IO manager must resolve partition paths
-
-Partitioned assets need file-level input paths:
-
-- Inputs must resolve to concrete files for each partition key
-- If an IO manager returns a directory for a partitioned asset, downstream processing will crash
-
-### 5) Collector must depend on partitioned outputs
-
-Collector assets must declare deps on the partitioned asset to ensure correct lineage and ordering.
+### Dagster Best Practices (Universal)
 
-### 6) Filter temporary outputs before upload
+#### Asset Return Types
 
-Do not upload temp files. Filter out:
+| Asset Returns | IO Manager | Use Case |
+|---------------|------------|----------|
+| `pl.LazyFrame` | `polars_parquet_io_manager` | Small parquet, schema visibility |
+| `Path` | Custom IO manager | Large data, DuckDB joins, file uploads |
+| `dict` | Default | API responses, upload results |
 
-- files starting with `tmp`
-- files ending with `.tmp.parquet` or `.parquet.tmp`
-- dotfiles
+#### Key Rules
 
-### 7) Concurrency and memory safety
+- **dagster-polars**: Use `PolarsParquetIOManager` for `LazyFrame` assets → automatic schema/row count in UI
+- **Path assets**: Add `"dagster/column_schema": polars_schema_to_table_schema(path)` for schema visibility
+- **Asset checks**: Use `@asset_check` for validation; include via `AssetSelection.checks_for_assets(...)`
+- **Streaming**: Use `lazy_frame.sink_parquet()`, never `.collect().write_parquet()` on large data
+- **DuckDB**: Use for large joins (out-of-core); set `memory_limit` and `temp_directory`
+- **Concurrency**: Use `op_tags={"dagster/concurrency_key": "name"}` to limit parallel execution
 
-- Use download parallelism (I/O bound), but keep conversion limited.
-- Concurrency should be enforced via Dagster tag limits in `dagster.yaml`.
+#### Dynamic Partitions Pattern
 
-### 8) Dagster home
+1. Create partition def: `PARTS = DynamicPartitionsDefinition(name="files")`
+2. Discovery asset registers partitions: `context.instance.add_dynamic_partitions(PARTS.name, keys)`
+3. Partitioned assets use: `partitions_def=PARTS`, access `context.partition_key`
+4. Collector depends on partitioned output via `deps=[partitioned_asset]`, scans filesystem for results
 
-- `DAGSTER_HOME` is `data/interim/dagster`
-- Always set it for runs so UI and API share the same instance.
+#### Execution
 
-### Primary Dagster Pipelines (Recommended)
+- **Python API only**: `defs.resolve_job_def(name)` + `job.execute_in_process(instance=instance)`
+- **Same DAGSTER_HOME** for UI and execution: `dagster dev -m module.definitions`
+- **All assets in `Definitions(assets=[...])`** for lineage visibility in UI
 
-- `uv run dagster-ensembl`: Run the full Ensembl pipeline (download, convert, upload).
-- `uv run prepare longevitymap`: Run the LongevityMap pipeline (convert, join with Ensembl, upload).
-- `uv run dagster-ui`: Launch Dagster UI for monitoring and lineage visualization.
+#### Anti-Patterns
 
-### OakVar Module Management
-
-- `uv run modules data --repo dna-seq/just_longevitymap`: Download module data files.
-- `uv run modules clone --repo dna-seq/just_longevitymap`: Clone full module repository.
-- `uv run prepare longevitymap`: Run full Dagster pipeline (convert + Ensembl join + upload).
-- `uv run prepare longevitymap --convert-only`: Convert only (no Ensembl join, no upload).
-
-### Unified Annotation Schema
-
-The module conversion produces three standardized parquet files:
-
-1. **annotations.parquet**: Variant-level facts
-   - Schema: `rsid, module, gene, phenotype, category`
-   - Links variants to genes and phenotype categories
-
-2. **studies.parquet**: Per-study evidence
-   - Schema: `rsid, module, pmid, population, p_value, conclusion, study_design`
-   - Scientific evidence from publications
-
-3. **weights.parquet**: Curator-defined scoring
-   - Schema: `rsid, genotype, module, weight, state, priority, conclusion, curator, method`
-   - Curated weight assignments for variant impact
-   - State: `protective`, `risk`, or `neutral`
-   - Genotype: Normalized (e.g., `CT`, `TT`, `AA`)
-
-### Available Modules
-
-Modules from https://github.com/orgs/dna-seq/repositories:
-- `just_longevitymap`: Longevity-associated variants
-- `just_pathogenic`: Pathogenic variant annotations
-- `just_cancer`: Cancer-associated genes
-- `just_coronary`: Coronary disease variants
-- `just_vo2max`: VO2max-related variants
-- `just_lipidmetabolism`: Lipid metabolism variants
-- `just_prs`: Polygenic risk score data
-- `just_drugs`: Pharmacogenomic data
-- `just_superhuman`: Elite performance genetics
-
-## Deployment
-
-Datasets are typically uploaded to the `just-dna-seq` organization on Hugging Face Hub.
-
-## Testing
-
-### Test Philosophy
-
-- **Integration tests**: Use real data, no mocking unless necessary
-- **Auto-download**: Tests automatically download required data from GitHub
-- **Validation**: Comprehensive checks ensuring data integrity during conversion
+- `dagster job execute` CLI (deprecated)
+- Hardcoded asset names; use `defs.get_all_asset_specs()`
+- Config for unselected assets (validation errors)
+- Suspended jobs holding DuckDB file locks
 
 ### Test Generation Guidelines (Universal)
 
@@ -265,53 +128,50 @@ assert source_ids == output_ids
 assert valid_states == {"active", "inactive", "pending"}  # from API spec
 ```
 
-### Running Tests
+---
 
-```bash
-# Run all tests (excluding large downloads)
-uv run pytest
+## Project-Specific: prepare-annotations
 
-# Run specific module tests
-uv run pytest tests/test_longevitymap_module.py -v
+### Key Entry Points
 
-# Run with verbose output
-uv run pytest -vvv
-```
+- `src/prepare_annotations/definitions.py`: Main Dagster definitions (assets, jobs, resources)
+- `src/prepare_annotations/cli.py`: Typer CLI - run `uv run prepare --help`
+- `src/prepare_annotations/assets/`: Dagster assets (ensembl.py, modules.py)
+- `src/prepare_annotations/converters/`: OakVar module converters
 
-### Test Fixtures
+### CLI Commands
 
-The `conftest.py` provides shared fixtures for OakVar module testing:
+```bash
+uv run prepare longevitymap          # Full pipeline: convert + Ensembl join + upload
+uv run prepare longevitymap --convert-only  # Convert only
+uv run dagster-ui                    # Launch Dagster UI
+uv run modules data --repo dna-seq/just_longevitymap  # Download module data
+```
 
-- `ensure_oakvar_module_data()`: Downloads module data if not present
-- `download_oakvar_module_data()`: Directly downloads from GitHub repositories
+### Unified Annotation Schema
 
-These fixtures are automatically used by test modules to ensure data availability.
+Module conversion produces three standardized parquet files:
 
-### Example: LongevityMap Validation
+1. **annotations.parquet**: `rsid, module, gene, phenotype, category`
+2. **studies.parquet**: `rsid, module, pmid, population, p_value, conclusion, study_design`
+3. **weights.parquet**: `rsid, genotype, module, weight, state, priority, conclusion, curator, method`
+   - State: `protective`, `risk`, or `neutral`
+   - Genotype: List of 2 alleles, alphabetically sorted
 
-The `test_longevitymap_module.py` validates conversion integrity:
+### Available Modules
 
-1. **Weights Table**
-   - Row counts: parquet ≥ sqlite (due to genotype expansion)
-   - Unique rsid counts match between formats
-   - Weight values preserved (min/max match, all unique values present)
-   - Per-rsid weight sets match (not sums, due to expansion)
+Modules from https://github.com/orgs/dna-seq/repositories:
+- `just_longevitymap`, `just_coronary`, `just_vo2max`, `just_lipidmetabolism`
+- `just_superhuman`, `just_drugs`, `just_pathogenic`, `just_cancer`, `just_prs`
 
-2. **APOE Variants** (Critical longevity markers)
-   - rs7412 (APOE e2): expected protective weight set `{0.5, 1.0}`
-   - rs429358 (APOE e4): expected risk weight set `{-0.5, -1.0}`
+### Testing
 
-3. **Schema Transformations**
-   - Genotype format: list of 2 alleles, alphabetically sorted
-   - State values: valid enum (`protective`, `risk`, `alt`, `ref`)
-   - Module column: all rows have `"longevitymap"`
+```bash
+uv run pytest                                    # Run all tests
+uv run pytest tests/test_longevitymap_module.py  # Specific module
+uv run pytest -v --tb=short                      # Verbose with short traceback
+```
 
-4. **Studies & Annotations**
-   - All PMIDs: set equality between sqlite and parquet
-   - Categories: parquet subset of sqlite categories
-   - Populations: parquet subset of sqlite populations
+### Deployment
 
-Tests automatically:
-1. Download SQLite from `dna-seq/just_longevitymap` if missing
-2. Convert to parquet if needed
-3. Validate data integrity via source comparison (not hardcoded counts)
+Datasets uploaded to `just-dna-seq` organization on HuggingFace Hub.
@@ -67,9 +67,26 @@ The `modules` command manages OakVar modules from the [dna-seq GitHub organizati
 uv run modules data --repo dna-seq/just_longevitymap
 
 # Convert module data to unified schema
-uv run modules convert-longevitymap
+uv run prepare longevitymap --convert-only
 ```
 
+**Available modules**: `just_longevitymap`, `just_coronary`, `just_vo2max`, `just_lipidmetabolism`, `just_superhuman`, `just_drugs`, `just_pathogenic`, `just_cancer`, `just_prs`
+
+### Unified Annotation Schema
+
+Module conversion produces three standardized parquet files:
+
+| File | Schema |
+|------|--------|
+| **annotations.parquet** | `rsid, module, gene, phenotype, category` |
+| **studies.parquet** | `rsid, module, pmid, population, p_value, conclusion, study_design` |
+| **weights.parquet** | `rsid, genotype, module, weight, state, priority, conclusion, curator, method` |
+
+- **State**: `protective`, `risk`, or `neutral`
+- **Genotype**: List of 2 alleles, alphabetically sorted
+
+Converted datasets are uploaded to the [`just-dna-seq`](https://huggingface.co/just-dna-seq) organization on HuggingFace Hub.
+
 ## Package Structure
 
 The package follows Dagster best practices with utilities organized in subpackages: