Skip to content

Training: DataLoader, feature cache, AMP, lazy package exports#3

Open
cropsgg wants to merge 1 commit intopr/02-clinvar-data-loaderfrom
pr/03-training-pipeline
Open

Training: DataLoader, feature cache, AMP, lazy package exports#3
cropsgg wants to merge 1 commit intopr/02-clinvar-data-loaderfrom
pr/03-training-pipeline

Conversation

@cropsgg
Copy link
Copy Markdown
Owner

@cropsgg cropsgg commented Apr 15, 2026

Summary

Updates training/inference plumbing so pipelines consume AppSettings: batched DNABERT encoding, optional on-disk feature cache, DataLoader-based training, optional AMP, and lazy imports in bloom_dnabert/__init__.py so importing light modules does not require every heavy dependency at import time.

Base branch: pr/02-clinvar-data-loader (stacked — merge after #2 or retarget once earlier PRs land).

What changed

  • bloom_dnabert/feature_cache.py — Fingerprinted cache for precomputed Bloom/DNABERT features when data.feature_cache_dir is set.
  • bloom_dnabert/classifier.pyHybridClassifierPipeline and BloomGuidedPipeline read batch sizes, workers, AMP, max length, etc. from settings.
  • bloom_dnabert/dnabert_wrapper.py — Batch encode / token-level outputs aligned with training loops.
  • bloom_dnabert/bloom_filter.py — Aligns with config paths and seed loading APIs used by the app/CLI.
  • bloom_dnabert/__init__.py — Lazy __getattr__ exports for torch/transformer-heavy submodules.

Why this PR exists

Scales experimentation beyond notebook-style loops: reproducible batching, less redundant DNABERT work via cache, and cleaner import graph for tests and partial installs.

Dependencies

How to verify

Run classifier/bloom tests in an environment with torch and dependencies from requirements.txt (full pytest tests/).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant