Skip to content

Latest commit

 

History

History
186 lines (150 loc) · 14.5 KB

File metadata and controls

186 lines (150 loc) · 14.5 KB

2025-10-19, 05:30 : System prep

  • Verified repo structure and key components of the TRM codebase.
  • Installed PyYAML, hydra-core, argdantic, and adam-atan2-pytorch into trm_env.
  • Adjusted pretrain.py to use AdamAtan2 with a non-zero LR and recorded the dependency in requirements.txt.

2025-10-19, 05:45 : ARC dataset work

  • Generated the full ARC dataset (arc1concept-aug-1000) to validate augmentation behaviour.
  • Added dataset/__init__.py and produced data/arc1concept-mini (800 puzzles, 801 IDs) for sanity checks.

2025-10-19, 06:00 : Training sanity checks (RTX 3050, 4 GB)

  • Ran the tiny TRM config with DISABLE_COMPILE=1, global_batch_size=4, hidden_size=128, num_heads=2, expansion=2, L_layers=1, L_cycles=2, H_cycles=1.
  • Confirmed the loop trains end-to-end, saving checkpoints and logging to WandB (debug_run_tiny).

2025-10-19, 06:10 : Next focus (TinyVariant POC)

  • Ready to implement ClinVar ingestion and stand up the VariantTRM pipeline per the proof-of-concept plan.

2025-10-19, 06:30 : ClinVar dataset prep

  • Added VERSION (0.1.0) and pulled pandas into requirements.txt for preprocessing work.
  • Authored tools/prepare_clinvar_dataset.py to filter GRCh38 germline missense SNVs with high-confidence labels.
  • Downloaded variant_summary.txt.gz (378 MB) via data/download_data.sh.
  • Produced a balanced 5 k-sample dataset under data/clinvar/processed with versioned stats.

2025-10-19, 06:45 : ClinVar TRM dataset

  • Tokenized the balanced variants into 13-token sequences covering gene, chromosome, alleles, amino-acid change, and position digits.
  • Emitted train/test splits in data/clinvar/processed/clinvar_trm along with vocab.json and identifiers.json.
  • Verified splits at 4 k train / 1 k test with label tokens LABEL_BENIGN (3) and LABEL_PATHOGENIC (4).

2025-10-19, 07:00 : ClinVar training setup

  • Added cfg_clinvar.yaml and the variant_trm arch to point pretrain.py at the ClinVar dataset.
  • Implemented VariantClassificationHead to supervise the final token via cross entropy on the binary labels.
  • Attempted an offline smoke run; GPU wasn’t exposed inside the sandbox at the time.

2025-10-22, 19:30 : ClinVar smoke run on GPU

  • Ran the 1-epoch cfg_clinvar smoke test locally on Pop!_OS (RTX 3050) with offline WandB logging.
  • Training completed successfully; checkpoint and logs saved under checkpoints/Clinvar_trm_ACT-torch/clinvar_smoke.
  • Metrics show the wiring works, but longer training and baselines are still pending for the full POC.

2025-10-22, 20:00 : Next steps (10k dataset & baselines)

  • To scale the data, rerun tools/prepare_clinvar_dataset.py --max-per-class 5000 followed by tools/build_clinvar_trm_dataset.py (yields 10 k balanced variants).
  • Plan: train with higher epochs/batch size using cfg_clinvar overrides and compare against a logistic regression baseline.
  • ClinVar evaluator and evaluate_clinvar_checkpoint.py will track accuracy/AUC for every run.

2025-10-22, 20:10 : Training configuration for extended run

  • Added cfg_clinvar_long.yaml (50 epochs, LR 5e-4, batch 256) for 10 k experiments.
  • Command: WANDB_MODE=offline DISABLE_COMPILE=1 python pretrain.py --config-name cfg_clinvar_long +run_name=clinvar_long
  • Evaluator logs ClinVar/accuracy and ClinVar/roc_auc every 5 epochs; checkpoints saved after each evaluation.

2025-10-22, 20:20 : Logistic regression baseline helper

  • Added tools/train_baseline_logreg.py (one-hot features + standardized position) as a quick comparison.
  • Command: python tools/train_baseline_logreg.py --input data/clinvar/processed/clinvar_missense_balanced.tsv
  • Outputs accuracy/ROC AUC to stdout (optional --output saves JSON); split matches the TRM preprocessing seed.

2025-10-22, 20:30 : First ClinVar training + baseline results

  • Ran cfg_clinvar_long for 50 epochs (10 k data). Final evaluation: accuracy ≈ 0.7339, ROC AUC ≈ 0.8097.
  • Logistic regression baseline (tools/train_baseline_logreg.py) achieved accuracy 0.7660, ROC AUC 0.8435.
  • Baseline currently outperforms VariantTRM; next steps: feature engineering, architecture tweaks, or hyperparameter search.

2025-10-22, 21:15 : Added review status and ClinSig tokens

  • Updated tools/build_clinvar_trm_dataset.py to insert review-status and clinical-significance tokens (sequence length now 15).
  • Logistic baseline now one-hot encodes those fields as well.
  • Reminder: regenerate data via
    python tools/prepare_clinvar_dataset.py --max-per-class 5000
    python tools/build_clinvar_trm_dataset.py
    before rerunning training/baseline.

2025-10-22, 22:20 : Results with enriched features

  • cfg_clinvar_long (50 epochs, 10 k data): accuracy 0.9165, ROC AUC 0.9771.
  • Logistic regression baseline: accuracy 0.9690, ROC AUC 0.9849 (feature leakage suspected).
  • CPU evaluation script now supports loading checkpoints without CUDA.

2025-10-22, 22:40 : Removed ClinSig-derived leakage

  • Stripped all ClinicalSignificance-based tokens from dataset builder and baseline.
  • Logistic baseline (8k/2k split) now reports accuracy ≈0.746, ROC AUC ≈0.838.
  • Request: regenerate data and rerun cfg_clinvar_long for fresh TRM metrics.

2025-10-22, 21:30 : TRM architecture alignment

  • Re-read Samsung SAIT’s Tiny Recursive Models documentation to confirm architecture parity.
  • Noted in README that we inherit TRM’s recursive reasoning core and only swap the input tokens/labels.
  • Future changes: respect their embedding/halting conventions when adding new features.

2025-10-22, 19:50 : ClinVar evaluation utilities

  • Added tools/evaluate_clinvar_checkpoint.py to score checkpoints with accuracy and ROC AUC on the ClinVar test split.
  • Introduced evaluators/clinvar.py and wired it into cfg_clinvar.yaml so evaluation metrics log automatically during training.
  • Updated the README Quickstart with a sample evaluation command. 2025-10-22, 23:30 : Amino-acid property features
  • Added AA property buckets (nonpolar/charged/etc.) and change-class tokens to the TRM dataset.
  • Logistic baseline (80k/20k split) after rebuild: accuracy 0.823, ROC AUC 0.907.
  • Reran cfg_clinvar_long on the 50k-per-class dataset: accuracy 0.841, ROC AUC 0.915.
  • Baseline remains at 0.823 / 0.907 with the enriched features.

2025-10-23, 00:30 : Added no-leakage test

  • Created tests/test_clinvar_dataset.py to ensure ClinSig tokens never appear in the TRM vocab and sequence length matches expectations.
  • README now documents how to run the quick pytest check.

2025-10-23, 00:45 : Hyperparameter sweep config

  • Added config/clinvar_sweep.yaml to drive Hydra multiruns over hidden size, L layers, L cycles, and learning rate.
  • README documents the sweep command (python pretrain.py --config-name clinvar_sweep --multirun).

2025-10-23, 19:00 : Integrated sweep analyzer

  • Updated config/clinvar_sweep.yaml to rely on Hydra override directories.
  • Added scripts/analyze_sweep.py and documented the workflow (run sweep + analyzer).

2025-10-23, 19:20 : Added early stopping support

  • PretrainConfig now accepts early_stop_patience, early_stop_metric, and early_stop_delta.
  • Training loop stops when the chosen metric fails to improve beyond early_stop_delta for patience evaluations.
  • README documents usage via +early_stop_patience=....

2025-10-24, 09:10 : ClinVar preprocessing enrichment

  • Expanded tools/prepare_clinvar_dataset.py to retain phenotype terms/IDs, submitter counts, ISO-normalized review dates, and derived provenance buckets.
  • Normalized phenotype strings (dedupe, placeholder filtering) and tracked coverage stats in clinvar_missense_balanced_stats.json.
  • Verified pipeline with a 50-per-class smoke build under data/clinvar/processed/tmp_balanced.

2025-10-24, 09:40 : Phenotype context in TRM + baselines

  • Updated tools/build_clinvar_trm_dataset.py to emit phenotype/source/submitter/evaluation tokens (sequence length 24) and persist feature metadata in dataset manifests.
  • Synced logistic regression baseline with the new context features via shared preprocessing helpers.
  • Extended pytest coverage to assert phenotype tokens are present and leak-free.

2025-10-24, 10:15 : Baseline runner ergonomics

  • Added tools/__init__.py and path bootstrapping so python tools/train_baseline_logreg.py works without manual PYTHONPATH tweaks.
  • Successfully reran the logistic regression baseline on the refreshed 5k/5k dataset (accuracy ≈0.861, ROC AUC ≈0.930).

2025-10-24, 17:25 : VariantTRM run with phenotype context

  • Trained cfg_clinvar_long on the 5k/5k dataset with +early_stop_patience=5 (offline WandB run clinvar_long_5k).
  • Final evaluation: accuracy ≈0.819, ROC AUC ≈0.884; checkpoints saved under checkpoints/Clinvar_trm-ACT-torch/clinvar_long_5k/.
  • Noted the performance gap vs. the logistic baseline, highlighting the need for further feature/model work.

2025-10-24, 17:45 : Expanded ClinVar dataset tests

  • Augmented tests/test_clinvar_dataset.py to verify phenotype/provenance buckets, feature metadata, and updated sequence length (now 25).
  • All ClinVar tests pass against the refreshed 5k/5k dataset (python -m pytest tests/test_clinvar_dataset.py → 5 passed).

2025-10-24, 18:10 : Added tertiary phenotype slots

  • Extended tools/build_clinvar_trm_dataset.py to encode up to three phenotype tokens (sequence length 25) and preserved slot metadata.
  • Updated the logistic baseline, docs, and tests to stay in sync; new baseline run: accuracy ≈0.860, ROC AUC ≈0.930.

2025-10-24, 18:55 : ClinVar dataset rebuild + run recap

  • Regenerated the balanced ClinVar table (5k per class) and rebuilt the TRM dataset so phenotype/provenance fields populate all tokens.
  • Trained cfg_clinvar_long with +early_stop_patience=5 (clinvar_long_20251024-175518); early stopping fired at step 1248, final eval accuracy 0.819, ROC AUC 0.8859.
  • Logistic baseline rerun on the refreshed dataset: accuracy 0.860, ROC AUC 0.9300 (8k/2k split).
  • Evaluated step_1248 with both CPU and CUDA modes after fixing tools/evaluate_clinvar_checkpoint.py to create carries on-device; scores match training (accuracy 0.8195, ROC AUC 0.8858).

2025-10-24, 19:20 : Plotting helpers for documentation

  • Extended tools/evaluate_clinvar_checkpoint.py with --save-preds so per-example scores can be captured alongside summary metrics.
  • Added scripts/plot_eval_comparison.py to visualize TinyVariant vs. baseline accuracy/AUC from saved JSON metrics.
  • Added scripts/plot_roc_curve.py to render ROC curves from prediction JSONL files (pairs with the new evaluation flag).
  • README Quickstart now documents the evaluation output flags and plotting workflow for future write-ups.

2025-10-24, 19:40 : Feature ablation toggles + scaling notes

  • Added --phenotype-ablation and --provenance-ablation flags to tools/build_clinvar_trm_dataset.py so we can rebuild datasets with phenotype tokens or provenance buckets zeroed out.
  • Feature metadata now records the ablation state; downstream baselines/TRM runs pick up the altered dataset automatically.
  • README/Instructions document how to run ablation rebuilds and how to scale preprocessing to 50k+ examples via --max-per-class.

2025-10-24, 21:05 : 50k-per-class ClinVar run results

  • Rebuilt the balanced dataset with --max-per-class 50000 (100k examples total; ~61k carry phenotype annotations) and refreshed the TRM arrays.
  • Logistic baseline on the 80k/20k split reached accuracy 0.8955, ROC AUC 0.9591.
  • Trained cfg_clinvar_long with early stopping (run clinvar_long_20251024-194100); stopped at step 9372 with eval accuracy 0.8828 and ROC AUC 0.9450.
  • Updated tools/evaluate_clinvar_checkpoint.py to move carries to the active device so GPU evaluation succeeds; metrics for step_9372 are stored in outputs/clinvar_trm_metrics.json and the ROC curve in docs/figures/clinvar_trm_roc.png.

2025-10-24, 21:55 : Phenotype ablation baseline

  • Rebuilt TRM dataset with --phenotype-ablation (phenotype slots forced to <none>).
  • TRM run clinvar_long_phenotype_ablation_20251024-215110 (W&B disabled, single-worker loader) early-stopped at step 14058: eval accuracy 0.8741, ROC AUC 0.9417.
  • Evaluated checkpoint step_14058 on CPU (GPU evaluation timing out) → metrics saved to outputs/clinvar_long_phenotype_ablation_20251024-215110_metrics.json, predictions in _predictions.jsonl.

2025-10-25, 07:55 : Provenance ablation baseline

  • Rebuilt TRM dataset with --provenance-ablation (submitter/evaluation buckets set to the lowest bucket).
  • TRM run clinvar_long_provenance_ablation_20251025-074717 (WANDB offline, single-worker loader) completed full schedule; eval accuracy 0.8755, ROC AUC 0.9438.
  • Evaluated checkpoint step_15620 on CPU → metrics in outputs/clinvar_long_provenance_ablation_20251025-074717_metrics.json, predictions in _predictions.jsonl.

2025-10-26, 05:00 : ClinVar 50k hyperparameter sweep

  • Hydra sweep (arch.hidden_size ∈ {256, 384}, L_layers ∈ {2, 3}, L_cycles ∈ {2, 3}, lr ∈ {5e-4, 3e-4}) completed; per-run metrics regenerated via scripts/evaluate_sweep_runs.py.
  • python scripts/analyze_sweep.py highlights best config hidden_size=384, L_layers=2, L_cycles=2, lr=3e-4 with ROC AUC 0.9513 and accuracy 0.8867 (checkpoint step_9372).
  • Sweep summary saved to sweep_summary.csv; heatmap generation skipped (pandas/seaborn unavailable in sandbox).

2025-10-26, 15:30 : Best-config TRM run (384 hidden, 2×2 cycles, lr=3e-4)

  • Run clinvar_long_best384_20251026-140402 reached ROC AUC 0.9513, accuracy 0.8867 on the 20k test split; parameter count ≈17.0M.
  • Artifacts: outputs/clinvar_long_best384_20251026-140402_{metrics,predictions}.jsonl, plots refreshed under docs/figures/.

2025-10-25, 17:45 : Sweep preparation

  • Added explicit run_name / checkpoint_path templating to config/clinvar_sweep.yaml so each Hydra job writes to its own folder (checkpoints/Clinvar_trm-ACT-torch/<override_dirname>).
  • README now documents the sweep command with WANDB_DISABLED=true and TINYVARIANT_NUM_WORKERS=0 for smoother offline runs.

2025-10-26, 17:10 : Documentation wrap-up & hypothesis outcome

  • Updated README.md with a key results table, clarified that the logistic regression baseline (ROC AUC 0.959 / accuracy 0.896) remains ahead of the best TRM run (ROC AUC 0.951 / accuracy 0.887, ~17 M params, hidden_size 384, 2×2 cycles).
  • Added context that phenotype/provenance ablations barely dent performance, reinforcing that the recursive halting core adds little for ClinVar missense classification.
  • Closed out the README with next-step ideas (write-up, error slicing, new hypotheses) and captured the negative result so this proof of concept can be considered complete.