2025-10-19, 05:30 : System prep
- Verified repo structure and key components of the TRM codebase.
- Installed PyYAML, hydra-core, argdantic, and adam-atan2-pytorch into
trm_env. - Adjusted
pretrain.pyto useAdamAtan2with a non-zero LR and recorded the dependency inrequirements.txt.
2025-10-19, 05:45 : ARC dataset work
- Generated the full ARC dataset (arc1concept-aug-1000) to validate augmentation behaviour.
- Added
dataset/__init__.pyand produceddata/arc1concept-mini(800 puzzles, 801 IDs) for sanity checks.
2025-10-19, 06:00 : Training sanity checks (RTX 3050, 4 GB)
- Ran the tiny TRM config with
DISABLE_COMPILE=1,global_batch_size=4,hidden_size=128,num_heads=2,expansion=2,L_layers=1,L_cycles=2,H_cycles=1. - Confirmed the loop trains end-to-end, saving checkpoints and logging to WandB (
debug_run_tiny).
2025-10-19, 06:10 : Next focus (TinyVariant POC)
- Ready to implement ClinVar ingestion and stand up the VariantTRM pipeline per the proof-of-concept plan.
2025-10-19, 06:30 : ClinVar dataset prep
- Added
VERSION(0.1.0) and pulled pandas intorequirements.txtfor preprocessing work. - Authored
tools/prepare_clinvar_dataset.pyto filter GRCh38 germline missense SNVs with high-confidence labels. - Downloaded
variant_summary.txt.gz(378 MB) viadata/download_data.sh. - Produced a balanced 5 k-sample dataset under
data/clinvar/processedwith versioned stats.
2025-10-19, 06:45 : ClinVar TRM dataset
- Tokenized the balanced variants into 13-token sequences covering gene, chromosome, alleles, amino-acid change, and position digits.
- Emitted train/test splits in
data/clinvar/processed/clinvar_trmalong withvocab.jsonandidentifiers.json. - Verified splits at 4 k train / 1 k test with label tokens
LABEL_BENIGN(3) andLABEL_PATHOGENIC(4).
2025-10-19, 07:00 : ClinVar training setup
- Added
cfg_clinvar.yamland thevariant_trmarch to pointpretrain.pyat the ClinVar dataset. - Implemented
VariantClassificationHeadto supervise the final token via cross entropy on the binary labels. - Attempted an offline smoke run; GPU wasn’t exposed inside the sandbox at the time.
2025-10-22, 19:30 : ClinVar smoke run on GPU
- Ran the 1-epoch
cfg_clinvarsmoke test locally on Pop!_OS (RTX 3050) with offline WandB logging. - Training completed successfully; checkpoint and logs saved under
checkpoints/Clinvar_trm_ACT-torch/clinvar_smoke. - Metrics show the wiring works, but longer training and baselines are still pending for the full POC.
2025-10-22, 20:00 : Next steps (10k dataset & baselines)
- To scale the data, rerun
tools/prepare_clinvar_dataset.py --max-per-class 5000followed bytools/build_clinvar_trm_dataset.py(yields 10 k balanced variants). - Plan: train with higher epochs/batch size using
cfg_clinvaroverrides and compare against a logistic regression baseline. - ClinVar evaluator and
evaluate_clinvar_checkpoint.pywill track accuracy/AUC for every run.
2025-10-22, 20:10 : Training configuration for extended run
- Added
cfg_clinvar_long.yaml(50 epochs, LR 5e-4, batch 256) for 10 k experiments. - Command:
WANDB_MODE=offline DISABLE_COMPILE=1 python pretrain.py --config-name cfg_clinvar_long +run_name=clinvar_long - Evaluator logs
ClinVar/accuracyandClinVar/roc_aucevery 5 epochs; checkpoints saved after each evaluation.
2025-10-22, 20:20 : Logistic regression baseline helper
- Added
tools/train_baseline_logreg.py(one-hot features + standardized position) as a quick comparison. - Command:
python tools/train_baseline_logreg.py --input data/clinvar/processed/clinvar_missense_balanced.tsv - Outputs accuracy/ROC AUC to stdout (optional
--outputsaves JSON); split matches the TRM preprocessing seed.
2025-10-22, 20:30 : First ClinVar training + baseline results
- Ran
cfg_clinvar_longfor 50 epochs (10 k data). Final evaluation: accuracy ≈ 0.7339, ROC AUC ≈ 0.8097. - Logistic regression baseline (
tools/train_baseline_logreg.py) achieved accuracy 0.7660, ROC AUC 0.8435. - Baseline currently outperforms VariantTRM; next steps: feature engineering, architecture tweaks, or hyperparameter search.
2025-10-22, 21:15 : Added review status and ClinSig tokens
- Updated
tools/build_clinvar_trm_dataset.pyto insert review-status and clinical-significance tokens (sequence length now 15). - Logistic baseline now one-hot encodes those fields as well.
- Reminder: regenerate data via
before rerunning training/baseline.
python tools/prepare_clinvar_dataset.py --max-per-class 5000 python tools/build_clinvar_trm_dataset.py
2025-10-22, 22:20 : Results with enriched features
cfg_clinvar_long(50 epochs, 10 k data): accuracy 0.9165, ROC AUC 0.9771.- Logistic regression baseline: accuracy 0.9690, ROC AUC 0.9849 (feature leakage suspected).
- CPU evaluation script now supports loading checkpoints without CUDA.
2025-10-22, 22:40 : Removed ClinSig-derived leakage
- Stripped all ClinicalSignificance-based tokens from dataset builder and baseline.
- Logistic baseline (8k/2k split) now reports accuracy ≈0.746, ROC AUC ≈0.838.
- Request: regenerate data and rerun
cfg_clinvar_longfor fresh TRM metrics.
2025-10-22, 21:30 : TRM architecture alignment
- Re-read Samsung SAIT’s Tiny Recursive Models documentation to confirm architecture parity.
- Noted in README that we inherit TRM’s recursive reasoning core and only swap the input tokens/labels.
- Future changes: respect their embedding/halting conventions when adding new features.
2025-10-22, 19:50 : ClinVar evaluation utilities
- Added
tools/evaluate_clinvar_checkpoint.pyto score checkpoints with accuracy and ROC AUC on the ClinVar test split. - Introduced
evaluators/clinvar.pyand wired it intocfg_clinvar.yamlso evaluation metrics log automatically during training. - Updated the README Quickstart with a sample evaluation command. 2025-10-22, 23:30 : Amino-acid property features
- Added AA property buckets (nonpolar/charged/etc.) and change-class tokens to the TRM dataset.
- Logistic baseline (80k/20k split) after rebuild: accuracy 0.823, ROC AUC 0.907.
- Reran
cfg_clinvar_longon the 50k-per-class dataset: accuracy 0.841, ROC AUC 0.915. - Baseline remains at 0.823 / 0.907 with the enriched features.
2025-10-23, 00:30 : Added no-leakage test
- Created
tests/test_clinvar_dataset.pyto ensure ClinSig tokens never appear in the TRM vocab and sequence length matches expectations. - README now documents how to run the quick pytest check.
2025-10-23, 00:45 : Hyperparameter sweep config
- Added
config/clinvar_sweep.yamlto drive Hydra multiruns over hidden size, L layers, L cycles, and learning rate. - README documents the sweep command (
python pretrain.py --config-name clinvar_sweep --multirun).
2025-10-23, 19:00 : Integrated sweep analyzer
- Updated
config/clinvar_sweep.yamlto rely on Hydra override directories. - Added
scripts/analyze_sweep.pyand documented the workflow (run sweep + analyzer).
2025-10-23, 19:20 : Added early stopping support
PretrainConfignow acceptsearly_stop_patience,early_stop_metric, andearly_stop_delta.- Training loop stops when the chosen metric fails to improve beyond
early_stop_deltaforpatienceevaluations. - README documents usage via
+early_stop_patience=....
2025-10-24, 09:10 : ClinVar preprocessing enrichment
- Expanded
tools/prepare_clinvar_dataset.pyto retain phenotype terms/IDs, submitter counts, ISO-normalized review dates, and derived provenance buckets. - Normalized phenotype strings (dedupe, placeholder filtering) and tracked coverage stats in
clinvar_missense_balanced_stats.json. - Verified pipeline with a 50-per-class smoke build under
data/clinvar/processed/tmp_balanced.
2025-10-24, 09:40 : Phenotype context in TRM + baselines
- Updated
tools/build_clinvar_trm_dataset.pyto emit phenotype/source/submitter/evaluation tokens (sequence length 24) and persist feature metadata in dataset manifests. - Synced logistic regression baseline with the new context features via shared preprocessing helpers.
- Extended pytest coverage to assert phenotype tokens are present and leak-free.
2025-10-24, 10:15 : Baseline runner ergonomics
- Added
tools/__init__.pyand path bootstrapping sopython tools/train_baseline_logreg.pyworks without manualPYTHONPATHtweaks. - Successfully reran the logistic regression baseline on the refreshed 5k/5k dataset (accuracy ≈0.861, ROC AUC ≈0.930).
2025-10-24, 17:25 : VariantTRM run with phenotype context
- Trained
cfg_clinvar_longon the 5k/5k dataset with+early_stop_patience=5(offline WandB runclinvar_long_5k). - Final evaluation: accuracy ≈0.819, ROC AUC ≈0.884; checkpoints saved under
checkpoints/Clinvar_trm-ACT-torch/clinvar_long_5k/. - Noted the performance gap vs. the logistic baseline, highlighting the need for further feature/model work.
2025-10-24, 17:45 : Expanded ClinVar dataset tests
- Augmented
tests/test_clinvar_dataset.pyto verify phenotype/provenance buckets, feature metadata, and updated sequence length (now 25). - All ClinVar tests pass against the refreshed 5k/5k dataset (
python -m pytest tests/test_clinvar_dataset.py→ 5 passed).
2025-10-24, 18:10 : Added tertiary phenotype slots
- Extended
tools/build_clinvar_trm_dataset.pyto encode up to three phenotype tokens (sequence length 25) and preserved slot metadata. - Updated the logistic baseline, docs, and tests to stay in sync; new baseline run: accuracy ≈0.860, ROC AUC ≈0.930.
2025-10-24, 18:55 : ClinVar dataset rebuild + run recap
- Regenerated the balanced ClinVar table (5k per class) and rebuilt the TRM dataset so phenotype/provenance fields populate all tokens.
- Trained
cfg_clinvar_longwith+early_stop_patience=5(clinvar_long_20251024-175518); early stopping fired at step 1248, final eval accuracy 0.819, ROC AUC 0.8859. - Logistic baseline rerun on the refreshed dataset: accuracy 0.860, ROC AUC 0.9300 (8k/2k split).
- Evaluated
step_1248with both CPU and CUDA modes after fixingtools/evaluate_clinvar_checkpoint.pyto create carries on-device; scores match training (accuracy 0.8195, ROC AUC 0.8858).
2025-10-24, 19:20 : Plotting helpers for documentation
- Extended
tools/evaluate_clinvar_checkpoint.pywith--save-predsso per-example scores can be captured alongside summary metrics. - Added
scripts/plot_eval_comparison.pyto visualize TinyVariant vs. baseline accuracy/AUC from saved JSON metrics. - Added
scripts/plot_roc_curve.pyto render ROC curves from prediction JSONL files (pairs with the new evaluation flag). - README Quickstart now documents the evaluation output flags and plotting workflow for future write-ups.
2025-10-24, 19:40 : Feature ablation toggles + scaling notes
- Added
--phenotype-ablationand--provenance-ablationflags totools/build_clinvar_trm_dataset.pyso we can rebuild datasets with phenotype tokens or provenance buckets zeroed out. - Feature metadata now records the ablation state; downstream baselines/TRM runs pick up the altered dataset automatically.
- README/Instructions document how to run ablation rebuilds and how to scale preprocessing to 50k+ examples via
--max-per-class.
2025-10-24, 21:05 : 50k-per-class ClinVar run results
- Rebuilt the balanced dataset with
--max-per-class 50000(100k examples total; ~61k carry phenotype annotations) and refreshed the TRM arrays. - Logistic baseline on the 80k/20k split reached accuracy 0.8955, ROC AUC 0.9591.
- Trained
cfg_clinvar_longwith early stopping (runclinvar_long_20251024-194100); stopped at step 9372 with eval accuracy 0.8828 and ROC AUC 0.9450. - Updated
tools/evaluate_clinvar_checkpoint.pyto move carries to the active device so GPU evaluation succeeds; metrics forstep_9372are stored inoutputs/clinvar_trm_metrics.jsonand the ROC curve indocs/figures/clinvar_trm_roc.png.
2025-10-24, 21:55 : Phenotype ablation baseline
- Rebuilt TRM dataset with
--phenotype-ablation(phenotype slots forced to<none>). - TRM run
clinvar_long_phenotype_ablation_20251024-215110(W&B disabled, single-worker loader) early-stopped at step 14058: eval accuracy 0.8741, ROC AUC 0.9417. - Evaluated checkpoint
step_14058on CPU (GPU evaluation timing out) → metrics saved tooutputs/clinvar_long_phenotype_ablation_20251024-215110_metrics.json, predictions in_predictions.jsonl.
2025-10-25, 07:55 : Provenance ablation baseline
- Rebuilt TRM dataset with
--provenance-ablation(submitter/evaluation buckets set to the lowest bucket). - TRM run
clinvar_long_provenance_ablation_20251025-074717(WANDB offline, single-worker loader) completed full schedule; eval accuracy 0.8755, ROC AUC 0.9438. - Evaluated checkpoint
step_15620on CPU → metrics inoutputs/clinvar_long_provenance_ablation_20251025-074717_metrics.json, predictions in_predictions.jsonl.
2025-10-26, 05:00 : ClinVar 50k hyperparameter sweep
- Hydra sweep (
arch.hidden_size ∈ {256, 384},L_layers ∈ {2, 3},L_cycles ∈ {2, 3},lr ∈ {5e-4, 3e-4}) completed; per-run metrics regenerated viascripts/evaluate_sweep_runs.py. python scripts/analyze_sweep.pyhighlights best confighidden_size=384,L_layers=2,L_cycles=2,lr=3e-4with ROC AUC 0.9513 and accuracy 0.8867 (checkpointstep_9372).- Sweep summary saved to
sweep_summary.csv; heatmap generation skipped (pandas/seaborn unavailable in sandbox).
2025-10-26, 15:30 : Best-config TRM run (384 hidden, 2×2 cycles, lr=3e-4)
- Run
clinvar_long_best384_20251026-140402reached ROC AUC 0.9513, accuracy 0.8867 on the 20k test split; parameter count ≈17.0M. - Artifacts:
outputs/clinvar_long_best384_20251026-140402_{metrics,predictions}.jsonl, plots refreshed underdocs/figures/.
2025-10-25, 17:45 : Sweep preparation
- Added explicit
run_name/checkpoint_pathtemplating toconfig/clinvar_sweep.yamlso each Hydra job writes to its own folder (checkpoints/Clinvar_trm-ACT-torch/<override_dirname>). - README now documents the sweep command with
WANDB_DISABLED=trueandTINYVARIANT_NUM_WORKERS=0for smoother offline runs.
2025-10-26, 17:10 : Documentation wrap-up & hypothesis outcome
- Updated
README.mdwith a key results table, clarified that the logistic regression baseline (ROC AUC 0.959 / accuracy 0.896) remains ahead of the best TRM run (ROC AUC 0.951 / accuracy 0.887, ~17 M params, hidden_size 384, 2×2 cycles). - Added context that phenotype/provenance ablations barely dent performance, reinforcing that the recursive halting core adds little for ClinVar missense classification.
- Closed out the README with next-step ideas (write-up, error slicing, new hypotheses) and captured the negative result so this proof of concept can be considered complete.