The core workflow is:
augment_fasta.py→ slice sequences and generate augmented FASTA + domain dictdata_processing.py→ tokenize, label, batch, and shard datasetsscripts/train/train_rsalm.py→ train/evaluate the RSALM model on shards
Splits long sequences into domain-preserving slices and optionally emits shuffled and negative variants. Produces a new FASTA and a new domain dict with aligned IDs.
Key inputs
--fasta,--domain-dict--output-fasta,--output-dict
Common flags
--max-length: slice length threshold--negative-prob: target fraction of negatives (approximate)--include-domain-slices,--shuffle-only,--no-shuffle,--domain-slices-only--large-datawith--p-shuffled,--domain-counts-tsv,--domain-slice-frac--seed,--verbose
Tokenizes sequences, generates per-token labels from the domain dict and label mapping, batches by token budget, and saves shards.
Config handling
- This script is CLI-only; it does not read
config.yaml.
Required args
--fasta,--domain-dict,--output-dir,--ignore-label--model-name,--max-length,--max-tokens-per-batch--label-mapping-dict
Optional args
--chunk-size,--tmp-dir,--shard-size,--seed,--keep-tmp
Notes
- ID normalization uses the FASTA header segment between
>and the first space. --ignore-labelmust match the training--ignore-label.
Trains or evaluates RSALM on preprocessed shard datasets.
Config handling
- Training always uses a YAML config.
- If
--configis provided without a value, the script looks forsrc/rsalm/config.yaml. - If
--configis not provided, the script still looks forsrc/rsalm/config.yaml.
Required args
--val-dir,--ignore-label--train-diriftraining.total_steps > 0in config
Optional args
--label-mapping-dictto override configmodel.label_mapping_path
Checkpoint loading
- Supports
model.safetensorsorpytorch_model.binwithin a checkpoint directory, or a direct path to a.safetensors/.binfile.
Logging
report_to=["wandb"]is enabled by default.
The scripts expect a YAML config with these sections:
model
model_namemax_position_embeddingsmax_batch_sizeoutput_sizefreeze_esmuse_fapretrained_checkpoint_pathlabel_mapping_path
training
gradient_accumulation_steps,learning_rate,optimizer,gradient_clippinglr_scheduler,eval_strategy,eval_steps,total_steps,warmup_stepslogging_steps,save_steps,output_dirmixed_precision,dataloader_num_workers,dataloader_prefetch_factor,dataloader_pin_memory,seed
data
chunk_size,default_tmp_dir,default_shard_size
src/rsalm/config.yaml is provided as a template with null values. Populate it
before use, or pass all required values via CLI without --config.
python scripts/data/augment_fasta.py \
--fasta input.fa \
--domain-dict domains.pkl \
--output-fasta augmented.fa \
--output-dict augmented.pkl
python scripts/data/data_processing.py \
--fasta augmented.fa \
--domain-dict augmented.pkl \
--label-mapping-dict labels.pkl \
--output-dir data/shards \
--model-name multimolecule/rinalmo-giga \
--max-length 4096 \
--max-tokens-per-batch 8196 \
--ignore-label -100
python scripts/train/train_rsalm.py \
--config src/rsalm/config.yaml \
--train-dir data/shards/train \
--val-dir data/shards/val \
--ignore-label -100
PyYAMLis required for config loading.multimoleculeis required for the RiNALMo model and tokenizer.- Core runtime uses
torch,transformers, anddatasets.