Skip to content

Protein-Sequence-Annotation/RSALM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 

Repository files navigation

Scripts overview

The core workflow is:

  1. augment_fasta.py → slice sequences and generate augmented FASTA + domain dict
  2. data_processing.py → tokenize, label, batch, and shard datasets
  3. scripts/train/train_rsalm.py → train/evaluate the RSALM model on shards

scripts/data/augment_fasta.py

Splits long sequences into domain-preserving slices and optionally emits shuffled and negative variants. Produces a new FASTA and a new domain dict with aligned IDs.

Key inputs

  • --fasta, --domain-dict
  • --output-fasta, --output-dict

Common flags

  • --max-length: slice length threshold
  • --negative-prob: target fraction of negatives (approximate)
  • --include-domain-slices, --shuffle-only, --no-shuffle, --domain-slices-only
  • --large-data with --p-shuffled, --domain-counts-tsv, --domain-slice-frac
  • --seed, --verbose

scripts/data/data_processing.py

Tokenizes sequences, generates per-token labels from the domain dict and label mapping, batches by token budget, and saves shards.

Config handling

  • This script is CLI-only; it does not read config.yaml.

Required args

  • --fasta, --domain-dict, --output-dir, --ignore-label
  • --model-name, --max-length, --max-tokens-per-batch
  • --label-mapping-dict

Optional args

  • --chunk-size, --tmp-dir, --shard-size, --seed, --keep-tmp

Notes

  • ID normalization uses the FASTA header segment between > and the first space.
  • --ignore-label must match the training --ignore-label.

scripts/train/train_rsalm.py

Trains or evaluates RSALM on preprocessed shard datasets.

Config handling

  • Training always uses a YAML config.
  • If --config is provided without a value, the script looks for src/rsalm/config.yaml.
  • If --config is not provided, the script still looks for src/rsalm/config.yaml.

Required args

  • --val-dir, --ignore-label
  • --train-dir if training.total_steps > 0 in config

Optional args

  • --label-mapping-dict to override config model.label_mapping_path

Checkpoint loading

  • Supports model.safetensors or pytorch_model.bin within a checkpoint directory, or a direct path to a .safetensors/.bin file.

Logging

  • report_to=["wandb"] is enabled by default.

Config format

The scripts expect a YAML config with these sections:

model

  • model_name
  • max_position_embeddings
  • max_batch_size
  • output_size
  • freeze_esm
  • use_fa
  • pretrained_checkpoint_path
  • label_mapping_path

training

  • gradient_accumulation_steps, learning_rate, optimizer, gradient_clipping
  • lr_scheduler, eval_strategy, eval_steps, total_steps, warmup_steps
  • logging_steps, save_steps, output_dir
  • mixed_precision, dataloader_num_workers, dataloader_prefetch_factor, dataloader_pin_memory, seed

data

  • chunk_size, default_tmp_dir, default_shard_size

src/rsalm/config.yaml is provided as a template with null values. Populate it before use, or pass all required values via CLI without --config.

Example usage

python scripts/data/augment_fasta.py \
  --fasta input.fa \
  --domain-dict domains.pkl \
  --output-fasta augmented.fa \
  --output-dict augmented.pkl
python scripts/data/data_processing.py \
  --fasta augmented.fa \
  --domain-dict augmented.pkl \
  --label-mapping-dict labels.pkl \
  --output-dir data/shards \
  --model-name multimolecule/rinalmo-giga \
  --max-length 4096 \
  --max-tokens-per-batch 8196 \
  --ignore-label -100
python scripts/train/train_rsalm.py \
  --config src/rsalm/config.yaml \
  --train-dir data/shards/train \
  --val-dir data/shards/val \
  --ignore-label -100

Dependencies

  • PyYAML is required for config loading.
  • multimolecule is required for the RiNALMo model and tokenizer.
  • Core runtime uses torch, transformers, and datasets.

About

RNA Sequence Annotation with Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%