This document describes the command-line arguments available for ike.
--config_filepaths TYPE: str, NARGS: "*"
Paths to configuration files.
See [configargparse](https://github.com/bw2/ConfigArgParse) for the use of configuration files.
--pretrained_model_dir TYPE: str
Path to the pretrained model directory
--tokenizer_dir TYPE: str
Path to the tokenizer directory
--use_fast_tokenizer TYPE: flag
Set `use_fast` flag for HF tokenizer
--use_tokenizers TYPE: flag
Use the `tokenizers` library instead of `transformers` for building tokenizer
--attn_implementation TYPE: str, DEFAULT: "flash_attention_2"
Attention implementation to use in HF transformers
Choices: ["eager", "flash_attention_2", "sdpa"]
--train_filepaths TYPE: str, REQUIRED: True, NARGS: "+"
List of training data file paths
--valid_filepaths TYPE: str, NARGS: "+"
List of validation data file paths
If not provided, must use `held_out_valid_portion` or `held_out_valid_number` to hold out validation data from training data.
--data_processor_types TYPE: str, NARGS: "+"
List of data processor types to apply
Length must be 1 or the same as the number of training data files. When length is 1, the same data processor is applied to all training data files.
--valid_data_processor_types TYPE: str, NARGS: "+"
List of data processor types for validation data
Length must be 1 or the same as the number of validation data files. When length is 1, the same data processor is applied to all validation data files.
--data_reformatter_type TYPE: str
Type of data reformatter to use
If not provided, the data reformatter is not used.
--held_out_valid_portion TYPE: float
Portion of training data to hold out for validation when validation data is not provided
--held_out_valid_number TYPE: int
Number of training samples to hold out for validation when validation data is not provided
--min_seq_len TYPE: int, DEFAULT: 1
Minimum sequence length
Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.
--max_seq_len TYPE: int, DEFAULT: 1e9
Maximum sequence length
Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.
--min_src_seq_len TYPE: int, DEFAULT: 1
Minimum source sequence length
Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.
--max_src_seq_len TYPE: int, DEFAULT: 1e9
Maximum source sequence length
Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.
--min_tgt_seq_len TYPE: int, DEFAULT: 1
Minimum target sequence length
Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.
--max_tgt_seq_len TYPE: int, DEFAULT: 1e9
Maximum target sequence length
Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.
--multiple_samples_per_jsonl_line TYPE: flag
Whether your DataProcessor.line2data() returns multiple samples or one sample. It is used in the provided load_data_from_jsonl() function.
--micro_batch_size TYPE: int, DEFAULT: 32
Batch size per GPU per accumulation step
--global_batch_size TYPE: int, DEFAULT: 32
Global batch size, i.e., micro_batch_size * n_accum_steps * n_gpus
By default not provided.
If provided, it is used to recalculate `n_accum_steps` as `global_batch_size // (micro_batch_size * n_gpus)`.
--micro_valid_batch_size TYPE: int, DEFAULT: 32
Batch size per GPU per accumulation step for validation during training
--n_accum_steps TYPE: int, DEFAULT: 1
Number of gradient accumulation steps
--n_epochs TYPE: int, DEFAULT: 1
Number of training epochs
--n_training_steps TYPE: int, DEFAULT: 1e8
Maximum number of training steps
--freeze_word_embeddings TYPE: flag
Freeze word embeddings during training
--optimizer_type TYPE: str, DEFAULT: "adam"
Type of optimizer to use
Choices: ["adam"]
--peak_lr TYPE: float, DEFAULT: 1e-3
Peak learning rate
--min_lr TYPE: float, DEFAULT: 1e-6
Minimum learning rate
--weight_decay TYPE: float, DEFAULT: 0.01
Weight decay coefficient
--adam_beta1 TYPE: float, DEFAULT: 0.9
Adam optimizer beta1 parameter
--adam_beta2 TYPE: float, DEFAULT: 0.95
Adam optimizer beta2 parameter
--adam_eps TYPE: float, DEFAULT: 1e-8
Adam optimizer epsilon parameter
--grad_clip TYPE: float, DEFAULT: 1.0
Gradient clipping value
--dropout_prob TYPE: float, DEFAULT: 0.1
Dropout probability
Note that HF transformers models use dropout probability defined in their configuration files.
--scheduler_type TYPE: str, DEFAULT: "warmup_decay"
Learning rate scheduler type
Choices: ["warmup_decay", "warmup_linear", "warmup_cosine", "constant"]
--n_warmup_steps TYPE: int, DEFAULT: 0
Number of warmup steps
--n_decay_until_steps TYPE: int, DEFAULT: None
Number of steps to decay from peak LR to min LR after warmup, for decay schedulers.
If not provided, decay until (recomputed) `n_training_steps`.
--from_checkpoint_path TYPE: str
For DeepSpeed checkpoint, directory to load model [optimizer, scheduler, and training states] checkpoint.
For non-DeepSpeed checkpoint, path to the model weights file.
--resume_training TYPE: flag
Resume optimizer, scheduler, and training states from the checkpoint directory specified by `from_checkpoint_dir`.
If not provided, will only load model weights from the checkpoint directory. Training will start from scratch.
--save_log TYPE: flag
Enable logging to a local log file
--save_model TYPE: flag
Save the best checkpoint to disk, based on the metric specified by `monitor_metric`.
--save_all_models TYPE: flag
Save every checkpoint, regardless of its performance
--save_ds_checkpoint TYPE: flag
Save DeepSpeed checkpoint that include model, optimizer, scheduler, and training states.
Must be used with `--save_model`.
Must be used if you want to resume training from the saved checkpoint later.
If not provided, only the model weights are saved.
--save_dir TYPE: str
Directory to save logs and checkpoints
--save_filename TYPE: str
An identifier for the this experiment run. Will be used to construct a subdirectory under `save_dir` for saving logs and checkpoints.
--log_interval TYPE: int, DEFAULT: 100
Steps between logging
If negative, calculated as n_training_steps_per_epoch * abs(log_interval)
--validate_interval TYPE: float, DEFAULT: 1000
Steps between validation runs
If negative, calculated as n_training_steps_per_epoch * abs(validate_interval)
--save_model_interval TYPE: float
Steps between model saving, only used when `save_all_models` is enabled.
If negative, calculated as n_training_steps_per_epoch * abs(save_model_interval)
If not provided, it is set to `validate_interval`
--monitor_metric TYPE: str, DEFAULT: "loss"
The training pipeline will use `ret_stat[monitor_metric]` returned by valid_forward_step() to determine the best model to save.
--monitor_metric_should_ascend TYPE: flag, DEFAULT: False
Whether a higher value of `monitor_metric` is better
--log_grad_norms TYPE: flag
Log gradient norms of all model parameters
--log_weight_norms TYPE: flag
Log weight norms of all model parameters
--local_rank TYPE: int, REQUIRED: True
Local rank for distributed training
Required for DeepSpeed.
Automatically set by DeepSpeed launcher.
--zero_stage TYPE: int, DEFAULT: 1
ZeRO optimization stage
Choices: [0, 1, 2, 3]
--fp16 TYPE: flag
Enable FP16 training
Usually NOT used.
--bf16 TYPE: flag
Enable BF16 training
Usually used.
--activation_checkpointing_layers TYPE: int
Number of layers for activation checkpointing
--offload_adam TYPE: flag
Enable optimizer state offloading to CPU
--offload_param TYPE: flag
Enable parameter offloading to CPU
Must be used with `--zero_stage 3`.
--pretrained_peft_model_dir TYPE: str
Path to pretrained PEFT model directory
--peft_modules_to_save TYPE: str, NARGS: "+"
Non-PEFT trainable parameters. Corresponding to `modules_to_save` in PEFT.
--peft_type TYPE: PeftType
Type of PEFT to use
Choices: ["LORA"]
--peft_task_type TYPE: TaskType
PEFT task type
Choices: ["SEQ_CLS", "SEQ_2_SEQ_LM", "CAUSAL_LM", "TOKEN_CLS"]
--peft_inference_mode TYPE: flag
Enable PEFT inference mode
--peft_lora_r TYPE: int, DEFAULT: 8
LoRA attention dimension
--peft_lora_alpha TYPE: float, DEFAULT: 32
LoRA alpha parameter
--peft_lora_dropout TYPE: float, DEFAULT: 0.1
LoRA dropout probability
--peft_lora_fan_in_fan_out TYPE: flag
Enable LoRA fan-in/fan-out
--peft_lora_target_modules TYPE: str, NARGS: "+", DEFAULT: None
Target modules for LoRA
--generation_min_len TYPE: int, DEFAULT: 1
Minimum number of tokens to generate
--generation_max_len TYPE: int, DEFAULT: 128
Maximum number of tokens to generate
--generation_do_sample TYPE: flag
Whether to use sampling or greedy decoding
--generation_top_p TYPE: float, DEFAULT: 1.0
Top-p sampling parameter
[0.0, 1.0]
--generation_top_k TYPE: int, DEFAULT: 0
Top-k sampling parameter
[0, vocab_size]
0 for not taking top-k
--generation_temperature TYPE: float, DEFAULT: 1.0
Sampling temperature
[0.0, inf]
0.0: one-hot distribution, and thus greedy decoding
0.0-1.0: sharper distribution
1.0: original distribution
1.0-inf: smoother distribution
--generation_repetition_penalty TYPE: float, DEFAULT: 1.0
Repetition penalty
--generation_num_return_sequences TYPE: int, DEFAULT: 1
Number of sequences to return
--use_legacy_past_key_values TYPE: flag
Use legacy data structure for past_key_values in HF transformers
--seed TYPE: int, DEFAULT: 42
Random seed
--debug_mode TYPE: flag
Enable debug mode
You may want to use this flag in your implementation.
For example, in the provided `load_data_from_jsonl()` function, multiprocessing is disabled in debug mode, so that you can debug your data processing code more easily.
--debug_mode_data_size TYPE: int, DEFAULT: 1000
Number of samples to use in debug mode
Used in the provided `load_data_from_jsonl()` function.
--data_processor_chunksize TYPE: int, DEFAULT: 1000
Chunk size for multiprocessing in data processing
Used in the provided `load_data_from_jsonl()` function.
--validation_on_rank_0_only TYPE: flag
Use the first GPU for validation.
If not provided, all GPUs are used for validation.