Skip to content

Latest commit

 

History

History
336 lines (250 loc) · 10.9 KB

File metadata and controls

336 lines (250 loc) · 10.9 KB

ike / Arguments Documentation

This document describes the command-line arguments available for ike.

ConfigArgparse Arguments

--config_filepaths TYPE: str, NARGS: "*"
    Paths to configuration files.
    See [configargparse](https://github.com/bw2/ConfigArgParse) for the use of configuration files.

Model Arguments

--pretrained_model_dir TYPE: str
    Path to the pretrained model directory

--tokenizer_dir TYPE: str
    Path to the tokenizer directory

--use_fast_tokenizer TYPE: flag
    Set `use_fast` flag for HF tokenizer

--use_tokenizers TYPE: flag
    Use the `tokenizers` library instead of `transformers` for building tokenizer

--attn_implementation TYPE: str, DEFAULT: "flash_attention_2"
    Attention implementation to use in HF transformers
    Choices: ["eager", "flash_attention_2", "sdpa"]

Data Arguments

--train_filepaths TYPE: str, REQUIRED: True, NARGS: "+"
    List of training data file paths

--valid_filepaths TYPE: str, NARGS: "+"
    List of validation data file paths
    If not provided, must use `held_out_valid_portion` or `held_out_valid_number` to hold out validation data from training data.

--data_processor_types TYPE: str, NARGS: "+"
    List of data processor types to apply
    Length must be 1 or the same as the number of training data files. When length is 1, the same data processor is applied to all training data files.

--valid_data_processor_types TYPE: str, NARGS: "+"
    List of data processor types for validation data
    Length must be 1 or the same as the number of validation data files. When length is 1, the same data processor is applied to all validation data files.

--data_reformatter_type TYPE: str
    Type of data reformatter to use
    If not provided, the data reformatter is not used.

--held_out_valid_portion TYPE: float
    Portion of training data to hold out for validation when validation data is not provided

--held_out_valid_number TYPE: int
    Number of training samples to hold out for validation when validation data is not provided

--min_seq_len TYPE: int, DEFAULT: 1
    Minimum sequence length
    Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.

--max_seq_len TYPE: int, DEFAULT: 1e9
    Maximum sequence length
    Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.

--min_src_seq_len TYPE: int, DEFAULT: 1
    Minimum source sequence length
    Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.

--max_src_seq_len TYPE: int, DEFAULT: 1e9
    Maximum source sequence length
    Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.

--min_tgt_seq_len TYPE: int, DEFAULT: 1
    Minimum target sequence length
    Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.

--max_tgt_seq_len TYPE: int, DEFAULT: 1e9
    Maximum target sequence length
    Note it is not used automatically. Use it in your implementation, for example of `DataProcessor` or `DataReformatter`, as needed.

--multiple_samples_per_jsonl_line TYPE: flag
    Whether your DataProcessor.line2data() returns multiple samples or one sample. It is used in the provided load_data_from_jsonl() function.

Training amd Inference Arguments

--micro_batch_size TYPE: int, DEFAULT: 32
    Batch size per GPU per accumulation step

Training Arguments

--global_batch_size TYPE: int, DEFAULT: 32
    Global batch size, i.e., micro_batch_size * n_accum_steps * n_gpus
    By default not provided.
    If provided, it is used to recalculate `n_accum_steps` as `global_batch_size // (micro_batch_size * n_gpus)`.

--micro_valid_batch_size TYPE: int, DEFAULT: 32
    Batch size per GPU per accumulation step for validation during training

--n_accum_steps TYPE: int, DEFAULT: 1
    Number of gradient accumulation steps

--n_epochs TYPE: int, DEFAULT: 1
    Number of training epochs

--n_training_steps TYPE: int, DEFAULT: 1e8
    Maximum number of training steps

--freeze_word_embeddings TYPE: flag
    Freeze word embeddings during training

--optimizer_type TYPE: str, DEFAULT: "adam"
    Type of optimizer to use
    Choices: ["adam"]

--peak_lr TYPE: float, DEFAULT: 1e-3
    Peak learning rate

--min_lr TYPE: float, DEFAULT: 1e-6
    Minimum learning rate

--weight_decay TYPE: float, DEFAULT: 0.01
    Weight decay coefficient

--adam_beta1 TYPE: float, DEFAULT: 0.9
    Adam optimizer beta1 parameter

--adam_beta2 TYPE: float, DEFAULT: 0.95
    Adam optimizer beta2 parameter

--adam_eps TYPE: float, DEFAULT: 1e-8
    Adam optimizer epsilon parameter

--grad_clip TYPE: float, DEFAULT: 1.0
    Gradient clipping value

--dropout_prob TYPE: float, DEFAULT: 0.1
    Dropout probability
    Note that HF transformers models use dropout probability defined in their configuration files.

--scheduler_type TYPE: str, DEFAULT: "warmup_decay"
    Learning rate scheduler type
    Choices: ["warmup_decay", "warmup_linear", "warmup_cosine", "constant"]

--n_warmup_steps TYPE: int, DEFAULT: 0
    Number of warmup steps

--n_decay_until_steps TYPE: int, DEFAULT: None
    Number of steps to decay from peak LR to min LR after warmup, for decay schedulers.
    If not provided, decay until (recomputed) `n_training_steps`.

Management Arguments

--from_checkpoint_path TYPE: str
    For DeepSpeed checkpoint, directory to load model [optimizer, scheduler, and training states] checkpoint.
    For non-DeepSpeed checkpoint, path to the model weights file.

--resume_training TYPE: flag
    Resume optimizer, scheduler, and training states from the checkpoint directory specified by `from_checkpoint_dir`.
    If not provided, will only load model weights from the checkpoint directory. Training will start from scratch.

--save_log TYPE: flag
    Enable logging to a local log file

--save_model TYPE: flag
    Save the best checkpoint to disk, based on the metric specified by `monitor_metric`.

--save_all_models TYPE: flag
    Save every checkpoint, regardless of its performance

--save_ds_checkpoint TYPE: flag
    Save DeepSpeed checkpoint that include model, optimizer, scheduler, and training states.
    Must be used with `--save_model`.
    Must be used if you want to resume training from the saved checkpoint later.
    If not provided, only the model weights are saved.

--save_dir TYPE: str
    Directory to save logs and checkpoints

--save_filename TYPE: str
    An identifier for the this experiment run. Will be used to construct a subdirectory under `save_dir` for saving logs and checkpoints.

--log_interval TYPE: int, DEFAULT: 100
    Steps between logging
    If negative, calculated as n_training_steps_per_epoch * abs(log_interval)

--validate_interval TYPE: float, DEFAULT: 1000
    Steps between validation runs
    If negative, calculated as n_training_steps_per_epoch * abs(validate_interval)

--save_model_interval TYPE: float
    Steps between model saving, only used when `save_all_models` is enabled.
    If negative, calculated as n_training_steps_per_epoch * abs(save_model_interval)
    If not provided, it is set to `validate_interval`

--monitor_metric TYPE: str, DEFAULT: "loss"
    The training pipeline will use `ret_stat[monitor_metric]` returned by valid_forward_step() to determine the best model to save.

--monitor_metric_should_ascend TYPE: flag, DEFAULT: False
    Whether a higher value of `monitor_metric` is better

--log_grad_norms TYPE: flag
    Log gradient norms of all model parameters

--log_weight_norms TYPE: flag
    Log weight norms of all model parameters

DeepSpeed Arguments

--local_rank TYPE: int, REQUIRED: True
    Local rank for distributed training
    Required for DeepSpeed.
    Automatically set by DeepSpeed launcher.

--zero_stage TYPE: int, DEFAULT: 1
    ZeRO optimization stage
    Choices: [0, 1, 2, 3]

--fp16 TYPE: flag
    Enable FP16 training
    Usually NOT used.

--bf16 TYPE: flag
    Enable BF16 training
    Usually used.

--activation_checkpointing_layers TYPE: int
    Number of layers for activation checkpointing

--offload_adam TYPE: flag
    Enable optimizer state offloading to CPU

--offload_param TYPE: flag
    Enable parameter offloading to CPU
    Must be used with `--zero_stage 3`.

PEFT Arguments

--pretrained_peft_model_dir TYPE: str
    Path to pretrained PEFT model directory

--peft_modules_to_save TYPE: str, NARGS: "+"
    Non-PEFT trainable parameters. Corresponding to `modules_to_save` in PEFT.

--peft_type TYPE: PeftType
    Type of PEFT to use
    Choices: ["LORA"]

--peft_task_type TYPE: TaskType
    PEFT task type
    Choices: ["SEQ_CLS", "SEQ_2_SEQ_LM", "CAUSAL_LM", "TOKEN_CLS"]

--peft_inference_mode TYPE: flag
    Enable PEFT inference mode

--peft_lora_r TYPE: int, DEFAULT: 8
    LoRA attention dimension

--peft_lora_alpha TYPE: float, DEFAULT: 32
    LoRA alpha parameter

--peft_lora_dropout TYPE: float, DEFAULT: 0.1
    LoRA dropout probability

--peft_lora_fan_in_fan_out TYPE: flag
    Enable LoRA fan-in/fan-out

--peft_lora_target_modules TYPE: str, NARGS: "+", DEFAULT: None
    Target modules for LoRA

Generation Arguments

--generation_min_len TYPE: int, DEFAULT: 1
    Minimum number of tokens to generate

--generation_max_len TYPE: int, DEFAULT: 128
    Maximum number of tokens to generate

--generation_do_sample TYPE: flag
    Whether to use sampling or greedy decoding

--generation_top_p TYPE: float, DEFAULT: 1.0
    Top-p sampling parameter
    [0.0, 1.0]

--generation_top_k TYPE: int, DEFAULT: 0
    Top-k sampling parameter
    [0, vocab_size]
    0 for not taking top-k

--generation_temperature TYPE: float, DEFAULT: 1.0
    Sampling temperature
    [0.0, inf]
    0.0: one-hot distribution, and thus greedy decoding
    0.0-1.0: sharper distribution
    1.0: original distribution
    1.0-inf: smoother distribution

--generation_repetition_penalty TYPE: float, DEFAULT: 1.0
    Repetition penalty

--generation_num_return_sequences TYPE: int, DEFAULT: 1
    Number of sequences to return

--use_legacy_past_key_values TYPE: flag
    Use legacy data structure for past_key_values in HF transformers

Miscellaneous Arguments

--seed TYPE: int, DEFAULT: 42
    Random seed

--debug_mode TYPE: flag
    Enable debug mode
    You may want to use this flag in your implementation.
    For example, in the provided `load_data_from_jsonl()` function, multiprocessing is disabled in debug mode, so that you can debug your data processing code more easily.

--debug_mode_data_size TYPE: int, DEFAULT: 1000
    Number of samples to use in debug mode
    Used in the provided `load_data_from_jsonl()` function.

--data_processor_chunksize TYPE: int, DEFAULT: 1000
    Chunk size for multiprocessing in data processing
    Used in the provided `load_data_from_jsonl()` function.

--validation_on_rank_0_only TYPE: flag
    Use the first GPU for validation.
    If not provided, all GPUs are used for validation.