QLoRA finetuning of granite-34b-code-base-gptq model fails on fms-hf-tuning 2.6.0

## Describe the bug
Running QLoRA finetuning of granite-34b-code-base-gptq model fails with error:
```
TypeError: GPTBigCodeForCausalLM.forward() got an unexpected keyword argument 'cu_seq_lens_q'
```

Finetuning configuration:
```
{
    "model_name_or_path": "/mnt/model/model/granite-34b-code-base-gptq-20241001T150701",
    "training_data_path": "/mnt/scratch/dataset/alpaca_data.json",
    "output_dir": "/mnt/output/model",
    "save_model_dir": "/mnt/output/model",
    "num_train_epochs": 1.0,
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 4,
    "gradient_accumulation_steps": 4,
    "save_strategy": "no",
    "learning_rate": 1e-5,
    "weight_decay": 0.0,
    "lr_scheduler_type": "cosine",
    "include_tokens_per_second": true,
    "response_template": "\n### Response:",
    "dataset_text_field": "output",
    "use_flash_attn": true,
    "peft_method": "lora",
    "target_modules": ["all-linear"],
    "auto_gptq": ["triton_v2"],
    "torch_dtype": "float16",
    "fp16": true,
    "fast_kernels": [true, true, true],
    "fused_lora":  ["auto_gptq", true],
    "padding_free": ["huggingface"]
}
```

Full Pod log:
```
INFO - Ignoring unknown parameter in the quantization configuration: is_marlin_format.
INFO - `checkpoint_format` is missing from the quantization configuration and is automatically inferred to gptq
INFO - Ignoring unknown parameter in the quantization configuration: is_marlin_format.
INFO - `checkpoint_format` is missing from the quantization configuration and is automatically inferred to gptq
WARNING:modeling.py:The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
WARNING:sft_trainer.py:PAD token set to default, to make it different from eos token
WARNING:modeling.py:The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
WARNING:modeling.py:The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
INFO - Compatibility: converting `checkpoint_format` from `gptq` to `gptq_v2`.
WARNING:sft_trainer.py:PAD token set to default, to make it different from eos token
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 520 examples [00:00, 110142.31 examples/s]

Map (num_proc=80):   0%|          | 0/520 [00:00<?, ? examples/s]
Map (num_proc=80):   1%|â–         | 7/520 [00:00<00:13, 39.24 examples/s]
Map (num_proc=80):   8%|â–Š         | 42/520 [00:00<00:02, 170.18 examples/s]
Map (num_proc=80):  15%|â–ˆâ–        | 77/520 [00:00<00:01, 229.55 examples/s]
Map (num_proc=80):  22%|â–ˆâ–ˆâ–       | 112/520 [00:00<00:01, 269.21 examples/s]
Map (num_proc=80):  28%|â–ˆâ–ˆâ–Š       | 147/520 [00:00<00:01, 243.08 examples/s]
Map (num_proc=80):  34%|â–ˆâ–ˆâ–ˆâ–Ž      | 175/520 [00:00<00:01, 240.79 examples/s]
Map (num_proc=80):  42%|â–ˆâ–ˆâ–ˆâ–ˆâ–     | 217/520 [00:00<00:01, 275.45 examples/s]
Map (num_proc=80):  48%|â–ˆâ–ˆâ–ˆâ–ˆâ–Š     | 252/520 [00:01<00:00, 279.71 examples/s]
Map (num_proc=80):  55%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ    | 286/520 [00:01<00:00, 284.01 examples/s]
Map (num_proc=80):  61%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 316/520 [00:01<00:00, 268.06 examples/s]
Map (num_proc=80):  68%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Š   | 352/520 [00:01<00:00, 278.88 examples/s]
Map (num_proc=80):  73%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Ž  | 382/520 [00:01<00:00, 276.99 examples/s]
Map (num_proc=80):  79%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–‰  | 412/520 [00:01<00:00, 274.43 examples/s]
Map (num_proc=80):  85%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ | 442/520 [00:01<00:00, 273.51 examples/s]
Map (num_proc=80):  91%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ | 472/520 [00:01<00:00, 258.29 examples/s]
Map (num_proc=80):  97%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–‹| 502/520 [00:01<00:00, 269.09 examples/s]
Map (num_proc=80): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 520/520 [00:02<00:00, 246.16 examples/s]
/home/tuning/.local/lib/python3.12/site-packages/fms_acceleration_foak/models/granite.py:172: UserWarning: Granite Rules: activation is gelu_pytorch_tanh, thus disabling LoRA fused-op for MLP, since only SwiGLU is supported. This only affects quantized-peft.
  warnings.warn(
/home/tuning/.local/lib/python3.12/site-packages/fms_acceleration_foak/models/granite.py:172: UserWarning: Granite Rules: activation is gelu_pytorch_tanh, thus disabling LoRA fused-op for MLP, since only SwiGLU is supported. This only affects quantized-peft.
  warnings.warn(
/home/tuning/.local/lib/python3.12/site-packages/fms_acceleration_foak/models/llama.py:167: UserWarning: LLamaRules: activation is gelu_pytorch_tanh, thus disabling LoRA fused-op for MLP, since only SwiGLU is supported. This only affects quantized-peft.
  warnings.warn(
/home/tuning/.local/lib/python3.12/site-packages/fms_acceleration_foak/models/mistral.py:160: UserWarning: Mistral rules: activation is gelu_pytorch_tanh, thus disabling LoRA fused-op for MLP, since only SwiGLU is supported. This only affects quantized-peft.
  warnings.warn(
/home/tuning/.local/lib/python3.12/site-packages/fms_acceleration_foak/models/llama.py:167: UserWarning: LLamaRules: activation is gelu_pytorch_tanh, thus disabling LoRA fused-op for MLP, since only SwiGLU is supported. This only affects quantized-peft.
  warnings.warn(
/home/tuning/.local/lib/python3.12/site-packages/fms_acceleration_foak/models/mistral.py:160: UserWarning: Mistral rules: activation is gelu_pytorch_tanh, thus disabling LoRA fused-op for MLP, since only SwiGLU is supported. This only affects quantized-peft.
  warnings.warn(
/home/tuning/.local/lib/python3.12/site-packages/fms_acceleration_aadp/framework_plugin_padding_free.py:132: UserWarning: transformers version supports padding free natively in various models.
  warnings.warn(
/home/tuning/.local/lib/python3.12/site-packages/fms_acceleration_aadp/framework_plugin_padding_free.py:132: UserWarning: transformers version supports padding free natively in various models.
  warnings.warn(
/home/tuning/.local/lib/python3.12/site-packages/transformers/training_args.py:2058: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of ðŸ¤— Transformers. Use `--hub_token` instead.
  warnings.warn(
/home/tuning/.local/lib/python3.12/site-packages/transformers/training_args.py:2058: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of ðŸ¤— Transformers. Use `--hub_token` instead.
  warnings.warn(
/home/tuning/.local/lib/python3.12/site-packages/tuning/sft_trainer.py:364: FutureWarning: `tokenizer` is deprecated and removed starting from version 0.16.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
  trainer = SFTTrainer(
/home/tuning/.local/lib/python3.12/site-packages/tuning/sft_trainer.py:364: FutureWarning: `tokenizer` is deprecated and removed starting from version 0.16.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
  trainer = SFTTrainer(

Map:   0%|          | 0/520 [00:00<?, ? examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 520/520 [00:00<00:00, 3944.25 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 520/520 [00:00<00:00, 3873.46 examples/s]
/home/tuning/.local/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:300: UserWarning: You passed a processing_class with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `processing_class.padding_side = 'right'` to your code.
  warnings.warn(
/home/tuning/.local/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:300: UserWarning: You passed a processing_class with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `processing_class.padding_side = 'right'` to your code.
  warnings.warn(

  0%|          | 0/65 [00:00<?, ?it/s]ERROR:sft_trainer.py:Traceback (most recent call last):
  File "/home/tuning/.local/lib/python3.12/site-packages/tuning/sft_trainer.py", line 676, in main
    trainer, additional_train_info = train(
                                     ^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/tuning/sft_trainer.py", line 420, in train
    trainer.train(resume_from_checkpoint)
  File "/home/tuning/.local/lib/python3.12/site-packages/transformers/trainer.py", line 2171, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/transformers/trainer.py", line 3675, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/transformers/trainer.py", line 3731, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 863, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/peft/peft_model.py", line 1644, in forward
    return self.base_model(
           ^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.12/site-packages/peft/tuners/tuners_utils.py", line 197, in forward
    return self.model.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: GPTBigCodeForCausalLM.forward() got an unexpected keyword argument 'cu_seq_lens_q'
```

## Platform
fms-hf-tuning image: `quay.io/modh/fms-hf-tuning:v2.6.0`
Trained model: `granite-34b-code-base-gptq-20241001T150701`

## Sample Code


## Expected behavior
Training of the model pass successfully.

## Observed behavior
Training failed, see description for logs.


## Additional context

Add any other context about the problem here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QLoRA finetuning of granite-34b-code-base-gptq model fails on fms-hf-tuning 2.6.0 #479

Describe the bug

Platform

Sample Code

Expected behavior

Observed behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

QLoRA finetuning of granite-34b-code-base-gptq model fails on fms-hf-tuning 2.6.0 #479

Description

Describe the bug

Platform

Sample Code

Expected behavior

Observed behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions