WIP [fine-tuning]: Gather more results by albertoperdomo2 · Pull Request #608 · openshift-psap/topsail

albertoperdomo2 · 2024-12-04T08:05:27Z

No description provided.

topsail-bot · 2024-12-04T08:11:16Z

Jenkins Job #1719

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T08:29:24Z

Jenkins Job #1720

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 09 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T08:29:36Z

Jenkins Job #1720

🔴 Test of 'rhoai test export_artifacts /logs/artifacts' failed after 00 hours 00 minutes 06 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test export_artifacts /logs/artifacts
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T10:30:44Z

Jenkins Job #1721

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 09 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: metal gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: metal
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T10:31:00Z

Jenkins Job #1721

🔴 Test of 'rhoai test export_artifacts /logs/artifacts' failed after 00 hours 00 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test export_artifacts /logs/artifacts
PR_POSITIONAL_ARGS: metal gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: metal
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T19:27:51Z

Jenkins Job #1722

🟢 Test of 'rhoai test test_ci' succeeded after 07 hours 24 minutes 44 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T21:43:59Z

Jenkins Job #1723

🔴 Test of 'rhoai test test_ci' failed after 00 hours 03 minutes 19 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/mistral-7b-v0.3-gptq', 'storage_dir': '/model', 'name': 'mistral-7b-v0.3-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/granite-8b-code-instruct-gptq', 'storage_dir': '/model', 'name': 'granite-8b-code-instruct-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/002__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/allam-beta-13b-chat-gptq', 'storage_dir': '/model', 'name': 'allam-beta-13b-chat-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/003__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/granite-34b-code-base-gptq', 'storage_dir': '/model', 'name': 'granite-34b-code-base-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/mixtral-8x7b-instruct-v0.1-gptq', 'storage_dir': '/model', 'name': 'mixtral-8x7b-instruct-v0.1-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/002__plots/FAILURE | An error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/000_test_ci/003__prom_plots/FAILURE | An error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/000_test_ci/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 394, in test

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-05T09:42:17Z

Jenkins Job #1724

🔴 Test of 'rhoai test test_ci' failed after 01 hours 11 minutes 23 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'model_name': 'mistral-7b-v0.3-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'pod_count': 1, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'qlora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'model_name': 'mistral-7b-v0.3-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'pod_count': 1, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'qlora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 143, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-06T00:10:56Z

Jenkins Job #1725

🟢 Test of 'rhoai test test_ci' succeeded after 10 hours 14 minutes 41 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-07T14:04:04Z

Jenkins Job #1726

🔴 Test of 'rhoai test test_ci' failed after 02 hours 23 minutes 48 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-09T10:10:17Z

Jenkins Job #1727

🔴 Test of 'rhoai test test_ci' failed after 02 hours 23 minutes 41 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-09T10:58:42Z

Jenkins Job #1728

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 07 minutes 54 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai/000__rhods__deploy_ods/FAILURE | [000__rhods__deploy_ods] ./run_toolbox.py from_config rhods deploy_ods --extra={} --> 2
/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai" ./run_toolbox.py from_config rhods deploy_ods --extra="{}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/rhods/library/prepare_rhoai.py", line 58, in install
    run.run_toolbox_from_config("rhods", "deploy_ods")
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-09T15:11:05Z

Jenkins Job #1729

🔴 Test of 'rhoai test test_ci' failed after 01 hours 48 minutes 31 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-09T17:51:47Z

Jenkins Job #1731

🔴 Test of 'rhoai test test_ci' failed after 01 hours 52 minutes 36 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-10T03:37:57Z

Jenkins Job #1732

🟢 Test of 'rhoai test test_ci' succeeded after 03 hours 54 minutes 37 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-10T07:47:24Z

Jenkins Job #1733

🟢 Test of 'rhoai test test_ci' succeeded after 00 hours 07 minutes 13 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-10T08:06:10Z

Jenkins Job #1734

🟢 Test of 'rhoai test test_ci' succeeded after 00 hours 09 minutes 15 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-10T18:55:57Z

Jenkins Job #1735

🔴 Test of 'rhoai test test_ci' failed after 08 hours 07 minutes 49 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-11T10:35:18Z

Jenkins Job #1736

🔴 Test of 'rhoai test test_ci' failed after 01 hours 56 minutes 16 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-11T14:56:58Z

Jenkins Job #1737

🟢 Test of 'rhoai test test_ci' succeeded after 03 hours 59 minutes 37 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-11T23:25:41Z

Jenkins Job #1738

🔴 Test of 'rhoai test test_ci' failed after 08 hours 05 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'mixtral-8x7b-instruct-v0.1-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'mixtral-8x7b-instruct-v0.1-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-12T16:55:22Z

Jenkins Job #1739

🔴 Test of 'rhoai test test_ci' failed after 08 hours 05 minutes 04 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-13T17:47:38Z

Jenkins Job #1740

🟢 Test of 'rhoai test test_ci' succeeded after 10 hours 09 minutes 20 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-16T12:31:43Z

Jenkins Job #1741

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: cluster_rhoai_2x8h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_2x8h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-17T14:18:17Z

Jenkins Job #1744

🟢 Test of 'rhoai test test_ci' succeeded after 04 hours 25 minutes 49 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_2x8h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_2x8h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-18T21:21:04Z

Jenkins Job #1746

🔴 Test of 'rhoai test test_ci' failed after 00 hours 02 minutes 48 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 402, in test
    failed = _run_test_and_visualize()
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 281, in _run_test_and_visualize
    if not prepare_rhoai_mod.is_rhoai_installed():
  File "/opt/topsail/src/projects/rhods/library/prepare_rhoai.py", line 40, in is_rhoai_installed
    installed_csv_cmd = run.run(f"oc get csv -loperators.coreos.com/{RHODS_OPERATOR_MANIFEST_NAME}.{RHODS_NAMESPACE}"
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-19T05:23:50Z

Jenkins Job #1747

🔴 Test of 'rhoai test test_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-19T05:37:39Z

Jenkins Job #1748

🔴 Test of 'rhoai test test_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

kpouget · 2025-12-10T13:16:56Z

outdated, closing

openshift-ci · 2026-03-18T16:22:23Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

albertoperdomo2 force-pushed the fine-tuning-blog branch from 5363336 to 382be9b Compare December 4, 2024 11:48

albertoperdomo2 force-pushed the fine-tuning-blog branch from 417172f to 014a54c Compare December 9, 2024 10:44

albertoperdomo2 added 24 commits March 14, 2025 11:35

[storage]: Change NFS PVC size to bypass prepare step

de53053

[storage]: Revert NFS size

30a89ac

[fine_tuning]: Remove unused code

82fe263

[fine_tuning]: Enable more models for testing

c58a845

[fine_tuning]: Update torchrun settings

1c553a5

[fine_tuning]: Update max seq length

abe7a5a

[fine_tuning]: Test only 8b model

24d71e0

[fine_tuning]: More agressive torchrun settings

19f4f79

[fine_tuning]: Use accelerate directly

7561149

[fine_tuning]: Torchrun with memory saving settings

bbc807c

[fine_tuning]: Try again the launch script

cafc59f

[fine_tuning]: Tweak settings

017523c

[fine_tuning]: Torch CUDA profile

89ac58d

[fine_tuning]: Fix artifact path

89bbdd5

[fine_tuning]: 3b test torchrun

275788e

[fine_tuning]: 3b test launch script

9bb26ec

[fine_tuning]: Test different batch sizes

01e4a09

[fine_tuning]: use standalone image

bebd009

[fine_tuning]: Remove file retrieval now

4b05f67

[fine_tuning]: Increase batch size

cf3367d

[fine_tuning]: Update settings.

35ba6ae

[fine_tuning]: Update settings.

0b90300

[fine_tuning]: Test 70b model

f38295d

[fine_tuning]: Run the FFT test

2db5bd8

albertoperdomo2 force-pushed the fine-tuning-blog branch from b372280 to 2db5bd8 Compare March 14, 2025 11:35

[fine_tuning]: Reduce batch size

26fe8d6

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2026

albertoperdomo2 closed this Mar 19, 2026

Conversation

albertoperdomo2 commented Dec 4, 2024

Uh oh!

topsail-bot Bot commented Dec 4, 2024

Uh oh!

topsail-bot Bot commented Dec 4, 2024

Uh oh!

topsail-bot Bot commented Dec 4, 2024

Uh oh!

topsail-bot Bot commented Dec 4, 2024

Uh oh!

topsail-bot Bot commented Dec 4, 2024

Uh oh!

topsail-bot Bot commented Dec 4, 2024

Uh oh!

topsail-bot Bot commented Dec 4, 2024

Uh oh!

topsail-bot Bot commented Dec 5, 2024

Uh oh!

topsail-bot Bot commented Dec 6, 2024

Uh oh!

topsail-bot Bot commented Dec 7, 2024

Uh oh!

topsail-bot Bot commented Dec 9, 2024

Uh oh!

topsail-bot Bot commented Dec 9, 2024

Uh oh!

topsail-bot Bot commented Dec 9, 2024

Uh oh!

topsail-bot Bot commented Dec 9, 2024

Uh oh!

topsail-bot Bot commented Dec 10, 2024

Uh oh!

topsail-bot Bot commented Dec 10, 2024

Uh oh!

topsail-bot Bot commented Dec 10, 2024

Uh oh!

topsail-bot Bot commented Dec 10, 2024

Uh oh!

topsail-bot Bot commented Dec 11, 2024

Uh oh!

topsail-bot Bot commented Dec 11, 2024

Uh oh!

topsail-bot Bot commented Dec 11, 2024

Uh oh!

topsail-bot Bot commented Dec 12, 2024

Uh oh!

topsail-bot Bot commented Dec 13, 2024

Uh oh!

topsail-bot Bot commented Dec 16, 2024

Uh oh!

topsail-bot Bot commented Dec 17, 2024

Uh oh!

topsail-bot Bot commented Dec 18, 2024

Uh oh!

topsail-bot Bot commented Dec 19, 2024

Uh oh!

topsail-bot Bot commented Dec 19, 2024

Uh oh!

kpouget commented Dec 10, 2025

Uh oh!

openshift-ci Bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants