Skip to content

WIP [fine-tuning]: Gather more results#608

Closed
albertoperdomo2 wants to merge 63 commits intoopenshift-psap:mainfrom
albertoperdomo2:fine-tuning-blog
Closed

WIP [fine-tuning]: Gather more results#608
albertoperdomo2 wants to merge 63 commits intoopenshift-psap:mainfrom
albertoperdomo2:fine-tuning-blog

Conversation

@albertoperdomo2
Copy link
Copy Markdown
Collaborator

No description provided.

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 4, 2024

Jenkins Job #1719

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 4, 2024

Jenkins Job #1720

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 09 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 4, 2024

Jenkins Job #1720

🔴 Test of 'rhoai test export_artifacts /logs/artifacts' failed after 00 hours 00 minutes 06 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test export_artifacts /logs/artifacts
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 4, 2024

Jenkins Job #1721

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 09 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: metal gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: metal
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 4, 2024

Jenkins Job #1721

🔴 Test of 'rhoai test export_artifacts /logs/artifacts' failed after 00 hours 00 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test export_artifacts /logs/artifacts
PR_POSITIONAL_ARGS: metal gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: metal
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 4, 2024

Jenkins Job #1722

🟢 Test of 'rhoai test test_ci' succeeded after 07 hours 24 minutes 44 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 4, 2024

Jenkins Job #1723

🔴 Test of 'rhoai test test_ci' failed after 00 hours 03 minutes 19 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/mistral-7b-v0.3-gptq', 'storage_dir': '/model', 'name': 'mistral-7b-v0.3-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/granite-8b-code-instruct-gptq', 'storage_dir': '/model', 'name': 'granite-8b-code-instruct-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/002__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/allam-beta-13b-chat-gptq', 'storage_dir': '/model', 'name': 'allam-beta-13b-chat-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/003__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/granite-34b-code-base-gptq', 'storage_dir': '/model', 'name': 'granite-34b-code-base-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/mixtral-8x7b-instruct-v0.1-gptq', 'storage_dir': '/model', 'name': 'mixtral-8x7b-instruct-v0.1-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/002__plots/FAILURE | An error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/000_test_ci/003__prom_plots/FAILURE | An error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/000_test_ci/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 394, in test

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 5, 2024

Jenkins Job #1724

🔴 Test of 'rhoai test test_ci' failed after 01 hours 11 minutes 23 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'model_name': 'mistral-7b-v0.3-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'pod_count': 1, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'qlora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'model_name': 'mistral-7b-v0.3-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'pod_count': 1, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'qlora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 143, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 6, 2024

Jenkins Job #1725

🟢 Test of 'rhoai test test_ci' succeeded after 10 hours 14 minutes 41 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 7, 2024

Jenkins Job #1726

🔴 Test of 'rhoai test test_ci' failed after 02 hours 23 minutes 48 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 9, 2024

Jenkins Job #1727

🔴 Test of 'rhoai test test_ci' failed after 02 hours 23 minutes 41 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 9, 2024

Jenkins Job #1728

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 07 minutes 54 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai/000__rhods__deploy_ods/FAILURE | [000__rhods__deploy_ods] ./run_toolbox.py from_config rhods deploy_ods --extra={} --> 2
/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai" ./run_toolbox.py from_config rhods deploy_ods --extra="{}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/rhods/library/prepare_rhoai.py", line 58, in install
    run.run_toolbox_from_config("rhods", "deploy_ods")
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 9, 2024

Jenkins Job #1729

🔴 Test of 'rhoai test test_ci' failed after 01 hours 48 minutes 31 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 9, 2024

Jenkins Job #1731

🔴 Test of 'rhoai test test_ci' failed after 01 hours 52 minutes 36 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 10, 2024

Jenkins Job #1732

🟢 Test of 'rhoai test test_ci' succeeded after 03 hours 54 minutes 37 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 10, 2024

Jenkins Job #1733

🟢 Test of 'rhoai test test_ci' succeeded after 00 hours 07 minutes 13 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 10, 2024

Jenkins Job #1734

🟢 Test of 'rhoai test test_ci' succeeded after 00 hours 09 minutes 15 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 10, 2024

Jenkins Job #1735

🔴 Test of 'rhoai test test_ci' failed after 08 hours 07 minutes 49 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 11, 2024

Jenkins Job #1736

🔴 Test of 'rhoai test test_ci' failed after 01 hours 56 minutes 16 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 11, 2024

Jenkins Job #1737

🟢 Test of 'rhoai test test_ci' succeeded after 03 hours 59 minutes 37 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 11, 2024

Jenkins Job #1738

🔴 Test of 'rhoai test test_ci' failed after 08 hours 05 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'mixtral-8x7b-instruct-v0.1-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'mixtral-8x7b-instruct-v0.1-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 12, 2024

Jenkins Job #1739

🔴 Test of 'rhoai test test_ci' failed after 08 hours 05 minutes 04 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 13, 2024

Jenkins Job #1740

🟢 Test of 'rhoai test test_ci' succeeded after 10 hours 09 minutes 20 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 16, 2024

Jenkins Job #1741

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: cluster_rhoai_2x8h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_2x8h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 17, 2024

Jenkins Job #1744

🟢 Test of 'rhoai test test_ci' succeeded after 04 hours 25 minutes 49 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_2x8h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_2x8h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 18, 2024

Jenkins Job #1746

🔴 Test of 'rhoai test test_ci' failed after 00 hours 02 minutes 48 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 402, in test
    failed = _run_test_and_visualize()
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 281, in _run_test_and_visualize
    if not prepare_rhoai_mod.is_rhoai_installed():
  File "/opt/topsail/src/projects/rhods/library/prepare_rhoai.py", line 40, in is_rhoai_installed
    installed_csv_cmd = run.run(f"oc get csv -loperators.coreos.com/{RHODS_OPERATOR_MANIFEST_NAME}.{RHODS_NAMESPACE}"
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 19, 2024

Jenkins Job #1747

🔴 Test of 'rhoai test test_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

@topsail-bot
Copy link
Copy Markdown

topsail-bot Bot commented Dec 19, 2024

Jenkins Job #1748

🔴 Test of 'rhoai test test_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

@kpouget
Copy link
Copy Markdown
Contributor

kpouget commented Dec 10, 2025

outdated, closing

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Mar 18, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants