Skip to content

[ROCM] Raise device memory cap for parallel GPU execution to 5GB

5487554
Select commit
Loading
Failed to load commit list.
Open

[ROCM] Raise device memory cap for parallel GPU execution to 5GB #2840

[ROCM] Raise device memory cap for parallel GPU execution to 5GB
5487554
Select commit
Loading
Failed to load commit list.
ROCm Repo Management API / Jenkins failed Sep 25, 2025 in 5h 25m 34s

Test required TF and ROCm versions/Test required TF and ROCm versions/Run tests: error in 'error' step

Test required TF and ROCm versions / Test required TF and ROCm versions / Test required TF and ROCm versions / Run tests / Shell Script

Error in sh step, with arguments docker exec 0963e7dc7a9d2bdb0aefa0c239eb642045e6b9387dc4c93be66c4c9fa1b1ce85 bazel --bazelrc=tensorflow/tools/tf_sig_build_dockerfiles/devel.usertools/rocm.bazelrc test --local_ram_resources=60000 --local_cpu_resources=32 --jobs=64 --verbose_failures --disk_cache=/tf/cache --config=sigbuild_local_cache --config=rocm --config=nonpip_multi_gpu --repo_env=USE_PYWRAP_RULES=True --action_env=TF_PYTHON_VERSION=3.10 --test_env=TF_TESTS_PER_GPU=1 --test_env=TF_GPU_COUNT=2 --local_test_jobs=2 --repo_env=TF_ROCM_AMDGPU_TARGETS=gfx90a.

script returned exit code 1
Build log
[2025-09-25T21:36:47.774Z] + docker exec 0963e7dc7a9d2bdb0aefa0c239eb642045e6b9387dc4c93be66c4c9fa1b1ce85 bazel --bazelrc=tensorflow/tools/tf_sig_build_dockerfiles/devel.usertools/rocm.bazelrc test --local_ram_resources=60000 --local_cpu_resources=32 --jobs=64 --verbose_failures --disk_cache=/tf/cache --config=sigbuild_local_cache --config=rocm --config=nonpip_multi_gpu --repo_env=USE_PYWRAP_RULES=True --action_env=TF_PYTHON_VERSION=3.10 --test_env=TF_TESTS_PER_GPU=1 --test_env=TF_GPU_COUNT=2 --local_test_jobs=2 --repo_env=TF_ROCM_AMDGPU_TARGETS=gfx90a
[2025-09-25T21:36:47.774Z] 2025/09/25 21:36:47 Downloading https://releases.bazel.build/6.5.0/release/bazel-6.5.0-linux-x86_64...
[2025-09-25T21:36:48.751Z] Extracting Bazel installation...
[2025-09-25T21:36:49.848Z] Starting local Bazel server and connecting to it...
[2025-09-25T21:36:51.398Z] INFO: Invocation ID: 515c4737-b3e8-455c-ab42-92e2b71a3091
[2025-09-25T21:36:51.398Z] INFO: Reading 'startup' options from /tf/tensorflow/.bazelrc: --windows_enable_symlinks
[2025-09-25T21:36:51.398Z] INFO: Options provided by the client:
[2025-09-25T21:36:51.398Z]   Inherited 'common' options: --isatty=0 --terminal_columns=80
[2025-09-25T21:36:51.398Z] INFO: Reading rc options for 'test' from /tf/tensorflow/.bazelrc:
[2025-09-25T21:36:51.398Z]   Inherited 'common' options: --experimental_repo_remote_exec
[2025-09-25T21:36:51.398Z] INFO: Reading rc options for 'test' from /tf/tensorflow/.bazelrc:
[2025-09-25T21:36:51.398Z]   Inherited 'build' options: --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --features=-force_no_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility
[2025-09-25T21:36:51.398Z] INFO: Reading rc options for 'test' from /tf/tensorflow/tensorflow/tools/tf_sig_build_dockerfiles/devel.usertools/gpu.bazelrc:
[2025-09-25T21:36:51.398Z]   Inherited 'build' options: --action_env=CACHEBUSTER=565341047
[2025-09-25T21:36:51.398Z] INFO: Reading rc options for 'test' from /tf/tensorflow/.bazelrc:
[2025-09-25T21:36:51.398Z]   'test' options: --test_env=GTEST_INSTALL_FAILURE_SIGNAL_HANDLER=1
[2025-09-25T21:36:51.398Z] INFO: Reading rc options for 'test' from /tf/tensorflow/tensorflow/tools/tf_sig_build_dockerfiles/devel.usertools/gpu.bazelrc:
[2025-09-25T21:36:51.398Z]   'test' options: --test_output=errors --test_timeout=920,2400,7200,9600 --local_test_jobs=4 --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition build:short_logs in file /tf/tensorflow/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition build:v2 in file /tf/tensorflow/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition build:sigbuild_local_cache in file /tf/tensorflow/tensorflow/tools/tf_sig_build_dockerfiles/devel.usertools/gpu.bazelrc: --disk_cache=/tf/cache
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition build:rocm in file /tf/tensorflow/.bazelrc: --config=rocm_base --config=release_cpu_linux_base --action_env=CLANG_COMPILER_PATH=/usr/lib/llvm-18/bin/clang --action_env=TF_ROCM_CLANG=1 --linkopt=-fuse-ld=lld --host_linkopt=-fuse-ld=lld --linkopt=-Wl,--undefined-version --copt=-Wno-gnu-offsetof-extensions --copt=-Wno-unused-result
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition build:rocm_base in file /tf/tensorflow/.bazelrc: --crosstool_top=@local_config_rocm//crosstool:toolchain --define=using_rocm_hipcc=true --define=tensorflow_mkldnn_contraction_kernel=0 --define=xnn_enable_avxvnniint8=false --define=xnn_enable_avx512fp16=false --repo_env TF_NEED_ROCM=1 --config=no_tfrt
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition build:no_tfrt in file /tf/tensorflow/.bazelrc: --deleted_packages=tensorflow/compiler/mlir/tfrt,tensorflow/compiler/mlir/tfrt/benchmarks,tensorflow/compiler/mlir/tfrt/ir,tensorflow/compiler/mlir/tfrt/ir/mlrt,tensorflow/compiler/mlir/tfrt/jit/python_binding,tensorflow/compiler/mlir/tfrt/jit/transforms,tensorflow/compiler/mlir/tfrt/python_tests,tensorflow/compiler/mlir/tfrt/tests,tensorflow/compiler/mlir/tfrt/tests/ifrt,tensorflow/compiler/mlir/tfrt/tests/mlrt,tensorflow/compiler/mlir/tfrt/tests/ir,tensorflow/compiler/mlir/tfrt/tests/analysis,tensorflow/compiler/mlir/tfrt/tests/jit,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_tfrt,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_jitrt,tensorflow/compiler/mlir/tfrt/tests/tf_to_corert,tensorflow/compiler/mlir/tfrt/tests/tf_to_tfrt_data,tensorflow/compiler/mlir/tfrt/tests/saved_model,tensorflow/compiler/mlir/tfrt/transforms/lhlo_gpu_to_tfrt_gpu,tensorflow/compiler/mlir/tfrt/transforms/mlrt,tensorflow/core/runtime_fallback,tensorflow/core/runtime_fallback/conversion,tensorflow/core/runtime_fallback/kernel,tensorflow/core/runtime_fallback/opdefs,tensorflow/core/runtime_fallback/runtime,tensorflow/core/runtime_fallback/util,tensorflow/core/runtime_fallback/test,tensorflow/core/runtime_fallback/test/gpu,tensorflow/core/runtime_fallback/test/saved_model,tensorflow/core/runtime_fallback/test/testdata,tensorflow/core/tfrt/stubs,tensorflow/core/tfrt/tfrt_session,tensorflow/core/tfrt/mlrt,tensorflow/core/tfrt/mlrt/attribute,tensorflow/core/tfrt/mlrt/kernel,tensorflow/core/tfrt/mlrt/bytecode,tensorflow/core/tfrt/mlrt/interpreter,tensorflow/compiler/mlir/tfrt/translate/mlrt,tensorflow/compiler/mlir/tfrt/translate/mlrt/testdata,tensorflow/core/tfrt/gpu,tensorflow/core/tfrt/run_handler_thread_pool,tensorflow/core/tfrt/runtime,tensorflow/core/tfrt/saved_model,tensorflow/core/tfrt/graph_executor,tensorflow/core/tfrt/saved_model/tests,tensorflow/core/tfrt/tpu,tensorflow/core/tfrt/utils,tensorflow/core/tfrt/utils/debug,tensorflow/core/tfrt/saved_model/python,tensorflow/core/tfrt/graph_executor/python,tensorflow/core/tfrt/saved_model/utils
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition build:release_cpu_linux_base in file /tf/tensorflow/.bazelrc: --repo_env=CC=/usr/lib/llvm-18/bin/clang --repo_env=BAZEL_COMPILER=/usr/lib/llvm-18/bin/clang --action_env=CLANG_COMPILER_PATH=/usr/lib/llvm-18/bin/clang --linkopt=-fuse-ld=lld
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition test:rocm in file /tf/tensorflow/tensorflow/tools/tf_sig_build_dockerfiles/devel.usertools/gpu.bazelrc: --test_env=HSA_TOOLS_LIB=libroctracer64.so --test_sharding_strategy=disabled --action_env=TF_ENABLE_ONEDNN_OPTS=0 --action_env=OPENBLAS_CORETYPE=Haswell
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition test:nonpip_multi_gpu in file /tf/tensorflow/tensorflow/tools/tf_sig_build_dockerfiles/devel.usertools/gpu.bazelrc: --config=nonpip_filters_multi_gpu -- //tensorflow/core/nccl:nccl_manager_test_2gpu //tensorflow/python/distribute/integration_test:mwms_peer_failure_test_2gpu //tensorflow/python/distribute:checkpoint_utils_test_2gpu //tensorflow/python/distribute:checkpointing_test_2gpu //tensorflow/python/distribute:collective_all_reduce_strategy_test_xla_2gpu //tensorflow/python/distribute:custom_training_loop_gradient_test_2gpu //tensorflow/python/distribute:custom_training_loop_input_test_2gpu //tensorflow/python/distribute:distribute_utils_test_2gpu //tensorflow/python/distribute:input_lib_test_2gpu //tensorflow/python/distribute:input_lib_type_spec_test_2gpu //tensorflow/python/distribute:metrics_v1_test_2gpu //tensorflow/python/distribute:mirrored_variable_test_2gpu //tensorflow/python/distribute:parameter_server_strategy_test_2gpu //tensorflow/python/distribute:ps_values_test_2gpu //tensorflow/python/distribute:random_generator_test_2gpu //tensorflow/python/distribute:test_util_test_2gpu //tensorflow/python/distribute:tf_function_test_2gpu //tensorflow/python/distribute:vars_test_2gpu //tensorflow/python/distribute:warm_starting_util_test_2gpu //tensorflow/python/training:saver_test_2gpu
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition test:nonpip_filters_multi_gpu in file /tf/tensorflow/tensorflow/tools/tf_sig_build_dockerfiles/devel.usertools/gpu.bazelrc: --test_tag_filters=-no_gpu,-cuda-only --build_tag_filters=-no_gpu,-cuda-only --test_lang_filters=py --flaky_test_attempts=2 --test_size_filters=small,medium,large --test_env=TF_PER_DEVICE_MEMORY_LIMIT_MB=2048
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition build:linux in file /tf/tensorflow/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --linkopt=-Wl,--undefined-version --host_linkopt=-Wl,--undefined-version --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
[2025-09-25T21:36:51.398Z] INFO: Found applicable config definition build:dynamic_kernels in file /tf/tensorflow/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
[2025-09-25T21:36:51.398Z] Loading: 
[2025-09-25T21:36:52.335Z] Loading: 
[2025-09-25T21:36:52.767Z] DEBUG: /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/external/local_tsl/third_party/py/python_repo.bzl:83:14: !!!Using pywrap rules instead of directly creating .so objects!!!
[2025-09-25T21:36:52.767Z] DEBUG: /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/external/local_tsl/third_party/py/python_repo.bzl:88:10: 
[2025-09-25T21:36:52.767Z] =============================
[2025-09-25T21:36:52.767Z] Hermetic Python configuration:
[2025-09-25T21:36:52.767Z] Version: "3.10"
[2025-09-25T21:36:52.767Z] Kind: ""
[2025-09-25T21:36:52.767Z] Interpreter: "default" (provided by rules_python)
[2025-09-25T21:36:52.767Z] Requirements_lock label: "@python_version_repo//:requirements_lock_3_10.txt"
[2025-09-25T21:36:52.767Z] =====================================
[2025-09-25T21:36:53.771Z] Loading: 
[2025-09-25T21:36:54.833Z] Loading: 
[2025-09-25T21:36:55.917Z] Loading: 
[2025-09-25T21:36:57.570Z] Loading: 
[2025-09-25T21:36:58.529Z] Loading: 
[2025-09-25T21:37:20.073Z] Loading: 
[2025-09-25T21:37:20.543Z] Loading: 
[2025-09-25T21:37:21.544Z] Loading: 
[2025-09-25T21:37:22.458Z] Loading: 
[2025-09-25T21:37:22.994Z] Loading: 
[2025-09-25T21:37:22.994Z] Loading: 0 packages loaded
[2025-09-25T21:37:23.510Z] Analyzing: 17 targets (4 packages loaded, 0 targets configured)
[2025-09-25T21:37:24.553Z] Analyzing: 17 targets (46 packages loaded, 10 targets configured)
[2025-09-25T21:37:25.725Z] Analyzing: 17 targets (46 packages loaded, 10 targets configured)
[2025-09-25T21:37:27.383Z] Analyzing: 17 targets (46 packages loaded, 10 targets configured)
[2025-09-25T21:37:29.544Z] Analyzing: 17 targets (46 packages loaded, 10 targets configured)
[2025-09-25T21:37:31.130Z] Analyzing: 17 targets (83 packages loaded, 306 targets configured)
[2025-09-25T21:37:32.767Z] Analyzing: 17 targets (315 packages loaded, 12732 targets configured)
[2025-09-25T21:37:33.285Z] Analyzing: 17 targets (574 packages loaded, 25420 targets configured)
[2025-09-25T21:37:33.804Z] ERROR: /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/external/local_xla/xla/stream_executor/rocm/BUILD:410:11: in cc_library rule @local_xla//xla/stream_executor/rocm:hipfft_if_static: target '@local_config_rocm//rocm:hipfft' is not visible from target '@local_xla//xla/stream_executor/rocm:hipfft_if_static'. Check the visibility declaration of the former target if you think the dependency is legitimate
[2025-09-25T21:37:33.804Z] ERROR: /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/external/local_xla/xla/stream_executor/rocm/BUILD:410:11: Analysis of target '@local_xla//xla/stream_executor/rocm:hipfft_if_static' failed
[2025-09-25T21:37:33.804Z] INFO: Repository sobol_data instantiated at:
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/WORKSPACE:64:14: in <toplevel>
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/tensorflow/workspace2.bzl:934:28: in workspace
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/tensorflow/workspace2.bzl:92:15: in _initialize_third_party
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/sobol_data/workspace.bzl:6:20: in repo
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/repo.bzl:136:21: in tf_http_archive
[2025-09-25T21:37:33.804Z] Repository rule _tf_http_archive defined at:
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/repo.bzl:89:35: in <toplevel>
[2025-09-25T21:37:33.804Z] INFO: Repository XNNPACK instantiated at:
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/WORKSPACE:64:14: in <toplevel>
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/tensorflow/workspace2.bzl:941:21: in workspace
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/tensorflow/workspace2.bzl:155:20: in _tf_repositories
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/repo.bzl:136:21: in tf_http_archive
[2025-09-25T21:37:33.804Z] Repository rule _tf_http_archive defined at:
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/repo.bzl:89:35: in <toplevel>
[2025-09-25T21:37:33.804Z] INFO: Repository stablehlo instantiated at:
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/WORKSPACE:64:14: in <toplevel>
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/tensorflow/workspace2.bzl:934:28: in workspace
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/tensorflow/workspace2.bzl:93:14: in _initialize_third_party
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/stablehlo/workspace.bzl:11:20: in repo
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/repo.bzl:136:21: in tf_http_archive
[2025-09-25T21:37:33.804Z] Repository rule _tf_http_archive defined at:
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/repo.bzl:89:35: in <toplevel>
[2025-09-25T21:37:33.804Z] INFO: Repository boringssl instantiated at:
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/WORKSPACE:64:14: in <toplevel>
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/tensorflow/workspace2.bzl:941:21: in workspace
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/tensorflow/workspace2.bzl:488:20: in _tf_repositories
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/repo.bzl:136:21: in tf_http_archive
[2025-09-25T21:37:33.804Z] Repository rule _tf_http_archive defined at:
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/repo.bzl:89:35: in <toplevel>
[2025-09-25T21:37:33.804Z] INFO: Repository curl instantiated at:
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/WORKSPACE:64:14: in <toplevel>
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/tensorflow/workspace2.bzl:941:21: in workspace
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/tensorflow/workspace2.bzl:429:20: in _tf_repositories
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/repo.bzl:136:21: in tf_http_archive
[2025-09-25T21:37:33.804Z] Repository rule _tf_http_archive defined at:
[2025-09-25T21:37:33.804Z]   /tf/tensorflow/third_party/repo.bzl:89:35: in <toplevel>
[2025-09-25T21:37:33.804Z] ERROR: Analysis of target '//tensorflow/python/distribute:metrics_v1_test_2gpu' failed; build aborted: 
[2025-09-25T21:37:33.804Z] INFO: Elapsed time: 44.977s
[2025-09-25T21:37:33.804Z] INFO: 0 processes.
[2025-09-25T21:37:33.804Z] FAILED: Build did NOT complete successfully (611 packages loaded, 26449 targets configured)
[2025-09-25T21:37:33.804Z] ERROR: Couldn't start the build. Unable to run tests

Test required TF and ROCm versions / Test required TF and ROCm versions / Test required TF and ROCm versions / Run tests / Error signal

Error in error step, with arguments Error detected when building or testing TensorFlow.

Error detected when building or testing TensorFlow

Details

  • Test required TF and ROCm versions (5 hr 25 min)
    • Test required TF and ROCm versions (5 hr 25 min)
      • Test required TF and ROCm versions (5 hr 25 min)
        • Clean up workspace on node (3.7 sec)
        • Initialization (1.5 sec)
        • Cloning repositories (1 min 28 sec)
        • Run tests (11 min)
          Error: script returned exit code 1 - logs
          Error: Error detected when building or testing TensorFlow - logs