Skip to content

XLA's bundled hwloc exports global symbols #39355

@yexiang-aws

Description

@yexiang-aws

XLA's bundled hwloc exports global symbols, which are without PCI discovery leading to break NCCL OFI network plugins on AWS EFA instances

Problem

XLA/TensorFlow bundles hwloc into libtensorflow_framework.so but the build configuration has two issues that break third-party libraries (e.g. aws-ofi-nccl, NIXL) that also depend on hwloc:

  1. hwloc symbols are exported with default visibility, so the dynamic linker resolves other libraries' hwloc_topology_init/hwloc_topology_load calls to TensorFlow's bundled hwloc instead of the system hwloc.

  2. The bundled hwloc is built without the PCI discovery backend (topology-pci.c is not in hwloc.BUILD srcs, and hwloc_pci_component is absent from static-components.h), so PCI device enumeration returns 0 devices.

The combination means: any library loaded in the same process that calls hwloc to discover PCI topology (GPUs, NICs, InfiniBand devices) silently gets zero results, because TensorFlow's incomplete hwloc implementation hijacks the calls.

Impact

On AWS GPU instances (P5/P6 with EFA networking), the aws-ofi-nccl NCCL network plugin uses hwloc to discover EFA NIC topology for optimal GPU-NIC pairing. When TensorFlow 2.20+ is loaded:

  • hwloc_get_next_pcidev() returns NULL (0 PCI devices discovered)
  • aws-ofi-nccl computes max_group_size = 0"Unexpected topo group size of 0" error
  • With aws-ofi-nccl 1.17.0: segmentation fault
  • With aws-ofi-nccl 1.18.1a1: falls back from RDMA to SENDRECV protocol (significant performance degradation)
  • TensorFlow 2.18.1 (which uses system hwloc) works correctly

This also affects NIXL's libfabric backend which uses the same hwloc PCI discovery path.

Reproduction

# Compile a minimal hwloc PCI enumeration program
cat > check_topo.c << 'EOF'
#include <stdio.h>
#include <hwloc.h>
int main() {
    hwloc_topology_t topo;
    hwloc_topology_init(&topo);
    hwloc_topology_set_io_types_filter(topo, HWLOC_TYPE_FILTER_KEEP_ALL);
    hwloc_topology_load(topo);
    int pci_count = 0;
    hwloc_obj_t obj = NULL;
    while ((obj = hwloc_get_next_pcidev(topo, obj)) != NULL) pci_count++;
    printf("PCI devices: %d\n", pci_count);
    hwloc_topology_destroy(topo);
    return 0;
}
EOF
gcc -o check_topo check_topo.c -lhwloc

# System hwloc — correct
./check_topo
# Output: PCI devices: 89

# With TF 2.20 preloaded — broken (TF's hwloc hijacks the calls)
LD_PRELOAD=$(python3 -c "import tensorflow; print(tensorflow.__file__.replace('__init__.py','') + 'libtensorflow_framework.so.2')") ./check_topo
# Output: PCI devices: 0

# With TF 2.18 preloaded — correct (TF 2.18 uses system hwloc, no bundled symbols)
LD_PRELOAD=$(python3 -c "import tensorflow; print(tensorflow.__file__.replace('__init__.py','') + 'libtensorflow_framework.so.2')") ./check_topo
# Output: PCI devices: 89

Symbol collision confirmed via LD_DEBUG:

LD_DEBUG=bindings python test.py 2>&1 | grep "hwloc_topology" | grep "libnccl-net-ofi"
# binding file libnccl-net-ofi.so to libtensorflow_framework.so.2: normal symbol `hwloc_topology_init'
# binding file libnccl-net-ofi.so to libtensorflow_framework.so.2: normal symbol `hwloc_topology_load'

Root Cause Analysis

In third_party/hwloc/hwloc.BUILD:

Issue 1: Missing -fvisibility=hidden

The copts do not include -fvisibility=hidden, so all 283 hwloc symbols are exported from libtensorflow_framework.so with default (global) visibility:

# Current
copts = COMMON_INCLUDE_COPTS + DISABLE_WARNINGS_COPTS + VAR_SETTINGS_COPTS,

# Should be
copts = COMMON_INCLUDE_COPTS + DISABLE_WARNINGS_COPTS + VAR_SETTINGS_COPTS + ["-fvisibility=hidden"],

Issue 2: Missing PCI discovery backend

static-components.h does not register hwloc_pci_component, and hwloc.BUILD does not compile topology-pci.c. This means the bundled hwloc cannot discover PCI devices at all. While TensorFlow itself may not need PCI topology, the globally-visible symbols mean other libraries in the same process get this broken implementation.

Proposed Fix

Add -fvisibility=hidden to the hwloc build copts so that the bundled hwloc symbols remain internal to libtensorflow_framework.so and cannot hijack other libraries' hwloc calls:

copts = COMMON_INCLUDE_COPTS + DISABLE_WARNINGS_COPTS + VAR_SETTINGS_COPTS + ["-fvisibility=hidden"],

This is the minimal, non-breaking fix. It keeps TensorFlow's internal hwloc usage working while preventing symbol leakage to other libraries.

Environment

  • Instance: AWS P5 (8x H100, 32x EFA NICs)
  • TensorFlow: 2.20.0 (broken), 2.18.1 (works)
  • System hwloc: 2.7.0
  • TF bundled hwloc: 2.0.3 (per hwloc.BUILD version defines)
  • Affected libraries: aws-ofi-nccl (NCCL network plugin), NIXL libfabric backend

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions