This issue arises when using multiple MI300A APUs in various scenarios, such as during high GPU memory usage or after loading AI model tensors. We observe a huge performance drop (by a factor of 2 to 10), primarily because some of the hipMalloc allocations on one device are being partially fulfilled by the HBM of another device. We expectthat all allocations on a GPU device should be served exclusively by the memory of that selected device, or the allocation must fail.
To reproduce we used a python env with torch 2.7.0 and numpy (reproducer.py is below), but you can reproduce using hipMalloc by monitoring the memory usage repartition in /sys/devices/system/node/node*/meminfo.
python reproducer.py --first_gpu_alloc_ratio 0.95 --next_gpus_relative_alloc_ratio 0.90
Number of available GPUs: 4
-------------------------------
Test 1: This pass will execute the following steps:
1. Show the current memory usage on each NUMA nodes.
2. Allocate tensors on GPU 1 to fill its memory capacity.
3. Check the actual location of the allocated memory by checking in which NUMA node's memory usage has increased.
4. Evaluate TFLOPs by selecting 3 random tensors (A, B, C) on the GPU and compute C += A . B (dot product).
5. Repeat the same process on each GPU without releasing the memory from the previous GPU.
--
Free memory layout at startup :
Numa node 0 free memory : 121 GiB, (of which pagecache memory : 1 GiB)
Numa node 1 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 2 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 3 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
--
GPU 1 : alloc 95.0% - 121.6 GiB of the GPU memory in 7782 tensors
Allocation tooks 13 seconds and comes at 93% (10.09 GiB still free of which 0.45 GiB of pagecache) from the right numa and from :
NUMA node 2: 7.72 GiB
Bench (any bank) :: 23.666840 TFLOPs (during 1.451455 sec)
Bench (right bank) :: 28.834836 TFLOPs (during 1.191314 sec)
Bench (wrong bank) :: 5.774188 TFLOPs (during 5.949122 sec)
--
GPU 2 : alloc 85.5% - 109.44 GiB of the GPU memory in 7004 tensors
Allocation tooks 13 seconds and comes mainly from the right memory bank
Bench (any bank) :: 54.405166 TFLOPs (during 0.631399 sec)
Bench (right bank) :: 54.367535 TFLOPs (during 0.631836 sec)
--
GPU 3 : alloc 76.95% - 98.5 GiB of the GPU memory in 6303 tensors
Allocation tooks 12 seconds and comes mainly from the right memory bank
Bench (any bank) :: 54.410138 TFLOPs (during 0.631341 sec)
Bench (right bank) :: 54.393623 TFLOPs (during 0.631533 sec)
--
GPU 0 : alloc 69.25% - 88.65 GiB of the GPU memory in 5673 tensors
Allocation tooks 6 seconds and comes mainly from the right memory bank
Bench (any bank) :: 54.396970 TFLOPs (during 0.631494 sec)
Bench (right bank) :: 54.373855 TFLOPs (during 0.631762 sec)
-------------------------------
Test 2: This pass will execute the following steps:
1. Fill the pagecache with a dummy 100 GiB file to demonstrate the effect of pagecache on memory allocation.
2. Retry the allocation and performance evaluation after partially filling the pagecache.
107374182400 bytes (107 GB, 100 GiB) copied, 102 s, 1.1 GB/s
100+0 records in
100+0 records out
107374182400 bytes (107 GB, 100 GiB) copied, 101.898 s, 1.1 GB/s
Successfully created 100 GiB file filled with zeros: ./zero_file.bin
Successfully synchronized file system buffers.
--
Free memory layout at startup :
Numa node 0 free memory : 111 GiB, (of which pagecache memory : 101 GiB)
Numa node 1 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 2 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 3 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
--
GPU 1 : alloc 95.0% - 121.6 GiB of the GPU memory in 7782 tensors
Allocation tooks 13 seconds and comes at 93% (9.93 GiB still free of which 0.45 GiB of pagecache) from the right numa and from :
NUMA node 2: 7.55 GiB
Bench (any bank) :: 24.073996 TFLOPs (during 1.426907 sec)
Bench (right bank) :: 28.760786 TFLOPs (during 1.194381 sec)
Bench (wrong bank) :: 5.542506 TFLOPs (during 6.197801 sec)
--
GPU 2 : alloc 85.5% - 109.44 GiB of the GPU memory in 7004 tensors
Allocation tooks 13 seconds and comes mainly from the right memory bank
Bench (any bank) :: 54.412645 TFLOPs (during 0.631312 sec)
Bench (right bank) :: 54.356111 TFLOPs (during 0.631968 sec)
--
GPU 3 : alloc 76.95% - 98.5 GiB of the GPU memory in 6303 tensors
Allocation tooks 11 seconds and comes mainly from the right memory bank
Bench (any bank) :: 54.441244 TFLOPs (during 0.630980 sec)
Bench (right bank) :: 54.393767 TFLOPs (during 0.631531 sec)
--
GPU 0 : alloc 69.25% - 88.65 GiB of the GPU memory in 5673 tensors
Allocation tooks 67 seconds and comes at 75% (49.43 GiB still free of which 0.41 GiB of pagecache) from the right numa and from :
NUMA node 3: 19.7 GiB
Bench (any bank) :: 22.395341 TFLOPs (during 1.533861 sec)
Bench (right bank) :: 18.161142 TFLOPs (during 1.891475 sec)
Bench (wrong bank) :: 54.232145 TFLOPs (during 0.633413 sec)
Successfully removed: ./zero_file.bin
import os
import torch
import subprocess
import time
import random
import argparse
def main(args):
# Get the number of available GPUs
num_gpus = torch.cuda.device_count()
print(f"Number of available GPUs: {num_gpus}")
# Execute 2 pass of :
# --> Show the memory
# --> allocate tensors on GPU 1 to fill it's memory
# --> Check the real location of the allocated memory (which NUMA node memory increased ?)
# --> Evaluate TFLOPs by selecting 3 random tensors (A, B, C) in the GPU and compute C += A . B (dot product)
# ... same thing on each of the GPU (without releasing the memory of each previous GPU)
#
# the second pass differ only by the fact that we start by partially filling pagecache
# with evictable pages to show the effect of the pagecache on
for l in range(2) :
alloc_ratio = args.first_gpu_alloc_ratio
torch.cuda.empty_cache()
print("-------------------------------")
if l == 0 :
print(f"Test 1: This pass will execute the following steps:")
print("1. Show the current memory usage on each NUMA nodes.")
print(f"2. Allocate tensors on GPU {args.first_gpu_index % num_gpus} to fill its memory capacity.")
print("3. Check the actual location of the allocated memory by checking in which NUMA node's memory usage has increased.")
print("4. Evaluate TFLOPs by selecting 3 random tensors (A, B, C) on the GPU and compute C += A . B (dot product).")
print("5. Repeat the same process on each GPU without releasing the memory from the previous GPU.")
else :
if not args.do_pagecache_test :
break
print(f"Test 2: This pass will execute the following steps:")
print(f"1. Fill the pagecache with a dummy {args.pagecache_fill_size} GiB file to demonstrate the effect of pagecache on memory allocation.")
print("2. Retry the allocation and performance evaluation after partially filling the pagecache.")
create_zero_file('./zero_file.bin', args.pagecache_fill_size)
print("--")
print("Free memory layout at startup :")
memfree_start, pagecache_start = get_memfree()
for node_id, memfree_bytes in memfree_start.items() :
print(f"Numa node {node_id} free memory : {round(memfree_bytes/1024/1024)} GiB, (of which pagecache memory : {round(pagecache_start[node_id]/1024/1024)} GiB)")
tensors = [None] * num_gpus
for i in range(num_gpus) :
print("--")
device_idx = (i + args.first_gpu_index) % num_gpus
total_memory = torch.cuda.get_device_properties(device_idx).total_memory
# Get initial numa node memory info
memfree_before, pagecache_before = get_memfree()
# Allocate a bunch of tensors (nb_blocks) that fill alloc_ratio of the GPU
nb_blocks = round((total_memory * alloc_ratio) // (args.matrix_size * args.matrix_size * 4))
print(f"GPU {device_idx} : alloc {round(alloc_ratio * 100, 2)}% - {round(total_memory*alloc_ratio / 1024 ** 3, 2)} GiB of the GPU memory in {nb_blocks} tensors")
start_time = time.time()
tensors[device_idx] = [torch.zeros(args.matrix_size, args.matrix_size, dtype=torch.float32, device=f"cuda:{device_idx}") for j in range(nb_blocks)]
torch.cuda.synchronize(device_idx)
duration = time.time() - start_time
# Get numa node memory info after allocation
memfree_after, pagecache_after = get_memfree()
# Calculate the difference in memory
mem_diff = {}
total_diff = 0
secondary_pools = ""
for node in memfree_before.keys():
diff = memfree_after[node] - memfree_before[node]
mem_diff[node] = diff
total_diff += diff
if node != device_idx and diff < - 1024 * 1024:
# Print additional nodes where the allocation difference is greater than 1024 MiB -- which is enough to skip the python memory usage increase
secondary_pools += f"\n NUMA node {node}: {-round(diff/1024/1024, 2)} GiB"
ratio_in_right_numa = mem_diff[device_idx] / total_diff
if len(secondary_pools) == 0 :
print(f" Allocation tooks {round(duration)} seconds and comes mainly from the right memory bank")
else :
print(f" Allocation tooks {round(duration)} seconds and comes at {round(ratio_in_right_numa*100)}% ({round(memfree_after[device_idx] / 1024 / 1024, 2)} GiB still free of which {round(pagecache_after[device_idx]/1024/1024, 2)} GiB of pagecache) from the right numa and from :{secondary_pools}")
# Perform matrix multiplication on each GPU
if args.do_perf_test :
execute_perf_test(args, tensors, device_idx, nb_blocks, ratio_in_right_numa)
# reduce the % of the memory to show different behaviours
alloc_ratio *= args.next_gpus_relative_alloc_ratio
tensors = None
print()
print()
remove_file('./zero_file.bin')
# Get the free and pagecache memory for each numa node from /sys/devices/system/node/node*/meminfo.
def get_memfree():
memfree = {}
page_cache = {}
try:
output = subprocess.check_output("grep -hi 'MemFree' /sys/devices/system/node/node*/meminfo", shell=True)
for line in output.decode().strip().split('\n'):
# assume format is : Node 0 MemFree: 28046496 kB
parts = line.split()
node_id = int(parts[1])
memfree[node_id] = int(parts[3])
output = subprocess.check_output("grep -hi 'FilePages' /sys/devices/system/node/node*/meminfo", shell=True)
for line in output.decode().strip().split('\n'):
# assume format is : Node 0 MemFree: 28046496 kB
parts = line.split()
node_id = int(parts[1])
page_cache[node_id] = int(parts[3])
memfree[node_id] += page_cache[node_id]
except Exception as e:
print(f"Error reading memory info: {e}")
return memfree, page_cache
def pick_random_square_matrix(tensors, mnp, n_blocks) :
if n_blocks < 0 :
idx = random.randint(n_blocks, -1) # pick a random index for tensor
else :
idx = random.randint(0, n_blocks - 1) # pick a random index for tensor
return tensors[idx].view(mnp, mnp) # use the tensor as a square matrix
def random_mm_sum(tensors, mnp, n_blocks, n_mm) :
for l in range(n_mm):
A = pick_random_square_matrix(tensors, mnp, n_blocks)
B = pick_random_square_matrix(tensors, mnp, n_blocks)
C = pick_random_square_matrix(tensors, mnp, n_blocks)
C += torch.mm(A,B)
flops = n_mm * (mnp * mnp * (2*mnp - 1))
return C, flops
def execute_perf_test(args, tensors, device_idx, nb_blocks, ratio_in_right_numa) :
results = []
def bench(test_name, n_blocks, n_mm, from_end = False) :
torch.cuda.synchronize(device=f'cuda:{device_idx}')
start_time = time.time()
C, total_flops = random_mm_sum(tensors[device_idx], args.matrix_size, n_blocks, n_mm)
torch.cuda.synchronize(device=f'cuda:{device_idx}')
end_time = time.time()
elapsed_time = end_time - start_time # Time in seconds
tflops = total_flops / (elapsed_time * 1e12) # Convert to TFLOPs
if test_name is not None :
print(f"{test_name}: {tflops:.6f} TFLOPs (during {elapsed_time:.6f} sec)")
return C
# warmup bench (discarded)
results.append(bench(None, nb_blocks, args.perf_nb_warmup_loop))
# any tensor is candidate
results.append(bench(" Bench (any bank) :", nb_blocks, args.perf_nb_loop))
# assume that the allocation started to use the right memory bank and then fallback to the wrong one
nb_block_in_right_numa = round(ratio_in_right_numa * nb_blocks) - 1
if nb_block_in_right_numa > 100 :
results.append(bench(" Bench (right bank) :", nb_block_in_right_numa, args.perf_nb_loop))
nb_block_in_wrong_numa = round((1 - ratio_in_right_numa) * nb_blocks) - 1
if nb_block_in_wrong_numa > 100 :
results.append(bench(" Bench (wrong bank) :", -nb_block_in_wrong_numa, args.perf_nb_loop))
# Write a file full of zeros that will fill the pagecache
# you can replace it by any non O_DIRECT file read (for instance cat filename > /dev/null)
# we ensure the file is properly sync at the end to be sure pages can be evicted from pagecache easily
def create_zero_file(filename, size_gb):
# Execute the dd command
process = subprocess.run(f'dd if=/dev/zero of={filename} bs=1G count={size_gb} status=progress', shell=True)
# Check if the command was successful
if process.returncode == 0:
print(f"Successfully created {size_gb} GiB file filled with zeros: {filename}")
else:
print(f"Error occurred while creating the file: {process.returncode}")
# Ensure there are no dirty pages
sync_process = subprocess.run('sync', shell=True)
if sync_process.returncode == 0:
print("Successfully synchronized file system buffers.")
else:
print(f"Error occurred while synchronizing: {sync_process.returncode}")
# At the end we want to empty the pagecache to restore the original state
# it is done by destroying the file created by create_zero_file
def remove_file(filename):
try:
os.remove(filename)
print(f"Successfully removed: {filename}")
except FileNotFoundError:
print(f"File not found: {filename}")
except PermissionError:
print(f"Permission denied: {filename}")
except Exception as e:
print(f"Error occurred while removing the file: {e}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="GPU Memory Allocation Parameters")
parser.add_argument('--first_gpu_alloc_ratio', type=float, required=True,
help='Percent of the GPU capacity to allocate on the first GPU')
parser.add_argument('--next_gpus_relative_alloc_ratio', type=float, required=True,
help='Each GPU allocates memory of the previous one multiplied by this ratio')
parser.add_argument('--first_gpu_index', type=int, default=1,
help='Index to start with another GPU than the 0')
parser.add_argument('--matrix_size', type=int, default=2048,
help='Size of the tensors to allocate (matrix_size x matrix_size)')
parser.add_argument('--do_perf_test', type=bool, default=True,
help='Flag to enable performance test')
parser.add_argument('--perf_nb_loop', type=int, default=2000,
help='Number of dot products with 2 matrices of shape (matrix_size, matrix_size)')
parser.add_argument('--perf_nb_warmup_loop', type=int, default=round(2000 * 0.2),
help='Number of warmup loops for performance test')
parser.add_argument('--do_pagecache_test', type=bool, default=True,
help='Flag to enable pagecache test')
parser.add_argument('--pagecache_fill_size', type=int, default=100,
help='Size in GiB to fill in the pagecache for the test')
args = parser.parse_args()
main(args)
no specific message regarding the location of memory in dmesg, nor with AMD_LOG_LEVEL
Problem Description
This issue arises when using multiple MI300A APUs in various scenarios, such as during high GPU memory usage or after loading AI model tensors. We observe a huge performance drop (by a factor of 2 to 10), primarily because some of the hipMalloc allocations on one device are being partially fulfilled by the HBM of another device. We expectthat all allocations on a GPU device should be served exclusively by the memory of that selected device, or the allocation must fail.
Operating System
Red Hat Enterprise Linux 9.4 (Plow)
CPU
AMD Instinct MI300A Accelerator
GPU
4 * AMD Instinct MI300A Accelerator
ROCm Version
ROCm 6.4.0
ROCm Component
No response
Steps to Reproduce
To reproduce we used a python env with torch 2.7.0 and numpy (reproducer.py is below), but you can reproduce using hipMalloc by monitoring the memory usage repartition in /sys/devices/system/node/node*/meminfo.
python reproducer.py --first_gpu_alloc_ratio 0.95 --next_gpus_relative_alloc_ratio 0.90Output :
reproducer.py :
### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
no specific message regarding the location of memory in dmesg, nor with AMD_LOG_LEVEL