Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
f84410d
reduce zmobius verbosity
lehner May 20, 2023
0b88274
Merge branch 'feature/gpt' of https://github.com/lehner/Grid into fea…
lehner Oct 8, 2023
b31766a
Merge branch 'feature/gpt' of https://github.com/lehner/Grid into fea…
lehner Oct 23, 2023
ef0a442
Merge branch 'feature/gpt' of https://github.com/lehner/Grid into fea…
lehner Feb 28, 2024
225ae2e
Merge branch 'feature/gpt' of https://github.com/lehner/Grid into fea…
lehner May 10, 2024
d20caac
Low-level half-precision comms framework and bfloat16 implementation
lehner Sep 25, 2024
0a316ad
adjust help message
lehner Sep 25, 2024
a4b9923
mute benchmarking logs
lehner Sep 27, 2024
7b7c1bf
Don't benchmark Dirichlet version for JUREAP
lehner Oct 9, 2024
6cd4ce2
Update Benchmark_dwf_fp32_jureap.cc
lehner Dec 18, 2024
64c227c
Update Benchmark_dwf_fp32_jureap.cc
lehner Dec 18, 2024
576d06e
Merge fix
lehner Jan 26, 2025
eb1058d
Merge with upstream
lehner Jan 26, 2025
ffc9de9
Merge branch 'paboyle-develop' into feature/gpt
lehner Jan 26, 2025
eceb607
Merge branch 'feature/gpt' of https://github.com/lehner/Grid into fea…
lehner Jan 26, 2025
e138258
complete merge conflict resolution
lehner Jan 26, 2025
be50020
Merge branch 'feature/gpt' of https://github.com/lehner/Grid into fea…
lehner Feb 19, 2025
5adb165
allow for transpose in gemm of complex matrices
lehner Feb 22, 2025
9b317b4
Merge branch 'feature/gpt' of https://github.com/lehner/Grid into fea…
lehner Feb 22, 2025
059b6e6
default gemm needs to ignore C if beta vanishes
lehner Mar 2, 2025
2acb657
cpu target async
lehner Mar 10, 2025
e85b798
library adjustment for JEDI
lehner Mar 20, 2025
98cce16
Towards inverse and determinant in batched blas on all platforms
lehner Mar 21, 2025
99725f1
Merge branch 'feature/gpt' of https://github.com/lehner/Grid into fea…
lehner Mar 21, 2025
93fc21f
Merge branch 'feature/gpt' of https://github.com/lehner/Grid into fea…
lehner Mar 21, 2025
1d7edc3
Continue with cublas
lehner Mar 21, 2025
cc67283
Complete first version on CUDA/HIP/CPU of GridBlas::inverseBatched an…
lehner Mar 21, 2025
23db75d
CompactWilsonClover5D
lehner Apr 10, 2025
147c2eb
Add checkerboarding to 5D compact clover
lehner Apr 10, 2025
5e3a6f2
Allow for reduced-precision GEMM
lehner Apr 17, 2025
f0573d0
fix cuda BatchedGemmEx
lehner Apr 17, 2025
0fb69d6
no thread_for for identity operator
lehner May 8, 2025
f666dea
Making running on Aurora more debuggable
paboyle May 23, 2025
634aec2
Merge pull request #40 from paboyle/develop
lehner Jun 22, 2025
1d23492
--debug-stdout in subdirectory
lehner Jun 22, 2025
edfb09f
merge
lehner Jun 22, 2025
dcc51ad
merge feature/gpt and feature/aurora
lehner Jun 26, 2025
a04a2b7
fix communicator_none
lehner Jun 26, 2025
b1859c8
Merge branch 'feature/gpt' of https://github.com/lehner/Grid into fea…
lehner Jun 26, 2025
1016ed5
experiment with AC 2.71
lehner Jun 27, 2025
ffc6dbd
towards argonne compatibility
lehner Jun 27, 2025
dc73400
SYCL blas compatibility
lehner Jun 28, 2025
6297001
Merge pull request #41 from paboyle/develop
lehner Jun 28, 2025
5cca827
Rollback half-prec comms for now
lehner Jun 30, 2025
87350e0
checksum comms
lehner Jul 6, 2025
912b6cf
refined checksum
lehner Jul 6, 2025
c0ca7f0
checksum fix
lehner Jul 7, 2025
6307fcb
view logging
lehner Jul 9, 2025
5ad20a9
more appropriate view modes
lehner Jul 10, 2025
84c7196
stencil view plays nicely with view logging
lehner Jul 11, 2025
965cee4
towards cleanup of LogViews
lehner Jul 14, 2025
d7cdab5
double up sycl fences and barriers
lehner Jul 22, 2025
81cc867
SYCL lane
lehner Jul 23, 2025
daf6e25
Merge branch 'feature/gpt' of https://github.com/lehner/Grid into fea…
lehner Jul 23, 2025
8fbeee8
cshift test
lehner Jul 23, 2025
08b3844
remove error message since fallback to small-sum path is taken
lehner Jul 23, 2025
d22d07b
checksum index per Grid to play nice with split grid
lehner Jul 26, 2025
b830235
DeviceMemoryAllocator
lehner Jul 28, 2025
928abe5
DeviceMemoryAllocator
lehner Jul 28, 2025
c973aa2
fix header
lehner Jul 29, 2025
5cdf5ba
fix configure.ac
lehner Jul 29, 2025
7a939ed
oversubscription factor
lehner Jul 29, 2025
8bd4562
quote reusable pages
lehner Jul 29, 2025
bc4f7bd
cleanup device allocator
lehner Jul 30, 2025
faed6d3
M5D virtual
lehner Dec 12, 2025
afd4423
try a different barrier setup at Aurora
lehner Feb 28, 2026
09589eb
merged upstream
lehner May 9, 2026
a870698
clean compile merge
lehner May 9, 2026
30cf9a0
Update Benchmark_dwf_fp32.cc
lehner May 9, 2026
14bff9e
merge adjust
lehner May 9, 2026
a6ede60
merge adjust
lehner May 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Grid/allocator/Allocator.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
#include <Grid/allocator/MemoryStats.h>
#include <Grid/allocator/MemoryManager.h>
#include <Grid/allocator/AlignedAllocator.h>
#include <Grid/allocator/DeviceMemoryAllocator.h>
235 changes: 235 additions & 0 deletions Grid/allocator/DeviceMemoryAllocator.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
/*************************************************************************************

Grid physics library, www.github.com/paboyle/Grid

Source file: ./lib/DeviceMemoryAllocator.h

Copyright (C) 2025

Author: Christoph Lehner <christoph@lhnr.de>

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */

#include <Grid/GridCore.h>

NAMESPACE_BEGIN(Grid);

#define DEVICE_MEMORY_ALLOCATOR_PAGE_SIZE (64*1024)
#define OVERALLOCATION_FACTOR 1.2

#ifdef GRID_DEVICE_MEMORY_ALLOCATOR
struct DeviceMemoryAllocator {

bool initialized;
char* base;
size_t size;
size_t offset;
bool verbose;

DeviceMemoryAllocator() {
initialized = false;
base = 0;
size = 0;
offset = 0;
verbose = false;
}

~DeviceMemoryAllocator() {
if (initialized) {
acceleratorFreeDevice(base);
initialized = false;
}
}

std::vector<size_t> pages;
std::map<size_t, std::vector<size_t> > size_map;

void Init(size_t _size) {
assert(!initialized);

char* str;
if ((str = getenv("GRID_OVERALLOCATION_FACTOR"))) {
_size = (size_t)(_size * atof(str));
} else {
_size = (size_t)(_size * OVERALLOCATION_FACTOR);
}

verbose = (getenv("GRID_DEBUG_DEVICE_ALLOCATOR") != 0);

size_t n_pages = (_size + DEVICE_MEMORY_ALLOCATOR_PAGE_SIZE - 1) / DEVICE_MEMORY_ALLOCATOR_PAGE_SIZE;
size = n_pages * DEVICE_MEMORY_ALLOCATOR_PAGE_SIZE;
std::cout << GridLogMessage << "Init device allocator with " << size << " bytes" << std::endl;

base = (char*)acceleratorAllocDeviceInternal(size);
assert(base);

if (verbose)
std::cout << GridLogMessage << "Initialize memory to zero" << std::endl;

{
uint64_t* ba = (uint64_t*)base;
size_t n = size / sizeof(uint64_t);
size_t MAX_BLOCK_INIT = 128*1024*1024;
while (n > 0) {
size_t n0 = n;
if (n0 > MAX_BLOCK_INIT)
n0 = MAX_BLOCK_INIT;
accelerator_for(i, n0, 1, {
ba[i] = (uint64_t)-1;
});
ba += n0;
n -= n0;
}
}

if (verbose)
std::cout << GridLogMessage << "Done" << std::endl;

offset = 0;

pages.resize(n_pages, 0);

if (verbose)
std::cout << GridLogMessage << "Pages initialized" << std::endl;

initialized = true;
}

void* attemptReuseExactSize(size_t n_pages) {
auto sm = size_map.find(n_pages);
if (sm != size_map.end() && sm->second.size() > 0) {
size_t index = sm->second.back();
sm->second.pop_back();

if (sm->second.size() == 0)
size_map.erase(sm);

assert(pages[index] == 0);
pages[index] = n_pages;

return base + index * DEVICE_MEMORY_ALLOCATOR_PAGE_SIZE;
}
return 0;
}

void* attemptAllocUnused(size_t n_pages) {
size_t end = (offset + n_pages) * DEVICE_MEMORY_ALLOCATOR_PAGE_SIZE;
void* ptr = 0;

if (end <= size) {
pages[offset] = n_pages;

ptr = base + offset * DEVICE_MEMORY_ALLOCATOR_PAGE_SIZE;
offset += n_pages;

if (verbose) {
size_t reusable_pages = 0;
for (auto & sm : size_map)
reusable_pages += sm.first * sm.second.size();

std::cout << GridLogMessage << (size - end) / DEVICE_MEMORY_ALLOCATOR_PAGE_SIZE << " pages left to allocate ("
<< (size - end) * 100 / size << "% unallocated, " << reusable_pages << " reusable pages)" << std::endl;
}
}

return ptr;
}

void* alloc(size_t bytes) {
if (!initialized)
Init(MemoryManager::DeviceMaxBytes);

if (!bytes)
bytes++;

size_t n_pages = (bytes + DEVICE_MEMORY_ALLOCATOR_PAGE_SIZE - 1) / DEVICE_MEMORY_ALLOCATOR_PAGE_SIZE;

// first check if block of perfect size is available
void* ptr;
if ((ptr = attemptReuseExactSize(n_pages))) {

if (verbose)
std::cout << GridLogMessage << "Can re-use perfect pointer for " << n_pages << " pages" << std::endl;

return ptr;
}

// if not, attempt to allocate in the unused area
if ((ptr = attemptAllocUnused(n_pages)))
return ptr;

// last attempt, find a re-usable region that barely fits and return it
// for loop of std::map iterates in ascending order
size_t reusable_pages = 0;
size_t n_pages_usable = 0;
for (auto & sm : size_map) {
assert(sm.second.size() > 0); // should never be empty
reusable_pages += sm.first * sm.second.size();
if (n_pages_usable == 0 && sm.first > n_pages)
n_pages_usable = sm.first;
}

if (n_pages_usable == 0) {
std::cout << GridLogMessage << "Out of memory for " << n_pages << " pages! Re-usable pages at time of death:" << std::endl;

for (auto & sm : size_map) {
std::cout << GridLogMessage << sm.second.size() << " x " << sm.first << " pages" << std::endl;
}

exit(1);
}

if ((ptr = attemptReuseExactSize(n_pages_usable))) {

if (verbose)
std::cout << GridLogMessage << "Can re-use pointer for " << n_pages_usable << " pages when " << n_pages << " were needed; " << reusable_pages << " reusable pages" << std::endl;

return ptr;
}

// this should never be reached
assert(0);
return ptr;
}

void free(void* ptr) {
if (!initialized)
return;

size_t index = ((size_t)((char*)ptr - base)) / DEVICE_MEMORY_ALLOCATOR_PAGE_SIZE;
size_t n_pages = pages[index];
//std::cout << GridLogMessage << "Freeing ptr " << ptr << " has " << n_pages << " pages" << std::endl;
pages[index] = 0;
auto & sm = size_map[n_pages];
sm.push_back(index);
}
};

static DeviceMemoryAllocator dma;

void *acceleratorAllocDevice(size_t bytes) {
return dma.alloc(bytes);
}

void acceleratorFreeDevice(void *ptr) {
dma.free(ptr);
}
#endif

NAMESPACE_END(Grid);
36 changes: 36 additions & 0 deletions Grid/allocator/DeviceMemoryAllocator.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
/*************************************************************************************

Grid physics library, www.github.com/paboyle/Grid

Source file: ./lib/DeviceMemoryAllocator.h

Copyright (C) 2025

Author: Christoph Lehner <christoph@lhnr.de>

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

See the full license in the file "LICENSE" in the top level distribution directory
*************************************************************************************/
/* END LEGAL */

NAMESPACE_BEGIN(Grid);

#ifdef GRID_DEVICE_MEMORY_ALLOCATOR
void *acceleratorAllocDevice(size_t bytes);
void acceleratorFreeDevice(void *ptr);
#endif

NAMESPACE_END(Grid);
24 changes: 17 additions & 7 deletions Grid/threads/Accelerator.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,11 @@ Author: paboyle <paboyle@ph.ed.ac.uk>
inline void *memalign(size_t align, size_t bytes) { return malloc(bytes); }
#endif

#ifdef GRID_DEVICE_MEMORY_ALLOCATOR
#define acceleratorAllocDevice acceleratorAllocDeviceInternal
#define acceleratorFreeDevice acceleratorFreeDeviceInternal
#endif

NAMESPACE_BEGIN(Grid);

//////////////////////////////////////////////////////////////////////////////////
Expand Down Expand Up @@ -346,7 +351,7 @@ accelerator_inline int acceleratorSIMTlane(int Nsimd) {
}); \
});

#define accelerator_barrier(dummy) { theGridAccelerator->wait(); }
#define accelerator_barrier(dummy) { theGridAccelerator->wait_and_throw(); theGridAccelerator->wait_and_throw(); }

inline void *acceleratorAllocShared(size_t bytes){ return malloc_shared(bytes,*theGridAccelerator);};
inline void *acceleratorAllocHost(size_t bytes) { return malloc_host(bytes,*theGridAccelerator);};
Expand All @@ -355,7 +360,7 @@ inline void acceleratorFreeHost(void *ptr){free(ptr,*theGridAccelerator);};
inline void acceleratorFreeShared(void *ptr){free(ptr,*theGridAccelerator);};
inline void acceleratorFreeDevice(void *ptr){free(ptr,*theGridAccelerator);};

inline void acceleratorCopySynchronise(void) { theCopyAccelerator->wait(); }
inline void acceleratorCopySynchronise(void) { theCopyAccelerator->wait_and_throw(); theCopyAccelerator->wait_and_throw(); }


///////
Expand All @@ -365,7 +370,7 @@ typedef sycl::event acceleratorEvent_t;

inline void acceleratorEventWait(acceleratorEvent_t ev)
{
ev.wait();
ev.wait_and_throw();
}

inline int acceleratorEventIsComplete(acceleratorEvent_t ev)
Expand All @@ -377,9 +382,9 @@ inline acceleratorEvent_t acceleratorCopyDeviceToDeviceAsynch(void *from,void *t
inline acceleratorEvent_t acceleratorCopyToDeviceAsynch(void *from,void *to,size_t bytes) { return theCopyAccelerator->memcpy(to,from,bytes); }
inline acceleratorEvent_t acceleratorCopyFromDeviceAsynch(void *from,void *to,size_t bytes) { return theCopyAccelerator->memcpy(to,from,bytes); }

inline void acceleratorCopyToDevice(const void *from,void *to,size_t bytes) { theCopyAccelerator->memcpy(to,from,bytes); theCopyAccelerator->wait();}
inline void acceleratorCopyFromDevice(const void *from,void *to,size_t bytes){ theCopyAccelerator->memcpy(to,from,bytes); theCopyAccelerator->wait();}
inline void acceleratorMemSet(void *base,int value,size_t bytes) { theCopyAccelerator->memset(base,value,bytes); theCopyAccelerator->wait();}
inline void acceleratorCopyToDevice(const void *from,void *to,size_t bytes) { theCopyAccelerator->memcpy(to,from,bytes); theCopyAccelerator->wait_and_throw();theCopyAccelerator->wait_and_throw();}
inline void acceleratorCopyFromDevice(const void *from,void *to,size_t bytes){ theCopyAccelerator->memcpy(to,from,bytes); theCopyAccelerator->wait_and_throw();theCopyAccelerator->wait_and_throw();}
inline void acceleratorMemSet(void *base,int value,size_t bytes) { theCopyAccelerator->memset(base,value,bytes); theCopyAccelerator->wait_and_throw();theCopyAccelerator->wait_and_throw();}

inline int acceleratorIsCommunicable(void *ptr)
{
Expand Down Expand Up @@ -650,7 +655,7 @@ inline void acceleratorFreeCpu (void *ptr){free(ptr);};
//////////////////////////////////////////////

#ifdef GRID_SYCL
inline void acceleratorFenceComputeStream(void){ theGridAccelerator->ext_oneapi_submit_barrier(); };
inline void acceleratorFenceComputeStream(void){ theGridAccelerator->ext_oneapi_submit_barrier(); theGridAccelerator->ext_oneapi_submit_barrier(); };
#else
// Ordering within a stream guaranteed on Nvidia & AMD
inline void acceleratorFenceComputeStream(void){ };
Expand Down Expand Up @@ -720,3 +725,8 @@ template<class T> T acceleratorGet(T& dev)


NAMESPACE_END(Grid);

#ifdef GRID_DEVICE_MEMORY_ALLOCATOR
#undef acceleratorAllocDevice
#undef acceleratorFreeDevice
#endif
Loading