OpenKL is a high-performance memory pooling library for accelerator-style workloads, with first-class C++ and Python APIs.
It supports CUDA when available and falls back to host-backed allocation stubs when CUDA is not present (useful for macOS/Windows development machines).
- Fixed-size
MemoryPoolwith O(1) allocation/deallocation. - Variable-size
SlabPoolusing configurable size classes. - Batch allocation APIs and detailed telemetry/stats.
- Runtime exhaustion policies (
Throw,Wait,Growscaffold,FallbackRaw). - RAII and typed helpers (
PooledPtr,allocate_t<T>()). - No-exception pathways (
Status,ErrorCode, C API_excalls). - Multi-GPU routing with affinity policies and peer-access helpers.
- C API for embedding and Python bindings via pybind11.
- Optional backend abstraction scaffold (host, CUDA, HIP stub).
- CMake 3.18+
- C++17 compiler
- Python 3.8+ (for Python package/tests)
- Optional CUDA: CUDA 11+ and
nvccinPATH
cd /path/to/OpenKL
./build.shThis configures, builds, and runs C++ tests.
Windows helpers:
.\build_windows.ps1build_windows.batcd /path/to/OpenKL/python
python -m pip install -e .or from repo root:
pip install -e ./python#include "openkl/memory_pool.hpp"
openkl::MemoryPoolOptions opts;
opts.thread_safe = true;
opts.debug = true;
opts.alignment = 64;
opts.exhaustion_policy = openkl::ExhaustionPolicy::FallbackRaw;
openkl::MemoryPool pool(4096, 1024, opts);
void* ptr = pool.allocate();
pool.deallocate(ptr);
auto st = pool.stats();#include "openkl/slab_pool.hpp"
auto classes = openkl::SlabPool::default_classes(64, 1024 * 1024, 512);
openkl::SlabPool slab(classes);
void* p = slab.allocate(200); // chooses best fitting class
slab.deallocate(p);import openkl
pool = openkl.MemoryPool(
block_size=4096,
num_blocks=1024,
thread_safe=True,
debug=True,
alignment=64,
exhaustion_policy=openkl.ExhaustionPolicy.Throw,
)
ptr = pool.allocate()
pool.deallocate(ptr)
print(pool.stats().in_use, pool.stats().free_blocks)
slab = openkl.make_default_slab(64, 1024 * 1024, 256, debug=True)
p = slab.allocate(128)
slab.deallocate(p)#include "openkl/c_api.h"
openkl_pool* pool = openkl_pool_create_ex(4096, 1024, 1, 1, 64, 0);
void* ptr = NULL;
openkl_error_code ec = openkl_pool_alloc_ex(pool, &ptr);
if (ec == OPENKL_OK) {
openkl_pool_free_ex(pool, ptr);
}
openkl_pool_destroy(pool);Core methods:
allocate(),deallocate()allocate_batch(n),deallocate_batch(ptrs)try_allocate(out_ptr),try_deallocate(ptr)allocate_owned()fragmentation(),stats()set_exhaustion_policy(),set_wait_timeout(),set_device_id()reserve(),reset(),validate(),for_each_allocation()
Metrics helpers:
capacity_bytes(),in_use_bytes(),free_bytes()
CUDA-only methods:
allocate_async(stream)deallocate_async(ptr, stream)(thread_safe=truerequired)
allocate(size),allocate_exact(size),deallocate(ptr)default_classes(min, max, blocks_per_class)set_compaction_policy(...)stats(),slab_stats()class_for_size(size),rebalance(...)(rebalance is currently scaffolded)
allocate(device_id),allocate_auto(),allocate_on_best_fit(size_bytes)deallocate(ptr, device_id), batch variantsset_affinity_policy(...),set_device_weight(...)stats(device_id),stats_all(),stats_aggregate()peer_access_supported(src, dst),copy_peer(...)
- Exception classes for C++ high-level APIs.
Status+ErrorCodefor no-exception workflows.- C API error retrieval via
openkl_last_error().
- CUDA enabled: device allocations via CUDA runtime.
- CUDA disabled: host-backed aligned allocation stubs.
- Backend abstraction exists in source (
backend.hpp+ implementations):- host backend
- CUDA backend
- HIP backend scaffold (stub)
Platform helpers:
get_os(),get_os_name()cuda_available(),get_device_count(),get_device_name(device_id)
./build.shManual:
mkdir -p build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .
ctest --output-on-failureUseful CMake options:
-DOPENKL_USE_CUDA=OFF-DOPENKL_BUILD_PYTHON=OFF-DOPENKL_BUILD_TESTS=OFF-DOPENKL_BUILD_BENCHMARKS=OFF-DOPENKL_BUILD_SHARED=ON
PYTHONPATH=python python tests/test_all.pyor
PYTHONPATH=python python -m pytest tests/test_all.py -vNote: use the same Python interpreter for build and test when using native extensions.
From build/:
./bench_allocWhen CUDA is not enabled, CUDA baseline is skipped and the pool path still runs.
OpenKL/
├── include/openkl/ # Public C++ headers
├── src/ # C++ implementations
├── python/ # Python package + pybind bindings
├── tests/ # C++ and Python tests
├── benchmarks/ # Benchmarks
├── build.sh # Unix/macOS build+test helper
├── build_windows.ps1 # PowerShell build+test helper
├── build_windows.bat # CMD build+test helper
└── CMakeLists.txt
Apache-2.0. See LICENSE.
Developed by Aksel Aghajanyan — Aqwel AI.
GitHub: Aksel588