This document explains the internal architecture of the dynemit library.
Dynemit uses a hybrid library architecture that supports both:
- All-in-one mode: Single library containing all features
- Modular mode: Separate libraries for each feature
This is achieved through CMake's object library pattern, which allows compiling code once and linking it into multiple targets.
dynemit/
├── src/ # Core library (CPU detection)
│ ├── dynemit.c # CPU feature detection
│ └── dynemit_features.c # Feature list (all-in-one only)
├── features/ # Individual SIMD features
│ ├── vector_add/
│ ├── vector_mul/
│ └── vector_sub/
├── include/dynemit/ # Public headers
└── tests/ # Test suite
Each feature is built as an object library (*_obj target):
add_library(vector_add_obj OBJECT vector_add.c)Object libraries compile source files but don't create an archive. The compiled objects can then be:
- Bundled into the all-in-one library
- Packaged into individual static libraries
The libdynemit.a library combines all object libraries:
add_library(dynemit STATIC
$<TARGET_OBJECTS:dynemit_core_obj>
$<TARGET_OBJECTS:vector_add_obj>
$<TARGET_OBJECTS:vector_mul_obj>
$<TARGET_OBJECTS:vector_sub_obj>
src/dynemit_features.c
)Benefits:
- Single library to link against
- All features included
- Runtime feature discovery via
dynemit_features()
Each feature also creates its own static library:
add_library(dynemit_vector_add STATIC
$<TARGET_OBJECTS:vector_add_obj>
)Benefits:
- Minimal binary size (only link what you need)
- Fine-grained control
- No unused code in final binary
The core library (libdynemit_core.a) contains only CPU detection:
add_library(dynemit_core STATIC
$<TARGET_OBJECTS:dynemit_core_obj>
)This is required by all feature libraries and applications.
Each feature uses the ifunc (indirect function) attribute, supported by both GCC and Clang on Linux/ELF targets:
// Resolver runs once at program load
static vector_add_f32_func_t
vector_add_f32_resolver(void)
{
simd_level_t level = detect_simd_level();
switch (level) {
case SIMD_AVX512F: return vector_add_f32_avx512f;
case SIMD_AVX2: return vector_add_f32_avx2;
// ... etc
}
}
// Public function uses ifunc for dispatch
void vector_add_f32(const float *a, const float *b, float *out, size_t n)
__attribute__((ifunc("vector_add_f32_resolver")));How it works:
- At program load time, the dynamic linker calls
vector_add_f32_resolver() - Resolver detects CPU features and returns optimal implementation pointer
- All subsequent calls to
vector_add_f32()go directly to the selected implementation - Zero runtime overhead after initial resolution
The detect_simd_level() function in src/dynemit.c:
simd_level_t detect_simd_level(void)
{
// 1. Check CPU vendor and max CPUID leaf
cpuid_x86(0, 0, &eax, &ebx, &ecx, &edx);
// 2. Check basic features (SSE2, SSE4.2, AVX)
cpuid_x86(1, 0, &eax, &ebx, &ecx, &edx);
// 3. Check OS support for YMM/ZMM registers
if (osxsave)
xcr0 = xgetbv_x86(0);
// 4. Check extended features (AVX2, AVX-512F)
cpuid_x86(7, 0, &eax7, &ebx7, &ecx7, &edx7);
// 5. Return highest supported level
return simd_level;
}Detection criteria:
- SSE2: CPUID.01H:EDX[26]
- SSE4.2: CPUID.01H:ECX[20]
- AVX: CPUID.01H:ECX[28] + XCR0[2:1] (OS support)
- AVX2: CPUID.07H:EBX[5] + XCR0[2:1]
- AVX-512F: CPUID.07H:EBX[16] + XCR0[7:5] (ZMM state)
The dynemit_features() function uses weak symbols to provide different implementations:
Default (in dynemit.c):
__attribute__((weak))
const char **
dynemit_features(void)
{
static const char *features[] = { "core", NULL };
return features;
}All-in-one override (in dynemit_features.c):
const char **
dynemit_features(void)
{
static const char *features[] = {
"core",
"vector_add",
"vector_mul",
"vector_sub",
NULL
};
return features;
}When linking:
libdynemit.a: Uses the all-in-one version (includes dynemit_features.c)libdynemit_core.a: Uses the weak default version
include/dynemit.h conditionally includes feature headers:
#include <dynemit/core.h>
#ifdef DYNEMIT_ALL_FEATURES
#include <dynemit/vector_add.h>
#include <dynemit/vector_mul.h>
#include <dynemit/vector_sub.h>
#endifThe all-in-one library defines DYNEMIT_ALL_FEATURES:
target_compile_definitions(dynemit PUBLIC DYNEMIT_ALL_FEATURES)Each feature has a minimal header in include/dynemit/:
#ifndef DYNEMIT_VECTOR_ADD_H
#define DYNEMIT_VECTOR_ADD_H
#include <stddef.h>
void vector_add_f32(const float *a, const float *b, float *out, size_t n);
#endifDesign principles:
- Self-contained (only depends on standard headers and core.h)
- No implementation details leaked
- Can be included independently
Every SIMD feature follows this pattern:
// 1. Include intrinsics
#include <immintrin.h>
#include <stddef.h>
#include <dynemit/core.h>
#include <dynemit/compiler.h>
// 2. Scalar fallback
__attribute__((target("default")))
DYNEMIT_NO_AUTOVECTORIZE
static void feature_scalar(...)
{
DYNEMIT_PRAGMA_NO_VECTORIZE_BEGIN
/* ... */
}
// 3. SIMD implementations (SSE2, SSE4.2, AVX, AVX2, AVX-512F)
__attribute__((target("sse2")))
static void feature_sse2(...) { /* ... */ }
// ... more SIMD levels ...
// 4. Resolver function
typedef void (*feature_func_t)(...);
static feature_func_t
feature_resolver(void)
{
simd_level_t level = detect_simd_level();
switch (level) { /* ... */ }
}
// 5. Public ifunc
__attribute__((target("avx512f,avx2,avx,sse4.2,sse2")))
void feature(...) __attribute__((ifunc("feature_resolver")));- Scalar fallback: Ensures portability to non-x86 platforms
- Static implementations: Prevents symbol conflicts
- Target attributes: Enables specific CPU instructions
- Resolver pattern: Centralizes dispatch logic
- ifunc attribute: Achieves zero-overhead dispatch
Each feature should have a correctness test:
// Verify SIMD result matches expected scalar computation
for (int i = 0; i < N; i++) {
if (fabsf(result[i] - expected[i]) > epsilon) {
return FAIL;
}
}tests/test_features.c verifies:
- Runtime feature enumeration works
- All expected features are present
- SIMD level detection succeeds
Benchmarks serve as integration tests:
- Verify end-to-end functionality
- Measure performance
- Detect regressions
- Use unaligned loads (
_mm_loadu_ps) to avoid alignment overhead - Process data sequentially to maximize cache hits
- Consider prefetching for large datasets
size_t i = 0;
const size_t step = 16; // SIMD width
// Main SIMD loop
for (; i + step <= n; i += step) {
// SIMD operations
}
// Scalar tail for remaining elements
for (; i < n; i++) {
// Scalar operation
}- Build with
-O3for maximum performance - Use
__restrict__for pointer arguments if applicable - Let compiler auto-vectorize scalar tail
On non-x86 systems:
- CPUID returns 0 for all features
detect_simd_level()returnsSIMD_SCALAR- Resolver always selects scalar implementation
- No SIMD intrinsics are used
Required:
- GCC 13+ or Clang 16+ with
ifuncsupport (Linux/ELF targets) - x86-64 target for SIMD optimizations
Not supported:
- MSVC (different intrinsics model)
- macOS/non-ELF platforms (no
ifuncsupport)
- ARM NEON support: Add ARM-specific implementations
- Multi-architecture: Support both x86 and ARM in same library
- AVX-512 subsets: Use AVX-512BW, AVX-512DQ, etc.
- Runtime benchmarking: Choose implementation based on actual performance
- Compile-time options: Allow disabling specific SIMD levels
The current architecture scales well up to ~10-20 features. Beyond that, consider:
- Feature categorization (
features/math/,features/crypto/, etc.) - Automated CMake feature discovery
- Plugin system for dynamic feature loading