Aligned Memory Management¶

SIMD-optimized memory allocation with guaranteed alignment and cache-line awareness.

🎯 Features¶

Aligned Allocation: SSE (16), AVX (32), AVX-512 (64) byte alignment
STL-Compatible: AlignedAllocator works with std::vector, etc.
RAII Buffers: AlignedBuffer for automatic memory management
Cache Alignment: Prevent false sharing with CacheAligned<T>
Cross-Platform: Windows, Linux, macOS support
Zero Overhead: Header-only, compile-time validation

📦 Components¶

1. `aligned_allocator.hpp`¶

Low-level aligned allocation and STL-compatible allocator.

// Low-level functions
void* ptr = aligned_malloc(1024, 32);  // 1KB, 32-byte aligned
aligned_free(ptr);

// STL allocator
std::vector<float, AlignedAllocator<float, 32>> vec(1024);
// vec.data() is guaranteed 32-byte aligned (AVX)

// Convenience aliases
std::vector<float, SSEAllocator<float>> sse_vec(512);    // 16-byte
std::vector<float, AVXAllocator<float>> avx_vec(512);    // 32-byte
std::vector<float, AVX512Allocator<float>> avx512_vec(512);  // 64-byte

2. `aligned_buffer.hpp`¶

RAII-managed buffer with guaranteed alignment.

// Create aligned buffer
AlignedBuffer<float, 32> buffer(1024);  // 1024 floats, 32-byte aligned

// Access
buffer[0] = 1.0f;
buffer.fill(0.5f);
buffer.zero();

// Resize (with optional preserve)
buffer.resize(2048, true);  // Preserve existing data

// Convenience aliases
SSEBuffer<float> sse_buf(512);        // 16-byte aligned
AVXBuffer<float> avx_buf(512);        // 32-byte aligned
AVX512Buffer<float> avx512_buf(512);  // 64-byte aligned

3. `cache_aligned.hpp`¶

Cache-line alignment to prevent false sharing.

// Cache-aligned value (prevents false sharing)
CacheAligned<std::atomic<int>> counter;

// Per-thread data (no false sharing)
CacheAlignedArray<std::atomic<int>, 8> per_thread_counters;

// Manual alignment
struct Data {
    CACHE_ALIGNED std::atomic<int> head;  // Own cache line
    CACHE_ALIGNED std::atomic<int> tail;  // Own cache line
};

// Prefetch hints
for (size_t i = 0; i < size; ++i) {
    if (i + 64 < size) {
        prefetch(&data[i + 64]);  // Prefetch ahead
    }
    process(data[i]);
}

🚀 Usage Examples¶

Audio Buffer Alignment¶

// SSE-optimized audio buffer
AlignedBuffer<float, 16> audio_buffer(512);

#ifdef __SSE__
__m128 v = _mm_load_ps(audio_buffer.data());  // Aligned load (fast!)
#endif

// BAD: False sharing (slow)
struct BadData {
    std::atomic<int> counter1;  // Same cache line
    std::atomic<int> counter2;  // False sharing!
};

// GOOD: No false sharing (fast)
struct GoodData {
    CACHE_ALIGNED std::atomic<int> counter1;  // Own cache line
    CACHE_ALIGNED std::atomic<int> counter2;  // No false sharing
};

Ring Buffer (Producer-Consumer)¶

struct RingBuffer {
    // Producer hot data
    CACHE_ALIGNED std::atomic<size_t> head;
    size_t local_tail;  // Cached consumer position

    CacheLineSeparator sep1;

    // Consumer hot data
    CACHE_ALIGNED std::atomic<size_t> tail;
    size_t local_head;  // Cached producer position

    CacheLineSeparator sep2;

    // Cold data (shared, rarely accessed)
    const size_t capacity;
    float* const buffer;
};

🧪 Testing¶

mkdir build && cd build
cmake ..
cmake --build .
ctest

Or manually:

cd tests
g++ -std=c++17 -O2 -msse -mavx -pthread test_alignment.cpp -o test_alignment
./test_alignment

Test Coverage¶

✅ Aligned malloc/free (16, 32, 64, 128 bytes)
✅ AlignedAllocator with std::vector
✅ AlignedBuffer (construction, resize, copy, move)
✅ SIMD loads (SSE, AVX)
✅ Cache alignment (no false sharing)
✅ False sharing detection (performance comparison)
✅ Prefetch hints

📊 Performance¶

Alignment Impact (aligned vs unaligned loads)¶

SSE aligned:    2.5 ns/load
SSE unaligned:  5.0 ns/load   (2x slower)

AVX aligned:    4.0 ns/load
AVX unaligned:  8.5 ns/load   (2x slower)

Without cache alignment: 1500ms (10M iterations)
With cache alignment:     200ms (10M iterations)
Speedup: 7.5x

🔧 Alignment Requirements¶

SIMD Instruction	Alignment	Buffer Type
SSE	16 bytes	`SSEBuffer<T>`
AVX	32 bytes	`AVXBuffer<T>`
AVX-512	64 bytes	`AVX512Buffer<T>`
Cache line	64 bytes	`CacheAlignedBuffer<T>`

🐛 Common Pitfalls¶

1. Unaligned SIMD Loads¶

// ❌ BAD: May crash with aligned SIMD loads
float data[512];
__m128 v = _mm_load_ps(data);  // Crash if data not 16-byte aligned

// ✅ GOOD: Guaranteed aligned
AlignedBuffer<float, 16> data(512);
__m128 v = _mm_load_ps(data.data());  // Always safe

// ❌ BAD: False sharing between threads
struct PerThreadData {
    std::atomic<int> counter;  // Adjacent in memory
};
PerThreadData thread_data[8];  // False sharing!

// ✅ GOOD: Each counter in own cache line
CacheAlignedArray<std::atomic<int>, 8> thread_counters;

3. Mixing Allocators¶

// ❌ BAD: Can't mix allocators
std::vector<float, SSEAllocator<float>> vec1;
std::vector<float, AVXAllocator<float>> vec2 = vec1;  // Compile error!

// ✅ GOOD: Use same allocator
std::vector<float, SSEAllocator<float>> vec2 = vec1;  // OK

📚 References¶

Intel Intrinsics Guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Cache Line Size: https://en.wikipedia.org/wiki/CPU_cache#Cache_performance
False Sharing: https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html

📄 License¶

Part of AudioLab Core library.