Aligned Memory Management¶
SIMD-optimized memory allocation with guaranteed alignment and cache-line awareness.
๐ฏ Features¶
- Aligned Allocation: SSE (16), AVX (32), AVX-512 (64) byte alignment
- STL-Compatible:
AlignedAllocatorworks withstd::vector, etc. - RAII Buffers:
AlignedBufferfor automatic memory management - Cache Alignment: Prevent false sharing with
CacheAligned<T> - Cross-Platform: Windows, Linux, macOS support
- Zero Overhead: Header-only, compile-time validation
๐ฆ Components¶
1. aligned_allocator.hpp¶
Low-level aligned allocation and STL-compatible allocator.
// Low-level functions
void* ptr = aligned_malloc(1024, 32); // 1KB, 32-byte aligned
aligned_free(ptr);
// STL allocator
std::vector<float, AlignedAllocator<float, 32>> vec(1024);
// vec.data() is guaranteed 32-byte aligned (AVX)
// Convenience aliases
std::vector<float, SSEAllocator<float>> sse_vec(512); // 16-byte
std::vector<float, AVXAllocator<float>> avx_vec(512); // 32-byte
std::vector<float, AVX512Allocator<float>> avx512_vec(512); // 64-byte
2. aligned_buffer.hpp¶
RAII-managed buffer with guaranteed alignment.
// Create aligned buffer
AlignedBuffer<float, 32> buffer(1024); // 1024 floats, 32-byte aligned
// Access
buffer[0] = 1.0f;
buffer.fill(0.5f);
buffer.zero();
// Resize (with optional preserve)
buffer.resize(2048, true); // Preserve existing data
// Convenience aliases
SSEBuffer<float> sse_buf(512); // 16-byte aligned
AVXBuffer<float> avx_buf(512); // 32-byte aligned
AVX512Buffer<float> avx512_buf(512); // 64-byte aligned
3. cache_aligned.hpp¶
Cache-line alignment to prevent false sharing.
// Cache-aligned value (prevents false sharing)
CacheAligned<std::atomic<int>> counter;
// Per-thread data (no false sharing)
CacheAlignedArray<std::atomic<int>, 8> per_thread_counters;
// Manual alignment
struct Data {
CACHE_ALIGNED std::atomic<int> head; // Own cache line
CACHE_ALIGNED std::atomic<int> tail; // Own cache line
};
// Prefetch hints
for (size_t i = 0; i < size; ++i) {
if (i + 64 < size) {
prefetch(&data[i + 64]); // Prefetch ahead
}
process(data[i]);
}
๐ Usage Examples¶
Audio Buffer Alignment¶
// SSE-optimized audio buffer
AlignedBuffer<float, 16> audio_buffer(512);
#ifdef __SSE__
__m128 v = _mm_load_ps(audio_buffer.data()); // Aligned load (fast!)
#endif
Avoiding False Sharing¶
// BAD: False sharing (slow)
struct BadData {
std::atomic<int> counter1; // Same cache line
std::atomic<int> counter2; // False sharing!
};
// GOOD: No false sharing (fast)
struct GoodData {
CACHE_ALIGNED std::atomic<int> counter1; // Own cache line
CACHE_ALIGNED std::atomic<int> counter2; // No false sharing
};
Ring Buffer (Producer-Consumer)¶
struct RingBuffer {
// Producer hot data
CACHE_ALIGNED std::atomic<size_t> head;
size_t local_tail; // Cached consumer position
CacheLineSeparator sep1;
// Consumer hot data
CACHE_ALIGNED std::atomic<size_t> tail;
size_t local_head; // Cached producer position
CacheLineSeparator sep2;
// Cold data (shared, rarely accessed)
const size_t capacity;
float* const buffer;
};
๐งช Testing¶
Or manually:
cd tests
g++ -std=c++17 -O2 -msse -mavx -pthread test_alignment.cpp -o test_alignment
./test_alignment
Test Coverage¶
- โ Aligned malloc/free (16, 32, 64, 128 bytes)
- โ AlignedAllocator with std::vector
- โ AlignedBuffer (construction, resize, copy, move)
- โ SIMD loads (SSE, AVX)
- โ Cache alignment (no false sharing)
- โ False sharing detection (performance comparison)
- โ Prefetch hints
๐ Performance¶
Alignment Impact (aligned vs unaligned loads)¶
SSE aligned: 2.5 ns/load
SSE unaligned: 5.0 ns/load (2x slower)
AVX aligned: 4.0 ns/load
AVX unaligned: 8.5 ns/load (2x slower)
False Sharing Impact¶
Without cache alignment: 1500ms (10M iterations)
With cache alignment: 200ms (10M iterations)
Speedup: 7.5x
๐ง Alignment Requirements¶
| SIMD Instruction | Alignment | Buffer Type |
|---|---|---|
| SSE | 16 bytes | SSEBuffer<T> |
| AVX | 32 bytes | AVXBuffer<T> |
| AVX-512 | 64 bytes | AVX512Buffer<T> |
| Cache line | 64 bytes | CacheAlignedBuffer<T> |
๐ Common Pitfalls¶
1. Unaligned SIMD Loads¶
// โ BAD: May crash with aligned SIMD loads
float data[512];
__m128 v = _mm_load_ps(data); // Crash if data not 16-byte aligned
// โ
GOOD: Guaranteed aligned
AlignedBuffer<float, 16> data(512);
__m128 v = _mm_load_ps(data.data()); // Always safe
2. False Sharing¶
// โ BAD: False sharing between threads
struct PerThreadData {
std::atomic<int> counter; // Adjacent in memory
};
PerThreadData thread_data[8]; // False sharing!
// โ
GOOD: Each counter in own cache line
CacheAlignedArray<std::atomic<int>, 8> thread_counters;
3. Mixing Allocators¶
// โ BAD: Can't mix allocators
std::vector<float, SSEAllocator<float>> vec1;
std::vector<float, AVXAllocator<float>> vec2 = vec1; // Compile error!
// โ
GOOD: Use same allocator
std::vector<float, SSEAllocator<float>> vec2 = vec1; // OK
๐ References¶
- Intel Intrinsics Guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
- Cache Line Size: https://en.wikipedia.org/wiki/CPU_cache#Cache_performance
- False Sharing: https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
๐ License¶
Part of AudioLab Core library.