05_16_03_cache_variants¶
Cache-Optimized Audio Processing Variants¶
Status: βΈοΈ NOT STARTED Priority: HIGH Dependencies: TAREA 0 (Variant Framework), TAREA 1 (SIMD Variants) Estimated Effort: 2-3 weeks
π― Purpose¶
Cache variants optimize data access patterns to maximize CPU cache utilization, achieving 20-40% additional performance on top of SIMD optimizations. Modern CPUs have limited cache bandwidth, and cache-friendly algorithms can dramatically reduce memory latency.
Cache Hierarchy¶
βββββββββββββββββββββββββββββββββββββββββββ
β L1 Cache: 32-64 KB per core β
β Latency: ~4 cycles β β Fastest
β Bandwidth: ~100 GB/s β
βββββββββββββββββββββββββββββββββββββββββββ€
β L2 Cache: 256 KB - 1 MB per core β
β Latency: ~12 cycles β β Fast
β Bandwidth: ~50 GB/s β
βββββββββββββββββββββββββββββββββββββββββββ€
β L3 Cache: 8-128 MB shared β
β Latency: ~40 cycles β β Medium
β Bandwidth: ~25 GB/s β
βββββββββββββββββββββββββββββββββββββββββββ€
β Main RAM: GBs β
β Latency: ~100-200 cycles β β Slow
β Bandwidth: ~10-20 GB/s β
βββββββββββββββββββββββββββββββββββββββββββ
Goal: Keep working set in L1/L2 cache as much as possible.
π Planned Features¶
1. Cache Blocking (Tiling)¶
Break large operations into cache-sized blocks:
// Instead of processing entire buffer at once
for (size_t i = 0; i < numSamples; ++i) {
output[i] = process(input[i]); // Poor cache locality
}
// Process in L1-cache-sized blocks
constexpr size_t BLOCK_SIZE = 1024; // Fits in L1
for (size_t block = 0; block < numSamples; block += BLOCK_SIZE) {
size_t blockSize = std::min(BLOCK_SIZE, numSamples - block);
processBlock(&input[block], &output[block], blockSize);
}
Expected Benefit: 15-25% faster
2. Data Prefetching¶
Explicitly tell CPU to load data into cache before it's needed:
class PrefetchGainVariant : public IVariant {
public:
bool process(const float* input, float* output, size_t numSamples) override {
constexpr size_t PREFETCH_DISTANCE = 64; // samples ahead
for (size_t i = 0; i < numSamples; ++i) {
// Prefetch future data
if (i + PREFETCH_DISTANCE < numSamples) {
_mm_prefetch((const char*)&input[i + PREFETCH_DISTANCE], _MM_HINT_T0);
}
output[i] = input[i] * gain_;
}
return true;
}
};
Expected Benefit: 10-20% faster for large buffers
3. Loop Tiling for Filters¶
Optimize IIR/FIR filters with cache blocking:
// Standard FIR convolution (cache-unfriendly)
for (size_t i = 0; i < numSamples; ++i) {
float sum = 0.0f;
for (size_t j = 0; j < irLength; ++j) { // Random access to IR
sum += input[i - j] * ir[j];
}
output[i] = sum;
}
// Cache-blocked FIR (better locality)
constexpr size_t TILE_SIZE = 256;
for (size_t tile = 0; tile < numSamples; tile += TILE_SIZE) {
// Process tile with IR in cache
processFIRTile(&input[tile], &output[tile], ir, irLength);
}
Expected Benefit: 30-50% faster for long IRs
4. Structure-of-Arrays (SoA) Layouts¶
Transform data layout for better vectorization and cache usage:
// Array-of-Structures (AoS) - poor cache locality
struct Frame {
float left;
float right;
};
Frame frames[1024]; // L R L R L R ... (interleaved)
// Structure-of-Arrays (SoA) - better cache locality
struct Frames {
float left[1024]; // L L L L ...
float right[1024]; // R R R R ...
};
Expected Benefit: 20-30% faster for stereo processing
5. Cache-Aware Sorting/Ordering¶
Optimize processing order to minimize cache misses:
// Process voices in cache-friendly order
void processVoices(Voice* voices, int numVoices) {
// Sort by memory address (spatial locality)
std::sort(voices, voices + numVoices,
[](const Voice& a, const Voice& b) {
return &a < &b;
});
// Process in order
for (int i = 0; i < numVoices; ++i) {
voices[i].process();
}
}
Expected Benefit: 10-15% faster for many voices
ποΈ Proposed Architecture¶
Cache-Aware Variant Base Class¶
class CacheOptimizedVariant : public IVariant {
protected:
// Determine optimal block size based on cache hierarchy
size_t getOptimalBlockSize(size_t dataSize, CacheLevel targetCache) const;
// Prefetch data ahead of processing
void prefetchData(const void* address, PrefetchHint hint) const;
// Align data to cache line boundaries
void* allocateCacheAligned(size_t size, size_t alignment = 64) const;
// Cache hierarchy info
const CPUInfo& cpuInfo_;
};
Specific Variants¶
class CacheBlockedConvolutionVariant : public CacheOptimizedVariant {
public:
bool process(const float* input, float* output, size_t numSamples) override;
void setImpulseResponse(const float* ir, size_t irLength);
private:
void processTile(const float* input, float* output,
size_t tileSize, const float* ir, size_t irLength);
size_t tileSize_; // Calculated from L2 cache size
};
class CacheOptimizedStereoVariant : public CacheOptimizedVariant {
public:
bool processStereo(const float* inL, const float* inR,
float* outL, float* outR, size_t numSamples) override;
private:
// Uses SoA layout internally
AlignedBuffer<float> tempBuffers_[2];
};
π Performance Targets¶
Cache Miss Reduction¶
| Variant | Cache Misses (Baseline) | Cache Misses (Optimized) | Reduction |
|---|---|---|---|
| FIR Convolution (4096) | 45% | 12% | 73% reduction |
| Stereo Processing | 30% | 8% | 73% reduction |
| Multi-Voice Synth | 55% | 20% | 64% reduction |
Performance Improvements¶
| Operation | SIMD Only | + Cache Optimization | Additional Gain |
|---|---|---|---|
| Long FIR (4096 taps) | 8x | 11x | +37% |
| Stereo Gain | 6.7x | 8.5x | +27% |
| Multi-Biquad (10 filters) | 2.5x | 3.5x | +40% |
π οΈ Implementation Plan¶
Week 1: Infrastructure¶
Tasks:
1. Implement cache size detection (from CPUDetection)
2. Create CacheOptimizedVariant base class
3. Implement cache-aligned memory allocation
4. Prefetch helper functions
5. Benchmarking infrastructure for cache performance
Deliverables: - CacheOptimizedVariant.h/cpp - Cache detection utilities - Alignment helpers
Week 2: Basic Cache Variants¶
Tasks: 1. Implement cache-blocked gain variant 2. Implement prefetch gain variant 3. Implement cache-blocked convolution 4. Benchmark vs SIMD-only variants 5. Tune block sizes for different CPUs
Deliverables: - CacheBlockedGainVariant - CacheBlockedConvolutionVariant - Performance comparison data
Week 3: Advanced Optimizations¶
Tasks: 1. SoA stereo processing variants 2. Cache-aware multi-voice processing 3. Integration with existing SIMD variants 4. Documentation and examples 5. Validation tests
Deliverables: - SoA stereo variants - Complete documentation - Test suite
π§ Technical Details¶
Cache Line Size Optimization¶
// Align to cache line (64 bytes on x86)
struct alignas(64) AudioBuffer {
float samples[16]; // 64 bytes = 16 floats
// Next buffer starts on new cache line
};
Software Prefetching¶
// Prefetch hints (x86)
_mm_prefetch(ptr, _MM_HINT_T0); // L1 cache
_mm_prefetch(ptr, _MM_HINT_T1); // L2 cache
_mm_prefetch(ptr, _MM_HINT_T2); // L3 cache
_mm_prefetch(ptr, _MM_HINT_NTA); // Non-temporal (bypass cache)
Cache-Conscious Data Structures¶
// Bad: Scattered data (poor spatial locality)
class Voice {
float* oscillator; // Points to random memory
float* filter; // Points to random memory
float* envelope; // Points to random memory
};
// Good: Colocated data (good spatial locality)
class Voice {
float oscillatorData[64]; // Inline, cache-friendly
float filterData[64];
float envelopeData[64];
};
π Key Concepts¶
1. Spatial Locality¶
Access memory locations that are close together:
// Good spatial locality
for (int i = 0; i < N; ++i) {
array[i] = ...; // Sequential access
}
// Poor spatial locality
for (int i = 0; i < N; i += stride) {
array[i] = ...; // Strided access (cache line waste)
}
2. Temporal Locality¶
Reuse recently accessed data:
// Good temporal locality
float temp = input[i];
output1[i] = temp * gain1;
output2[i] = temp * gain2; // 'temp' likely still in cache
// Poor temporal locality
output1[i] = input[i] * gain1;
// ... 1000 lines of code ...
output2[i] = input[i] * gain2; // 'input[i]' likely evicted
3. Cache Line Sharing¶
Avoid false sharing in multi-threaded code:
// Bad: False sharing (threads fight over same cache line)
struct Data {
alignas(64) int thread1_counter;
int thread2_counter; // Same cache line!
};
// Good: Separate cache lines
struct Data {
alignas(64) int thread1_counter;
alignas(64) int thread2_counter; // Different cache lines
};
π¬ Profiling Tools¶
Linux: perf¶
perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses ./benchmark
Performance counter stats:
1,234,567 cache-references
123,456 cache-misses # 10.00% of all cache refs
9,876,543 L1-dcache-loads
987,654 L1-dcache-load-misses # 10.00% of all L1 loads
Intel: VTune¶
- Cache Miss Analysis
- Memory Access Patterns
- Cache Hierarchy Visualization
Windows: VS Performance Profiler¶
- CPU Sampling
- Instrumentation
- Memory Usage
β Success Criteria¶
- β 20-40% performance improvement over SIMD-only
- β Cache miss rate reduced by 50%+
- β Works on Intel, AMD, and ARM CPUs
- β Maintains accuracy (no quality degradation)
- β Validated with profiling tools
π Documentation Needed¶
- Cache hierarchy explanation
- Block size selection guide
- Prefetch strategy guide
- SoA conversion guide
- Profiling guide (perf, VTune)
- Integration examples
π Dependencies¶
Required: - TAREA 0: Variant Framework (CPUDetection for cache info) - TAREA 1: SIMD Variants (baseline to improve upon)
Optional: - TAREA 5: Threading Variants (cache-aware NUMA)
β οΈ Challenges¶
- CPU-Specific Tuning: Optimal block sizes vary by CPU
- Complex Trade-offs: Sometimes cache optimization hurts SIMD efficiency
- Hard to Debug: Cache issues don't crash, they just slow down
- Profiling Required: Must use tools like perf/VTune to verify gains
Status: βΈοΈ Ready to start after TAREA 1 completion Priority: HIGH - Can provide 20-40% additional gains Contact: performance@audiolab.com
"Cache Variants: Because cache misses cost 100+ cycles!" β‘πΎ