05_16_03_cache_variants¶

Cache-Optimized Audio Processing Variants¶

Status: ⏸️ NOT STARTED Priority: HIGH Dependencies: TAREA 0 (Variant Framework), TAREA 1 (SIMD Variants) Estimated Effort: 2-3 weeks

🎯 Purpose¶

Cache variants optimize data access patterns to maximize CPU cache utilization, achieving 20-40% additional performance on top of SIMD optimizations. Modern CPUs have limited cache bandwidth, and cache-friendly algorithms can dramatically reduce memory latency.

Cache Hierarchy¶

┌─────────────────────────────────────────┐
│ L1 Cache: 32-64 KB per core            │
│ Latency: ~4 cycles                      │ ← Fastest
│ Bandwidth: ~100 GB/s                    │
├─────────────────────────────────────────┤
│ L2 Cache: 256 KB - 1 MB per core       │
│ Latency: ~12 cycles                     │ ← Fast
│ Bandwidth: ~50 GB/s                     │
├─────────────────────────────────────────┤
│ L3 Cache: 8-128 MB shared               │
│ Latency: ~40 cycles                     │ ← Medium
│ Bandwidth: ~25 GB/s                     │
├─────────────────────────────────────────┤
│ Main RAM: GBs                           │
│ Latency: ~100-200 cycles                │ ← Slow
│ Bandwidth: ~10-20 GB/s                  │
└─────────────────────────────────────────┘

Goal: Keep working set in L1/L2 cache as much as possible.

📋 Planned Features¶

1. Cache Blocking (Tiling)¶

Break large operations into cache-sized blocks:

// Instead of processing entire buffer at once
for (size_t i = 0; i < numSamples; ++i) {
    output[i] = process(input[i]);  // Poor cache locality
}

// Process in L1-cache-sized blocks
constexpr size_t BLOCK_SIZE = 1024;  // Fits in L1
for (size_t block = 0; block < numSamples; block += BLOCK_SIZE) {
    size_t blockSize = std::min(BLOCK_SIZE, numSamples - block);
    processBlock(&input[block], &output[block], blockSize);
}

Expected Benefit: 15-25% faster

2. Data Prefetching¶

Explicitly tell CPU to load data into cache before it's needed:

class PrefetchGainVariant : public IVariant {
public:
    bool process(const float* input, float* output, size_t numSamples) override {
        constexpr size_t PREFETCH_DISTANCE = 64;  // samples ahead

        for (size_t i = 0; i < numSamples; ++i) {
            // Prefetch future data
            if (i + PREFETCH_DISTANCE < numSamples) {
                _mm_prefetch((const char*)&input[i + PREFETCH_DISTANCE], _MM_HINT_T0);
            }

            output[i] = input[i] * gain_;
        }
        return true;
    }
};

Expected Benefit: 10-20% faster for large buffers

3. Loop Tiling for Filters¶

Optimize IIR/FIR filters with cache blocking:

// Standard FIR convolution (cache-unfriendly)
for (size_t i = 0; i < numSamples; ++i) {
    float sum = 0.0f;
    for (size_t j = 0; j < irLength; ++j) {  // Random access to IR
        sum += input[i - j] * ir[j];
    }
    output[i] = sum;
}

// Cache-blocked FIR (better locality)
constexpr size_t TILE_SIZE = 256;
for (size_t tile = 0; tile < numSamples; tile += TILE_SIZE) {
    // Process tile with IR in cache
    processFIRTile(&input[tile], &output[tile], ir, irLength);
}

Expected Benefit: 30-50% faster for long IRs

4. Structure-of-Arrays (SoA) Layouts¶

Transform data layout for better vectorization and cache usage:

// Array-of-Structures (AoS) - poor cache locality
struct Frame {
    float left;
    float right;
};
Frame frames[1024];  // L R L R L R ... (interleaved)

// Structure-of-Arrays (SoA) - better cache locality
struct Frames {
    float left[1024];   // L L L L ...
    float right[1024];  // R R R R ...
};

Expected Benefit: 20-30% faster for stereo processing

5. Cache-Aware Sorting/Ordering¶

Optimize processing order to minimize cache misses:

// Process voices in cache-friendly order
void processVoices(Voice* voices, int numVoices) {
    // Sort by memory address (spatial locality)
    std::sort(voices, voices + numVoices,
              [](const Voice& a, const Voice& b) {
                  return &a < &b;
              });

    // Process in order
    for (int i = 0; i < numVoices; ++i) {
        voices[i].process();
    }
}

Expected Benefit: 10-15% faster for many voices

🏗️ Proposed Architecture¶

Cache-Aware Variant Base Class¶

class CacheOptimizedVariant : public IVariant {
protected:
    // Determine optimal block size based on cache hierarchy
    size_t getOptimalBlockSize(size_t dataSize, CacheLevel targetCache) const;

    // Prefetch data ahead of processing
    void prefetchData(const void* address, PrefetchHint hint) const;

    // Align data to cache line boundaries
    void* allocateCacheAligned(size_t size, size_t alignment = 64) const;

    // Cache hierarchy info
    const CPUInfo& cpuInfo_;
};

Specific Variants¶

class CacheBlockedConvolutionVariant : public CacheOptimizedVariant {
public:
    bool process(const float* input, float* output, size_t numSamples) override;
    void setImpulseResponse(const float* ir, size_t irLength);

private:
    void processTile(const float* input, float* output,
                     size_t tileSize, const float* ir, size_t irLength);

    size_t tileSize_;  // Calculated from L2 cache size
};

class CacheOptimizedStereoVariant : public CacheOptimizedVariant {
public:
    bool processStereo(const float* inL, const float* inR,
                       float* outL, float* outR, size_t numSamples) override;

private:
    // Uses SoA layout internally
    AlignedBuffer<float> tempBuffers_[2];
};

📊 Performance Targets¶

Cache Miss Reduction¶

Variant	Cache Misses (Baseline)	Cache Misses (Optimized)	Reduction
FIR Convolution (4096)	45%	12%	73% reduction
Stereo Processing	30%	8%	73% reduction
Multi-Voice Synth	55%	20%	64% reduction

Performance Improvements¶

Operation	SIMD Only	+ Cache Optimization	Additional Gain
Long FIR (4096 taps)	8x	11x	+37%
Stereo Gain	6.7x	8.5x	+27%
Multi-Biquad (10 filters)	2.5x	3.5x	+40%

🛠️ Implementation Plan¶

Week 1: Infrastructure¶

Tasks: 1. Implement cache size detection (from CPUDetection) 2. Create CacheOptimizedVariant base class 3. Implement cache-aligned memory allocation 4. Prefetch helper functions 5. Benchmarking infrastructure for cache performance

Deliverables: - CacheOptimizedVariant.h/cpp - Cache detection utilities - Alignment helpers

Week 2: Basic Cache Variants¶

Tasks: 1. Implement cache-blocked gain variant 2. Implement prefetch gain variant 3. Implement cache-blocked convolution 4. Benchmark vs SIMD-only variants 5. Tune block sizes for different CPUs

Deliverables: - CacheBlockedGainVariant - CacheBlockedConvolutionVariant - Performance comparison data

Week 3: Advanced Optimizations¶

Tasks: 1. SoA stereo processing variants 2. Cache-aware multi-voice processing 3. Integration with existing SIMD variants 4. Documentation and examples 5. Validation tests

Deliverables: - SoA stereo variants - Complete documentation - Test suite

🔧 Technical Details¶

Cache Line Size Optimization¶

// Align to cache line (64 bytes on x86)
struct alignas(64) AudioBuffer {
    float samples[16];  // 64 bytes = 16 floats
    // Next buffer starts on new cache line
};

Software Prefetching¶

// Prefetch hints (x86)
_mm_prefetch(ptr, _MM_HINT_T0);  // L1 cache
_mm_prefetch(ptr, _MM_HINT_T1);  // L2 cache
_mm_prefetch(ptr, _MM_HINT_T2);  // L3 cache
_mm_prefetch(ptr, _MM_HINT_NTA); // Non-temporal (bypass cache)

Cache-Conscious Data Structures¶

// Bad: Scattered data (poor spatial locality)
class Voice {
    float* oscillator;   // Points to random memory
    float* filter;       // Points to random memory
    float* envelope;     // Points to random memory
};

// Good: Colocated data (good spatial locality)
class Voice {
    float oscillatorData[64];  // Inline, cache-friendly
    float filterData[64];
    float envelopeData[64];
};

📚 Key Concepts¶

1. Spatial Locality¶

Access memory locations that are close together:

// Good spatial locality
for (int i = 0; i < N; ++i) {
    array[i] = ...;  // Sequential access
}

// Poor spatial locality
for (int i = 0; i < N; i += stride) {
    array[i] = ...;  // Strided access (cache line waste)
}

2. Temporal Locality¶

Reuse recently accessed data:

// Good temporal locality
float temp = input[i];
output1[i] = temp * gain1;
output2[i] = temp * gain2;  // 'temp' likely still in cache

// Poor temporal locality
output1[i] = input[i] * gain1;
// ... 1000 lines of code ...
output2[i] = input[i] * gain2;  // 'input[i]' likely evicted

Avoid false sharing in multi-threaded code:

// Bad: False sharing (threads fight over same cache line)
struct Data {
    alignas(64) int thread1_counter;
    int thread2_counter;  // Same cache line!
};

// Good: Separate cache lines
struct Data {
    alignas(64) int thread1_counter;
    alignas(64) int thread2_counter;  // Different cache lines
};

🔬 Profiling Tools¶

Linux: perf¶

perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses ./benchmark

Performance counter stats:
  1,234,567 cache-references
    123,456 cache-misses              # 10.00% of all cache refs
  9,876,543 L1-dcache-loads
    987,654 L1-dcache-load-misses     # 10.00% of all L1 loads

Intel: VTune¶

Cache Miss Analysis
Memory Access Patterns
Cache Hierarchy Visualization

Windows: VS Performance Profiler¶

CPU Sampling
Instrumentation
Memory Usage

✅ Success Criteria¶

✅ 20-40% performance improvement over SIMD-only
✅ Cache miss rate reduced by 50%+
✅ Works on Intel, AMD, and ARM CPUs
✅ Maintains accuracy (no quality degradation)
✅ Validated with profiling tools

📝 Documentation Needed¶

Cache hierarchy explanation
Block size selection guide
Prefetch strategy guide
SoA conversion guide
Profiling guide (perf, VTune)
Integration examples

🔗 Dependencies¶

Required: - TAREA 0: Variant Framework (CPUDetection for cache info) - TAREA 1: SIMD Variants (baseline to improve upon)

Optional: - TAREA 5: Threading Variants (cache-aware NUMA)

⚠️ Challenges¶

CPU-Specific Tuning: Optimal block sizes vary by CPU
Complex Trade-offs: Sometimes cache optimization hurts SIMD efficiency
Hard to Debug: Cache issues don't crash, they just slow down
Profiling Required: Must use tools like perf/VTune to verify gains

Status: ⏸️ Ready to start after TAREA 1 completion Priority: HIGH - Can provide 20-40% additional gains Contact: performance@audiolab.com

"Cache Variants: Because cache misses cost 100+ cycles!" ⚡💾