Skip to content

05_16_03_cache_variants

Cache-Optimized Audio Processing Variants

Status: ⏸️ NOT STARTED Priority: HIGH Dependencies: TAREA 0 (Variant Framework), TAREA 1 (SIMD Variants) Estimated Effort: 2-3 weeks


🎯 Purpose

Cache variants optimize data access patterns to maximize CPU cache utilization, achieving 20-40% additional performance on top of SIMD optimizations. Modern CPUs have limited cache bandwidth, and cache-friendly algorithms can dramatically reduce memory latency.

Cache Hierarchy

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L1 Cache: 32-64 KB per core            β”‚
β”‚ Latency: ~4 cycles                      β”‚ ← Fastest
β”‚ Bandwidth: ~100 GB/s                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ L2 Cache: 256 KB - 1 MB per core       β”‚
β”‚ Latency: ~12 cycles                     β”‚ ← Fast
β”‚ Bandwidth: ~50 GB/s                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ L3 Cache: 8-128 MB shared               β”‚
β”‚ Latency: ~40 cycles                     β”‚ ← Medium
β”‚ Bandwidth: ~25 GB/s                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Main RAM: GBs                           β”‚
β”‚ Latency: ~100-200 cycles                β”‚ ← Slow
β”‚ Bandwidth: ~10-20 GB/s                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Goal: Keep working set in L1/L2 cache as much as possible.


πŸ“‹ Planned Features

1. Cache Blocking (Tiling)

Break large operations into cache-sized blocks:

// Instead of processing entire buffer at once
for (size_t i = 0; i < numSamples; ++i) {
    output[i] = process(input[i]);  // Poor cache locality
}

// Process in L1-cache-sized blocks
constexpr size_t BLOCK_SIZE = 1024;  // Fits in L1
for (size_t block = 0; block < numSamples; block += BLOCK_SIZE) {
    size_t blockSize = std::min(BLOCK_SIZE, numSamples - block);
    processBlock(&input[block], &output[block], blockSize);
}

Expected Benefit: 15-25% faster

2. Data Prefetching

Explicitly tell CPU to load data into cache before it's needed:

class PrefetchGainVariant : public IVariant {
public:
    bool process(const float* input, float* output, size_t numSamples) override {
        constexpr size_t PREFETCH_DISTANCE = 64;  // samples ahead

        for (size_t i = 0; i < numSamples; ++i) {
            // Prefetch future data
            if (i + PREFETCH_DISTANCE < numSamples) {
                _mm_prefetch((const char*)&input[i + PREFETCH_DISTANCE], _MM_HINT_T0);
            }

            output[i] = input[i] * gain_;
        }
        return true;
    }
};

Expected Benefit: 10-20% faster for large buffers

3. Loop Tiling for Filters

Optimize IIR/FIR filters with cache blocking:

// Standard FIR convolution (cache-unfriendly)
for (size_t i = 0; i < numSamples; ++i) {
    float sum = 0.0f;
    for (size_t j = 0; j < irLength; ++j) {  // Random access to IR
        sum += input[i - j] * ir[j];
    }
    output[i] = sum;
}

// Cache-blocked FIR (better locality)
constexpr size_t TILE_SIZE = 256;
for (size_t tile = 0; tile < numSamples; tile += TILE_SIZE) {
    // Process tile with IR in cache
    processFIRTile(&input[tile], &output[tile], ir, irLength);
}

Expected Benefit: 30-50% faster for long IRs

4. Structure-of-Arrays (SoA) Layouts

Transform data layout for better vectorization and cache usage:

// Array-of-Structures (AoS) - poor cache locality
struct Frame {
    float left;
    float right;
};
Frame frames[1024];  // L R L R L R ... (interleaved)

// Structure-of-Arrays (SoA) - better cache locality
struct Frames {
    float left[1024];   // L L L L ...
    float right[1024];  // R R R R ...
};

Expected Benefit: 20-30% faster for stereo processing

5. Cache-Aware Sorting/Ordering

Optimize processing order to minimize cache misses:

// Process voices in cache-friendly order
void processVoices(Voice* voices, int numVoices) {
    // Sort by memory address (spatial locality)
    std::sort(voices, voices + numVoices,
              [](const Voice& a, const Voice& b) {
                  return &a < &b;
              });

    // Process in order
    for (int i = 0; i < numVoices; ++i) {
        voices[i].process();
    }
}

Expected Benefit: 10-15% faster for many voices


πŸ—οΈ Proposed Architecture

Cache-Aware Variant Base Class

class CacheOptimizedVariant : public IVariant {
protected:
    // Determine optimal block size based on cache hierarchy
    size_t getOptimalBlockSize(size_t dataSize, CacheLevel targetCache) const;

    // Prefetch data ahead of processing
    void prefetchData(const void* address, PrefetchHint hint) const;

    // Align data to cache line boundaries
    void* allocateCacheAligned(size_t size, size_t alignment = 64) const;

    // Cache hierarchy info
    const CPUInfo& cpuInfo_;
};

Specific Variants

class CacheBlockedConvolutionVariant : public CacheOptimizedVariant {
public:
    bool process(const float* input, float* output, size_t numSamples) override;
    void setImpulseResponse(const float* ir, size_t irLength);

private:
    void processTile(const float* input, float* output,
                     size_t tileSize, const float* ir, size_t irLength);

    size_t tileSize_;  // Calculated from L2 cache size
};

class CacheOptimizedStereoVariant : public CacheOptimizedVariant {
public:
    bool processStereo(const float* inL, const float* inR,
                       float* outL, float* outR, size_t numSamples) override;

private:
    // Uses SoA layout internally
    AlignedBuffer<float> tempBuffers_[2];
};

πŸ“Š Performance Targets

Cache Miss Reduction

Variant Cache Misses (Baseline) Cache Misses (Optimized) Reduction
FIR Convolution (4096) 45% 12% 73% reduction
Stereo Processing 30% 8% 73% reduction
Multi-Voice Synth 55% 20% 64% reduction

Performance Improvements

Operation SIMD Only + Cache Optimization Additional Gain
Long FIR (4096 taps) 8x 11x +37%
Stereo Gain 6.7x 8.5x +27%
Multi-Biquad (10 filters) 2.5x 3.5x +40%

πŸ› οΈ Implementation Plan

Week 1: Infrastructure

Tasks: 1. Implement cache size detection (from CPUDetection) 2. Create CacheOptimizedVariant base class 3. Implement cache-aligned memory allocation 4. Prefetch helper functions 5. Benchmarking infrastructure for cache performance

Deliverables: - CacheOptimizedVariant.h/cpp - Cache detection utilities - Alignment helpers

Week 2: Basic Cache Variants

Tasks: 1. Implement cache-blocked gain variant 2. Implement prefetch gain variant 3. Implement cache-blocked convolution 4. Benchmark vs SIMD-only variants 5. Tune block sizes for different CPUs

Deliverables: - CacheBlockedGainVariant - CacheBlockedConvolutionVariant - Performance comparison data

Week 3: Advanced Optimizations

Tasks: 1. SoA stereo processing variants 2. Cache-aware multi-voice processing 3. Integration with existing SIMD variants 4. Documentation and examples 5. Validation tests

Deliverables: - SoA stereo variants - Complete documentation - Test suite


πŸ”§ Technical Details

Cache Line Size Optimization

// Align to cache line (64 bytes on x86)
struct alignas(64) AudioBuffer {
    float samples[16];  // 64 bytes = 16 floats
    // Next buffer starts on new cache line
};

Software Prefetching

// Prefetch hints (x86)
_mm_prefetch(ptr, _MM_HINT_T0);  // L1 cache
_mm_prefetch(ptr, _MM_HINT_T1);  // L2 cache
_mm_prefetch(ptr, _MM_HINT_T2);  // L3 cache
_mm_prefetch(ptr, _MM_HINT_NTA); // Non-temporal (bypass cache)

Cache-Conscious Data Structures

// Bad: Scattered data (poor spatial locality)
class Voice {
    float* oscillator;   // Points to random memory
    float* filter;       // Points to random memory
    float* envelope;     // Points to random memory
};

// Good: Colocated data (good spatial locality)
class Voice {
    float oscillatorData[64];  // Inline, cache-friendly
    float filterData[64];
    float envelopeData[64];
};

πŸ“š Key Concepts

1. Spatial Locality

Access memory locations that are close together:

// Good spatial locality
for (int i = 0; i < N; ++i) {
    array[i] = ...;  // Sequential access
}

// Poor spatial locality
for (int i = 0; i < N; i += stride) {
    array[i] = ...;  // Strided access (cache line waste)
}

2. Temporal Locality

Reuse recently accessed data:

// Good temporal locality
float temp = input[i];
output1[i] = temp * gain1;
output2[i] = temp * gain2;  // 'temp' likely still in cache

// Poor temporal locality
output1[i] = input[i] * gain1;
// ... 1000 lines of code ...
output2[i] = input[i] * gain2;  // 'input[i]' likely evicted

3. Cache Line Sharing

Avoid false sharing in multi-threaded code:

// Bad: False sharing (threads fight over same cache line)
struct Data {
    alignas(64) int thread1_counter;
    int thread2_counter;  // Same cache line!
};

// Good: Separate cache lines
struct Data {
    alignas(64) int thread1_counter;
    alignas(64) int thread2_counter;  // Different cache lines
};


πŸ”¬ Profiling Tools

Linux: perf

perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses ./benchmark

Performance counter stats:
  1,234,567 cache-references
    123,456 cache-misses              # 10.00% of all cache refs
  9,876,543 L1-dcache-loads
    987,654 L1-dcache-load-misses     # 10.00% of all L1 loads

Intel: VTune

  • Cache Miss Analysis
  • Memory Access Patterns
  • Cache Hierarchy Visualization

Windows: VS Performance Profiler

  • CPU Sampling
  • Instrumentation
  • Memory Usage

βœ… Success Criteria

  • βœ… 20-40% performance improvement over SIMD-only
  • βœ… Cache miss rate reduced by 50%+
  • βœ… Works on Intel, AMD, and ARM CPUs
  • βœ… Maintains accuracy (no quality degradation)
  • βœ… Validated with profiling tools

πŸ“ Documentation Needed

  • Cache hierarchy explanation
  • Block size selection guide
  • Prefetch strategy guide
  • SoA conversion guide
  • Profiling guide (perf, VTune)
  • Integration examples

πŸ”— Dependencies

Required: - TAREA 0: Variant Framework (CPUDetection for cache info) - TAREA 1: SIMD Variants (baseline to improve upon)

Optional: - TAREA 5: Threading Variants (cache-aware NUMA)


⚠️ Challenges

  1. CPU-Specific Tuning: Optimal block sizes vary by CPU
  2. Complex Trade-offs: Sometimes cache optimization hurts SIMD efficiency
  3. Hard to Debug: Cache issues don't crash, they just slow down
  4. Profiling Required: Must use tools like perf/VTune to verify gains

Status: ⏸️ Ready to start after TAREA 1 completion Priority: HIGH - Can provide 20-40% additional gains Contact: performance@audiolab.com


"Cache Variants: Because cache misses cost 100+ cycles!" βš‘πŸ’Ύ