⚡ DSP Optimization (08_02_03)¶

Advanced optimization utilities for high-performance audio processing.

📦 Components¶

1. SIMDHelpers¶

Cross-platform SIMD operations (SSE/AVX/NEON).

#include "SIMDHelpers.hpp"
using namespace audiolab::plugins::dsp::simd;

float input[512], output[512];
vectorGain(output, input, 0.5f, 512);  // 2-3x faster than scalar

Supported operations: - Vector add/subtract/multiply - Multiply-add (FMA when available) - Gain scaling - Peak/RMS calculation - Interleaved ↔ Planar conversion

Platform support: - x64: SSE, SSE4.1, AVX, AVX2, AVX-512 - ARM: NEON - Fallback: Scalar (portable)

2. CacheOptimizer¶

Cache-friendly data layout utilities.

#include "CacheOptimizer.hpp"

// Bad: Array of Structures (cache thrashing)
struct Voice {
    float freq, amp, phase;
};
Voice voices[100];

// Good: Structure of Arrays (cache friendly)
struct VoiceBank {
    CACHE_ALIGNED float freqs[100];
    CACHE_ALIGNED float amps[100];
    CACHE_ALIGNED float phases[100];
};

// Prefetching for sequential access
for (int i = 0; i < N; i += 8) {
    CacheOptimizer::prefetch(&data[i + 8]);
    process(&data[i]);
}

3. BranchPredictor¶

Branch prediction hints.

#include "BranchPredictor.hpp"

// Hot path optimization
if (LIKELY(buffer != nullptr)) {
    processAudio(buffer);  // Common case optimized
} else {
    return;  // Rare error path
}

// Error handling
if (UNLIKELY(sampleRate <= 0)) {
    throw std::invalid_argument("Invalid sample rate");
}

// Branchless selection (for unpredictable conditions)
float result = branchlessClamp(value, 0.0f, 1.0f);

4. VectorizationHints¶

Compiler vectorization hints.

#include "VectorizationHints.hpp"

// Encourage loop vectorization
VECTORIZE_LOOP
for (int i = 0; i < count; ++i) {
    output[i] = input[i] * gain;
}

// Assume pointer alignment
ASSUME_ALIGNED(buffer, 32);

5. AlignmentHelpers¶

Memory alignment utilities (integrated in SIMDHelpers).

// Check alignment
if (isAligned(buffer, 16)) {
    // Use SIMD path
} else {
    // Use scalar fallback
}

// Aligned allocation
AUDIOLAB_ALIGNED(32) float buffer[512];

🚀 Performance Gains¶

Optimization	Speedup	Use Case
SIMD (AVX2)	2-8x	Vector math, mixing
Cache prefetch	1.5-3x	Sequential processing
Branch hints	1.1-1.5x	Hot loops, error checks
SoA layout	2-4x	Multi-voice processing
Branchless	1.2-2x	Unpredictable conditions

Measured on modern x64 CPU @ 44.1kHz

📊 Benchmarks¶

cd benchmarks
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .
./benchmark_optimizations.exe

Example output:

Scalar gain:    2500 ns
SIMD gain (AVX2): 320 ns  → 7.8x speedup

Branching clamp:    1200 ns
Branchless clamp:    450 ns  → 2.7x speedup

Random access:  5000 ns
Prefetched access: 1800 ns  → 2.8x speedup

📝 Example: Optimized Mixing¶

#include "SIMDHelpers.hpp"
#include "CacheOptimizer.hpp"

using namespace audiolab::plugins::dsp;

void mixChannels(float** outputs, const float* const* inputs,
                 int numChannels, int blockSize) {
    // Clear outputs
    for (int ch = 0; ch < numChannels; ++ch) {
        simd::vectorClear(outputs[ch], blockSize);
    }

    // Mix inputs with SIMD
    for (int ch = 0; ch < numChannels; ++ch) {
        simd::vectorMultiplyAdd(outputs[0], inputs[ch], gains[ch], blockSize);

        // Prefetch next channel
        if (LIKELY(ch + 1 < numChannels)) {
            CacheOptimizer::prefetch(inputs[ch + 1]);
        }
    }
}

📝 Example: Cache-Friendly Voice Pool¶

// Bad: AoS (cache unfriendly)
struct Voice {
    Oscillator osc;     // 64 bytes
    Filter filter;      // 128 bytes
    Envelope env;       // 32 bytes
    // ... more state
};
Voice voices[32];  // Scattered in memory

// When processing one parameter across all voices:
for (int i = 0; i < 32; ++i) {
    voices[i].env.update();  // Cache miss every iteration!
}

// Good: SoA (cache friendly)
struct VoicePool {
    CACHE_ALIGNED Oscillator oscillators[32];
    CACHE_ALIGNED Filter filters[32];
    CACHE_ALIGNED Envelope envelopes[32];
};

// Now sequential access = cache hits
for (int i = 0; i < 32; ++i) {
    pool.envelopes[i].update();  // Sequential = fast!
}

🔍 Optimization Guidelines¶

When to Use SIMD¶

✅ DO use for: - Vector math operations - Gain scaling - Mixing multiple buffers - Format conversions - Large buffer operations (> 64 samples)

❌ DON'T use for: - Non-contiguous data - Small buffers (< 16 samples) - Complex branching logic - Misaligned data (use scalar fallback)

When to Use Cache Optimization¶

✅ DO use for: - Multi-voice processing - Sequential access patterns - Large data structures - Thread-local data

❌ DON'T use for: - Random access patterns - Single-instance processing - Small data (< 1 KB)

When to Use Branch Hints¶

✅ DO use for: - Null pointer checks (usually non-null) - Error conditions (rarely triggered) - Loop continuation (usually continues) - Predictable patterns (> 95% accuracy)

❌ DON'T use for: - 50/50 conditions - Data-dependent branches (audio content) - Already well-predicted code

🧪 Testing Performance¶

#include "SIMDHelpers.hpp"
#include <chrono>

void benchmarkOperation() {
    constexpr size_t SIZE = 4096;
    float input[SIZE], output[SIZE];

    auto start = std::chrono::high_resolution_clock::now();

    for (int iter = 0; iter < 10000; ++iter) {
        simd::vectorGain(output, input, 0.5f, SIZE);
    }

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);

    std::cout << "Time: " << duration.count() / 10000 << " ns/iter" << std::endl;
}

📚 Further Reading¶

Intel Intrinsics Guide: https://software.intel.com/intrinsics-guide
ARM NEON Guide: https://developer.arm.com/architectures/instruction-sets/simd-isas/neon
Cache Optimization: "What Every Programmer Should Know About Memory" (Ulrich Drepper)
Branch Prediction: "Computer Architecture: A Quantitative Approach" (Hennessy & Patterson)

⚠️ Important Notes¶

Alignment: SIMD requires 16-byte (SSE) or 32-byte (AVX) alignment
Portability: Always provide scalar fallback for unsupported platforms
Compiler Flags: Enable optimization flags (-O3, /O2) for best results
Measurement: Always profile before/after - don't assume optimizations help
RT-Safety: All optimizations maintain real-time safety (no allocations)

Status: ✅ Complete Version: 1.0.0 Last Updated: 2025-10-09