Thread-Local Storage (TLS) for Audio Processing¶

Zero-contention thread-local storage for real-time audio processing.

Overview¶

Thread-local storage eliminates synchronization overhead by giving each thread its own isolated data. Perfect for: - Real-time audio threads: No locks, no cache-line bouncing - Per-thread scratch buffers: Avoid allocations in hot paths - Thread-local random generators: No mutex contention - Performance tracking: Per-thread statistics

Components¶

1. ThreadContext¶

Thread-local context for audio processing with: - Thread identification - Random number generator (thread-local, no mutex) - Scratch buffer for temporary calculations - Performance counters

#include "thread_context.hpp"

void processAudio() {
    auto& ctx = ThreadContext::get();

    // Use thread-local random (no contention)
    float noise = ctx.randomFloat(-1.0f, 1.0f);

    // Use scratch buffer (pre-allocated, no allocation)
    for (size_t i = 0; i < ctx.scratchSize; ++i) {
        ctx.scratchBuffer[i] = noise * ctx.randomFloat();
    }

    // Track performance
    ctx.recordProcessCall(bufferSize);
}

2. ThreadLocalPool¶

Zero-contention object pool where each thread has its own pool: - No locks: Each thread accesses only its own pool - No allocations: Objects pre-allocated and reused - Cache-friendly: No false sharing between threads

#include "thread_local_pool.hpp"

ThreadLocalPool<Buffer> bufferPool(8);  // 8 buffers per thread

void audioCallback() {
    // Each thread acquires from its own pool (no contention)
    ScopedThreadLocalObject<Buffer> temp(bufferPool);

    // Use buffer
    processBuffer(temp.get());

    // Auto-released when temp goes out of scope
}

Usage Examples¶

Basic ThreadContext¶

// Get thread-local context
auto& ctx = ThreadContext::get();

// Initialize for audio processing
ctx.initialize(48000.0, 512, 2048);  // SR, buffer size, scratch size

// Use random numbers
float random = ctx.randomFloat();  // [0, 1)
float bipolar = ctx.randomFloat(-1.0f, 1.0f);
int choice = ctx.randomInt(0, 10);

// Use scratch buffer
for (size_t i = 0; i < 512; ++i) {
    ctx.scratchBuffer[i] = input[i] * 0.5f;
}

// Track performance
ctx.recordProcessCall(512);
std::cout << "Avg samples/call: " << ctx.getAverageSamplesPerCall() << "\n";

Scoped Context Initialization¶

void audioThread() {
    ScopedThreadContext ctx(48000.0, 512, 2048);

    while (running) {
        // Context initialized and ready
        float* scratch = ctx.get().scratchBuffer;
        // ... process audio ...
    }

    // Context automatically cleaned up
}

ThreadLocalPool - Manual¶

ThreadLocalPool<AudioBuffer> pool(16);  // 16 buffers per thread

void processAudio() {
    AudioBuffer* temp = pool.acquire();

    // Use buffer
    applyReverb(temp);

    pool.release(temp);
}

ThreadLocalPool - RAII (Recommended)¶

ThreadLocalPool<AudioBuffer> pool(16);

void processAudio() {
    ScopedThreadLocalObject<AudioBuffer> temp(pool);

    // Use buffer
    temp->process();

    // Automatically released
}

Custom Allocator¶

// Aligned allocation for SIMD
auto pool = makeThreadLocalPool<SIMDBuffer>(
    []() {
        return new(std::align_val_t{64}) SIMDBuffer();  // 64-byte aligned
    },
    [](SIMDBuffer* buf) {
        operator delete(buf, std::align_val_t{64});
    },
    8  // Initial capacity
);

auto buffer = pool.acquire();
// ... use aligned buffer for SIMD ...
pool.release(buffer);

Multi-threaded Audio Processing¶

ThreadLocalPool<float> bufferPool(8);

void audioThread(int threadId) {
    auto& ctx = ThreadContext::get();
    ctx.initialize(48000.0, 512);

    while (running) {
        // Each thread has its own pool (zero contention)
        ScopedThreadLocalObject<float> temp1(bufferPool);
        ScopedThreadLocalObject<float> temp2(bufferPool);

        // Use thread-local scratch
        for (size_t i = 0; i < 512; ++i) {
            ctx.scratchBuffer[i] = ctx.randomFloat();
        }

        // Process with temp buffers
        process(temp1.get(), temp2.get(), ctx.scratchBuffer);

        ctx.recordProcessCall(512);
    }

    std::cout << "Thread " << threadId
              << " peak usage: " << bufferPool.getPeakUsage() << "\n";
}

Performance Benefits¶

Without TLS (Contention)¶

// Shared pool with mutex
std::mutex mutex;
std::vector<Buffer*> sharedPool;

Buffer* acquire() {
    std::lock_guard lock(mutex);  // ← Contention!
    return sharedPool.back();
}

With TLS (Zero Contention)¶

// Thread-local pool, no mutex
thread_local std::vector<Buffer*> pool;

Buffer* acquire() {
    return pool.back();  // ← No lock!
}

Benchmark Results (8 threads, 10K acquires each): - Shared pool with mutex: ~250ms - Thread-local pool: ~15ms (16x faster!)

API Reference¶

ThreadContext¶

Method	Description
`static ThreadContext& get()`	Get thread-local context
`void initialize(double sr, uint32_t bufSize, size_t scratchSize)`	Initialize context
`float randomFloat()`	Random in [0, 1)
`float randomFloat(min, max)`	Random in [min, max)
`int randomInt(min, max)`	Random int in [min, max]
`void resizeScratchBuffer(size_t)`	Resize scratch buffer
`void clearScratchBuffer()`	Zero scratch buffer
`void recordProcessCall(uint32_t samples)`	Track performance
`double getAverageSamplesPerCall()`	Get avg samples/call
`void resetCounters()`	Reset perf counters

ThreadLocalPool¶

Method	Description
`ThreadLocalPool(size_t initial, size_t max = 0)`	Constructor
`T* acquire()`	Acquire object
`void release(T*)`	Release object
`size_t getAvailableCount()`	Free objects
`size_t getTotalCount()`	Total objects
`size_t getInUseCount()`	Objects in use
`size_t getPeakUsage()`	Peak usage
`bool ownsObject(const T*)`	Check ownership
`void resetStats()`	Reset statistics

ScopedThreadLocalObject¶

Method	Description
`ScopedThreadLocalObject(pool)`	Acquire from pool
`~ScopedThreadLocalObject()`	Auto-release
`T* get()`	Get raw pointer
`T* operator->()`	Member access
`T& operator*()`	Dereference

Building and Testing¶

Compile¶

g++ -std=c++17 -pthread tests/test_tls.cpp -o test_tls
./test_tls

CMake¶

add_library(thread_local INTERFACE)
target_include_directories(thread_local INTERFACE ${CMAKE_CURRENT_SOURCE_DIR})
target_compile_features(thread_local INTERFACE cxx_std_17)

add_executable(test_tls tests/test_tls.cpp)
target_link_libraries(test_tls PRIVATE thread_local)

Best Practices¶

DO:¶

✅ Use ThreadContext::get() for per-thread state
✅ Use ThreadLocalPool for zero-contention object pooling
✅ Use ScopedThreadLocalObject for RAII
✅ Pre-allocate sufficient pool capacity
✅ Track peak usage to size pools correctly

DON'T:¶

❌ Pass thread-local pointers between threads
❌ Store thread-local pointers in global state
❌ Assume thread-local state persists after thread exit
❌ Over-allocate (TLS memory multiplied by thread count)

Thread Safety¶

✅ ThreadContext: Each thread has isolated context (thread-safe by design)
✅ ThreadLocalPool: Each thread has own pool (zero synchronization)
⚠️ Cross-thread: Never pass TLS objects between threads

Memory Considerations¶

Thread-local storage memory is per-thread: - 4 threads × 8 buffers = 32 buffers total - 8 threads × 8 buffers = 64 buffers total

Size pools based on: 1. Number of threads 2. Peak concurrent usage per thread 3. Available memory

Diagnostics¶

// Check pool health
auto& pool = getThreadLocalPool();
std::cout << "Available: " << pool.getAvailableCount() << "\n";
std::cout << "In use: " << pool.getInUseCount() << "\n";
std::cout << "Peak: " << pool.getPeakUsage() << "\n";
std::cout << "Total acquires: " << pool.getTotalAcquires() << "\n";

// Check context
auto& ctx = ThreadContext::get();
std::cout << "Thread ID: " << ctx.threadId << "\n";
std::cout << "Process calls: " << ctx.processCallCount << "\n";
std::cout << "Avg samples: " << ctx.getAverageSamplesPerCall() << "\n";

Platform Support¶

✅ Windows (MSVC, MinGW)
✅ macOS (Clang)
✅ Linux (GCC, Clang)
✅ C++17 or later

Implementation Notes¶

Thread-Local Storage¶

Uses C++11 thread_local keyword:

thread_local ThreadContext ctx;  // One per thread

Lazy Initialization¶

Pools initialized on first access per thread:

static PerThreadData& getThreadData() {
    thread_local PerThreadData data;  // Init on first call
    return data;
}

Cleanup¶

Thread-local destructors called automatically on thread exit:

~ThreadContext() {
    delete[] scratchBuffer;  // Cleaned up per thread
}

License¶

Part of AudioLab foundation library.