🔧 Metrics Implementation Guide¶

🎯 Choosing Metric Type¶

┌────────────────────────────────────────────────────────────────┐
│                  METRIC TYPE DECISION TREE                     │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  Question: Does value only go up?                             │
│  ├─ YES → Counter                                             │
│  │         Examples: errors, requests, events                 │
│  │         Use cases: rate(), increase()                      │
│  │                                                            │
│  └─ NO → Continue...                                          │
│                                                                │
│  Question: Care about distribution/percentiles?               │
│  ├─ YES → Histogram                                           │
│  │         Examples: latency, request size, duration          │
│  │         Use cases: histogram_quantile(), bucket analysis   │
│  │                                                            │
│  └─ NO → Gauge                                                │
│            Examples: CPU, memory, temperature, connections    │
│            Use cases: current value, avg_over_time()          │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Detailed Comparison¶

Metric Type	When to Use	Example	Query Pattern
Counter	Monotonically increasing values	`error_count`, `requests_total`	`rate(metric[5m])`
Gauge	Values that go up and down	`cpu_usage`, `memory_mb`, `active_users`	`metric`, `avg_over_time(metric[5m])`
Histogram	Distribution of values	`audio_latency`, `request_duration`	`histogram_quantile(0.95, metric_bucket)`
Summary	Pre-aggregated percentiles	`api_latency`	`metric{quantile="0.95"}`

📊 Sampling Strategies¶

When to Sample¶

DO sample when: - Event frequency > 1000/sec - Storage cost is concern - Exact precision not required - Pattern detection sufficient

DON'T sample when: - Critical error tracking - SLA monitoring - Compliance/audit requirements - Low frequency events (< 1/sec)

Sampling Methods¶

1. Random Sampling¶

// Sample 1% of events
if (rand() % 100 == 0) {
    metrics.recordLatency(latency);
}

// Probabilistic sampling
const float SAMPLE_RATE = 0.01f;  // 1%
if ((float)rand() / RAND_MAX < SAMPLE_RATE) {
    metrics.recordEvent();
}

Pros: Unbiased, simple to implement Cons: Can miss rare events Use case: High-frequency events (buffer processing)

2. Time-Based Sampling¶

static auto last_sample = std::chrono::steady_clock::now();
const auto now = std::chrono::steady_clock::now();

// Sample every 100ms
if (now - last_sample > std::chrono::milliseconds(100)) {
    metrics.recordCpuUsage(cpu_percent);
    last_sample = now;
}

Pros: Predictable rate, smooth data Cons: Can miss spikes between samples Use case: Resource monitoring (CPU, memory)

3. Reservoir Sampling¶

// Keep N samples from stream of unknown size
class ReservoirSampler {
    std::vector<double> samples_;
    size_t count_ = 0;

public:
    ReservoirSampler(size_t capacity) : samples_(capacity) {}

    void record(double value) {
        if (count_ < samples_.size()) {
            samples_[count_] = value;
        } else {
            // Replace with decreasing probability
            size_t j = rand() % (count_ + 1);
            if (j < samples_.size()) {
                samples_[j] = value;
            }
        }
        count_++;
    }
};

Pros: Uniform distribution, handles unbounded streams Cons: More complex, needs memory Use case: Latency distribution over long periods

4. Adaptive Sampling¶

// Sample more during high-load periods
float getSampleRate(float cpu_usage) {
    if (cpu_usage > 90.0f) return 1.0f;    // 100% when critical
    if (cpu_usage > 70.0f) return 0.1f;    // 10% when high
    if (cpu_usage > 50.0f) return 0.01f;   // 1% when medium
    return 0.001f;                          // 0.1% when low
}

if ((float)rand() / RAND_MAX < getSampleRate(cpu)) {
    metrics.recordLatency(latency);
}

Pros: Balances detail and cost Cons: More CPU overhead, complex logic Use case: Production systems with varying load

💰 Cost Considerations¶

Cost Model¶

╔════════════════════════════════════════════════════════════╗
║                    MONTHLY COST ESTIMATE                   ║
╠════════════════════════════════════════════════════════════╣
║                                                            ║
║  Metrics: 1000 time series                                ║
║  Scrape interval: 15 seconds                              ║
║  Retention: 30 days                                       ║
║                                                            ║
║  Data points = 1000 × (86400 / 15) × 30                   ║
║              = 1000 × 5760 × 30                           ║
║              = 172,800,000 points/month                   ║
║                                                            ║
║  Storage: ~1-2 GB compressed                              ║
║  Cost: $50-500/month (service dependent)                  ║
║                                                            ║
║  ┌──────────────────────────────────────────────┐        ║
║  │ Provider    │ Cost/Million Samples           │        ║
║  ├─────────────┼────────────────────────────────┤        ║
║  │ Prometheus  │ Self-hosted (infra cost only)  │        ║
║  │ Datadog     │ ~$5                            │        ║
║  │ New Relic   │ ~$3                            │        ║
║  │ Grafana     │ ~$2                            │        ║
║  └──────────────────────────────────────────────┘        ║
║                                                            ║
╚════════════════════════════════════════════════════════════╝

Optimization Strategies¶

1. Increase Scrape Interval¶

# Before: 15s interval = 5760 points/day
scrape_interval: 15s

# After: 60s interval = 1440 points/day (75% reduction)
scrape_interval: 60s

Impact: 75% cost reduction Trade-off: Lower resolution for short-term spikes

2. Reduce Cardinality¶

// ❌ BAD: High cardinality (millions of unique values)
metrics.counter("requests")
    .labels({{"user_id", userId}});  // Unique per user

// ✅ GOOD: Low cardinality (< 100 unique values)
metrics.counter("requests")
    .labels({{"endpoint", "/api/audio"}});  // Limited set

Cardinality calculation:

Total series = metric_count × label1_values × label2_values × ...

Example:
- 10 metrics
- 5 hosts
- 3 environments
= 10 × 5 × 3 = 150 time series ✅

Bad example:
- 10 metrics
- 1000 users
- 100 sessions
= 10 × 1000 × 100 = 1,000,000 time series ❌

3. Sample High-Volume Metrics¶

// High-frequency metric (10,000 events/sec)
static size_t sample_counter = 0;
if (++sample_counter % 100 == 0) {  // 1% sampling
    metrics.recordBufferProcessed();
}

// Low-frequency metric (< 1/sec)
metrics.recordError(error_type);  // No sampling

4. Use Recording Rules¶

# Pre-compute expensive queries
- record: cpu_usage_avg_5m
  expr: avg_over_time(cpu_usage[5m])

# Query pre-computed metric (cheap)
cpu_usage_avg_5m

# vs raw query (expensive)
avg_over_time(cpu_usage[5m])

Cost reduction: 10-100x faster queries

5. Retention Policies¶

# Short retention for high-frequency
- match: '{__name__=~"audio_.*"}'
  retention: 7d

# Long retention for important metrics
- match: '{__name__=~"error_.*|crash_.*"}'
  retention: 90d

# Default
retention: 30d

🎯 Implementation Best Practices¶

1. Naming Conventions¶

metric_name{label="value"} value timestamp

Pattern: <namespace>_<subsystem>_<name>_<unit>_<suffix>

Examples:

audiolab_audio_latency_milliseconds_total
audiolab_dsp_cpu_usage_percent
audiolab_buffer_underruns_total
audiolab_memory_usage_bytes

Rules: - Use snake_case - Include unit in name - Add _total suffix for counters - Namespace with project name

2. Label Design¶

Good labels:

// Low cardinality (< 10 values each)
.labels({
    {"host", "ableton"},      // 5-10 DAWs
    {"platform", "windows"},  // 3 platforms
    {"version", "1.2.3"}      // 10-20 versions
})

Bad labels:

// High cardinality
.labels({
    {"user_id", "12345"},        // ❌ Millions of users
    {"timestamp", "2024-01-01"}, // ❌ Infinite values
    {"session_id", "abc123"}     // ❌ Unique per session
})

3. Error Handling¶

void recordMetric() {
    try {
        metrics.recordLatency(latency);
    } catch (const std::exception& e) {
        // NEVER throw from metrics collection
        // Log error but continue execution
        std::cerr << "Metrics error: " << e.what() << "\n";
    }
}

Rule: Metrics collection must NEVER break application

4. Real-Time Safety¶

// ✅ SAFE: Lock-free atomic operations
void processAudioBuffer(float* buffer, size_t size) {
    const auto start = std::chrono::high_resolution_clock::now();

    // Process audio...

    const auto end = std::chrono::high_resolution_clock::now();
    const auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    // Lock-free metric recording (< 1 μs)
    metrics.recordLatency(duration);
}

// ❌ UNSAFE: Locks, allocations, I/O
void processAudioBuffer(float* buffer, size_t size) {
    // Process audio...

    // ❌ Mutex lock in RT thread
    std::lock_guard<std::mutex> lock(metrics_mutex);
    metrics.push_back(latency);

    // ❌ Heap allocation in RT thread
    std::string label = "buffer_" + std::to_string(buffer_id);

    // ❌ File I/O in RT thread
    logfile << "Latency: " << latency << "\n";
}

5. Testing Metrics¶

#include <gtest/gtest.h>

TEST(MetricsTest, CounterIncrement) {
    auto& metrics = MetricsCollector::instance();
    metrics.resetAll();

    metrics.recordBufferUnderrun();
    metrics.recordBufferUnderrun();

    auto snapshot = metrics.getSnapshot();
    EXPECT_EQ(snapshot.buffer_underruns, 2);
}

TEST(MetricsTest, HistogramPercentiles) {
    auto& metrics = MetricsCollector::instance();
    metrics.resetAll();

    for (int i = 0; i < 100; ++i) {
        metrics.recordLatency(std::chrono::microseconds(i * 100));
    }

    auto snapshot = metrics.getSnapshot();
    EXPECT_NEAR(snapshot.latency_p50, 5.0, 0.5);  // ~5ms median
    EXPECT_NEAR(snapshot.latency_p99, 9.9, 0.5);  // ~10ms p99
}

TEST(MetricsTest, ThreadSafety) {
    auto& metrics = MetricsCollector::instance();
    metrics.resetAll();

    std::vector<std::thread> threads;
    for (int i = 0; i < 10; ++i) {
        threads.emplace_back([&]() {
            for (int j = 0; j < 1000; ++j) {
                metrics.recordBufferUnderrun();
            }
        });
    }

    for (auto& t : threads) t.join();

    auto snapshot = metrics.getSnapshot();
    EXPECT_EQ(snapshot.buffer_underruns, 10000);
}

📈 Metric Aggregation Patterns¶

Time-Based Aggregation¶

# 5-minute average
avg_over_time(cpu_usage[5m])

# Maximum in last hour
max_over_time(cpu_usage[1h])

# Rate (per-second average)
rate(requests_total[5m])

# Increase (total change)
increase(errors_total[1h])

Spatial Aggregation¶

# Sum across all instances
sum(cpu_usage)

# Average across instances
avg(cpu_usage)

# Maximum across instances
max(cpu_usage)

# Count instances above threshold
count(cpu_usage > 80)

Percentile Aggregation¶

# 95th percentile
histogram_quantile(0.95, audio_latency_bucket)

# Multiple percentiles
histogram_quantile(0.50, audio_latency_bucket)  # p50
histogram_quantile(0.95, audio_latency_bucket)  # p95
histogram_quantile(0.99, audio_latency_bucket)  # p99

🚨 Common Pitfalls¶

1. Label Explosion¶

// ❌ Creates 1M+ time series
for (int user_id = 0; user_id < 1000000; ++user_id) {
    metrics.counter("logins")
        .labels({{"user_id", std::to_string(user_id)}});
}

// ✅ Aggregate by user type instead
metrics.counter("logins")
    .labels({{"user_type", getUserType(user_id)}});  // 3-5 types

2. Missing Units¶

// ❌ What unit is this?
metrics.gauge("audio_latency").set(5.2);

// ✅ Clear unit
metrics.gauge("audio_latency_milliseconds").set(5.2);

3. Inconsistent Naming¶

// ❌ Inconsistent
metrics.counter("Requests_Total");      // PascalCase
metrics.counter("errors-count");        // kebab-case
metrics.counter("cpu_usage_percent");   // snake_case

// ✅ Consistent snake_case
metrics.counter("requests_total");
metrics.counter("errors_total");
metrics.counter("cpu_usage_percent");

4. Missing _total Suffix¶

// ❌ Ambiguous (is this a rate or count?)
metrics.counter("errors");

// ✅ Clear counter
metrics.counter("errors_total");

5. Gauge vs Counter Confusion¶

// ❌ Using gauge for monotonic value
metrics.gauge("requests").increment();  // Wrong type

// ✅ Using counter for monotonic value
metrics.counter("requests_total").increment();

// ✅ Using gauge for value that goes up/down
metrics.gauge("active_connections").set(42);