🔧 Metrics Implementation Guide¶
🎯 Choosing Metric Type¶
┌────────────────────────────────────────────────────────────────┐
│ METRIC TYPE DECISION TREE │
├────────────────────────────────────────────────────────────────┤
│ │
│ Question: Does value only go up? │
│ ├─ YES → Counter │
│ │ Examples: errors, requests, events │
│ │ Use cases: rate(), increase() │
│ │ │
│ └─ NO → Continue... │
│ │
│ Question: Care about distribution/percentiles? │
│ ├─ YES → Histogram │
│ │ Examples: latency, request size, duration │
│ │ Use cases: histogram_quantile(), bucket analysis │
│ │ │
│ └─ NO → Gauge │
│ Examples: CPU, memory, temperature, connections │
│ Use cases: current value, avg_over_time() │
│ │
└────────────────────────────────────────────────────────────────┘
Detailed Comparison¶
| Metric Type | When to Use | Example | Query Pattern |
|---|---|---|---|
| Counter | Monotonically increasing values | error_count, requests_total |
rate(metric[5m]) |
| Gauge | Values that go up and down | cpu_usage, memory_mb, active_users |
metric, avg_over_time(metric[5m]) |
| Histogram | Distribution of values | audio_latency, request_duration |
histogram_quantile(0.95, metric_bucket) |
| Summary | Pre-aggregated percentiles | api_latency |
metric{quantile="0.95"} |
📊 Sampling Strategies¶
When to Sample¶
DO sample when: - Event frequency > 1000/sec - Storage cost is concern - Exact precision not required - Pattern detection sufficient
DON'T sample when: - Critical error tracking - SLA monitoring - Compliance/audit requirements - Low frequency events (< 1/sec)
Sampling Methods¶
1. Random Sampling¶
// Sample 1% of events
if (rand() % 100 == 0) {
metrics.recordLatency(latency);
}
// Probabilistic sampling
const float SAMPLE_RATE = 0.01f; // 1%
if ((float)rand() / RAND_MAX < SAMPLE_RATE) {
metrics.recordEvent();
}
Pros: Unbiased, simple to implement Cons: Can miss rare events Use case: High-frequency events (buffer processing)
2. Time-Based Sampling¶
static auto last_sample = std::chrono::steady_clock::now();
const auto now = std::chrono::steady_clock::now();
// Sample every 100ms
if (now - last_sample > std::chrono::milliseconds(100)) {
metrics.recordCpuUsage(cpu_percent);
last_sample = now;
}
Pros: Predictable rate, smooth data Cons: Can miss spikes between samples Use case: Resource monitoring (CPU, memory)
3. Reservoir Sampling¶
// Keep N samples from stream of unknown size
class ReservoirSampler {
std::vector<double> samples_;
size_t count_ = 0;
public:
ReservoirSampler(size_t capacity) : samples_(capacity) {}
void record(double value) {
if (count_ < samples_.size()) {
samples_[count_] = value;
} else {
// Replace with decreasing probability
size_t j = rand() % (count_ + 1);
if (j < samples_.size()) {
samples_[j] = value;
}
}
count_++;
}
};
Pros: Uniform distribution, handles unbounded streams Cons: More complex, needs memory Use case: Latency distribution over long periods
4. Adaptive Sampling¶
// Sample more during high-load periods
float getSampleRate(float cpu_usage) {
if (cpu_usage > 90.0f) return 1.0f; // 100% when critical
if (cpu_usage > 70.0f) return 0.1f; // 10% when high
if (cpu_usage > 50.0f) return 0.01f; // 1% when medium
return 0.001f; // 0.1% when low
}
if ((float)rand() / RAND_MAX < getSampleRate(cpu)) {
metrics.recordLatency(latency);
}
Pros: Balances detail and cost Cons: More CPU overhead, complex logic Use case: Production systems with varying load
💰 Cost Considerations¶
Cost Model¶
╔════════════════════════════════════════════════════════════╗
║ MONTHLY COST ESTIMATE ║
╠════════════════════════════════════════════════════════════╣
║ ║
║ Metrics: 1000 time series ║
║ Scrape interval: 15 seconds ║
║ Retention: 30 days ║
║ ║
║ Data points = 1000 × (86400 / 15) × 30 ║
║ = 1000 × 5760 × 30 ║
║ = 172,800,000 points/month ║
║ ║
║ Storage: ~1-2 GB compressed ║
║ Cost: $50-500/month (service dependent) ║
║ ║
║ ┌──────────────────────────────────────────────┐ ║
║ │ Provider │ Cost/Million Samples │ ║
║ ├─────────────┼────────────────────────────────┤ ║
║ │ Prometheus │ Self-hosted (infra cost only) │ ║
║ │ Datadog │ ~$5 │ ║
║ │ New Relic │ ~$3 │ ║
║ │ Grafana │ ~$2 │ ║
║ └──────────────────────────────────────────────┘ ║
║ ║
╚════════════════════════════════════════════════════════════╝
Optimization Strategies¶
1. Increase Scrape Interval¶
# Before: 15s interval = 5760 points/day
scrape_interval: 15s
# After: 60s interval = 1440 points/day (75% reduction)
scrape_interval: 60s
Impact: 75% cost reduction Trade-off: Lower resolution for short-term spikes
2. Reduce Cardinality¶
// ❌ BAD: High cardinality (millions of unique values)
metrics.counter("requests")
.labels({{"user_id", userId}}); // Unique per user
// ✅ GOOD: Low cardinality (< 100 unique values)
metrics.counter("requests")
.labels({{"endpoint", "/api/audio"}}); // Limited set
Cardinality calculation:
Total series = metric_count × label1_values × label2_values × ...
Example:
- 10 metrics
- 5 hosts
- 3 environments
= 10 × 5 × 3 = 150 time series ✅
Bad example:
- 10 metrics
- 1000 users
- 100 sessions
= 10 × 1000 × 100 = 1,000,000 time series ❌
3. Sample High-Volume Metrics¶
// High-frequency metric (10,000 events/sec)
static size_t sample_counter = 0;
if (++sample_counter % 100 == 0) { // 1% sampling
metrics.recordBufferProcessed();
}
// Low-frequency metric (< 1/sec)
metrics.recordError(error_type); // No sampling
4. Use Recording Rules¶
# Pre-compute expensive queries
- record: cpu_usage_avg_5m
expr: avg_over_time(cpu_usage[5m])
# Query pre-computed metric (cheap)
cpu_usage_avg_5m
# vs raw query (expensive)
avg_over_time(cpu_usage[5m])
Cost reduction: 10-100x faster queries
5. Retention Policies¶
# Short retention for high-frequency
- match: '{__name__=~"audio_.*"}'
retention: 7d
# Long retention for important metrics
- match: '{__name__=~"error_.*|crash_.*"}'
retention: 90d
# Default
retention: 30d
🎯 Implementation Best Practices¶
1. Naming Conventions¶
Pattern: <namespace>_<subsystem>_<name>_<unit>_<suffix>
Examples:
audiolab_audio_latency_milliseconds_total
audiolab_dsp_cpu_usage_percent
audiolab_buffer_underruns_total
audiolab_memory_usage_bytes
Rules:
- Use snake_case
- Include unit in name
- Add _total suffix for counters
- Namespace with project name
2. Label Design¶
Good labels:
// Low cardinality (< 10 values each)
.labels({
{"host", "ableton"}, // 5-10 DAWs
{"platform", "windows"}, // 3 platforms
{"version", "1.2.3"} // 10-20 versions
})
Bad labels:
// High cardinality
.labels({
{"user_id", "12345"}, // ❌ Millions of users
{"timestamp", "2024-01-01"}, // ❌ Infinite values
{"session_id", "abc123"} // ❌ Unique per session
})
3. Error Handling¶
void recordMetric() {
try {
metrics.recordLatency(latency);
} catch (const std::exception& e) {
// NEVER throw from metrics collection
// Log error but continue execution
std::cerr << "Metrics error: " << e.what() << "\n";
}
}
Rule: Metrics collection must NEVER break application
4. Real-Time Safety¶
// ✅ SAFE: Lock-free atomic operations
void processAudioBuffer(float* buffer, size_t size) {
const auto start = std::chrono::high_resolution_clock::now();
// Process audio...
const auto end = std::chrono::high_resolution_clock::now();
const auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
// Lock-free metric recording (< 1 μs)
metrics.recordLatency(duration);
}
// ❌ UNSAFE: Locks, allocations, I/O
void processAudioBuffer(float* buffer, size_t size) {
// Process audio...
// ❌ Mutex lock in RT thread
std::lock_guard<std::mutex> lock(metrics_mutex);
metrics.push_back(latency);
// ❌ Heap allocation in RT thread
std::string label = "buffer_" + std::to_string(buffer_id);
// ❌ File I/O in RT thread
logfile << "Latency: " << latency << "\n";
}
5. Testing Metrics¶
#include <gtest/gtest.h>
TEST(MetricsTest, CounterIncrement) {
auto& metrics = MetricsCollector::instance();
metrics.resetAll();
metrics.recordBufferUnderrun();
metrics.recordBufferUnderrun();
auto snapshot = metrics.getSnapshot();
EXPECT_EQ(snapshot.buffer_underruns, 2);
}
TEST(MetricsTest, HistogramPercentiles) {
auto& metrics = MetricsCollector::instance();
metrics.resetAll();
for (int i = 0; i < 100; ++i) {
metrics.recordLatency(std::chrono::microseconds(i * 100));
}
auto snapshot = metrics.getSnapshot();
EXPECT_NEAR(snapshot.latency_p50, 5.0, 0.5); // ~5ms median
EXPECT_NEAR(snapshot.latency_p99, 9.9, 0.5); // ~10ms p99
}
TEST(MetricsTest, ThreadSafety) {
auto& metrics = MetricsCollector::instance();
metrics.resetAll();
std::vector<std::thread> threads;
for (int i = 0; i < 10; ++i) {
threads.emplace_back([&]() {
for (int j = 0; j < 1000; ++j) {
metrics.recordBufferUnderrun();
}
});
}
for (auto& t : threads) t.join();
auto snapshot = metrics.getSnapshot();
EXPECT_EQ(snapshot.buffer_underruns, 10000);
}
📈 Metric Aggregation Patterns¶
Time-Based Aggregation¶
# 5-minute average
avg_over_time(cpu_usage[5m])
# Maximum in last hour
max_over_time(cpu_usage[1h])
# Rate (per-second average)
rate(requests_total[5m])
# Increase (total change)
increase(errors_total[1h])
Spatial Aggregation¶
# Sum across all instances
sum(cpu_usage)
# Average across instances
avg(cpu_usage)
# Maximum across instances
max(cpu_usage)
# Count instances above threshold
count(cpu_usage > 80)
Percentile Aggregation¶
# 95th percentile
histogram_quantile(0.95, audio_latency_bucket)
# Multiple percentiles
histogram_quantile(0.50, audio_latency_bucket) # p50
histogram_quantile(0.95, audio_latency_bucket) # p95
histogram_quantile(0.99, audio_latency_bucket) # p99
🚨 Common Pitfalls¶
1. Label Explosion¶
// ❌ Creates 1M+ time series
for (int user_id = 0; user_id < 1000000; ++user_id) {
metrics.counter("logins")
.labels({{"user_id", std::to_string(user_id)}});
}
// ✅ Aggregate by user type instead
metrics.counter("logins")
.labels({{"user_type", getUserType(user_id)}}); // 3-5 types
2. Missing Units¶
// ❌ What unit is this?
metrics.gauge("audio_latency").set(5.2);
// ✅ Clear unit
metrics.gauge("audio_latency_milliseconds").set(5.2);
3. Inconsistent Naming¶
// ❌ Inconsistent
metrics.counter("Requests_Total"); // PascalCase
metrics.counter("errors-count"); // kebab-case
metrics.counter("cpu_usage_percent"); // snake_case
// ✅ Consistent snake_case
metrics.counter("requests_total");
metrics.counter("errors_total");
metrics.counter("cpu_usage_percent");
4. Missing _total Suffix¶
// ❌ Ambiguous (is this a rate or count?)
metrics.counter("errors");
// ✅ Clear counter
metrics.counter("errors_total");