Skip to content

📊 AudioLab Monitoring & Observability

Complete monitoring and observability infrastructure for the AudioLab system.

📁 Structure

03_08_monitoring_observability/
├── 03_08_01_metrics_collection/         # Metrics instrumentation
│   ├── prometheus_config.yml            # Prometheus configuration
│   ├── metrics_registry.hpp             # C++ metrics SDK
│   └── METRICS_GUIDE.md                 # Metrics best practices
├── 03_08_02_dashboards_templates/       # Grafana dashboards
│   ├── DASHBOARD_GUIDELINES.md          # Dashboard design rules
│   ├── overview_dashboard.json          # System overview
│   └── performance_dashboard.json       # Performance monitoring
└── 03_08_03_alerting_rules/            # Alerting configuration
    ├── alert_rules.yaml                 # Prometheus alert rules
    ├── ONCALL_PLAYBOOKS.md             # Incident response guides
    └── ALERT_FATIGUE_PREVENTION.md     # Alert quality guidelines

🎯 Quick Start

1. Setup Metrics Collection

# Start Prometheus
docker run -d \
  -p 9090:9090 \
  -v $(pwd)/03_08_01_metrics_collection/prometheus_config.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# Verify metrics endpoint
curl http://localhost:9090/metrics

2. Import Dashboards

# Import to Grafana
curl -X POST http://localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @03_08_02_dashboards_templates/overview_dashboard.json

curl -X POST http://localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @03_08_02_dashboards_templates/performance_dashboard.json

3. Configure Alerts

# Load alert rules into Prometheus
promtool check rules 03_08_03_alerting_rules/alert_rules.yaml

# Reload Prometheus config
curl -X POST http://localhost:9090/-/reload

📊 Available Dashboards

1. System Overview

File: overview_dashboard.json - Purpose: High-level system health - Metrics: - System Status (UP/DOWN) - CPU Usage (%) - Memory Usage (%) - Error Rate - Active Alerts - Disk Usage - Network I/O - Refresh: 10s - Time Range: Last 15 minutes

2. Performance Monitoring

File: performance_dashboard.json - Purpose: Detailed performance analysis - Metrics: - Latency Percentiles (p50, p95, p99) - Latency Histogram (heatmap) - Buffer Underruns - Plugin Load Time - CPU by Component - Throughput (samples/sec) - Top 10 Slowest Operations - Refresh: 5s - Time Range: Last 1 hour

🚨 Alert Categories

Critical Alerts (Page Immediately)

  • ServiceDown - Service is down for > 1 minute
  • HighErrorRate - Error rate > 1% for 5 minutes
  • CrashDetected - Application crash detected
  • CriticalCPUUsage - CPU > 95% for 5 minutes
  • CriticalLatency - p99 latency > 100ms for 5 minutes
  • DiskSpaceCritical - Disk > 95% full

Warning Alerts (Page During Business Hours)

  • HighCPUUsage - CPU > 80% for 10 minutes
  • HighMemoryUsage - Memory > 80% for 10 minutes
  • HighLatency - p99 latency > 50ms for 10 minutes
  • BufferUnderrun - Audio buffer underruns detected
  • ErrorRateSpiking - Error rate increased 5x
  • SlowPluginLoad - Plugin load time > 500ms

Info Alerts (Log Only, No Page)

  • NewVersionDeployed - New version deployed
  • RateLimitHit - Rate limit frequently hit
  • Watchdog - Monitoring health check (always firing)

📖 Key Metrics

Application Metrics

# Error rate
rate(audiolab_errors_total[5m])

# Latency percentiles
histogram_quantile(0.99, rate(audiolab_latency_seconds_bucket[5m]))

# Throughput
rate(audiolab_samples_processed_total[5m])

# CPU by component
rate(audiolab_cpu_seconds_total[5m]) by (component)

# Buffer underruns
increase(audiolab_buffer_underruns_total[5m])

System Metrics

# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Network I/O
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

🔧 Instrumentation Guide

Adding Metrics in C++

#include "metrics_registry.hpp"

// Define metrics
static auto& cpu_usage = metrics::gauge("audiolab_cpu_usage")
    .help("CPU usage percentage")
    .label("component", "audio_engine")
    .build();

static auto& latency_histogram = metrics::histogram("audiolab_latency_seconds")
    .help("Processing latency in seconds")
    .buckets({0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0})
    .build();

// Record metrics
void processAudio() {
    auto timer = latency_histogram.start_timer();

    // ... process audio ...

    cpu_usage.set(getCurrentCPUUsage());
}

Counter Pattern

// Errors
static auto& errors = metrics::counter("audiolab_errors_total")
    .label("severity", "error")
    .label("component", "reverb")
    .build();

errors.inc();  // Increment by 1
errors.inc(5); // Increment by 5

Histogram Pattern

// Latency distribution
static auto& latency = metrics::histogram("audiolab_request_duration_seconds")
    .buckets({0.001, 0.01, 0.1, 1.0})
    .build();

{
    auto timer = latency.start_timer();
    // ... timed operation ...
}  // Auto-recorded on destruction

// Or manual
latency.observe(0.042);  // 42ms

🎨 Dashboard Design Guidelines

Layout Rules

  • Max 9 panels per dashboard
  • 12-column grid system
  • Consistent colors:
  • Green: < 50% (good)
  • Yellow: 50-80% (warning)
  • Red: > 80% (critical)

Panel Types

  • Stat panels: Single current values
  • Time series: Trends over time
  • Tables: Detailed breakdowns
  • Heatmaps: Distribution visualization
  • Gauges: Percentage with thresholds

Naming Conventions

  • Dashboard: [System] - [Purpose]
  • Panel: [Metric] ([Unit])
  • Query: {service="name", env="prod"}

See DASHBOARD_GUIDELINES.md for details.

🚑 Incident Response

When Alert Fires

  1. Acknowledge in PagerDuty/Opsgenie
  2. Check dashboard for context
  3. Follow playbook for alert type
  4. Escalate if needed (see escalation path)
  5. Document in incident channel

Playbooks Available

  • High Error Rate
  • High CPU Usage
  • Crash Detected
  • High Latency
  • Buffer Underrun
  • Slow Plugin Load

See ONCALL_PLAYBOOKS.md for details.

📈 Alert Quality

Prevention Strategies

  • Deduplication: Group similar alerts
  • Progressive thresholds: Warning → Critical
  • Time-based routing: Business hours vs off-hours
  • Maintenance windows: Auto-silence during deploys
  • Data-driven thresholds: Based on historical data

Key Metrics

  • False positive rate: < 5%
  • Mean time to acknowledge: < 5 min
  • Alerts per day: < 10
  • Alert grouping rate: > 80%

See ALERT_FATIGUE_PREVENTION.md for details.

🔗 Integration Points

Prometheus Exporters

# Application metrics
- job_name: 'audiolab'
  static_configs:
    - targets: ['localhost:9100']

# Node metrics
- job_name: 'node'
  static_configs:
    - targets: ['localhost:9100']

# Plugin metrics
- job_name: 'plugins'
  static_configs:
    - targets: ['localhost:9101']

Alertmanager Receivers

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<key>'
        severity: 'critical'

  - name: 'slack'
    slack_configs:
      - api_url: '<webhook>'
        channel: '#alerts'

  - name: 'email'
    email_configs:
      - to: 'team@audiolab.com'

Grafana Data Sources

datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true

  - name: Loki
    type: loki
    url: http://loki:3100

📊 SLO Tracking

Service Level Objectives

# Availability SLO: 99.9%
- alert: AvailabilitySLOViolation
  expr: |
    (
      sum(rate(up{job="audiolab"}[30d]))
      /
      count(up{job="audiolab"})
    ) < 0.999

# Latency SLO: p99 < 100ms
- alert: LatencySLOViolation
  expr: |
    histogram_quantile(0.99,
      rate(audiolab_latency_seconds_bucket[30d])
    ) > 0.100

# Error budget
- alert: ErrorBudgetBurnRate
  expr: |
    (
      1 - (
        sum(rate(requests_total{status=~"2.."}[1h]))
        /
        sum(rate(requests_total[1h]))
      )
    ) > 0.001  # SLO: 99.9%

🛠️ Maintenance

Daily Tasks

  • Check alert health dashboard
  • Review overnight pages
  • Acknowledge known issues

Weekly Tasks

  • Review top 10 noisy alerts
  • Tune false positive alerts
  • Update runbooks

Monthly Tasks

  • Alert deep dive meeting
  • Review dashboard usage
  • Clean up unused metrics

Quarterly Tasks

  • Full alert audit
  • Dashboard consolidation
  • SLO review

📚 Resources

Documentation

On-Call


Last Updated: 2024-10-03 Maintainer: AudioLab SRE Team Version: 1.0