📊 AudioLab Monitoring & Observability¶

Complete monitoring and observability infrastructure for the AudioLab system.

📁 Structure¶

03_08_monitoring_observability/
├── 03_08_01_metrics_collection/         # Metrics instrumentation
│   ├── prometheus_config.yml            # Prometheus configuration
│   ├── metrics_registry.hpp             # C++ metrics SDK
│   └── METRICS_GUIDE.md                 # Metrics best practices
│
├── 03_08_02_dashboards_templates/       # Grafana dashboards
│   ├── DASHBOARD_GUIDELINES.md          # Dashboard design rules
│   ├── overview_dashboard.json          # System overview
│   └── performance_dashboard.json       # Performance monitoring
│
└── 03_08_03_alerting_rules/            # Alerting configuration
    ├── alert_rules.yaml                 # Prometheus alert rules
    ├── ONCALL_PLAYBOOKS.md             # Incident response guides
    └── ALERT_FATIGUE_PREVENTION.md     # Alert quality guidelines

🎯 Quick Start¶

1. Setup Metrics Collection¶

# Start Prometheus
docker run -d \
  -p 9090:9090 \
  -v $(pwd)/03_08_01_metrics_collection/prometheus_config.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# Verify metrics endpoint
curl http://localhost:9090/metrics

2. Import Dashboards¶

# Import to Grafana
curl -X POST http://localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @03_08_02_dashboards_templates/overview_dashboard.json

curl -X POST http://localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @03_08_02_dashboards_templates/performance_dashboard.json

3. Configure Alerts¶

# Load alert rules into Prometheus
promtool check rules 03_08_03_alerting_rules/alert_rules.yaml

# Reload Prometheus config
curl -X POST http://localhost:9090/-/reload

📊 Available Dashboards¶

1. System Overview¶

File: overview_dashboard.json - Purpose: High-level system health - Metrics: - System Status (UP/DOWN) - CPU Usage (%) - Memory Usage (%) - Error Rate - Active Alerts - Disk Usage - Network I/O - Refresh: 10s - Time Range: Last 15 minutes

2. Performance Monitoring¶

File: performance_dashboard.json - Purpose: Detailed performance analysis - Metrics: - Latency Percentiles (p50, p95, p99) - Latency Histogram (heatmap) - Buffer Underruns - Plugin Load Time - CPU by Component - Throughput (samples/sec) - Top 10 Slowest Operations - Refresh: 5s - Time Range: Last 1 hour

🚨 Alert Categories¶

Critical Alerts (Page Immediately)¶

ServiceDown - Service is down for > 1 minute
HighErrorRate - Error rate > 1% for 5 minutes
CrashDetected - Application crash detected
CriticalCPUUsage - CPU > 95% for 5 minutes
CriticalLatency - p99 latency > 100ms for 5 minutes
DiskSpaceCritical - Disk > 95% full

Warning Alerts (Page During Business Hours)¶

HighCPUUsage - CPU > 80% for 10 minutes
HighMemoryUsage - Memory > 80% for 10 minutes
HighLatency - p99 latency > 50ms for 10 minutes
BufferUnderrun - Audio buffer underruns detected
ErrorRateSpiking - Error rate increased 5x
SlowPluginLoad - Plugin load time > 500ms

Info Alerts (Log Only, No Page)¶

NewVersionDeployed - New version deployed
RateLimitHit - Rate limit frequently hit
Watchdog - Monitoring health check (always firing)

📖 Key Metrics¶

Application Metrics¶

# Error rate
rate(audiolab_errors_total[5m])

# Latency percentiles
histogram_quantile(0.99, rate(audiolab_latency_seconds_bucket[5m]))

# Throughput
rate(audiolab_samples_processed_total[5m])

# CPU by component
rate(audiolab_cpu_seconds_total[5m]) by (component)

# Buffer underruns
increase(audiolab_buffer_underruns_total[5m])

System Metrics¶

# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Network I/O
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

🔧 Instrumentation Guide¶

Adding Metrics in C++¶

#include "metrics_registry.hpp"

// Define metrics
static auto& cpu_usage = metrics::gauge("audiolab_cpu_usage")
    .help("CPU usage percentage")
    .label("component", "audio_engine")
    .build();

static auto& latency_histogram = metrics::histogram("audiolab_latency_seconds")
    .help("Processing latency in seconds")
    .buckets({0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0})
    .build();

// Record metrics
void processAudio() {
    auto timer = latency_histogram.start_timer();

    // ... process audio ...

    cpu_usage.set(getCurrentCPUUsage());
}

Counter Pattern¶

// Errors
static auto& errors = metrics::counter("audiolab_errors_total")
    .label("severity", "error")
    .label("component", "reverb")
    .build();

errors.inc();  // Increment by 1
errors.inc(5); // Increment by 5

Histogram Pattern¶

// Latency distribution
static auto& latency = metrics::histogram("audiolab_request_duration_seconds")
    .buckets({0.001, 0.01, 0.1, 1.0})
    .build();

{
    auto timer = latency.start_timer();
    // ... timed operation ...
}  // Auto-recorded on destruction

// Or manual
latency.observe(0.042);  // 42ms

🎨 Dashboard Design Guidelines¶

Layout Rules¶

Max 9 panels per dashboard
12-column grid system
Consistent colors:
Green: < 50% (good)
Yellow: 50-80% (warning)
Red: > 80% (critical)

Panel Types¶

Stat panels: Single current values
Time series: Trends over time
Tables: Detailed breakdowns
Heatmaps: Distribution visualization
Gauges: Percentage with thresholds

Naming Conventions¶

Dashboard: [System] - [Purpose]
Panel: [Metric] ([Unit])
Query: {service="name", env="prod"}

See DASHBOARD_GUIDELINES.md for details.

🚑 Incident Response¶

When Alert Fires¶

Acknowledge in PagerDuty/Opsgenie
Check dashboard for context
Follow playbook for alert type
Escalate if needed (see escalation path)
Document in incident channel

Playbooks Available¶

High Error Rate
High CPU Usage
Crash Detected
High Latency
Buffer Underrun
Slow Plugin Load

See ONCALL_PLAYBOOKS.md for details.

📈 Alert Quality¶

Prevention Strategies¶

Deduplication: Group similar alerts
Progressive thresholds: Warning → Critical
Time-based routing: Business hours vs off-hours
Maintenance windows: Auto-silence during deploys
Data-driven thresholds: Based on historical data

Key Metrics¶

False positive rate: < 5%
Mean time to acknowledge: < 5 min
Alerts per day: < 10
Alert grouping rate: > 80%

See ALERT_FATIGUE_PREVENTION.md for details.

🔗 Integration Points¶

Prometheus Exporters¶

# Application metrics
- job_name: 'audiolab'
  static_configs:
    - targets: ['localhost:9100']

# Node metrics
- job_name: 'node'
  static_configs:
    - targets: ['localhost:9100']

# Plugin metrics
- job_name: 'plugins'
  static_configs:
    - targets: ['localhost:9101']

Alertmanager Receivers¶

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<key>'
        severity: 'critical'

  - name: 'slack'
    slack_configs:
      - api_url: '<webhook>'
        channel: '#alerts'

  - name: 'email'
    email_configs:
      - to: 'team@audiolab.com'

Grafana Data Sources¶

datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true

  - name: Loki
    type: loki
    url: http://loki:3100

📊 SLO Tracking¶

Service Level Objectives¶

# Availability SLO: 99.9%
- alert: AvailabilitySLOViolation
  expr: |
    (
      sum(rate(up{job="audiolab"}[30d]))
      /
      count(up{job="audiolab"})
    ) < 0.999

# Latency SLO: p99 < 100ms
- alert: LatencySLOViolation
  expr: |
    histogram_quantile(0.99,
      rate(audiolab_latency_seconds_bucket[30d])
    ) > 0.100

# Error budget
- alert: ErrorBudgetBurnRate
  expr: |
    (
      1 - (
        sum(rate(requests_total{status=~"2.."}[1h]))
        /
        sum(rate(requests_total[1h]))
      )
    ) > 0.001  # SLO: 99.9%

🛠️ Maintenance¶

Daily Tasks¶

Check alert health dashboard
Review overnight pages
Acknowledge known issues

Weekly Tasks¶

Review top 10 noisy alerts
Tune false positive alerts
Update runbooks

Monthly Tasks¶

Alert deep dive meeting
Review dashboard usage
Clean up unused metrics

Quarterly Tasks¶

Full alert audit
Dashboard consolidation
SLO review

📚 Resources¶

Documentation¶

Internal Links¶

Grafana: https://grafana.audiolab.com
Prometheus: https://prometheus.audiolab.com
Alertmanager: https://alertmanager.audiolab.com
Runbooks: https://docs.audiolab.com/runbooks

On-Call¶

Schedule: https://pagerduty.com/schedules/audiolab
Escalation: https://pagerduty.com/escalation-policies
Incidents: https://pagerduty.com/incidents

Last Updated: 2024-10-03 Maintainer: AudioLab SRE Team Version: 1.0