📊 AudioLab Monitoring & Observability¶
Complete monitoring and observability infrastructure for the AudioLab system.
📁 Structure¶
03_08_monitoring_observability/
├── 03_08_01_metrics_collection/ # Metrics instrumentation
│ ├── prometheus_config.yml # Prometheus configuration
│ ├── metrics_registry.hpp # C++ metrics SDK
│ └── METRICS_GUIDE.md # Metrics best practices
│
├── 03_08_02_dashboards_templates/ # Grafana dashboards
│ ├── DASHBOARD_GUIDELINES.md # Dashboard design rules
│ ├── overview_dashboard.json # System overview
│ └── performance_dashboard.json # Performance monitoring
│
└── 03_08_03_alerting_rules/ # Alerting configuration
├── alert_rules.yaml # Prometheus alert rules
├── ONCALL_PLAYBOOKS.md # Incident response guides
└── ALERT_FATIGUE_PREVENTION.md # Alert quality guidelines
🎯 Quick Start¶
1. Setup Metrics Collection¶
# Start Prometheus
docker run -d \
-p 9090:9090 \
-v $(pwd)/03_08_01_metrics_collection/prometheus_config.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
# Verify metrics endpoint
curl http://localhost:9090/metrics
2. Import Dashboards¶
# Import to Grafana
curl -X POST http://localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @03_08_02_dashboards_templates/overview_dashboard.json
curl -X POST http://localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @03_08_02_dashboards_templates/performance_dashboard.json
3. Configure Alerts¶
# Load alert rules into Prometheus
promtool check rules 03_08_03_alerting_rules/alert_rules.yaml
# Reload Prometheus config
curl -X POST http://localhost:9090/-/reload
📊 Available Dashboards¶
1. System Overview¶
File: overview_dashboard.json
- Purpose: High-level system health
- Metrics:
- System Status (UP/DOWN)
- CPU Usage (%)
- Memory Usage (%)
- Error Rate
- Active Alerts
- Disk Usage
- Network I/O
- Refresh: 10s
- Time Range: Last 15 minutes
2. Performance Monitoring¶
File: performance_dashboard.json
- Purpose: Detailed performance analysis
- Metrics:
- Latency Percentiles (p50, p95, p99)
- Latency Histogram (heatmap)
- Buffer Underruns
- Plugin Load Time
- CPU by Component
- Throughput (samples/sec)
- Top 10 Slowest Operations
- Refresh: 5s
- Time Range: Last 1 hour
🚨 Alert Categories¶
Critical Alerts (Page Immediately)¶
ServiceDown- Service is down for > 1 minuteHighErrorRate- Error rate > 1% for 5 minutesCrashDetected- Application crash detectedCriticalCPUUsage- CPU > 95% for 5 minutesCriticalLatency- p99 latency > 100ms for 5 minutesDiskSpaceCritical- Disk > 95% full
Warning Alerts (Page During Business Hours)¶
HighCPUUsage- CPU > 80% for 10 minutesHighMemoryUsage- Memory > 80% for 10 minutesHighLatency- p99 latency > 50ms for 10 minutesBufferUnderrun- Audio buffer underruns detectedErrorRateSpiking- Error rate increased 5xSlowPluginLoad- Plugin load time > 500ms
Info Alerts (Log Only, No Page)¶
NewVersionDeployed- New version deployedRateLimitHit- Rate limit frequently hitWatchdog- Monitoring health check (always firing)
📖 Key Metrics¶
Application Metrics¶
# Error rate
rate(audiolab_errors_total[5m])
# Latency percentiles
histogram_quantile(0.99, rate(audiolab_latency_seconds_bucket[5m]))
# Throughput
rate(audiolab_samples_processed_total[5m])
# CPU by component
rate(audiolab_cpu_seconds_total[5m]) by (component)
# Buffer underruns
increase(audiolab_buffer_underruns_total[5m])
System Metrics¶
# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
# Network I/O
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
🔧 Instrumentation Guide¶
Adding Metrics in C++¶
#include "metrics_registry.hpp"
// Define metrics
static auto& cpu_usage = metrics::gauge("audiolab_cpu_usage")
.help("CPU usage percentage")
.label("component", "audio_engine")
.build();
static auto& latency_histogram = metrics::histogram("audiolab_latency_seconds")
.help("Processing latency in seconds")
.buckets({0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0})
.build();
// Record metrics
void processAudio() {
auto timer = latency_histogram.start_timer();
// ... process audio ...
cpu_usage.set(getCurrentCPUUsage());
}
Counter Pattern¶
// Errors
static auto& errors = metrics::counter("audiolab_errors_total")
.label("severity", "error")
.label("component", "reverb")
.build();
errors.inc(); // Increment by 1
errors.inc(5); // Increment by 5
Histogram Pattern¶
// Latency distribution
static auto& latency = metrics::histogram("audiolab_request_duration_seconds")
.buckets({0.001, 0.01, 0.1, 1.0})
.build();
{
auto timer = latency.start_timer();
// ... timed operation ...
} // Auto-recorded on destruction
// Or manual
latency.observe(0.042); // 42ms
🎨 Dashboard Design Guidelines¶
Layout Rules¶
- Max 9 panels per dashboard
- 12-column grid system
- Consistent colors:
- Green: < 50% (good)
- Yellow: 50-80% (warning)
- Red: > 80% (critical)
Panel Types¶
- Stat panels: Single current values
- Time series: Trends over time
- Tables: Detailed breakdowns
- Heatmaps: Distribution visualization
- Gauges: Percentage with thresholds
Naming Conventions¶
- Dashboard:
[System] - [Purpose] - Panel:
[Metric] ([Unit]) - Query:
{service="name", env="prod"}
See DASHBOARD_GUIDELINES.md for details.
🚑 Incident Response¶
When Alert Fires¶
- Acknowledge in PagerDuty/Opsgenie
- Check dashboard for context
- Follow playbook for alert type
- Escalate if needed (see escalation path)
- Document in incident channel
Playbooks Available¶
- High Error Rate
- High CPU Usage
- Crash Detected
- High Latency
- Buffer Underrun
- Slow Plugin Load
See ONCALL_PLAYBOOKS.md for details.
📈 Alert Quality¶
Prevention Strategies¶
- Deduplication: Group similar alerts
- Progressive thresholds: Warning → Critical
- Time-based routing: Business hours vs off-hours
- Maintenance windows: Auto-silence during deploys
- Data-driven thresholds: Based on historical data
Key Metrics¶
- False positive rate: < 5%
- Mean time to acknowledge: < 5 min
- Alerts per day: < 10
- Alert grouping rate: > 80%
See ALERT_FATIGUE_PREVENTION.md for details.
🔗 Integration Points¶
Prometheus Exporters¶
# Application metrics
- job_name: 'audiolab'
static_configs:
- targets: ['localhost:9100']
# Node metrics
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
# Plugin metrics
- job_name: 'plugins'
static_configs:
- targets: ['localhost:9101']
Alertmanager Receivers¶
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<key>'
severity: 'critical'
- name: 'slack'
slack_configs:
- api_url: '<webhook>'
channel: '#alerts'
- name: 'email'
email_configs:
- to: 'team@audiolab.com'
Grafana Data Sources¶
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
url: http://loki:3100
📊 SLO Tracking¶
Service Level Objectives¶
# Availability SLO: 99.9%
- alert: AvailabilitySLOViolation
expr: |
(
sum(rate(up{job="audiolab"}[30d]))
/
count(up{job="audiolab"})
) < 0.999
# Latency SLO: p99 < 100ms
- alert: LatencySLOViolation
expr: |
histogram_quantile(0.99,
rate(audiolab_latency_seconds_bucket[30d])
) > 0.100
# Error budget
- alert: ErrorBudgetBurnRate
expr: |
(
1 - (
sum(rate(requests_total{status=~"2.."}[1h]))
/
sum(rate(requests_total[1h]))
)
) > 0.001 # SLO: 99.9%
🛠️ Maintenance¶
Daily Tasks¶
- Check alert health dashboard
- Review overnight pages
- Acknowledge known issues
Weekly Tasks¶
- Review top 10 noisy alerts
- Tune false positive alerts
- Update runbooks
Monthly Tasks¶
- Alert deep dive meeting
- Review dashboard usage
- Clean up unused metrics
Quarterly Tasks¶
- Full alert audit
- Dashboard consolidation
- SLO review
📚 Resources¶
Documentation¶
Internal Links¶
- Grafana: https://grafana.audiolab.com
- Prometheus: https://prometheus.audiolab.com
- Alertmanager: https://alertmanager.audiolab.com
- Runbooks: https://docs.audiolab.com/runbooks
On-Call¶
- Schedule: https://pagerduty.com/schedules/audiolab
- Escalation: https://pagerduty.com/escalation-policies
- Incidents: https://pagerduty.com/incidents
Last Updated: 2024-10-03 Maintainer: AudioLab SRE Team Version: 1.0