📞 On-Call Playbooks¶

🎯 Purpose¶

Quick reference guides for responding to alerts. Follow these steps when paged.

🚨 High Error Rate¶

Symptoms¶

Alert: HighErrorRate
Error rate > 1% for 5+ minutes
Users reporting failures

Initial Response (First 5 minutes)¶

Acknowledge alert in PagerDuty/Opsgenie
Check dashboard: https://grafana.audiolab.com/d/audiolab-overview

Identify error type:

# Check recent errors
grep ERROR logs/audiolab-$(date +%Y%m%d).log | tail -100

# Count by error type
grep ERROR logs/latest.log | awk '{print $5}' | sort | uniq -c | sort -rn

Diagnosis (5-15 minutes)¶

Error classification:
Network errors: Check external service status
Database errors: Check DB connections/load
File I/O errors: Check disk space
Null pointer/crashes: Recent deployment issue

Check recent changes:

# Last deployment
kubectl get deployments -o wide

# Recent commits
git log --oneline -10

Check dependencies:
Database: SELECT 1; test query
Redis: redis-cli ping
External APIs: curl health endpoints

Mitigation Strategies¶

Strategy 1: Rollback Deployment¶

When: Errors started after recent deployment

# Rollback to previous version
kubectl rollout undo deployment/audiolab

# Verify
kubectl rollout status deployment/audiolab

Strategy 2: Enable Fallback¶

When: External service failure

# Enable circuit breaker
kubectl set env deployment/audiolab ENABLE_FALLBACK=true

# Or toggle feature flag
curl -X POST https://api.audiolab.com/admin/flags \
  -d '{"flag": "use_fallback_service", "enabled": true}'

Strategy 3: Hotfix¶

When: Critical bug identified

# Quick fix branch
git checkout -b hotfix/error-spike
# Make fix
git commit -m "Fix: Handle null pointer in audio processor"
# Deploy
./deploy.sh hotfix

Strategy 4: Disable Feature¶

When: Specific feature causing issues

# Disable problematic feature
kubectl set env deployment/audiolab FEATURE_XYZ_ENABLED=false

Escalation Path¶

15 min: No progress → Page Engineering Lead
30 min: Still broken → Page Engineering Manager
1 hour: Major incident → Activate Incident Response Team

Post-Incident¶

Create post-mortem: docs/postmortems/YYYY-MM-DD-error-spike.md
File bugs for root causes
Update runbook with learnings

🔥 High CPU Usage¶

Symptoms¶

Alert: HighCPUUsage or CriticalCPUUsage
CPU > 80% for 10+ minutes
System slow/unresponsive

Initial Response (First 2 minutes)¶

Check current CPU:

# Top processes
top -o %CPU

# Or on Windows
Get-Process | Sort-Object CPU -Descending | Select-Object -First 10

Quick dashboard check: CPU graph trends

Diagnosis (2-10 minutes)¶

Identify hot component:

# CPU by component (Prometheus)
curl 'http://prometheus:9090/api/v1/query?query=topk(5,rate(audiolab_cpu_seconds_total[5m]))'

# Profile running process
perf top -p $(pidof audiolab)  # Linux
dotnet-trace collect -p $(pidof audiolab)  # .NET

Check for:
Infinite loops (stuck at 100%)
Lock contention (many threads waiting)
Memory thrashing (high CPU + high memory)
External process (virus scan, backup)

Mitigation Strategies¶

Strategy 1: Restart Service¶

When: Temporary spike, no ongoing load

# Graceful restart
kubectl rollout restart deployment/audiolab

# Or force restart
systemctl restart audiolab

Strategy 2: Scale Horizontally¶

When: High legitimate load

# Increase replicas
kubectl scale deployment/audiolab --replicas=5

# Or use autoscaler
kubectl autoscale deployment/audiolab --cpu-percent=70 --min=3 --max=10

Strategy 3: Rate Limiting¶

When: Traffic spike

# Enable aggressive rate limiting
kubectl set env deployment/audiolab RATE_LIMIT_PER_MIN=100

# Or via nginx
kubectl set env deployment/nginx RATE_LIMIT=10r/s

Strategy 4: Kill Heavy Jobs¶

When: Background job consuming CPU

# Find background jobs
ps aux | grep audiolab | grep background

# Kill specific job
kill -9 <PID>

# Or disable job scheduler
kubectl set env deployment/audiolab SCHEDULER_ENABLED=false

Prevention¶

Set CPU limits: kubectl set resources deployment/audiolab -c=audiolab --limits=cpu=2000m
Implement circuit breakers
Add request timeouts

💥 Crash Detected¶

Symptoms¶

Alert: CrashDetected
Application crashed
Restart loop visible

CRITICAL: First 60 Seconds¶

Preserve crash dump:

# Copy core dump before it's overwritten
cp /var/crash/core.audiolab.* /backup/crashes/$(date +%s).core

# Or Windows crash dump
copy C:\CrashDumps\audiolab*.dmp D:\Backup\

Get stack trace:

# From core dump (Linux)
gdb /usr/bin/audiolab /var/crash/core.audiolab -ex "bt" -ex "quit"

# From crash log
tail -1000 /var/log/audiolab/crash.log

Check if service recovered:
```
curl http://localhost:8080/health
```

Diagnosis (1-10 minutes)¶

Analyze stack trace:
Segmentation fault → Null pointer/bad memory access
Assertion failed → Logic bug
Out of memory → Memory leak
Abort signal → Unhandled exception

Check recent changes:

# What changed in last deploy?
git diff HEAD~1 HEAD

Reproduce locally:

# Run with sanitizers
ASAN_OPTIONS=detect_leaks=1 ./audiolab

# Or with debugger
gdb --args ./audiolab --debug

Mitigation Strategies¶

Strategy 1: Emergency Rollback¶

When: Crash started with new deployment

# Immediate rollback
kubectl rollout undo deployment/audiolab
kubectl rollout status deployment/audiolab

# Verify no more crashes
kubectl logs -f deployment/audiolab | grep -i crash

Strategy 2: Isolate Crashing Component¶

When: Specific feature/plugin crashing

# Disable plugin system
kubectl set env deployment/audiolab PLUGINS_ENABLED=false

# Or disable specific feature
kubectl set env deployment/audiolab FEATURE_REVERB=false

Strategy 3: Increase Resources¶

When: OOM (Out of Memory) crash

# Increase memory limit
kubectl set resources deployment/audiolab --limits=memory=4Gi

# Check memory usage
kubectl top pods -l app=audiolab

Strategy 4: Add Restart Policy¶

When: Temporary crashes, service recovers

# Update restart policy
kubectl patch deployment audiolab -p '{"spec":{"template":{"spec":{"restartPolicy":"Always"}}}}'

# Add liveness probe
kubectl set probe deployment/audiolab --liveness --initial-delay-seconds=30 --period-seconds=10 --get-url=http://localhost:8080/health

Debug Commands¶

# Get crash logs
kubectl logs -l app=audiolab --previous  # Last crashed pod

# Attach debugger to running process
gdb -p $(pidof audiolab)

# Check system limits
ulimit -a

# Memory map
cat /proc/$(pidof audiolab)/maps

Escalation¶

Immediate: If production is down → Page Engineering Lead
Recurring crashes: Coordinate with Core Team
Can't reproduce: Need QA Environment setup

📈 High Latency¶

Symptoms¶

Alert: HighLatency or CriticalLatency
p99 latency > 50ms (warning) or > 100ms (critical)
Users reporting lag

Initial Response (First 3 minutes)¶

Check latency dashboard: p50, p95, p99 trends

Identify slow component:

# Slowest endpoints
curl 'http://prometheus:9090/api/v1/query?query=topk(10,histogram_quantile(0.99,rate(audiolab_latency_seconds_bucket[5m])))'

Diagnosis¶

Database slow?

-- PostgreSQL: Slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

Network latency?

# Ping external services
curl -w "@curl-format.txt" -o /dev/null -s https://api.external.com

Lock contention?

# Thread dump
kill -SIGQUIT $(pidof audiolab)  # Java
dotnet-dump collect -p $(pidof audiolab)  # .NET

Mitigation¶

Add caching
Increase worker threads
Add database indexes
Scale horizontally

🔋 Buffer Underrun¶

Symptoms¶

Alert: BufferUnderrun
Audio glitches/dropouts
Clicks and pops in audio

Immediate Actions¶

Increase buffer size:

kubectl set env deployment/audiolab BUFFER_SIZE=2048  # Default: 512

Check CPU priority:

# Increase process priority (Linux)
renice -n -10 -p $(pidof audiolab)

# Or set real-time priority
chrt -f -p 99 $(pidof audiolab)

Root Causes¶

CPU spike (check HighCPU playbook)
Disk I/O blocking (check disk latency)
Plugin timeout (check plugin metrics)
System interrupts (check /proc/interrupts)

📊 Slow Plugin Load¶

Symptoms¶

Alert: SlowPluginLoad
Plugin load time > 500ms
Slow startup time

Quick Fixes¶

Preload plugins:

kubectl set env deployment/audiolab PRELOAD_PLUGINS=true

Parallel loading:

kubectl set env deployment/audiolab PLUGIN_LOAD_PARALLEL=true

Disable slow plugins:

# Identify slow plugin from metrics
# Disable it
kubectl set env deployment/audiolab DISABLED_PLUGINS=slow-reverb,old-eq

🎯 General Response Framework¶

Severity Levels¶

Severity	Response Time	Escalation	Example
Critical	< 5 min	Immediate page	Service down, crashes
Warning	< 15 min	After 30 min	High CPU, slow queries
Info	< 1 hour	No escalation	Deployment, config change

Communication Template¶

Incident Update Format:

INCIDENT: [Brief title]
STATUS: [Investigating/Identified/Mitigating/Resolved]
IMPACT: [User-facing impact]
TIMELINE:
  HH:MM - Incident started
  HH:MM - Alert fired
  HH:MM - On-call responded
  HH:MM - Root cause identified
  HH:MM - Mitigation applied
NEXT STEPS: [What's next]
ETA: [Estimated resolution time]

Useful Commands Cheat Sheet¶

# === Logs ===
kubectl logs -f deployment/audiolab --tail=100
journalctl -u audiolab -f
tail -f /var/log/audiolab/error.log

# === Metrics ===
curl http://localhost:9090/metrics
kubectl top pods
docker stats

# === Health ===
curl http://localhost:8080/health
kubectl get pods -w
systemctl status audiolab

# === Quick fixes ===
kubectl rollout undo deployment/audiolab        # Rollback
kubectl scale deployment/audiolab --replicas=3  # Scale
kubectl rollout restart deployment/audiolab     # Restart
kubectl delete pod -l app=audiolab              # Force restart

# === Debug ===
kubectl exec -it audiolab-xxx -- bash           # Shell into pod
kubectl describe pod audiolab-xxx               # Pod details
kubectl get events --sort-by=.metadata.creationTimestamp  # Recent events

📚 Resources¶

Dashboards: https://grafana.audiolab.com
Logs: https://kibana.audiolab.com
Metrics: https://prometheus.audiolab.com
Runbooks: https://docs.audiolab.com/runbooks
Escalation: https://pagerduty.com/audiolab

Last Updated: 2024-10-03 Maintainer: AudioLab SRE Team On-Call Schedule: https://pagerduty.com/schedules/audiolab