📞 On-Call Playbooks¶
🎯 Purpose¶
Quick reference guides for responding to alerts. Follow these steps when paged.
🚨 High Error Rate¶
Symptoms¶
- Alert:
HighErrorRate - Error rate > 1% for 5+ minutes
- Users reporting failures
Initial Response (First 5 minutes)¶
- Acknowledge alert in PagerDuty/Opsgenie
- Check dashboard: https://grafana.audiolab.com/d/audiolab-overview
- Identify error type:
Diagnosis (5-15 minutes)¶
- Error classification:
- Network errors: Check external service status
- Database errors: Check DB connections/load
- File I/O errors: Check disk space
-
Null pointer/crashes: Recent deployment issue
-
Check recent changes:
-
Check dependencies:
- Database:
SELECT 1;test query - Redis:
redis-cli ping - External APIs: curl health endpoints
Mitigation Strategies¶
Strategy 1: Rollback Deployment¶
When: Errors started after recent deployment
# Rollback to previous version
kubectl rollout undo deployment/audiolab
# Verify
kubectl rollout status deployment/audiolab
Strategy 2: Enable Fallback¶
When: External service failure
# Enable circuit breaker
kubectl set env deployment/audiolab ENABLE_FALLBACK=true
# Or toggle feature flag
curl -X POST https://api.audiolab.com/admin/flags \
-d '{"flag": "use_fallback_service", "enabled": true}'
Strategy 3: Hotfix¶
When: Critical bug identified
# Quick fix branch
git checkout -b hotfix/error-spike
# Make fix
git commit -m "Fix: Handle null pointer in audio processor"
# Deploy
./deploy.sh hotfix
Strategy 4: Disable Feature¶
When: Specific feature causing issues
Escalation Path¶
- 15 min: No progress → Page Engineering Lead
- 30 min: Still broken → Page Engineering Manager
- 1 hour: Major incident → Activate Incident Response Team
Post-Incident¶
- Create post-mortem:
docs/postmortems/YYYY-MM-DD-error-spike.md - File bugs for root causes
- Update runbook with learnings
🔥 High CPU Usage¶
Symptoms¶
- Alert:
HighCPUUsageorCriticalCPUUsage - CPU > 80% for 10+ minutes
- System slow/unresponsive
Initial Response (First 2 minutes)¶
-
Check current CPU:
-
Quick dashboard check: CPU graph trends
Diagnosis (2-10 minutes)¶
-
Identify hot component:
-
Check for:
- Infinite loops (stuck at 100%)
- Lock contention (many threads waiting)
- Memory thrashing (high CPU + high memory)
- External process (virus scan, backup)
Mitigation Strategies¶
Strategy 1: Restart Service¶
When: Temporary spike, no ongoing load
# Graceful restart
kubectl rollout restart deployment/audiolab
# Or force restart
systemctl restart audiolab
Strategy 2: Scale Horizontally¶
When: High legitimate load
# Increase replicas
kubectl scale deployment/audiolab --replicas=5
# Or use autoscaler
kubectl autoscale deployment/audiolab --cpu-percent=70 --min=3 --max=10
Strategy 3: Rate Limiting¶
When: Traffic spike
# Enable aggressive rate limiting
kubectl set env deployment/audiolab RATE_LIMIT_PER_MIN=100
# Or via nginx
kubectl set env deployment/nginx RATE_LIMIT=10r/s
Strategy 4: Kill Heavy Jobs¶
When: Background job consuming CPU
# Find background jobs
ps aux | grep audiolab | grep background
# Kill specific job
kill -9 <PID>
# Or disable job scheduler
kubectl set env deployment/audiolab SCHEDULER_ENABLED=false
Prevention¶
- Set CPU limits:
kubectl set resources deployment/audiolab -c=audiolab --limits=cpu=2000m - Implement circuit breakers
- Add request timeouts
💥 Crash Detected¶
Symptoms¶
- Alert:
CrashDetected - Application crashed
- Restart loop visible
CRITICAL: First 60 Seconds¶
-
Preserve crash dump:
-
Get stack trace:
-
Check if service recovered:
Diagnosis (1-10 minutes)¶
- Analyze stack trace:
- Segmentation fault → Null pointer/bad memory access
- Assertion failed → Logic bug
- Out of memory → Memory leak
-
Abort signal → Unhandled exception
-
Check recent changes:
-
Reproduce locally:
Mitigation Strategies¶
Strategy 1: Emergency Rollback¶
When: Crash started with new deployment
# Immediate rollback
kubectl rollout undo deployment/audiolab
kubectl rollout status deployment/audiolab
# Verify no more crashes
kubectl logs -f deployment/audiolab | grep -i crash
Strategy 2: Isolate Crashing Component¶
When: Specific feature/plugin crashing
# Disable plugin system
kubectl set env deployment/audiolab PLUGINS_ENABLED=false
# Or disable specific feature
kubectl set env deployment/audiolab FEATURE_REVERB=false
Strategy 3: Increase Resources¶
When: OOM (Out of Memory) crash
# Increase memory limit
kubectl set resources deployment/audiolab --limits=memory=4Gi
# Check memory usage
kubectl top pods -l app=audiolab
Strategy 4: Add Restart Policy¶
When: Temporary crashes, service recovers
# Update restart policy
kubectl patch deployment audiolab -p '{"spec":{"template":{"spec":{"restartPolicy":"Always"}}}}'
# Add liveness probe
kubectl set probe deployment/audiolab --liveness --initial-delay-seconds=30 --period-seconds=10 --get-url=http://localhost:8080/health
Debug Commands¶
# Get crash logs
kubectl logs -l app=audiolab --previous # Last crashed pod
# Attach debugger to running process
gdb -p $(pidof audiolab)
# Check system limits
ulimit -a
# Memory map
cat /proc/$(pidof audiolab)/maps
Escalation¶
- Immediate: If production is down → Page Engineering Lead
- Recurring crashes: Coordinate with Core Team
- Can't reproduce: Need QA Environment setup
📈 High Latency¶
Symptoms¶
- Alert:
HighLatencyorCriticalLatency - p99 latency > 50ms (warning) or > 100ms (critical)
- Users reporting lag
Initial Response (First 3 minutes)¶
- Check latency dashboard: p50, p95, p99 trends
- Identify slow component:
Diagnosis¶
-
Database slow?
-
Network latency?
-
Lock contention?
Mitigation¶
- Add caching
- Increase worker threads
- Add database indexes
- Scale horizontally
🔋 Buffer Underrun¶
Symptoms¶
- Alert:
BufferUnderrun - Audio glitches/dropouts
- Clicks and pops in audio
Immediate Actions¶
-
Increase buffer size:
-
Check CPU priority:
Root Causes¶
- CPU spike (check
HighCPUplaybook) - Disk I/O blocking (check disk latency)
- Plugin timeout (check plugin metrics)
- System interrupts (check
/proc/interrupts)
📊 Slow Plugin Load¶
Symptoms¶
- Alert:
SlowPluginLoad - Plugin load time > 500ms
- Slow startup time
Quick Fixes¶
-
Preload plugins:
-
Parallel loading:
-
Disable slow plugins:
🎯 General Response Framework¶
Severity Levels¶
| Severity | Response Time | Escalation | Example |
|---|---|---|---|
| Critical | < 5 min | Immediate page | Service down, crashes |
| Warning | < 15 min | After 30 min | High CPU, slow queries |
| Info | < 1 hour | No escalation | Deployment, config change |
Communication Template¶
Incident Update Format:
INCIDENT: [Brief title]
STATUS: [Investigating/Identified/Mitigating/Resolved]
IMPACT: [User-facing impact]
TIMELINE:
HH:MM - Incident started
HH:MM - Alert fired
HH:MM - On-call responded
HH:MM - Root cause identified
HH:MM - Mitigation applied
NEXT STEPS: [What's next]
ETA: [Estimated resolution time]
Useful Commands Cheat Sheet¶
# === Logs ===
kubectl logs -f deployment/audiolab --tail=100
journalctl -u audiolab -f
tail -f /var/log/audiolab/error.log
# === Metrics ===
curl http://localhost:9090/metrics
kubectl top pods
docker stats
# === Health ===
curl http://localhost:8080/health
kubectl get pods -w
systemctl status audiolab
# === Quick fixes ===
kubectl rollout undo deployment/audiolab # Rollback
kubectl scale deployment/audiolab --replicas=3 # Scale
kubectl rollout restart deployment/audiolab # Restart
kubectl delete pod -l app=audiolab # Force restart
# === Debug ===
kubectl exec -it audiolab-xxx -- bash # Shell into pod
kubectl describe pod audiolab-xxx # Pod details
kubectl get events --sort-by=.metadata.creationTimestamp # Recent events
📚 Resources¶
- Dashboards: https://grafana.audiolab.com
- Logs: https://kibana.audiolab.com
- Metrics: https://prometheus.audiolab.com
- Runbooks: https://docs.audiolab.com/runbooks
- Escalation: https://pagerduty.com/audiolab
Last Updated: 2024-10-03 Maintainer: AudioLab SRE Team On-Call Schedule: https://pagerduty.com/schedules/audiolab