Skip to content

📞 On-Call Playbooks

🎯 Purpose

Quick reference guides for responding to alerts. Follow these steps when paged.


🚨 High Error Rate

Symptoms

  • Alert: HighErrorRate
  • Error rate > 1% for 5+ minutes
  • Users reporting failures

Initial Response (First 5 minutes)

  1. Acknowledge alert in PagerDuty/Opsgenie
  2. Check dashboard: https://grafana.audiolab.com/d/audiolab-overview
  3. Identify error type:
    # Check recent errors
    grep ERROR logs/audiolab-$(date +%Y%m%d).log | tail -100
    
    # Count by error type
    grep ERROR logs/latest.log | awk '{print $5}' | sort | uniq -c | sort -rn
    

Diagnosis (5-15 minutes)

  1. Error classification:
  2. Network errors: Check external service status
  3. Database errors: Check DB connections/load
  4. File I/O errors: Check disk space
  5. Null pointer/crashes: Recent deployment issue

  6. Check recent changes:

    # Last deployment
    kubectl get deployments -o wide
    
    # Recent commits
    git log --oneline -10
    

  7. Check dependencies:

  8. Database: SELECT 1; test query
  9. Redis: redis-cli ping
  10. External APIs: curl health endpoints

Mitigation Strategies

Strategy 1: Rollback Deployment

When: Errors started after recent deployment

# Rollback to previous version
kubectl rollout undo deployment/audiolab

# Verify
kubectl rollout status deployment/audiolab

Strategy 2: Enable Fallback

When: External service failure

# Enable circuit breaker
kubectl set env deployment/audiolab ENABLE_FALLBACK=true

# Or toggle feature flag
curl -X POST https://api.audiolab.com/admin/flags \
  -d '{"flag": "use_fallback_service", "enabled": true}'

Strategy 3: Hotfix

When: Critical bug identified

# Quick fix branch
git checkout -b hotfix/error-spike
# Make fix
git commit -m "Fix: Handle null pointer in audio processor"
# Deploy
./deploy.sh hotfix

Strategy 4: Disable Feature

When: Specific feature causing issues

# Disable problematic feature
kubectl set env deployment/audiolab FEATURE_XYZ_ENABLED=false

Escalation Path

  • 15 min: No progress → Page Engineering Lead
  • 30 min: Still broken → Page Engineering Manager
  • 1 hour: Major incident → Activate Incident Response Team

Post-Incident

  1. Create post-mortem: docs/postmortems/YYYY-MM-DD-error-spike.md
  2. File bugs for root causes
  3. Update runbook with learnings

🔥 High CPU Usage

Symptoms

  • Alert: HighCPUUsage or CriticalCPUUsage
  • CPU > 80% for 10+ minutes
  • System slow/unresponsive

Initial Response (First 2 minutes)

  1. Check current CPU:

    # Top processes
    top -o %CPU
    
    # Or on Windows
    Get-Process | Sort-Object CPU -Descending | Select-Object -First 10
    

  2. Quick dashboard check: CPU graph trends

Diagnosis (2-10 minutes)

  1. Identify hot component:

    # CPU by component (Prometheus)
    curl 'http://prometheus:9090/api/v1/query?query=topk(5,rate(audiolab_cpu_seconds_total[5m]))'
    
    # Profile running process
    perf top -p $(pidof audiolab)  # Linux
    dotnet-trace collect -p $(pidof audiolab)  # .NET
    

  2. Check for:

  3. Infinite loops (stuck at 100%)
  4. Lock contention (many threads waiting)
  5. Memory thrashing (high CPU + high memory)
  6. External process (virus scan, backup)

Mitigation Strategies

Strategy 1: Restart Service

When: Temporary spike, no ongoing load

# Graceful restart
kubectl rollout restart deployment/audiolab

# Or force restart
systemctl restart audiolab

Strategy 2: Scale Horizontally

When: High legitimate load

# Increase replicas
kubectl scale deployment/audiolab --replicas=5

# Or use autoscaler
kubectl autoscale deployment/audiolab --cpu-percent=70 --min=3 --max=10

Strategy 3: Rate Limiting

When: Traffic spike

# Enable aggressive rate limiting
kubectl set env deployment/audiolab RATE_LIMIT_PER_MIN=100

# Or via nginx
kubectl set env deployment/nginx RATE_LIMIT=10r/s

Strategy 4: Kill Heavy Jobs

When: Background job consuming CPU

# Find background jobs
ps aux | grep audiolab | grep background

# Kill specific job
kill -9 <PID>

# Or disable job scheduler
kubectl set env deployment/audiolab SCHEDULER_ENABLED=false

Prevention

  • Set CPU limits: kubectl set resources deployment/audiolab -c=audiolab --limits=cpu=2000m
  • Implement circuit breakers
  • Add request timeouts

💥 Crash Detected

Symptoms

  • Alert: CrashDetected
  • Application crashed
  • Restart loop visible

CRITICAL: First 60 Seconds

  1. Preserve crash dump:

    # Copy core dump before it's overwritten
    cp /var/crash/core.audiolab.* /backup/crashes/$(date +%s).core
    
    # Or Windows crash dump
    copy C:\CrashDumps\audiolab*.dmp D:\Backup\
    

  2. Get stack trace:

    # From core dump (Linux)
    gdb /usr/bin/audiolab /var/crash/core.audiolab -ex "bt" -ex "quit"
    
    # From crash log
    tail -1000 /var/log/audiolab/crash.log
    

  3. Check if service recovered:

    curl http://localhost:8080/health
    

Diagnosis (1-10 minutes)

  1. Analyze stack trace:
  2. Segmentation fault → Null pointer/bad memory access
  3. Assertion failed → Logic bug
  4. Out of memory → Memory leak
  5. Abort signal → Unhandled exception

  6. Check recent changes:

    # What changed in last deploy?
    git diff HEAD~1 HEAD
    

  7. Reproduce locally:

    # Run with sanitizers
    ASAN_OPTIONS=detect_leaks=1 ./audiolab
    
    # Or with debugger
    gdb --args ./audiolab --debug
    

Mitigation Strategies

Strategy 1: Emergency Rollback

When: Crash started with new deployment

# Immediate rollback
kubectl rollout undo deployment/audiolab
kubectl rollout status deployment/audiolab

# Verify no more crashes
kubectl logs -f deployment/audiolab | grep -i crash

Strategy 2: Isolate Crashing Component

When: Specific feature/plugin crashing

# Disable plugin system
kubectl set env deployment/audiolab PLUGINS_ENABLED=false

# Or disable specific feature
kubectl set env deployment/audiolab FEATURE_REVERB=false

Strategy 3: Increase Resources

When: OOM (Out of Memory) crash

# Increase memory limit
kubectl set resources deployment/audiolab --limits=memory=4Gi

# Check memory usage
kubectl top pods -l app=audiolab

Strategy 4: Add Restart Policy

When: Temporary crashes, service recovers

# Update restart policy
kubectl patch deployment audiolab -p '{"spec":{"template":{"spec":{"restartPolicy":"Always"}}}}'

# Add liveness probe
kubectl set probe deployment/audiolab --liveness --initial-delay-seconds=30 --period-seconds=10 --get-url=http://localhost:8080/health

Debug Commands

# Get crash logs
kubectl logs -l app=audiolab --previous  # Last crashed pod

# Attach debugger to running process
gdb -p $(pidof audiolab)

# Check system limits
ulimit -a

# Memory map
cat /proc/$(pidof audiolab)/maps

Escalation

  • Immediate: If production is down → Page Engineering Lead
  • Recurring crashes: Coordinate with Core Team
  • Can't reproduce: Need QA Environment setup

📈 High Latency

Symptoms

  • Alert: HighLatency or CriticalLatency
  • p99 latency > 50ms (warning) or > 100ms (critical)
  • Users reporting lag

Initial Response (First 3 minutes)

  1. Check latency dashboard: p50, p95, p99 trends
  2. Identify slow component:
    # Slowest endpoints
    curl 'http://prometheus:9090/api/v1/query?query=topk(10,histogram_quantile(0.99,rate(audiolab_latency_seconds_bucket[5m])))'
    

Diagnosis

  1. Database slow?

    -- PostgreSQL: Slow queries
    SELECT query, mean_exec_time, calls
    FROM pg_stat_statements
    ORDER BY mean_exec_time DESC
    LIMIT 10;
    

  2. Network latency?

    # Ping external services
    curl -w "@curl-format.txt" -o /dev/null -s https://api.external.com
    

  3. Lock contention?

    # Thread dump
    kill -SIGQUIT $(pidof audiolab)  # Java
    dotnet-dump collect -p $(pidof audiolab)  # .NET
    

Mitigation

  • Add caching
  • Increase worker threads
  • Add database indexes
  • Scale horizontally

🔋 Buffer Underrun

Symptoms

  • Alert: BufferUnderrun
  • Audio glitches/dropouts
  • Clicks and pops in audio

Immediate Actions

  1. Increase buffer size:

    kubectl set env deployment/audiolab BUFFER_SIZE=2048  # Default: 512
    

  2. Check CPU priority:

    # Increase process priority (Linux)
    renice -n -10 -p $(pidof audiolab)
    
    # Or set real-time priority
    chrt -f -p 99 $(pidof audiolab)
    

Root Causes

  • CPU spike (check HighCPU playbook)
  • Disk I/O blocking (check disk latency)
  • Plugin timeout (check plugin metrics)
  • System interrupts (check /proc/interrupts)

📊 Slow Plugin Load

Symptoms

  • Alert: SlowPluginLoad
  • Plugin load time > 500ms
  • Slow startup time

Quick Fixes

  1. Preload plugins:

    kubectl set env deployment/audiolab PRELOAD_PLUGINS=true
    

  2. Parallel loading:

    kubectl set env deployment/audiolab PLUGIN_LOAD_PARALLEL=true
    

  3. Disable slow plugins:

    # Identify slow plugin from metrics
    # Disable it
    kubectl set env deployment/audiolab DISABLED_PLUGINS=slow-reverb,old-eq
    


🎯 General Response Framework

Severity Levels

Severity Response Time Escalation Example
Critical < 5 min Immediate page Service down, crashes
Warning < 15 min After 30 min High CPU, slow queries
Info < 1 hour No escalation Deployment, config change

Communication Template

Incident Update Format:

INCIDENT: [Brief title]
STATUS: [Investigating/Identified/Mitigating/Resolved]
IMPACT: [User-facing impact]
TIMELINE:
  HH:MM - Incident started
  HH:MM - Alert fired
  HH:MM - On-call responded
  HH:MM - Root cause identified
  HH:MM - Mitigation applied
NEXT STEPS: [What's next]
ETA: [Estimated resolution time]

Useful Commands Cheat Sheet

# === Logs ===
kubectl logs -f deployment/audiolab --tail=100
journalctl -u audiolab -f
tail -f /var/log/audiolab/error.log

# === Metrics ===
curl http://localhost:9090/metrics
kubectl top pods
docker stats

# === Health ===
curl http://localhost:8080/health
kubectl get pods -w
systemctl status audiolab

# === Quick fixes ===
kubectl rollout undo deployment/audiolab        # Rollback
kubectl scale deployment/audiolab --replicas=3  # Scale
kubectl rollout restart deployment/audiolab     # Restart
kubectl delete pod -l app=audiolab              # Force restart

# === Debug ===
kubectl exec -it audiolab-xxx -- bash           # Shell into pod
kubectl describe pod audiolab-xxx               # Pod details
kubectl get events --sort-by=.metadata.creationTimestamp  # Recent events

📚 Resources


Last Updated: 2024-10-03 Maintainer: AudioLab SRE Team On-Call Schedule: https://pagerduty.com/schedules/audiolab