💰 Cost Optimization¶

📊 Where Costs Come From¶

Cost Breakdown¶

╔═══════════════════════════════════════════════════════════════╗
║ Component        │ % of Total │ Monthly Cost │ Optimization  ║
╠══════════════════╪════════════╪══════════════╪═══════════════╣
║ Storage          │ 40%        │ $200-800     │ Retention     ║
║ Ingestion        │ 30%        │ $150-600     │ Sampling      ║
║ Queries          │ 20%        │ $100-400     │ Caching       ║
║ Alerting         │ 10%        │ $50-200      │ Grouping      ║
╠══════════════════╪════════════╪══════════════╪═══════════════╣
║ TOTAL            │ 100%       │ $500-2000    │               ║
╚═══════════════════════════════════════════════════════════════╝

Cost Drivers¶

High cardinality: - User IDs as labels → millions of time series - Session IDs as labels → unbounded growth - Timestamps as labels → infinite cardinality

High frequency: - Logging every audio buffer → 1M events/sec - Per-sample metrics → 48,000 metrics/sec @ 48kHz - Unsampled high-volume events

Long retention: - 90 days raw data → 3x storage cost vs 30 days - No aggregation → storing every data point

🎯 Optimization Strategies¶

1. Sampling (99% Cost Reduction)¶

Before:

// Log every request (1M requests/day)
void processRequest() {
    logger.info("Request processed");
    metrics.recordRequest();
}
// Cost: $500/month for 30M datapoints

After:

// Sample 1% of requests (10K/day)
void processRequest() {
    if (rand() % 100 == 0) {  // 1% sample rate
        logger.info("Request processed");
        metrics.recordRequest();
    }
}
// Cost: $5/month for 300K datapoints
// Savings: $495/month (99%)

Smart Sampling:

// Sample based on importance
void logEvent(const Event& event) {
    float sample_rate = getSampleRate(event);

    if ((float)rand() / RAND_MAX < sample_rate) {
        logger.info(event.toString());
    }
}

float getSampleRate(const Event& event) {
    if (event.level == ERROR) return 1.0f;    // 100% errors
    if (event.level == WARN)  return 0.1f;    // 10% warnings
    if (event.level == INFO)  return 0.01f;   // 1% info
    return 0.001f;                             // 0.1% debug
}

2. Retention Tuning (70% Storage Reduction)¶

Before:

# 90 days for all metrics
retention:
  time: 90d
# Storage: 300 GB
# Cost: $900/month

After:

# Tiered retention
retention:
  - selector: '{__name__=~"audio_.*"}'
    time: 7d        # High-frequency audio metrics

  - selector: '{__name__=~"error_.*|crash_.*"}'
    time: 90d       # Critical errors

  - selector: '{job="aggregated"}'
    time: 365d      # Pre-aggregated summaries

  - default: 30d    # Everything else

# Storage: 90 GB
# Cost: $270/month
# Savings: $630/month (70%)

Retention Strategy:

┌─────────────────────────────────────────────┐
│ Data Type    │ Raw    │ Aggregated │ Total │
├──────────────┼────────┼────────────┼───────┤
│ Audio metrics│ 7 days │ 30 days    │ 30d   │
│ Errors       │ 90 days│ 1 year     │ 1y    │
│ Performance  │ 30 days│ 90 days    │ 90d   │
│ Usage        │ 30 days│ 6 months   │ 6m    │
└─────────────────────────────────────────────┘

3. Metric Cardinality Reduction (95% Series Reduction)¶

Before:

// ❌ BAD: user_id as label (1M unique users)
metrics.counter("login_count")
    .labels({{"user_id", userId}});

// Creates 1,000,000 time series
// Cost: $5000/month

After:

// ✅ GOOD: user_tier as label (3 tiers)
metrics.counter("login_count")
    .labels({{"user_tier", getUserTier(userId)}});

// Creates 3 time series
// Cost: $250/month
// Savings: $4750/month (95%)

Cardinality Analysis:

// Calculate cardinality impact
struct MetricCardinality {
    std::string name;
    size_t unique_series;
    float cost_per_month;
};

void analyzeCardinality() {
    // user_id label: 1M users
    // session_id label: 10M sessions
    // endpoint label: 50 endpoints
    //
    // Total series = 1M × 10M × 50 = 500 trillion! 💥
    // NEVER DO THIS

    // Instead:
    // user_tier: 3 (free, pro, enterprise)
    // request_type: 10 (GET, POST, etc.)
    // endpoint_category: 5 (api, web, admin, etc.)
    //
    // Total series = 3 × 10 × 5 = 150 ✅
}

4. Log Compression (90% Storage Reduction)¶

Raw logs:

2024-01-01 12:00:00 [INFO] Processing request from user 12345
2024-01-01 12:00:01 [INFO] Processing request from user 12346
...
Size: 1 GB/day
Cost: $30/day = $900/month

Compressed logs:

# Enable compression
gzip logs/audiolab.log

# Compressed size
Size: 100 MB/day
Cost: $3/day = $90/month
Savings: $810/month (90%)

Structured logging + compression:

{"t":1704110400,"l":"INFO","m":"req_proc","u":12345}
{"t":1704110401,"l":"INFO","m":"req_proc","u":12346}

// Even better compression ratio
Size: 50 MB/day
Cost: $1.50/day = $45/month
Savings: $855/month (95%)

5. Query Optimization (80% Query Cost Reduction)¶

Before:

# Expensive: Full table scan
sum(rate(requests_total[5m]))

# Cost: $0.10 per query
# 10,000 queries/day = $1000/month

After:

# Pre-compute with recording rules
recording_rules:
  - name: request_rates
    interval: 30s
    rules:
      - record: requests_rate_5m
        expr: sum(rate(requests_total[5m]))

# Then query pre-computed metric
requests_rate_5m

# Cost: $0.02 per query
# 10,000 queries/day = $200/month
# Savings: $800/month (80%)

6. Batch Processing (50% Ingestion Cost Reduction)¶

Before:

// Send each metric individually
void recordMetric(const std::string& name, double value) {
    http.post("/api/metrics", {name, value});
    // 1000 requests/sec = high API cost
}

After:

// Batch metrics together
class MetricsBatcher {
    std::vector<Metric> batch_;

public:
    void add(const std::string& name, double value) {
        batch_.push_back({name, value});

        if (batch_.size() >= 100) {
            flush();
        }
    }

    void flush() {
        http.post("/api/metrics/batch", batch_);
        batch_.clear();
    }
};

// 10 requests/sec instead of 1000
// Cost reduction: 50%

7. Alert Grouping (90% Alert Cost Reduction)¶

Before:

# Alert on every instance
- alert: HighCPU
  expr: cpu_usage > 80
  # Fires for each of 100 instances
  # = 100 alerts
  # Cost: $100/month

After:

# Alert on aggregate
- alert: HighCPU
  expr: avg(cpu_usage) > 80
  # Fires once for cluster
  # = 1 alert
  # Cost: $10/month
  # Savings: $90/month (90%)

💡 Cost Targets¶

Budget Allocation¶

╔══════════════════════════════════════════════════════════╗
║ Company Size     │ Revenue      │ Observability Budget  ║
╠══════════════════╪══════════════╪═══════════════════════╣
║ Startup          │ < $1M        │ $500-1000/month       ║
║ Scale-up         │ $1M-10M      │ $1000-5000/month      ║
║ Enterprise       │ $10M+        │ $5000-20000/month     ║
╚══════════════════════════════════════════════════════════╝

Target: < 5% of revenue on observability

ROI Calculation¶

Cost Avoidance from Observability:

1. Faster incident resolution
   - 30 min → 5 min MTTR
   - Savings: 25 min × $200/hr = $83/incident
   - 10 incidents/month = $830/month

2. Prevent outages
   - Early warning → prevent 1 outage/month
   - Outage cost: $10,000
   - Savings: $10,000/month

3. Optimize performance
   - Find bottleneck → 20% speedup
   - Save 20% infrastructure = $2000/month

Total Savings: $12,830/month
Observability Cost: $1,000/month
ROI: 12.8x

📉 Cost Reduction Checklist¶

Immediate (Do Now)¶

□ Enable sampling for high-volume logs (1-10%)
□ Compress log files (gzip)
□ Reduce metric cardinality (review labels)
□ Set retention to 30 days (was 90)
□ Enable query caching
□ Batch metric submissions

Short-term (This Week)¶

□ Implement recording rules for common queries
□ Set up tiered retention (7d/30d/90d)
□ Remove unused metrics
□ Consolidate duplicate dashboards
□ Optimize alert grouping
□ Review and remove zombie time series

Long-term (This Month)¶

□ Migrate to self-hosted Prometheus (if on SaaS)
□ Implement adaptive sampling
□ Build cost monitoring dashboard
□ Set up automated cost alerts
□ Review vendor contracts
□ Evaluate alternative providers

📊 Cost Monitoring¶

Track Spend Over Time¶

# Cost per million samples
cost_per_million = total_spend / (samples_ingested / 1000000)

# Trend analysis
increase(observability_cost[30d])

# Budget utilization
(current_spend / budget) * 100

Cost Dashboard¶

panels:
  - title: "Daily Spend"
    query: sum(rate(cost_total[1d]))

  - title: "Cost by Component"
    query: sum(cost_total) by (component)

  - title: "Samples per Dollar"
    query: sum(samples_ingested) / sum(cost_total)

  - title: "Top 10 Expensive Metrics"
    query: topk(10, sum(cost_total) by (metric_name))

Automated Alerts¶

- alert: CostBudgetExceeded
  expr: sum(rate(cost_total[1d])) * 30 > budget_monthly
  annotations:
    summary: "Monthly budget will be exceeded"

- alert: UnexpectedCostSpike
  expr: rate(cost_total[1h]) > rate(cost_total[1d] offset 1d) * 1.5
  annotations:
    summary: "Cost increased 50% compared to yesterday"

🔧 Cost Optimization Tools¶

1. Cardinality Explorer¶

# Find high-cardinality metrics
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | \
  jq -r '.data[]' | \
  while read metric; do
    count=$(curl -s "http://localhost:9090/api/v1/query?query=count({__name__=\"$metric\"})" | jq -r '.data.result[0].value[1]')
    echo "$metric: $count series"
  done | sort -t: -k2 -nr | head -20

2. Storage Analyzer¶

#!/usr/bin/env python3

import requests

def analyze_storage():
    # Get all metrics
    metrics = requests.get('http://localhost:9090/api/v1/label/__name__/values').json()['data']

    total_size = 0
    for metric in metrics:
        # Get series count
        query = f'count({{__name__="{metric}"}})'
        result = requests.get(f'http://localhost:9090/api/v1/query?query={query}').json()

        if result['data']['result']:
            series = int(result['data']['result'][0]['value'][1])
            size_mb = series * 0.01  # Estimate: 10KB per series

            total_size += size_mb
            print(f"{metric}: {series} series, ~{size_mb:.2f} MB")

    print(f"\nTotal estimated storage: {total_size:.2f} MB")

analyze_storage()

3. Cost Simulator¶

def simulate_cost(samples_per_day, retention_days, cost_per_million=5):
    total_samples = samples_per_day * retention_days
    total_cost = (total_samples / 1_000_000) * cost_per_million

    print(f"Samples/day: {samples_per_day:,}")
    print(f"Retention: {retention_days} days")
    print(f"Total samples: {total_samples:,}")
    print(f"Monthly cost: ${total_cost:.2f}")

    return total_cost

# Current config
current = simulate_cost(
    samples_per_day=1_000_000,
    retention_days=90
)

# Optimized config
optimized = simulate_cost(
    samples_per_day=10_000,  # 99% sampling
    retention_days=30         # Reduced retention
)

print(f"\nSavings: ${current - optimized:.2f}/month ({(1 - optimized/current)*100:.1f}%)")

📋 Cost Review Process¶

Monthly Review¶

Analyze spend breakdown
Storage vs ingestion vs queries
By team/project/environment
Identify top consumers
Most expensive metrics
Highest cardinality labels
Most frequent queries
Optimization opportunities
Unused metrics to remove
Metrics to sample more aggressively
Retention to reduce
Forecast next month
Based on growth trends
Upcoming features
Seasonal patterns

Quarterly Review¶

Vendor evaluation
Compare pricing models
Negotiate better rates
Consider alternatives
Architecture review
Self-hosted vs SaaS
Multi-cloud optimization
Regional deployment
ROI assessment
Value delivered
Incidents prevented
Performance improvements

🎯 Summary¶

Quick Wins (Implement Today)¶

Enable sampling: 1% sample rate → 99% cost reduction
Compress logs: gzip → 90% storage reduction
Fix cardinality: Remove user_id labels → 95% series reduction
Reduce retention: 90d → 30d → 66% storage reduction

Target Metrics¶

✅ Cost per million samples: < $5
✅ Storage growth: < 10% month-over-month
✅ Total observability cost: < 5% of revenue
✅ ROI: > 10x (value vs cost)

Red Flags¶

🚨 Metric cardinality > 10,000 per metric
🚨 Storage doubling month-over-month
🚨 Observability cost > 10% of revenue
🚨 Alert fatigue (> 100 alerts/day)