💰 Cost Optimization¶
📊 Where Costs Come From¶
Cost Breakdown¶
╔═══════════════════════════════════════════════════════════════╗
║ Component │ % of Total │ Monthly Cost │ Optimization ║
╠══════════════════╪════════════╪══════════════╪═══════════════╣
║ Storage │ 40% │ $200-800 │ Retention ║
║ Ingestion │ 30% │ $150-600 │ Sampling ║
║ Queries │ 20% │ $100-400 │ Caching ║
║ Alerting │ 10% │ $50-200 │ Grouping ║
╠══════════════════╪════════════╪══════════════╪═══════════════╣
║ TOTAL │ 100% │ $500-2000 │ ║
╚═══════════════════════════════════════════════════════════════╝
Cost Drivers¶
High cardinality: - User IDs as labels → millions of time series - Session IDs as labels → unbounded growth - Timestamps as labels → infinite cardinality
High frequency: - Logging every audio buffer → 1M events/sec - Per-sample metrics → 48,000 metrics/sec @ 48kHz - Unsampled high-volume events
Long retention: - 90 days raw data → 3x storage cost vs 30 days - No aggregation → storing every data point
🎯 Optimization Strategies¶
1. Sampling (99% Cost Reduction)¶
Before:
// Log every request (1M requests/day)
void processRequest() {
logger.info("Request processed");
metrics.recordRequest();
}
// Cost: $500/month for 30M datapoints
After:
// Sample 1% of requests (10K/day)
void processRequest() {
if (rand() % 100 == 0) { // 1% sample rate
logger.info("Request processed");
metrics.recordRequest();
}
}
// Cost: $5/month for 300K datapoints
// Savings: $495/month (99%)
Smart Sampling:
// Sample based on importance
void logEvent(const Event& event) {
float sample_rate = getSampleRate(event);
if ((float)rand() / RAND_MAX < sample_rate) {
logger.info(event.toString());
}
}
float getSampleRate(const Event& event) {
if (event.level == ERROR) return 1.0f; // 100% errors
if (event.level == WARN) return 0.1f; // 10% warnings
if (event.level == INFO) return 0.01f; // 1% info
return 0.001f; // 0.1% debug
}
2. Retention Tuning (70% Storage Reduction)¶
Before:
After:
# Tiered retention
retention:
- selector: '{__name__=~"audio_.*"}'
time: 7d # High-frequency audio metrics
- selector: '{__name__=~"error_.*|crash_.*"}'
time: 90d # Critical errors
- selector: '{job="aggregated"}'
time: 365d # Pre-aggregated summaries
- default: 30d # Everything else
# Storage: 90 GB
# Cost: $270/month
# Savings: $630/month (70%)
Retention Strategy:
┌─────────────────────────────────────────────┐
│ Data Type │ Raw │ Aggregated │ Total │
├──────────────┼────────┼────────────┼───────┤
│ Audio metrics│ 7 days │ 30 days │ 30d │
│ Errors │ 90 days│ 1 year │ 1y │
│ Performance │ 30 days│ 90 days │ 90d │
│ Usage │ 30 days│ 6 months │ 6m │
└─────────────────────────────────────────────┘
3. Metric Cardinality Reduction (95% Series Reduction)¶
Before:
// ❌ BAD: user_id as label (1M unique users)
metrics.counter("login_count")
.labels({{"user_id", userId}});
// Creates 1,000,000 time series
// Cost: $5000/month
After:
// ✅ GOOD: user_tier as label (3 tiers)
metrics.counter("login_count")
.labels({{"user_tier", getUserTier(userId)}});
// Creates 3 time series
// Cost: $250/month
// Savings: $4750/month (95%)
Cardinality Analysis:
// Calculate cardinality impact
struct MetricCardinality {
std::string name;
size_t unique_series;
float cost_per_month;
};
void analyzeCardinality() {
// user_id label: 1M users
// session_id label: 10M sessions
// endpoint label: 50 endpoints
//
// Total series = 1M × 10M × 50 = 500 trillion! 💥
// NEVER DO THIS
// Instead:
// user_tier: 3 (free, pro, enterprise)
// request_type: 10 (GET, POST, etc.)
// endpoint_category: 5 (api, web, admin, etc.)
//
// Total series = 3 × 10 × 5 = 150 ✅
}
4. Log Compression (90% Storage Reduction)¶
Raw logs:
2024-01-01 12:00:00 [INFO] Processing request from user 12345
2024-01-01 12:00:01 [INFO] Processing request from user 12346
...
Size: 1 GB/day
Cost: $30/day = $900/month
Compressed logs:
# Enable compression
gzip logs/audiolab.log
# Compressed size
Size: 100 MB/day
Cost: $3/day = $90/month
Savings: $810/month (90%)
Structured logging + compression:
{"t":1704110400,"l":"INFO","m":"req_proc","u":12345}
{"t":1704110401,"l":"INFO","m":"req_proc","u":12346}
// Even better compression ratio
Size: 50 MB/day
Cost: $1.50/day = $45/month
Savings: $855/month (95%)
5. Query Optimization (80% Query Cost Reduction)¶
Before:
# Expensive: Full table scan
sum(rate(requests_total[5m]))
# Cost: $0.10 per query
# 10,000 queries/day = $1000/month
After:
# Pre-compute with recording rules
recording_rules:
- name: request_rates
interval: 30s
rules:
- record: requests_rate_5m
expr: sum(rate(requests_total[5m]))
# Then query pre-computed metric
requests_rate_5m
# Cost: $0.02 per query
# 10,000 queries/day = $200/month
# Savings: $800/month (80%)
6. Batch Processing (50% Ingestion Cost Reduction)¶
Before:
// Send each metric individually
void recordMetric(const std::string& name, double value) {
http.post("/api/metrics", {name, value});
// 1000 requests/sec = high API cost
}
After:
// Batch metrics together
class MetricsBatcher {
std::vector<Metric> batch_;
public:
void add(const std::string& name, double value) {
batch_.push_back({name, value});
if (batch_.size() >= 100) {
flush();
}
}
void flush() {
http.post("/api/metrics/batch", batch_);
batch_.clear();
}
};
// 10 requests/sec instead of 1000
// Cost reduction: 50%
7. Alert Grouping (90% Alert Cost Reduction)¶
Before:
# Alert on every instance
- alert: HighCPU
expr: cpu_usage > 80
# Fires for each of 100 instances
# = 100 alerts
# Cost: $100/month
After:
# Alert on aggregate
- alert: HighCPU
expr: avg(cpu_usage) > 80
# Fires once for cluster
# = 1 alert
# Cost: $10/month
# Savings: $90/month (90%)
💡 Cost Targets¶
Budget Allocation¶
╔══════════════════════════════════════════════════════════╗
║ Company Size │ Revenue │ Observability Budget ║
╠══════════════════╪══════════════╪═══════════════════════╣
║ Startup │ < $1M │ $500-1000/month ║
║ Scale-up │ $1M-10M │ $1000-5000/month ║
║ Enterprise │ $10M+ │ $5000-20000/month ║
╚══════════════════════════════════════════════════════════╝
Target: < 5% of revenue on observability
ROI Calculation¶
Cost Avoidance from Observability:
1. Faster incident resolution
- 30 min → 5 min MTTR
- Savings: 25 min × $200/hr = $83/incident
- 10 incidents/month = $830/month
2. Prevent outages
- Early warning → prevent 1 outage/month
- Outage cost: $10,000
- Savings: $10,000/month
3. Optimize performance
- Find bottleneck → 20% speedup
- Save 20% infrastructure = $2000/month
Total Savings: $12,830/month
Observability Cost: $1,000/month
ROI: 12.8x
📉 Cost Reduction Checklist¶
Immediate (Do Now)¶
□ Enable sampling for high-volume logs (1-10%)
□ Compress log files (gzip)
□ Reduce metric cardinality (review labels)
□ Set retention to 30 days (was 90)
□ Enable query caching
□ Batch metric submissions
Short-term (This Week)¶
□ Implement recording rules for common queries
□ Set up tiered retention (7d/30d/90d)
□ Remove unused metrics
□ Consolidate duplicate dashboards
□ Optimize alert grouping
□ Review and remove zombie time series
Long-term (This Month)¶
□ Migrate to self-hosted Prometheus (if on SaaS)
□ Implement adaptive sampling
□ Build cost monitoring dashboard
□ Set up automated cost alerts
□ Review vendor contracts
□ Evaluate alternative providers
📊 Cost Monitoring¶
Track Spend Over Time¶
# Cost per million samples
cost_per_million = total_spend / (samples_ingested / 1000000)
# Trend analysis
increase(observability_cost[30d])
# Budget utilization
(current_spend / budget) * 100
Cost Dashboard¶
panels:
- title: "Daily Spend"
query: sum(rate(cost_total[1d]))
- title: "Cost by Component"
query: sum(cost_total) by (component)
- title: "Samples per Dollar"
query: sum(samples_ingested) / sum(cost_total)
- title: "Top 10 Expensive Metrics"
query: topk(10, sum(cost_total) by (metric_name))
Automated Alerts¶
- alert: CostBudgetExceeded
expr: sum(rate(cost_total[1d])) * 30 > budget_monthly
annotations:
summary: "Monthly budget will be exceeded"
- alert: UnexpectedCostSpike
expr: rate(cost_total[1h]) > rate(cost_total[1d] offset 1d) * 1.5
annotations:
summary: "Cost increased 50% compared to yesterday"
🔧 Cost Optimization Tools¶
1. Cardinality Explorer¶
# Find high-cardinality metrics
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | \
jq -r '.data[]' | \
while read metric; do
count=$(curl -s "http://localhost:9090/api/v1/query?query=count({__name__=\"$metric\"})" | jq -r '.data.result[0].value[1]')
echo "$metric: $count series"
done | sort -t: -k2 -nr | head -20
2. Storage Analyzer¶
#!/usr/bin/env python3
import requests
def analyze_storage():
# Get all metrics
metrics = requests.get('http://localhost:9090/api/v1/label/__name__/values').json()['data']
total_size = 0
for metric in metrics:
# Get series count
query = f'count({{__name__="{metric}"}})'
result = requests.get(f'http://localhost:9090/api/v1/query?query={query}').json()
if result['data']['result']:
series = int(result['data']['result'][0]['value'][1])
size_mb = series * 0.01 # Estimate: 10KB per series
total_size += size_mb
print(f"{metric}: {series} series, ~{size_mb:.2f} MB")
print(f"\nTotal estimated storage: {total_size:.2f} MB")
analyze_storage()
3. Cost Simulator¶
def simulate_cost(samples_per_day, retention_days, cost_per_million=5):
total_samples = samples_per_day * retention_days
total_cost = (total_samples / 1_000_000) * cost_per_million
print(f"Samples/day: {samples_per_day:,}")
print(f"Retention: {retention_days} days")
print(f"Total samples: {total_samples:,}")
print(f"Monthly cost: ${total_cost:.2f}")
return total_cost
# Current config
current = simulate_cost(
samples_per_day=1_000_000,
retention_days=90
)
# Optimized config
optimized = simulate_cost(
samples_per_day=10_000, # 99% sampling
retention_days=30 # Reduced retention
)
print(f"\nSavings: ${current - optimized:.2f}/month ({(1 - optimized/current)*100:.1f}%)")
📋 Cost Review Process¶
Monthly Review¶
- Analyze spend breakdown
- Storage vs ingestion vs queries
-
By team/project/environment
-
Identify top consumers
- Most expensive metrics
- Highest cardinality labels
-
Most frequent queries
-
Optimization opportunities
- Unused metrics to remove
- Metrics to sample more aggressively
-
Retention to reduce
-
Forecast next month
- Based on growth trends
- Upcoming features
- Seasonal patterns
Quarterly Review¶
- Vendor evaluation
- Compare pricing models
- Negotiate better rates
-
Consider alternatives
-
Architecture review
- Self-hosted vs SaaS
- Multi-cloud optimization
-
Regional deployment
-
ROI assessment
- Value delivered
- Incidents prevented
- Performance improvements
🎯 Summary¶
Quick Wins (Implement Today)¶
- Enable sampling: 1% sample rate → 99% cost reduction
- Compress logs: gzip → 90% storage reduction
- Fix cardinality: Remove user_id labels → 95% series reduction
- Reduce retention: 90d → 30d → 66% storage reduction
Target Metrics¶
✅ Cost per million samples: < $5
✅ Storage growth: < 10% month-over-month
✅ Total observability cost: < 5% of revenue
✅ ROI: > 10x (value vs cost)