Runbook: SLO Violation

Status: Active Last Updated: 2026-02-05 Owner: Platform Team Related Runbooks: Circuit Breaker Open, High Error Rate, Slow Queries

Overview

This runbook provides response procedures for Service Level Objective (SLO) violations in CareConnect.

SLO Targets (PROVISIONAL):

Uptime: 99.5% (3h 36m downtime budget/month)
Latency: p95 < 800ms
Error Budget: 0.5% (derived from uptime)

Note: Current SLO targets are PROVISIONAL (see docs/planning/v18-0-phase-3-slo-decision-guide.md). Adjust based on production data and business requirements.

Severity Levels

Severity	Condition	Response Time	Impact
Critical	Uptime < 99.5%, Error budget exhausted, Latency p95 > 800ms	15 minutes	User-facing degradation
Warning	Error budget >50% consumed	1 hour	Potential future impact

Alert Types

1. Uptime SLO Violation

Symptom: Uptime drops below 99.5% target over 30-day window

Possible Causes:

Circuit breaker frequently opening (database failures)
Recurring health check failures
Infrastructure outages
Deployment issues

Diagnosis:

# Check circuit breaker state
curl https://your-domain.com/api/v1/health | jq '.checks.circuitBreaker'

# View observability dashboard
open https://your-domain.com/admin/observability

# Check recent uptime history
# (View SLO Compliance Card on dashboard)

Resolution:

Identify Root Cause:
Check circuit breaker status (see Circuit Breaker Runbook)
Review recent deployments (rollback if necessary)
Check infrastructure status (Vercel, Supabase dashboards)
Immediate Actions:
If circuit breaker OPEN: Follow circuit breaker runbook
If deployment-related: Rollback to last known good version
If infrastructure issue: Contact provider support
Mitigation:
Enable fallback to local JSON data (automatic)
Increase circuit breaker failure threshold (temporary)
Scale database resources if capacity-related
Recovery Validation:
Monitor uptime trending back up
Confirm health checks passing consistently
Verify error budget stabilizing

Prevention:

Improve database resilience (connection pooling, query optimization)
Implement canary deployments
Add pre-deployment smoke tests
Set up infrastructure monitoring alerts

2. Error Budget Exhausted

Symptom: Error budget reaches 0% (all downtime allowance consumed)

Possible Causes:

Multiple incidents in short time period
Sustained degraded performance
High error rate over extended period
Cascading failures

Diagnosis:

# Check error budget status
curl https://your-domain.com/api/v1/health | jq '.checks.slo.errorBudget'

# Check recent incidents
# (View dashboard for circuit breaker history, error rate trends)

# Calculate burn rate
# Error budget consumed / days elapsed = daily burn rate

Resolution:

Freeze Non-Critical Changes:
Pause feature deployments
Only deploy critical fixes
Increase code review rigor
Stabilization Actions:
Investigate recent errors (check logs, Axiom)
Fix highest-impact bugs first
Increase monitoring sensitivity
Add extra validation before deploys
Budget Recovery:
Error budget resets over 30-day rolling window
Focus on incident prevention
Monitor daily burn rate
Wait for window to roll past incident dates
Communicate:
Notify stakeholders of change freeze
Update status page (if applicable)
Document incident timeline
Plan post-incident review

Prevention:

Improve testing coverage (see docs/development/testing-guidelines.md)
Add feature flags for gradual rollouts
Implement automated rollback triggers
Conduct pre-mortem planning for high-risk changes

3. Latency SLO Violation

Symptom: p95 latency exceeds 800ms target

Possible Causes:

Database query performance degradation
Increased traffic load
Memory pressure (WebLLM)
Network latency
Embedding generation overhead

Diagnosis:

# Check current latency metrics
curl https://your-domain.com/api/v1/health | jq '.checks.performance.metrics'

# View latency trends on dashboard
open https://your-domain.com/admin/observability

# Run load test to reproduce
npm run test:load:smoke

Resolution:

Identify Slow Operations:
Check Performance Charts on dashboard (p50/p95/p99)
Review Axiom logs for slow queries
Profile search operations locally
Check database query performance
Quick Wins:
Restart server (clear in-memory caches)
Scale database compute (Supabase dashboard)
Disable WebLLM if memory-constrained
Enable aggressive caching
Optimization:
Optimize slow database queries (add indexes)
Reduce embedding dimensions (if applicable)
Implement request coalescing
Add API response caching
Load Management:
Review rate limits (increase if legitimate traffic)
Add request prioritization
Implement graceful degradation
Consider CDN for static assets

Prevention:

Regular performance testing (see npm run test:load)
Query performance monitoring
Capacity planning based on growth trends
Optimize hot paths proactively

Response Checklist

Initial Response (0-15 minutes)

[ ] Acknowledge alert (Slack thread reply)
[ ] Check dashboard: /admin/observability
[ ] Identify violation type (uptime/error-budget/latency)
[ ] Determine severity (critical/warning)
[ ] Assess user impact (check support channels)

Investigation (15-30 minutes)

[ ] Review related runbooks
[ ] Check circuit breaker state
[ ] Review recent deployments
[ ] Check infrastructure status
[ ] Identify root cause

Mitigation (30-60 minutes)

[ ] Apply immediate fixes (rollback, scaling, etc.)
[ ] Verify mitigation effectiveness
[ ] Update stakeholders
[ ] Document actions taken

Recovery (1-4 hours)

[ ] Monitor SLO trending back to compliance
[ ] Confirm error budget stabilizing
[ ] Remove temporary mitigations
[ ] Close incident

Post-Incident (1-7 days)

[ ] Schedule post-incident review
[ ] Document lessons learned
[ ] Implement preventive measures
[ ] Update runbooks if needed

Escalation

Level 1: On-Call Engineer

Initial response and diagnosis
Apply standard mitigations
Follow runbook procedures

Level 2: Platform Lead

Escalate if:
Multiple SLOs violated simultaneously
Root cause unclear after 30 minutes
Standard mitigations ineffective
User impact significant

Level 3: Infrastructure/Database Team

Escalate if:
Database performance issues
Infrastructure-level problems
Requires provider support

Metrics and Monitoring

Dashboard Location: /admin/observability

Key Metrics:

Uptime percentage (30-day window)
Error budget remaining (%)
Latency p95 (ms)
SLO compliance status

Alert Throttling:

Uptime violation: 30 minutes
Error budget exhausted: 1 hour
Latency violation: 15 minutes

Data Sources:

In-memory uptime tracking (30-day retention)
Performance metrics (in-memory)
Circuit breaker telemetry

Configuration

SLO Targets: lib/config/slo-targets.ts Tracker Logic: lib/observability/slo-tracker.ts Alert Integration: lib/integrations/slack.ts Health Check: app/api/v1/health/route.ts

Notes

In-Memory Tracking: Uptime data resets on server restart (rebuilds quickly)
Provisional Targets: Review and adjust after 2-4 weeks of production data
30-Day Window: SLOs use rolling 30-day window (not calendar month)
Alert Fatigue: Throttling prevents spam during prolonged incidents

Changelog

Date	Change	Author
2026-02-05	Initial version	Platform Team

Questions? Contact platform team or file issue in GitHub.

Runbook: SLO Violation

Overview

Severity Levels

Alert Types

1. Uptime SLO Violation

2. Error Budget Exhausted

3. Latency SLO Violation

Response Checklist

Initial Response (0-15 minutes)

Investigation (15-30 minutes)

Mitigation (30-60 minutes)

Recovery (1-4 hours)

Post-Incident (1-7 days)

Escalation

Level 1: On-Call Engineer

Level 2: Platform Lead

Level 3: Infrastructure/Database Team

Metrics and Monitoring

Related Documentation

Configuration

Notes

Changelog