Runbook: High Error Rate
Overview
Severity: 🟡 WARNING Impact: User experience degraded, may escalate to critical MTTR: 5-20 minutes
High error rate detected (>10% of requests failing). This is an early warning signal that may lead to circuit breaker opening if not addressed.
Symptoms
You'll know there's a high error rate when:
- ⚠️ Slack alert: "⚠️ High Error Rate Alert"
- ⚠️ Dashboard: Error rate >10% in Top Operations
- ⚠️ User reports: Occasional "Something went wrong" messages
- ⚠️ Logs: Increased error log volume
User Impact:
- Some requests succeed, others fail intermittently
- User experience degraded but not completely broken
- May notice slow response times
- Errors appear inconsistently
Immediate Actions (< 2 minutes)
1. Check Observability Dashboard
Navigate to:
Check:
- Current error rate percentage
- Which operations are failing
- Error rate trend (increasing/stable/decreasing)
- Circuit breaker state (still CLOSED?)
2. Identify Failing Operations
In Dashboard "Top Operations" section:
- Look for operations with high failure counts
- Note error rate percentage per operation
- Identify if errors are concentrated or widespread
3. Check Recent Deployments
Questions:
- Was there a deployment recently?
- Did error rate spike after deployment?
- Are there any ongoing releases?
Diagnosis Steps
Step 1: Identify Failing Endpoints
In Axiom:
https://app.axiom.co
Dataset: kingston-care-production
Query:
['level'] == "error"
| summarize count() by endpoint, status_code
| where count > 10
| sort by count desc
Time range: Last 15 minutes
Common Patterns:
- Single endpoint failing → Likely code bug in that endpoint
- Multiple endpoints → Infrastructure or database issue
- Specific HTTP status → See table below
Step 2: Classify Error Type
| Status Code | Type | Common Causes |
|---|---|---|
| 400 | Bad Request | Invalid input, schema changes |
| 401/403 | Auth Error | Token expiry, permission changes |
| 429 | Rate Limit | Traffic spike, DoS attack |
| 500 | Server Error | Code bug, database issue |
| 502/503 | Gateway Error | Vercel issue, cold start timeout |
| 504 | Timeout | Slow query, external API delay |
Step 3: Check Root Cause
For 4xx errors:
- Recent schema changes?
- Client app updated?
- API contract broken?
- Input validation changes?
For 5xx errors:
- Recent deployment?
- Database performance degraded?
- External service down?
- Memory/CPU exhaustion?
For timeouts (504):
- Slow query introduced?
- Cold start issue (Vercel)?
- External API latency?
- Database connection pool exhausted?
Step 4: Check System Resources
In Vercel Dashboard:
- Function execution time increasing?
- Memory usage spikes?
- Cold starts frequency?
In Supabase:
- Database CPU usage?
- Active connections count?
- Slow queries?
Resolution
Fix Deployment Issue
If error rate spiked after recent deployment:
# Option 1: Rollback bad deployment
vercel rollback PREV_DEPLOYMENT_URL
# Option 2: Hotfix and redeploy
git revert COMMIT_HASH
git push origin main
Verify:
# Wait 2-3 minutes after rollback
curl https://yourdomain.com/api/v1/health
# Check error rate dropped
# Monitor dashboard for 10 minutes
Fix Database Performance
If slow queries are the cause:
-- Find slow queries in Supabase SQL Editor
SELECT
query,
calls,
mean_exec_time as avg_ms,
max_exec_time as max_ms
FROM pg_stat_statements
WHERE mean_exec_time > 1000 -- >1s
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Add missing index (example)
CREATE INDEX CONCURRENTLY idx_services_org_id
ON services(organization_id);
Verify index created:
Fix Rate Limiting
If legitimate traffic is being rate-limited:
// Temporarily increase rate limit in affected API route
// Example: app/api/v1/search/services/route.ts
const { success } = await rateLimit({
key: identifier,
limit: 120, // Increase from 60
window: "1m",
})
Note: This is a temporary fix. Consider:
- Is the traffic legitimate?
- Should we add caching instead?
- Do we need DDoS protection?
Fix External Service Failure
If external API is down:
- Check service status page
- Implement fallback if available
- Add circuit breaker for external API
- Communicate downtime to users
Example - Add fallback:
try {
return await externalAPI.call()
} catch (error) {
logger.warn("External API failed, using fallback", { error })
return fallbackData
}
Fix Schema/Validation Issue
If 400 errors due to validation:
// Review recent schema changes
// Check Zod schema definitions
// Example fix - make field optional
const schema = z.object({
field: z.string().optional(), // Was required
})
Verification
After implementing fix:
- [ ] Check error rate: Should drop below 5% within 5 minutes
- [ ] Check dashboard: All operations showing green status
- [ ] Monitor for 15 minutes: No new error spikes
- [ ] Check Slack: No new high error rate alerts
- [ ] User reports: Complaints should stop
- [ ] Test critical paths: Manually test affected endpoints
Verification commands:
# Test endpoint that was failing
curl https://yourdomain.com/api/v1/[affected-endpoint]
# Check health endpoint
curl https://yourdomain.com/api/v1/health
# Monitor error logs
vercel logs --prod | grep -i error | tail -n 20
Escalation
When to Escalate
Escalate to CRITICAL severity if:
- Error rate >25% (quarter of requests failing)
- Error rate increasing despite fixes
- Circuit breaker opens
- User impact severe (customer complaints)
- Root cause unclear after 20 minutes
Escalation Actions
- Circuit Breaker Opens:
- Follow Circuit Breaker Open runbook
-
This becomes a CRITICAL incident
-
Root Cause Unclear:
- Escalate to senior engineer
- Gather diagnostic data (logs, metrics, traces)
-
Consider enabling debug logging
-
External Service Outage:
- Monitor external service status page
- Communicate to users via status page
-
Implement workaround if possible
-
Database Issue:
- Contact Supabase support
- Prepare for potential downtime
- Communicate to stakeholders
Prevention
Short-Term (After Incident)
- [ ] Add integration tests for error scenarios
- [ ] Set up API contract testing
- [ ] Add input validation tests
- [ ] Create alert for error rate >5% (earlier warning)
Long-Term (System Hardening)
- [ ] Implement comprehensive error monitoring
- [ ] Add distributed tracing (e.g., OpenTelemetry)
- [ ] Set up synthetic monitoring (uptime checks)
- [ ] Create load testing suite
- [ ] Add chaos engineering tests
- [ ] Implement graceful degradation patterns
- [ ] Add circuit breakers for external APIs
- [ ] Set up automated canary deployments
Monitoring Improvements
Add alerts for:
- Error rate >5% (early warning)
- Slow queries >500ms
- Rate limit triggers >10/min
- External API failures
- Database connection pool usage >80%
Add dashboards for:
- Error breakdown by endpoint
- Error breakdown by status code
- Error rate trend over time
- Top failing operations
Related Resources
- Runbooks:
- Circuit Breaker Open
-
Dashboards:
- Observability Dashboard:
/admin/observability - Axiom Logs
- Vercel Dashboard
-
Documentation:
- Performance Tracking
- Database Optimization
- Alerting Setup
Post-Incident Checklist
After resolving a high error rate incident:
- [ ] Document root cause
- [ ] Update runbook if gaps found
- [ ] Create post-mortem (if significant)
- [ ] Add preventive measures
- [ ] Update monitoring/alerting if needed
- [ ] Share learnings with team
- [ ] Schedule follow-up review
Last Updated: 2026-01-31 Reviewed By: [Pending first production incident] Next Review: After first high error rate incident or 2026-02-28