Alerting Setup Guide
Overview
CareConnect sends automated Slack alerts for critical system events, enabling proactive incident detection and rapid response.
Alert Types:
- ๐จ Circuit Breaker OPEN (Critical) - Database protection activated
- โ Circuit Breaker CLOSED (Info) - System recovered
- โ ๏ธ High Error Rate (Warning) - Early warning signal
Key Features:
- Real-time notifications (alerts sent within 30 seconds)
- Alert throttling to prevent spam
- Rich Slack formatting with dashboard + runbook links
- Production-only (no noise in development)
Prerequisites
- Slack workspace with admin access
- 5 minutes for setup
- Vercel deployment (for production alerts)
Setup Steps
Step 1: Create Slack Incoming Webhook (3 minutes)
1.1 Create Slack App
- Go to https://api.slack.com/apps
- Click "Create New App" โ "From scratch"
- App Name:
Kingston Care Alerts - Select your workspace
- Click "Create App"
1.2 Enable Incoming Webhooks
- In the left sidebar, click "Incoming Webhooks"
- Toggle "Activate Incoming Webhooks" to ON
- Scroll down and click "Add New Webhook to Workspace"
- Select channel:
#kingston-alerts(create channel first if needed) - Recommended: Create a dedicated channel for alerts
- Alternative: Use
#generalor#engineering - Click "Allow"
1.3 Copy Webhook URL
- After authorization, you'll see your webhook URL
- It will look like:
https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX - Click "Copy" button
- Save this URL securely (you'll need it in the next step)
Verification:
# Test webhook manually
curl -X POST "YOUR_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{"text": "๐งช Test alert from CareConnect"}'
If successful, you should see the message appear in your Slack channel within 5 seconds.
Step 2: Configure Environment Variables (1 minute)
For Local Development:
Add to .env.local:
# Slack Alerting (v18.0 Phase 2)
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
Notes:
- Alerts will NOT send in development (production-only by design)
- You can still test the integration with
NODE_ENV=production
For Production (Vercel):
- Go to Vercel dashboard โ Your Project โ Settings โ Environment Variables
- Click "Add New"
- Fill in:
- Name:
SLACK_WEBHOOK_URL - Value: Your webhook URL (paste from Step 1)
- Environment: Select "Production" (and optionally "Preview" for staging)
- Click "Save"
Redeploy:
Step 3: Verify Alerting Works (1 minute)
Option A: Wait for Real Circuit Breaker Event
Monitor your Slack channel. When the circuit breaker opens in production, you should receive an alert within 30 seconds.
Option B: Manual Test (Advanced)
If you want to test immediately, you can manually trigger the circuit breaker using the admin dashboard or by simulating database failures.
Expected Result:
When circuit breaker opens, you should see a Slack message like:
๐จ Circuit Breaker Alert
Status: OPEN
Previous: CLOSED
Failure Rate: 75.0%
Failures: 5
โ ๏ธ Database operations are being protected. Check the dashboard for details and follow the runbook for troubleshooting steps.
Time: 2026-01-30, 2:45:30 PM
[๐ View Dashboard] [๐ View Runbook]
Alert Configuration
Alert Types & Throttling
| Alert Type | Severity | Throttle Window | Triggers When |
|---|---|---|---|
| Circuit Breaker OPEN | ๐ด Critical | 10 minutes | Circuit opens due to failures |
| Circuit Breaker CLOSED | ๐ข Info | 1 hour | Circuit recovers (optional) |
| High Error Rate | ๐ก Warning | 5 minutes | Error rate >10% |
Throttling Example:
2:00 PM - Circuit opens โ Alert sent โ
2:05 PM - Circuit opens again โ Alert blocked โ (within 10min window)
2:12 PM - Circuit opens again โ Alert sent โ
(window expired)
Alert Contents
Circuit Breaker OPEN Alert Includes:
- Current state (OPEN)
- Previous state (CLOSED)
- Failure count (e.g., 5 failures)
- Failure rate percentage (e.g., 75%)
- Timestamp (Eastern Time)
- Dashboard link (real-time metrics)
- Runbook link (troubleshooting guide)
Message Format:
- Fallback text: Plain text for notifications
- Rich blocks: Formatted Slack message with sections and buttons
- Color coding: Red for critical, yellow for warning, green for info
Troubleshooting
No Alerts Received
Check 1: Webhook URL Configured
Check 2: Production Environment
Alerts only send in production (NODE_ENV=production). Verify:
Check 3: Slack Channel Membership
Ensure you're a member of the #kingston-alerts channel where alerts are sent.
Check 4: Circuit Breaker Actually Opened
Check observability dashboard at /admin/observability:
- Circuit Breaker card should show state = OPEN (red)
- If state is CLOSED, no alert will be sent
Check 5: Webhook URL Validity
Test webhook manually:
curl -X POST "$SLACK_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{"text": "Test message"}'
# Should return: "ok"
Alerts Are Duplicated
Cause: Multiple server instances or throttling not working.
Expected Behavior:
- In serverless (Vercel), each instance has its own throttle
- You may see 1-2 duplicate alerts on redeploy
- This is acceptable (Axiom logs show all events for deduplication)
Not a concern unless:
-
3 duplicate alerts for single event
- Alerts spamming continuously (>10 per minute)
Alerts Are Delayed
Cause: Serverless cold starts or network latency.
Expected Latency:
- Typical: 2-5 seconds
- Maximum: 30 seconds
If delays >1 minute:
- Check Vercel function logs for errors
- Check Slack API status: https://status.slack.com
- Verify network connectivity from Vercel region
Alert Format Is Broken
Cause: Slack API version mismatch or webhook configuration.
Fix:
- Recreate webhook (Steps 1.1-1.3 above)
- Ensure webhook URL starts with
https://hooks.slack.com/services/ - Check Slack app has correct permissions (Incoming Webhooks)
Throttling Not Working
Symptom: Receiving >1 alert per 10 minutes for circuit-open.
Cause: Throttle is per-instance (serverless resets on redeploy).
Expected:
- First alert after deploy: Allowed
- Subsequent alerts within 10min: Blocked
- After server restart: Throttle resets
Not a bug: Serverless architecture means throttle resets occasionally.
Best Practices
Channel Setup
Recommended:
- Create dedicated channel:
#kingston-alerts - Add key team members
- Enable mobile push notifications
- Set channel topic: "Production alerts - CareConnect"
Don't:
- Mix with other app alerts (creates noise)
- Use DMs (hard to track who's on-call)
- Disable notifications (defeats purpose)
Alert Hygiene
Do:
- โ Acknowledge alerts in thread (shows you're investigating)
- โ Post resolution in thread (creates incident log)
- โ Use emoji reactions to show status (๐ investigating, โ resolved)
- โ Review weekly: Are alerts actionable? Tune thresholds if needed
Don't:
- โ Ignore alerts (creates alert fatigue)
- โ Archive without investigating
- โ Silence alerts permanently (if too noisy, tune thresholds instead)
Runbook Integration
Every alert includes a runbook link. Use it!
Workflow:
- Alert arrives in Slack
- Click "View Dashboard" โ See current system state
- Click "View Runbook" โ Follow troubleshooting steps
- Post resolution in Slack thread
Threshold Tuning
If alerts are too frequent or too rare, you can adjust thresholds:
Circuit Breaker Thresholds (in code):
// lib/resilience/supabase-breaker.ts
const config = {
failureThreshold: 3, // Number of failures before opening
failureRateThreshold: 0.5, // 50% error rate threshold
timeout: 30000, // Recovery timeout (30s)
}
Alert Throttle Windows (in code):
// lib/observability/alert-throttle.ts
const THROTTLE_WINDOWS = {
"circuit-open": 10 * 60 * 1000, // 10 minutes
"circuit-closed": 60 * 60 * 1000, // 1 hour
"high-error-rate": 5 * 60 * 1000, // 5 minutes
}
Tuning Guidelines:
- Start conservative (current defaults are good)
- Collect 1 week of production data
- If >10 false positives per day: Increase thresholds
- If missing real incidents: Decrease thresholds
Monitoring Alerts
Axiom Logs
All alert events are logged to Axiom for analysis:
-- View all Slack alerts sent (last 24h)
SELECT * FROM kingston-care-production
WHERE component = 'slack'
AND _time > now() - interval '24 hours'
ORDER BY _time DESC
-- Count alerts by type
SELECT
alertType,
count(*) as alert_count
FROM kingston-care-production
WHERE component = 'alert-throttle'
AND message LIKE '%Alert allowed%'
GROUP BY alertType
Health Check
Verify alerting system is working:
# Check environment variable
echo $SLACK_WEBHOOK_URL
# Check production deployment
vercel env pull
grep SLACK_WEBHOOK_URL .env.production.local
# Check logs for recent alerts
vercel logs --prod | grep -i "slack"
Advanced Configuration
Custom Alert Channels
To send different alerts to different channels:
- Create multiple webhooks in Slack (one per channel)
- Add environment variables:
- Modify
lib/integrations/slack.tsto route by severity
Alert Escalation
To add PagerDuty or email escalation:
- Install
@pagerduty/pdjsor use email API - Create
lib/integrations/pagerduty.ts - Add escalation logic in
lib/resilience/telemetry.ts:
Slack App Enhancements
Future Enhancements:
- Interactive buttons ("Acknowledge", "Silence for 1h")
- Slash commands ("/kingston status")
- Alert statistics in channel topic
- Daily summary thread
Related Documentation
- Observability Access & Setup: User Setup Required
- Circuit Breaker Runbook: Circuit Breaker Open
- Axiom Setup: User Setup Required
- Architecture: ADR-016 Performance Tracking
FAQ
Q: Can I send alerts to multiple channels? A: Yes, create multiple webhooks and modify the Slack integration to send to both.
Q: Can I test alerts in development? A: Alerts are production-only by design. You can temporarily set NODE_ENV=production locally to test.
Q: How do I silence alerts temporarily? A: Temporarily remove SLACK_WEBHOOK_URL from environment variables and redeploy. Restore after maintenance.
Q: Are alerts free? A: Yes, Slack webhooks are free with unlimited messages.
Q: Can I customize the message format? A: Yes, edit formatCircuitBreakerMessage() in lib/integrations/slack.ts.
Q: What if Slack is down? A: Alerts will fail gracefully (logged but not sent). Dashboard and Axiom are independent backups.
Last Updated: 2026-01-30 Version: 1.0 Maintained By: Platform Team