Launch Monitoring Checklist

Version: 1.0 Date Created: 2026-02-09 Last Updated: 2026-03-11 Purpose: Systematic monitoring procedures for safe production launch

Overview

This checklist ensures comprehensive monitoring during the critical launch period. Follow these procedures to detect and respond to issues quickly.

Launch Phases:

Pre-Launch (T-1 hour): Final verification before go-live
Launch Day - Critical Hours (Hours 0-4): Intensive monitoring
Launch Day - Extended (Hours 4-24): Continued close monitoring
Post-Launch Week (Days 2-7): Daily monitoring

Related Documents:

Rollback Procedures
Incident Response Plan
Observability Dashboard: /admin/observability

Pre-Launch Checklist (T-1 Hour)

Goal: Verify all systems are operational before announcing launch.

Timeframe: 1 hour before public announcement

1. Deployment Verification

[ ] Confirm deployment successful
Confirm /srv/apps/careconnect-web/current points at the intended release
Verify docker ps shows careconnect-web up on 127.0.0.1:3300
Check container logs for warnings or crash loops
[ ] Verify production URL accessible
Visit production URL in browser
Confirm homepage loads correctly
Check for HTTPS (no certificate errors)

2. Health Check Validation

[ ] Test health check endpoint

curl https://careconnect.ing/api/v1/health

Expected: 200 OK status
Response includes: "status": "healthy"
Circuit breaker state: "CLOSED"
[ ] Check observability dashboard
Visit /admin/observability
Confirm dashboard loads without errors
Verify latest metrics are recent (<5 min old)

3. Critical User Journey Testing

[ ] Crisis Search Flow
Search for "suicide help"
Verify crisis banner appears
Check Distress Centre Kingston is first result
Confirm contact information is clickable
Target: Complete flow in <5 seconds
[ ] General Search Flow
Search for "food bank"
Verify results appear
Check top 3 results are relevant
Confirm service cards display correctly
Target: Results in <800ms
[ ] Mobile Responsiveness
Test on mobile device or browser mobile view
Verify search works
Check service cards are readable
Test tap targets (phone, email, website)

4. Monitoring Infrastructure

[ ] Confirm Slack alerts enabled
Check SLACK_WEBHOOK_URL is set
Test with sample alert (if safe to do so)
Verify Slack channel receives alerts
[ ] Verify Axiom metrics flowing
Check Axiom dashboard for recent data
Confirm metrics are being exported
Verify cron job is active
[ ] Check SLO dashboard status
Visit SLO Compliance Card on observability page
Confirm uptime tracking is active
Verify error budget shows correct baseline
Check latency p95 is being measured

5. Security & Configuration

[ ] Verify environment variables
All required env vars set in production
No placeholder values (like "YOUR_KEY_HERE")
Secrets are not exposed in client
[ ] Test rate limiting
Confirm rate limiting is active (60 req/min)
Verify rate limit headers in response
Check rate limit doesn't block legitimate use
[ ] Check CSP headers
Inspect response headers in browser dev tools
Confirm Content-Security-Policy is present
Verify no CSP violations in console

6. Data Quality Spot Check

[ ] Verify top services are accurate
Check top 5 crisis services have correct phone numbers
Verify hours are up to date
Confirm addresses are correct
[ ] Test search quality
Search for "food"
Verify Partners in Mission Food Bank appears
Check verification badges are visible
Confirm descriptions are helpful

7. Error Monitoring Baseline

[ ] Review error logs
Check application logs for errors (should be minimal/none)
Review database connection logs
Verify no unexpected warnings
[ ] Confirm zero critical errors
No 500 errors in recent logs
No database connection failures
Circuit breaker has not opened

8. Team Readiness

[ ] On-call schedule confirmed
Know who is on-call for launch day
Contact information is current
Escalation path is clear
[ ] Communication channels ready
Slack channel for launch monitoring active
Key stakeholders have access
Communication templates ready (see Task 3.3)

Pre-Launch Sign-Off

If all checks pass:

✅ System is ready for launch
✅ Monitoring is active
✅ Team is prepared

If any checks fail:

🚫 DO NOT LAUNCH
Investigate and fix issues
Re-run failed checks
Document any deviations

Sign-Off:

Date/Time: **___**
Checked By: **___**
Status: GO / NO-GO

Launch Day: Critical Hours (0-4 Hours)

Goal: Detect and respond to issues immediately.

Monitoring Frequency: Every 30 minutes

Who: Primary on-call engineer + backup

Every 30 Minutes: Quick Check

Dashboard Review (5 min)

[ ] Visit /admin/observability
Uptime: Should be >99.5%
Error rate: Should be <0.5%
Latency p95: Should be <800ms
Circuit breaker: Should be CLOSED
[ ] Check SLO Compliance Card
All 3 metrics green (Uptime, Error Budget, Latency)
No violation alerts displayed
Error budget remaining >90%

Slack Monitoring (2 min)

[ ] Review Slack alerts
Check for any automated alerts
Review alert severity
Confirm alerts are legitimate (not false positives)
[ ] No critical alerts?
✅ Continue monitoring
If critical alert: Follow Incident Response Plan

User Feedback (3 min)

[ ] Check feedback channels
Review feedback widget submissions
Check email (feedback@careconnect.ing)
Monitor social media (if applicable)
[ ] Common issues?
Document any patterns
Escalate if multiple users report same issue

Hourly: Deep Dive (15 min)

Hours 1, 2, 3, 4 after launch

Performance Analysis

[ ] Review latency trends
Check p50, p95, p99 latencies
Compare to baseline (<800ms p95)
Look for degradation trends
[ ] Check error rate details
Review error types (4xx vs 5xx)
Identify most common errors
Verify errors are not increasing
[ ] Monitor circuit breaker
Confirm state is CLOSED
Check failure counts (should be low)
Review any state transitions

Traffic Analysis

[ ] Review search patterns
Top 10 search queries (aggregate only, privacy-safe)
Are users finding what they need?
Any unexpected search patterns?
[ ] Check traffic volume
Request rate (req/min)
Compare to load test baselines
Verify rate limiting is not blocking legitimate traffic

Data Quality

[ ] Spot-check search results
Test 3-5 common searches
Verify results are relevant
Check service card displays correctly
[ ] Review "no results" feedback
Check NotFoundFeedback submissions
Identify missing services or categories
Document for future data additions

Critical Hour Checklist

Hour 0-1 (First Hour):

[ ] 30 min check ✅
[ ] 60 min deep dive ✅
[ ] No critical issues detected ✅

Hour 1-2:

[ ] 90 min check ✅
[ ] 120 min deep dive ✅
[ ] No critical issues detected ✅

Hour 2-3:

[ ] 150 min check ✅
[ ] 180 min deep dive ✅
[ ] No critical issues detected ✅

Hour 3-4:

[ ] 210 min check ✅
[ ] 240 min deep dive ✅
[ ] No critical issues detected ✅

Hour 4: First Milestone

After 4 hours of stable operation:

[ ] Review overall health
Total uptime: ____%
Total error rate: ____%
Average p95 latency: _ms
[ ] Document any issues
List all issues encountered (even resolved ones)
Note resolutions
Update runbooks if needed
[ ] Post status update
Internal: Update team on launch status
External: Consider posting on status page (if configured)

If all metrics green:

✅ Reduce monitoring frequency to every 2 hours
✅ Continue to hour 24

If issues detected:

⚠️ Continue hourly monitoring
⚠️ Investigate root causes
⚠️ Consider rollback if critical (see Rollback Procedures)

Launch Day: Extended Hours (4-24 Hours)

Goal: Ensure stability continues through first full day.

Monitoring Frequency: Every 2 hours

Who: Primary on-call engineer (backup on standby)

Every 2 Hours: Standard Check (10 min)

Hours: 6, 8, 10, 12, 14, 16, 18, 20, 22, 24

Dashboard Quick Review

[ ] Visit /admin/observability
Uptime still >99.5%?
Error rate still <0.5%?
Latency p95 still <800ms?
Circuit breaker still CLOSED?
[ ] SLO Compliance
All metrics green?
Error budget consumption normal (<10% used)?
No violation alerts?

Alert Review

[ ] Check Slack for alerts
Any new alerts since last check?
Review and triage
Acknowledge or escalate

Trend Analysis

[ ] Compare to previous check
Are metrics improving or degrading?
Any concerning trends?
Document significant changes

End of Day 1 (Hour 24)

After 24 hours of operation:

Final Day 1 Review

[ ] Calculate Day 1 metrics
Total uptime: ____%
Total error rate: ____%
Average p95 latency: _ms
Peak concurrent users: _
Total searches performed: _
[ ] Review all incidents
List all issues (resolved and ongoing)
Document resolutions
Create follow-up tasks if needed
[ ] Assess SLO compliance
Did we meet 99.5% uptime target?
Is error budget consumption acceptable?
Did we stay under 800ms p95 latency?

Day 1 Status Report

Complete Day 1 summary:

Day 1 Launch Summary (24 hours)

Status: [STABLE / DEGRADED / CRITICAL]

Metrics:
- Uptime: ____%
- Error Rate: ____%
- Latency p95: _____ms
- Total Searches: _____

Issues Encountered:
1. [Issue description] - [RESOLVED / ONGOING]
2. [Issue description] - [RESOLVED / ONGOING]

Actions Taken:
- [Action 1]
- [Action 2]

Next Steps:
- [Next step 1]
- [Next step 2]

Overall Assessment: [1-2 sentence summary]

[ ] Share summary with team
Post in Slack
Email stakeholders
Update status page (if configured)

If Day 1 stable:

✅ Reduce monitoring to daily checks (see Post-Launch Week)
✅ Celebrate the successful launch! 🎉

If issues ongoing:

⚠️ Continue 2-hour monitoring
⚠️ Escalate to incident response if needed
⚠️ Consider rollback for critical issues

Post-Launch Week (Days 2-7)

Goal: Monitor for emerging issues and establish normal operations.

Monitoring Frequency: Daily

Who: On-call rotation

Daily Check (15 min)

Time: Once per day (morning recommended)

Dashboard Review

[ ] Visit /admin/observability
Review 24-hour metrics
Check SLO compliance
Verify circuit breaker stable
[ ] Compare to baseline
Is uptime meeting 99.5% target?
Is error rate under 0.5%?
Is latency p95 under 800ms?

Weekly Trends

[ ] Analyze weekly data (Days 3, 5, 7)
Uptime trend: Stable / Improving / Degrading?
Error rate trend: Stable / Improving / Degrading?
Latency trend: Stable / Improving / Degrading?

Search Quality

[ ] Review search patterns
Top 20 search queries (aggregate, privacy-safe)
Are users finding what they need?
Any common "no results" searches?
[ ] Check feedback
Review feedback widget submissions
Identify common themes
Prioritize improvements

Data Quality

[ ] Spot-check services
Verify hours are still accurate
Check contact information works
Test a few random service cards

Week 1 Milestones

Day 2:

[ ] Daily check complete
[ ] No critical issues

Day 3:

[ ] Daily check complete
[ ] First weekly trend analysis
[ ] No critical issues

Day 5:

[ ] Daily check complete
[ ] Mid-week trend analysis
[ ] No critical issues

Day 7:

[ ] Daily check complete
[ ] Full week trend analysis
[ ] Week 1 retrospective

Week 1 Retrospective (Day 7)

After 7 days of operation:

Metrics Summary

[ ] Calculate Week 1 averages
Average uptime: ____%
Average error rate: ____%
Average p95 latency: _ms
Total searches: _
Unique users: _ (estimate)
[ ] SLO Compliance
Met 99.5% uptime target? YES / NO
Stayed under 0.5% error rate? YES / NO
Met <800ms p95 latency? YES / NO

Issue Analysis

[ ] Review all issues
Total incidents: _
Critical (SEV-1): _
High (SEV-2): _
Medium (SEV-3): _
Low (SEV-4): _
[ ] Identify patterns
Common issue types
Time-of-day patterns
User-facing vs internal issues

Improvements Identified

[ ] Performance optimizations
List slow queries or endpoints
Identify caching opportunities
Document optimization tasks
[ ] Data quality improvements
Services to add/update
Categories needing expansion
Verification level upgrades
[ ] Feature requests
User feedback themes
Common feature requests
Prioritize for future versions

Week 1 Report

Complete comprehensive summary:

Week 1 Launch Report

Status: [SUCCESS / PARTIAL SUCCESS / NEEDS IMPROVEMENT]

Metrics:
- Uptime: ____%
- Error Rate: ____%
- Latency p95: _____ms
- Total Searches: _____
- User Engagement: [HIGH / MEDIUM / LOW]

Highlights:
- [Success 1]
- [Success 2]
- [Success 3]

Challenges:
- [Challenge 1] - [How resolved]
- [Challenge 2] - [How resolved]

SLO Compliance:
- Uptime: [MET / NOT MET]
- Error Budget: [MET / NOT MET]
- Latency: [MET / NOT MET]

Lessons Learned:
1. [Lesson 1]
2. [Lesson 2]
3. [Lesson 3]

Next Steps:
- [Action 1]
- [Action 2]
- [Action 3]

Overall Assessment:
[2-3 sentence summary of launch success]

[ ] Share Week 1 report
Email to team
Post in Slack
Archive for future reference

Transition to Normal Operations

After Week 1:

[ ] Update monitoring schedule
Transition from daily to weekly checks
Define normal monitoring cadence
Update on-call rotation
[ ] Review SLO targets
Are PROVISIONAL targets appropriate?
Should targets be adjusted based on Week 1 data?
Document any changes to lib/config/slo-targets.ts
[ ] Update runbooks
Incorporate lessons learned
Add new scenarios encountered
Update resolution procedures
[ ] Plan improvements
Create GitHub issues for identified improvements
Prioritize based on impact
Schedule for next version

Quick Reference

Critical Thresholds

Metric	Target	Warning	Critical
Uptime	>99.5%	<99.5%	<99.0%
Error Rate	<0.5%	>0.5%	>1.0%
Latency p95	<800ms	>800ms	>1500ms
Circuit Breaker	CLOSED	HALF_OPEN	OPEN

Key URLs

Observability Dashboard: /admin/observability
Health Check: /api/v1/health
Metrics: /api/v1/metrics (dev/staging only)
Status Page: status.careconnect.ing (when configured)

Key Contacts

On-Call Engineer: [Name/Contact]
Backup Engineer: [Name/Contact]
Escalation: [Name/Contact]

Notes

Monitoring Philosophy:

Be proactive, not reactive
Document everything
Communicate early and often
Don't panic - follow procedures

When to Escalate:

Critical errors (SEV-1) immediately
High error rates (SEV-2) within 15 minutes
SLO violations within 30 minutes
Uncertain situations - when in doubt, escalate

Remember:

Users depend on this platform in crisis situations
Fast detection and response save lives
Good monitoring enables continuous improvement

Last Updated: 2026-02-09 Version: 1.0 Next Review: After first launch or 2026-03-09