v18.0: Production Observability & Operational Excellence
Version: 18.0 Date: 2026-01-30 Status: PLANNING Dependencies: v17.6 (Authorization Resilience Complete) Estimated Effort: 24-32 hours (4-5 days)
Executive Summary
Current State (v17.6)
CareConnect is production-ready with comprehensive resilience, security, and testing infrastructure:
✅ Core Platform:
- 196 curated social services (validated via
npm run audit:data) - Zero-knowledge hybrid search (local + server modes)
- WCAG 2.1 AA accessibility compliance
- 7-language internationalization (EN, FR, ZH-Hans, AR, PT, ES, PA)
- Offline-ready PWA with IndexedDB fallback
✅ Resilience & Performance:
- Circuit breaker pattern protecting database operations
- Performance tracking system (p50/p95/p99 metrics)
- k6 load testing infrastructure
- Tiered authorization with fail-safe strategy (ADR-017)
- Health check endpoint (
/api/v1/health)
✅ Security & Authorization:
- Role-Based Access Control (Owner, Admin, Editor, Viewer)
- Row-Level Security policies
- Multi-layered authorization (UI → Server → Database)
- XSS prevention, CSP headers, rate limiting
✅ Testing:
- 100 test files, 537 passing tests (95%+ coverage on critical paths)
- Unit, integration, E2E (Playwright), load (k6), accessibility (Axe)
- CI/CD with automated validation
Key Unknowns & Assumptions
Unknowns:
- Production traffic patterns (no real-world baseline yet)
- Database failure frequency in production (Supabase SLA: 99.9%)
- Geographic distribution of users (impacts CDN strategy)
- Partner organization adoption rate (dashboard usage patterns unknown)
Assumptions:
- User expects production deployment within 1-2 months
- Budget constraints favor open-source/free-tier solutions initially
- Vercel remains the deployment platform (impacts observability choices)
- Manual data curation continues (no automated scraping planned)
Partially Complete Work from v17.5/v17.6:
- ⚠️ Circuit breaker integration tests (3 failing, timeout issues)
- ⚠️ API routes lack circuit breaker protection (only 40% coverage)
- ⚠️ Metrics endpoint exists but lacks authentication
- ⚠️ No production monitoring/alerting infrastructure
- ⚠️ Performance baseline metrics not documented
Goal
Complete the operational excellence foundation by:
- Finishing v17.5+ rollout (circuit breaker coverage, integration tests, baseline metrics)
- Adding production-grade observability (monitoring, alerting, dashboards)
- Establishing SLOs/SLAs (define success criteria for production)
- Creating operational runbooks (incident response, troubleshooting guides)
This prepares the platform for production launch with confidence that issues can be detected, diagnosed, and resolved quickly.
Phased Implementation Plan
Phase 1: Complete Circuit Breaker Rollout (8-10 hours)
Goal: Achieve 100% circuit breaker coverage on all database operations and validate resilience via integration tests.
Priority: HIGH Risk: MEDIUM (touches critical paths, but infrastructure exists)
1.1 Protect Remaining API Routes (3 hours)
Current State: Only search data loading, service management, and analytics have circuit breaker protection (~40% of DB operations).
Remaining Routes to Protect:
/api/v1/notifications/subscribe- POST/api/v1/notifications/unsubscribe- POST/api/v1/analytics/search- POST/api/v1/feedback- POST, GET/api/v1/feedback/[id]- PATCH, DELETE/api/v1/services/[id]/update-request- POST/api/v1/services/[id]- PATCH, DELETE/api/v1/services- POST
Implementation Pattern:
// Before
const { data, error } = await supabase.from("table").select("*")
// After
const { data, error } = await withCircuitBreaker(
async () => supabase.from("table").select("*"),
async () => null // Fallback for non-critical reads (or omit for writes)
)
Deliverables:
- [ ] 8 API routes protected with circuit breaker
- [ ] Fallback strategies documented per route (fail-closed vs fail-open)
- [ ] Unit tests for each protected route (mock circuit open state)
Validation:
- Type-check passes:
npm run type-check - Build succeeds:
npm run build - Route tests pass:
npm test -- app/api
1.2 Fix Circuit Breaker Integration Tests (4 hours)
Current State: 3 integration tests failing due to timeout/state transition issues:
should log all state transitions- CircuitOpenError not caught properly- Likely timing-dependent tests with 30s timeouts
Root Cause Analysis:
- Tests likely depend on precise timing (30s circuit timeout)
- Mock database simulator may not properly trigger state transitions
- Insufficient wait time for HALF_OPEN → CLOSED transitions
Refactoring Approach:
- Reduce timeout in test environment (5s instead of 30s via env var)
- Use fake timers (Vitest
vi.useFakeTimers()) to control time progression - Improve database simulator to track call counts and simulate recovery
- Add explicit state assertions after each transition
Implementation:
// tests/integration/circuit-breaker-db.test.ts
import { vi } from "vitest"
describe("Circuit Breaker Integration", () => {
beforeEach(() => {
vi.useFakeTimers()
process.env.CIRCUIT_BREAKER_TIMEOUT = "5000" // 5s for tests
})
afterEach(() => {
vi.useRealTimers()
})
it("should transition CLOSED → OPEN → HALF_OPEN → CLOSED", async () => {
const breaker = getSupabaseBreaker()
// 1. Start in CLOSED state
expect(breaker.getState()).toBe("CLOSED")
// 2. Trigger 3 failures to open circuit
dbSimulator.simulateFailure(3)
await expect(() => searchServices("test")).rejects.toThrow()
await expect(() => searchServices("test")).rejects.toThrow()
await expect(() => searchServices("test")).rejects.toThrow()
// Circuit should be OPEN
expect(breaker.getState()).toBe("OPEN")
// 3. Fast-forward past timeout (5s)
vi.advanceTimersByTime(5001)
// State should transition to HALF_OPEN (next request will test recovery)
const result = await searchServices("test") // Should attempt recovery
// If DB is restored, should be CLOSED
dbSimulator.restore()
await searchServices("test")
expect(breaker.getState()).toBe("CLOSED")
})
})
Deliverables:
- [ ] All 9 integration tests passing
- [ ] Test duration reduced from 60s to <10s via fake timers
- [ ] Test flakiness eliminated (run 10x without failure)
- [ ] CI integration confirmed (non-blocking per ADR-015)
Validation:
- Integration tests pass:
npm test -- tests/integration/circuit-breaker-db.test.ts - No test flakiness:
for i in {1..10}; do npm test -- tests/integration/circuit-breaker-db.test.ts || break; done
1.3 Document Performance Baselines (2 hours)
Current State: Load testing infrastructure exists (k6 scripts) but no baseline metrics documented.
Objective: Establish quantitative baseline for future regression detection.
Tasks:
- Run all k6 load tests locally
npm run test:load:smoke # 1 VU, 30s
npm run test:load # 10-50 VUs, ramp-up
npm run test:load:sustained # 20 VUs, 30min
npm run test:load:spike # 0-100 VUs spike
- Capture metrics:
- Request throughput (req/s)
- Response times (p50, p95, p99)
- Error rate (%)
- Circuit breaker activations
-
Memory usage stability
-
Document in standardized format:
- Test environment specs (CPU, RAM, network, DB tier)
- Date/time of baseline
- Expected vs actual results
- Regression thresholds (e.g., p95 > +20% = regression)
Deliverable:
- [ ]
docs/testing/performance-baselines.mdcreated - [ ] All 4 load tests executed and documented
- [ ] Regression thresholds defined
- [ ] Next review date set (3 months)
Format:
# Performance Baseline Metrics
## Environment
- Date: 2026-01-30
- Version: v18.0
- Hardware: [Record actual specs]
- Database: Supabase Free Tier, US-East-1
## Smoke Test (Basic Connectivity)
- Duration: 30s
- VUs: 1
- Results:
- Throughput: 5 req/s
- p95 latency: 150ms
- p99 latency: 200ms
- Error rate: 0%
## Search API Load Test (Realistic Traffic)
- Duration: 5min
- VUs: 10-50 (ramp-up)
- Results:
- Throughput: 45 req/s (sustained)
- p95 latency: 380ms
- p99 latency: 850ms
- Error rate: 2%
## Thresholds for Regression Alerts
- p95 latency increase: >20% = WARNING, >50% = CRITICAL
- Error rate increase: >2% absolute = WARNING
- Circuit breaker false-opens: >0 = CRITICAL
Validation:
- All metrics captured with timestamps
- Regression thresholds match project SLOs (defined in Phase 4)
1.4 Secure Metrics Endpoint (1 hour)
Current State: /api/v1/metrics exists but lacks authentication (security risk in production).
Security Requirements:
- Development/Staging: Require authentication OR whitelist localhost
- Production: Require admin authentication + rate limiting
Implementation:
// app/api/v1/metrics/route.ts
import { createClient } from "@/lib/supabase"
import { assertAdminRole } from "@/lib/auth/authorization"
export async function GET(request: NextRequest) {
// Security check
const isDev = process.env.NODE_ENV === "development"
const isLocalhost = request.headers.get("host")?.startsWith("localhost")
if (!isDev && !isLocalhost) {
// Production/staging: require admin auth
const supabase = await createClient()
const {
data: { user },
error,
} = await supabase.auth.getUser()
if (!user || error) {
return new Response("Unauthorized", { status: 401 })
}
await assertAdminRole(supabase, user.id)
}
// Return metrics...
}
Deliverables:
- [ ] Authentication enforced on
/api/v1/metrics - [ ] Rate limiting added (30 req/min per IP)
- [ ] Admin-only access in production
- [ ] Documentation updated in CLAUDE.md
Validation:
- Unauthenticated request returns 401:
curl http://localhost:3000/api/v1/metrics - Authenticated request succeeds with valid admin token
- Rate limit enforced:
for i in {1..35}; do curl http://localhost:3000/api/v1/metrics; done
Phase 2: Production Monitoring Infrastructure (10-12 hours)
Goal: Establish automated monitoring and alerting for production health.
Priority: HIGH Risk: LOW (net-new features, no existing code changes)
2.1 Integrate with External APM (Axiom) (4 hours)
Rationale: In-memory performance tracking is development-only. Production needs persistent, queryable metrics.
Tool Selection: Axiom
- Pros: Free tier (500GB/month), Vercel integration, structured logging, fast querying
- Cons: Third-party dependency, requires API key management
- Alternatives Considered: Datadog (too expensive), New Relic (complex setup), Sentry (focused on errors, not metrics)
Implementation:
// lib/observability/axiom.ts
import { Axiom } from "@axiomhq/js"
const axiom = new Axiom({
token: process.env.AXIOM_TOKEN!,
orgId: process.env.AXIOM_ORG_ID!,
})
export async function sendMetrics(dataset: string, events: any[]) {
if (process.env.NODE_ENV !== "production") return // Only in production
try {
await axiom.ingest(dataset, events)
} catch (error) {
logger.error("Axiom ingestion failed", { error })
// Don't throw - metrics are non-critical
}
}
// Hook into existing performance tracker
export function configureProductionMetrics() {
if (process.env.NODE_ENV === "production") {
// Send metrics to Axiom every 60s
setInterval(async () => {
const metrics = getAllMetrics() // From lib/performance/metrics.ts
await sendMetrics("performance", metrics)
}, 60000)
}
}
Integration Points:
- Performance tracker → Axiom (every 60s batch)
- Circuit breaker events → Axiom (real-time)
- API errors → Axiom (real-time)
- Health check results → Axiom (every 5min)
Deliverables:
- [ ] Axiom SDK integrated (
@axiomhq/js) - [ ] Environment variables added (
AXIOM_TOKEN,AXIOM_ORG_ID,AXIOM_DATASET) - [ ] Metrics batching implemented (60s intervals)
- [ ] Circuit breaker events sent to Axiom
- [ ] Documentation:
docs/observability/axiom-setup.md
Validation:
- Metrics appear in Axiom dashboard within 2min of sending
- No errors in application logs during metric ingestion
- Batch size reasonable (<10KB per batch)
2.2 Create Observability Dashboard (4 hours)
Objective: Visualize real-time system health without leaving the application.
Approach: Build internal dashboard at /admin/observability using existing metrics + Axiom data.
Features:
- Circuit Breaker Status
- Current state (CLOSED/OPEN/HALF_OPEN)
- Failure count, success count, failure rate
- Time in current state
-
Historical state transitions (last 24h)
-
Performance Metrics
- Search latency (p50, p95, p99) - last 1h, 24h, 7d
- API response times by endpoint
- Database query latency
-
Error rate trends
-
System Health
- Database connectivity status
- Uptime percentage (last 7d, 30d)
- Active sessions (if trackable)
-
Cache hit rates (IndexedDB, service worker)
-
Recent Incidents
- Circuit breaker opens (last 24h)
- High error rate periods
- Slow query alerts
Tech Stack:
- Recharts (already in use for analytics)
- Radix UI components (tables, cards, badges)
- Server component with periodic refresh (60s)
Implementation:
// app/[locale]/admin/observability/page.tsx
import { getCircuitBreakerStats } from '@/lib/resilience/supabase-breaker'
import { getMetricsSummary } from '@/lib/performance/metrics'
import { queryAxiomMetrics } from '@/lib/observability/axiom'
export default async function ObservabilityPage() {
const cbStats = getCircuitBreakerStats()
const perfMetrics = getMetricsSummary()
const recentIncidents = await queryAxiomMetrics('incidents', '24h')
return (
<div className="space-y-6">
<CircuitBreakerCard stats={cbStats} />
<PerformanceCharts metrics={perfMetrics} />
<IncidentTimeline incidents={recentIncidents} />
</div>
)
}
Deliverables:
- [ ] Dashboard page created at
/admin/observability - [ ] Circuit breaker status card
- [ ] Performance charts (p50/p95/p99 over time)
- [ ] Recent incidents timeline
- [ ] Auto-refresh every 60s
- [ ] Admin-only access enforced
Validation:
- Dashboard loads in <2s
- Metrics update automatically every 60s
- Charts render correctly with real data
- Non-admin users get 403 Forbidden
2.3 Configure Alerting (2 hours)
Objective: Proactively notify on critical issues (circuit open, high error rate, slow queries).
Alert Channels:
- Email (using Supabase Auth + Resend)
- Slack (webhook for team notifications)
- Dashboard Banners (in-app alerts for admins)
Alert Rules: | Alert Type | Condition | Severity | Channel | Throttle | |------------|-----------|----------|---------|----------| | Circuit Breaker Open | State = OPEN | CRITICAL | Email + Slack | Immediate | | High Error Rate | Error rate >5% (5min window) | WARNING | Slack | 10min | | Slow Queries | p95 latency >2000ms (5min window) | WARNING | Slack | 15min | | Database Down | Health check fails 3x | CRITICAL | Email + Slack | Immediate | | Low Uptime | Uptime <99% (24h rolling) | WARNING | Email | Daily digest |
Implementation:
// lib/observability/alerts.ts
import { sendEmail } from "@/lib/email"
import { sendSlackNotification } from "@/lib/integrations/slack"
interface Alert {
type: "circuit_open" | "high_error_rate" | "slow_queries" | "database_down"
severity: "CRITICAL" | "WARNING"
message: string
metadata: Record<string, any>
}
export async function sendAlert(alert: Alert) {
const timestamp = new Date().toISOString()
// Log alert
logger.warn(`ALERT: ${alert.type}`, { alert, timestamp })
// Send to appropriate channels
if (alert.severity === "CRITICAL") {
await sendEmail({
to: process.env.ADMIN_EMAIL!,
subject: `🚨 CRITICAL: ${alert.message}`,
body: JSON.stringify(alert, null, 2),
})
}
await sendSlackNotification({
channel: "#kingston-care-alerts",
text: `${alert.severity === "CRITICAL" ? "🚨" : "⚠️"} ${alert.message}`,
metadata: alert.metadata,
})
}
// Alert throttling to prevent spam
const alertThrottle = new Map<string, number>()
export function shouldSendAlert(alertKey: string, throttleMs: number): boolean {
const lastSent = alertThrottle.get(alertKey)
const now = Date.now()
if (!lastSent || now - lastSent > throttleMs) {
alertThrottle.set(alertKey, now)
return true
}
return false
}
Integration:
- Circuit breaker telemetry calls
sendAlert()on state transitions - Health check endpoint triggers alerts on failures
- Scheduled job (Vercel Cron) runs periodic checks
Deliverables:
- [ ] Alert system implemented with throttling
- [ ] Slack webhook integration
- [ ] Email alerts via Resend
- [ ] Alert configuration documented
- [ ] Test alerts sent successfully
Validation:
- Simulate circuit open → receive email + Slack notification within 30s
- Verify throttling prevents alert spam (max 1 per 10min for warnings)
- Check alert metadata includes useful debugging info
2.4 Create Operational Runbooks (2 hours)
Objective: Document step-by-step incident response procedures for common failure scenarios.
Runbooks to Create:
1. Circuit Breaker Open
# Runbook: Circuit Breaker Open
## Symptoms
- `/api/v1/health` returns `503 Service Unavailable`
- Circuit breaker state = OPEN
- Users see "Service temporarily unavailable" errors
- Alert: "Circuit breaker 'supabase' is OPEN"
## Diagnosis
1. Check circuit breaker stats: `GET /api/v1/health`
2. Check Supabase status: https://status.supabase.com
3. Review recent logs for database errors
4. Check database performance in Supabase dashboard
## Resolution Steps
1. **If Supabase is down:**
- Wait for Supabase to recover (automatic)
- Circuit will auto-transition to HALF_OPEN after 30s
- Monitor recovery via health check
2. **If Supabase is up but circuit is open:**
- Possible database overload or query timeout
- Review slow query logs in Supabase
- Consider optimizing problematic queries
- Manually reset circuit (not recommended): restart application
3. **If circuit repeatedly opens:**
- Increase circuit breaker timeout (env var)
- Investigate root cause (query performance, indexes)
- Consider scaling database tier
## Prevention
- Monitor database performance metrics weekly
- Optimize slow queries identified in Supabase dashboard
- Ensure database indexes are up-to-date (run `npm run db:verify`)
## Escalation
- If circuit remains open >5min, escalate to senior engineer
- If data loss suspected, engage database team immediately
2. High Error Rate 3. Slow Query Performance 4. Database Migration Failures
Deliverables:
- [ ]
docs/runbooks/circuit-breaker-open.md - [ ]
docs/runbooks/high-error-rate.md - [ ]
docs/runbooks/slow-queries.md - [ ]
docs/runbooks/database-migration.md - [ ] Runbook index added to
docs/README.md
Validation:
- Runbooks follow consistent template
- Include realistic scenarios based on actual failure modes
- Reviewed by at least one other engineer
Phase 3: Service Level Objectives (4-6 hours)
Goal: Define measurable success criteria for production reliability.
Priority: MEDIUM Risk: LOW (documentation-only, no code changes)
3.1 Define SLOs (2 hours)
Service Level Objectives (SLOs):
Availability SLO:
- Target: 99.5% uptime (monthly)
- Measurement: Health check endpoint availability
- Error Budget: 216 minutes/month (30 days × 0.5%)
- Exclusions: Scheduled maintenance (announced 24h in advance)
Latency SLO:
- Target: p95 search latency <800ms
- Measurement:
/api/v1/search/servicesresponse time - Degraded State: p95 >800ms but <2000ms
- Outage: p95 >2000ms
Error Rate SLO:
- Target: <5% error rate (5xx responses)
- Measurement: HTTP status codes from all API routes
- Warning Threshold: 5-10% error rate
- Critical Threshold: >10% error rate
Data Freshness SLO:
- Target: Search results reflect database updates within 5 minutes
- Measurement: Time between service update and appearance in search
- Exclusions: JSON fallback mode (stale by design during outages)
Deliverable:
- [ ]
docs/observability/slo-definition.mdcreated - [ ] SLO targets aligned with user expectations
- [ ] Measurement methodology documented
- [ ] Error budget tracking plan defined
3.2 SLO Monitoring Dashboard (2 hours)
Objective: Visualize SLO compliance in real-time.
Dashboard Components:
- Uptime SLO: Current month uptime percentage, error budget remaining
- Latency SLO: p95 latency trend (last 7d), threshold line at 800ms
- Error Rate SLO: Error percentage (last 24h), threshold line at 5%
- Incidents: Count of SLO violations this month
Implementation:
- Add SLO tab to
/admin/observability - Query Axiom for historical SLO data
- Calculate error budget burn rate
- Alert when error budget <20% remaining
Deliverable:
- [ ] SLO dashboard tab added to observability page
- [ ] Error budget burn rate calculator
- [ ] Historical SLO compliance (last 3 months)
Validation:
- Dashboard accurately reflects health check data
- Error budget calculation matches manual computation
- Alerts trigger at 20% error budget remaining
3.3 Public Status Page (2 hours)
Objective: Transparent communication of system status to users and partners.
Approach: Use Statuspage.io (free tier) or self-hosted Upptime (GitHub Actions).
Recommendation: Upptime
- Pros: Free, open-source, GitHub-hosted, no third-party dependency
- Cons: Requires GitHub Actions (free for public repos)
- Setup: 30 minutes
Status Page Components:
- Current Status: Operational / Degraded / Outage
- System Components:
- Search API
- Partner Dashboard
- Database
- AI Features
- Uptime Percentage: Last 30d, 90d
- Recent Incidents: Last 10 incidents with timestamps and resolutions
- Scheduled Maintenance: Upcoming maintenance windows
Implementation:
# .upptimerc.yml
owner: careconnect
repo: upptime
sites:
- name: CareConnect
url: https://careconnect.ing
- name: Search API
url: https://careconnect.ing/api/v1/health
expectedStatusCodes:
- 200
- name: Partner Dashboard
url: https://careconnect.ing/dashboard
status-website:
cname: status.careconnect.ing
name: CareConnect Status
theme: night
Deliverable:
- [ ] Upptime configured in separate GitHub repo
- [ ] Status page deployed at
status.careconnect.ing(or subdomain) - [ ] Automated incident detection via health check
- [ ] Status badge added to main README.md
Validation:
- Status page loads in <2s
- Component statuses update every 5min
- Incident timeline shows historical data
- Downtime detection works (test by failing health check)
Phase 4: Operational Documentation (2-4 hours)
Goal: Comprehensive documentation for production operations and incident management.
Priority: LOW Risk: NONE (documentation-only)
4.1 Update CLAUDE.md (1 hour)
Additions:
- Observability & Monitoring section
- Axiom integration
- Dashboard usage
- Alert configuration
- SLO definitions
- Circuit breaker best practices
- When to adjust thresholds
- Interpreting circuit states
- Debugging circuit opens
- Load testing guidelines
- Running baselines
- Interpreting results
- Regression detection
Deliverable:
- [ ] CLAUDE.md updated with new observability sections
- [ ] Links to runbooks and SLO docs
4.2 Production Deployment Checklist (1 hour)
File: docs/deployment/production-checklist.md
Checklist Items:
- [ ] Environment variables configured in Vercel
- [ ] Supabase production database provisioned
- [ ] Axiom account set up and API token added
- [ ] Slack webhook configured for alerts
- [ ] Admin email set for critical alerts
- [ ] Status page deployed and linked
- [ ] DNS configured (status.careconnect.ing)
- [ ] SSL certificates valid
- [ ] Database backups enabled (Supabase automatic)
- [ ] Rate limiting tested in production
- [ ] Circuit breaker thresholds validated
- [ ] Load test baseline documented
- [ ] SLOs published on status page
- [ ] Runbooks reviewed by team
- [ ] Incident response roles assigned
Deliverable:
- [ ] Production deployment checklist created
- [ ] All items reviewed and validated before launch
4.3 Incident Response Plan (2 hours)
File: docs/observability/incident-response-plan.md
Sections:
- Incident Severity Levels:
- SEV1 (Critical): Complete outage, data loss risk, security breach
- SEV2 (High): Degraded performance, partial outage, circuit open >5min
- SEV3 (Medium): Minor performance degradation, non-critical errors
-
SEV4 (Low): Cosmetic issues, logging errors
-
Incident Response Roles:
- Incident Commander: Coordinates response, makes decisions
- Communications Lead: Updates status page, notifies stakeholders
- Technical Lead: Investigates and implements fixes
-
Scribe: Documents timeline and actions taken
-
Response Timeline:
- 0-5min: Acknowledge alert, assess severity, assign roles
- 5-15min: Initial investigation, update status page
- 15-30min: Implement mitigation, test fix
- 30-60min: Deploy fix, verify resolution
-
Post-incident: Write postmortem within 24h
-
Communication Templates:
- Initial incident notification
- Progress update (every 30min)
- Resolution announcement
- Postmortem summary
Deliverable:
- [ ] Incident response plan documented
- [ ] Response roles assigned
- [ ] Communication templates created
- [ ] Postmortem template included
Validation:
- Plan reviewed by all stakeholders
- Dry-run incident simulation conducted
Success Criteria
Phase 1: Circuit Breaker Rollout
- ✅ 100% of API routes protected with circuit breaker
- ✅ All 9 integration tests passing (no flakiness)
- ✅ Performance baselines documented
- ✅ Metrics endpoint secured with authentication
- ✅ No production regressions introduced
Phase 2: Monitoring Infrastructure
- ✅ Axiom integration live in production
- ✅ Observability dashboard functional
- ✅ Alerts firing correctly (tested via simulation)
- ✅ 4 operational runbooks published
- ✅ Team trained on alert response
Phase 3: SLOs
- ✅ SLOs defined and approved by stakeholders
- ✅ SLO monitoring dashboard live
- ✅ Public status page deployed
- ✅ Error budget tracking active
- ✅ SLO compliance >90% in first month
Phase 4: Documentation
- ✅ CLAUDE.md updated with observability sections
- ✅ Production deployment checklist complete
- ✅ Incident response plan approved
- ✅ All documentation reviewed and validated
Timeline & Milestones
Week 1 (Jan 30 - Feb 5)
Focus: Complete Phase 1 (Circuit Breaker Rollout)
- Day 1-2: Protect remaining API routes (1.1)
- Day 3-4: Fix integration tests (1.2)
- Day 5: Document performance baselines (1.3), secure metrics endpoint (1.4)
Milestone: Circuit breaker coverage 100%, all tests passing
Week 2 (Feb 6 - Feb 12)
Focus: Phase 2 (Monitoring Infrastructure)
- Day 1-2: Axiom integration (2.1)
- Day 3-4: Observability dashboard (2.2)
- Day 5: Alerting configuration (2.3), operational runbooks (2.4)
Milestone: Production monitoring live, alerts functional
Week 3 (Feb 13 - Feb 19)
Focus: Phase 3 (SLOs) + Phase 4 (Documentation)
- Day 1: Define SLOs (3.1)
- Day 2: SLO monitoring dashboard (3.2)
- Day 3: Public status page (3.3)
- Day 4-5: Update CLAUDE.md (4.1), deployment checklist (4.2), incident plan (4.3)
Milestone: SLOs published, documentation complete
Week 4 (Feb 20 - Feb 26)
Focus: Validation, Testing, Production Readiness
- Day 1-2: End-to-end validation of all phases
- Day 3: Load testing with production config
- Day 4: Dry-run incident simulation
- Day 5: Final review and sign-off
Milestone: v18.0 production-ready, launch approved
Dependencies & Blockers
Required (Must Have)
- ✅ v17.6 complete (authorization resilience)
- ✅ Vercel account with production deployment
- ⚠️ Axiom account (free tier) - User action required
- ⚠️ Slack workspace with webhook - User action required
- ⚠️ Domain for status page (status.careconnect.ing) - User action required
Optional (Nice to Have)
- ⏳ Supabase production tier (more generous limits, better performance)
- ⏳ Dedicated Sentry account (error tracking, distinct from Axiom)
- ⏳ PagerDuty integration (advanced on-call rotation)
Known Blockers
- None identified - All infrastructure is free-tier compatible
- Slack webhook is free (no paid account required)
- Axiom free tier is sufficient (500GB/month >> estimated 10GB/month)
Rollout & Rollback Strategy
Rollout Plan
Phase 1: Canary Release (10% of traffic)
- Deploy v18.0 to staging environment
- Run full test suite including load tests
- Validate circuit breaker protection on all routes
- Deploy to production with feature flags (observability disabled)
- Monitor for 24h with existing health checks
Phase 2: Enable Monitoring (50% of traffic)
- Enable Axiom integration (server-side only)
- Monitor metric ingestion for 48h
- Verify no performance degradation (<5ms overhead)
- Enable observability dashboard for admins only
Phase 3: Full Rollout (100% of traffic)
- Enable alerting (email + Slack)
- Publish status page publicly
- Monitor SLO compliance for 7d
- Announce production readiness
Rollback Strategy
Trigger Conditions:
- Performance regression >20% (p95 latency)
- New errors introduced (>2% error rate increase)
- Circuit breaker false-opens (>3 in 24h)
- Axiom integration causes application crashes
Rollback Steps:
- Immediate: Revert deployment via Vercel (1-click rollback)
- Within 5min: Disable Axiom integration (env var toggle)
- Within 10min: Notify stakeholders via status page
- Within 1h: Identify root cause, create hotfix
- Within 24h: Publish postmortem, plan re-deployment
Rollback Testing:
- Test rollback procedure in staging before production
- Ensure previous version (v17.6) remains deployable
- Verify data migrations are backward-compatible
Risk Assessment
Low Risk
- Operational runbooks: Documentation-only, no code impact
- SLO definitions: Measurement framework, doesn't change behavior
- Status page: Separate deployment, no production dependencies
Medium Risk
- Axiom integration: Third-party dependency, potential ingestion failures
- Mitigation: Fail-safe (log error, don't throw), test in staging
- Circuit breaker coverage: Touching critical API routes
- Mitigation: Comprehensive tests, gradual rollout, feature flags
- Alerting configuration: Risk of alert fatigue or missed critical alerts
- Mitigation: Throttling, severity levels, dry-run testing
High Risk
- Integration test refactoring: Timing-dependent tests can be flaky
- Mitigation: Use fake timers, reduce timeout, 10x run validation
- Performance overhead: New monitoring could slow down requests
- Mitigation: Async ingestion, batching, <5ms overhead target
Critical Dependencies
- Axiom uptime: If Axiom is down, metrics are lost (but app unaffected)
- Mitigation: Axiom has 99.9% SLA, non-critical path
- Slack webhook: Alert delivery depends on Slack availability
- Mitigation: Dual-channel alerts (email + Slack), log all alerts
Future Enhancements (Post-v18.0)
v18.1: Advanced Observability
- Custom Grafana dashboards (if Axiom query limits exceeded)
- Distributed tracing with OpenTelemetry
- User session replay for debugging (LogRocket, FullStory)
- Client-side error tracking (Sentry browser SDK)
v18.2: Intelligent Alerting
- Anomaly detection using ML (Axiom AI features)
- Predictive alerts (circuit will open in 5min based on trends)
- Alert aggregation (summarize 10 similar alerts into 1)
- On-call rotation via PagerDuty
v18.3: Performance Optimization
- CDN integration for static assets (Vercel Edge)
- Database query caching (Redis)
- Incremental static regeneration for service pages
- Edge Functions for geo-distributed search
v19.0: Multi-Region Deployment
- Supabase read replicas in multiple regions
- Geo-routing based on user location
- Cross-region circuit breaker coordination
- Global load balancing
Validation & Testing Strategy
Pre-Production Testing
1. Staging Environment Validation:
- [ ] Deploy v18.0 to staging (
staging.careconnect.ing) - [ ] Run full test suite:
npm test && npm run test:e2e && npm run test:a11y - [ ] Execute load tests:
npm run test:load:sustained(30min) - [ ] Verify circuit breaker triggers correctly (simulate DB failure)
- [ ] Test alert delivery (Slack + email)
- [ ] Validate Axiom metric ingestion (check dashboard)
2. Regression Testing:
- [ ] Compare load test results to v17.6 baselines
- [ ] Ensure p95 latency increase <5%
- [ ] Verify error rate unchanged
- [ ] Check bundle size increase <50KB
3. Security Testing:
- [ ] Verify metrics endpoint requires authentication
- [ ] Test rate limiting on observability endpoints
- [ ] Check CORS policies on new routes
- [ ] Scan for secret leakage in logs
4. Accessibility Testing:
- [ ] Run Axe audit on observability dashboard
- [ ] Verify keyboard navigation works
- [ ] Test screen reader compatibility
- [ ] Check color contrast ratios
Post-Production Validation
First 24 Hours:
- [ ] Monitor circuit breaker state (should remain CLOSED)
- [ ] Check Axiom ingestion (metrics appearing every 60s)
- [ ] Verify alerts don't fire spuriously
- [ ] Monitor error logs for new issues
First 7 Days:
- [ ] Validate SLO compliance (uptime >99.5%, p95 <800ms)
- [ ] Review incident count (should be 0 for SEV1/SEV2)
- [ ] Analyze error budget burn rate
- [ ] Collect user feedback on any performance changes
First 30 Days:
- [ ] Publish first monthly SLO report
- [ ] Conduct retrospective on v18.0 rollout
- [ ] Identify areas for optimization
- [ ] Plan v18.1 enhancements based on production data
Stakeholder Communication
Weekly Progress Updates
Format: Slack message + email summary
Template:
📊 v18.0 Progress Update - Week [N]
Completed This Week:
- ✅ [Phase/Task completed]
- ✅ [Phase/Task completed]
In Progress:
- 🔄 [Current work]
- 🔄 [Current work]
Blockers:
- ⚠️ [Blocker description] - ETA: [date]
Next Week:
- 📅 [Planned work]
- 📅 [Planned work]
Overall Progress: [X]% complete
On Track for Launch: [Yes/No/At Risk]
Launch Announcement
Channels:
- Status page announcement
- Email to partner organizations
- Social media (if applicable)
- Internal team Slack
Key Messages:
- New observability features improve reliability
- Transparent uptime tracking via status page
- Faster incident resolution with automated monitoring
- No user-facing changes (backend enhancements)
References & Resources
Related ADRs
- ADR-016: Performance Tracking and Circuit Breaker
- ADR-017: Authorization Resilience Strategy
- ADR-015: Non-Blocking E2E Tests
External Documentation
- Axiom Documentation
- Upptime Setup Guide
- Vercel Cron Jobs
- Circuit Breaker Pattern (Martin Fowler)
- Google SRE Book: SLOs
Tools & Libraries
@axiomhq/js- Axiom JavaScript SDK@upstash/ratelimit- Rate limiting (already installed)recharts- Dashboard charts (already installed)vitest- Testing framework (already installed)
Appendix A: Environment Variables
New Variables for v18.0:
# Axiom Integration (Production Monitoring)
AXIOM_TOKEN=xaat-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
AXIOM_ORG_ID=careconnect-xxxxx
AXIOM_DATASET=performance
# Alerting
SLACK_WEBHOOK_URL=<your-slack-webhook-url>
ADMIN_EMAIL=admin@careconnect.ing
# Circuit Breaker (Production Tuning)
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5 # Higher in prod (less sensitive)
CIRCUIT_BREAKER_TIMEOUT=60000 # 1min timeout (vs 30s in dev)
# Performance Tracking (Production)
NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING=false # Disabled (use Axiom instead)
# Status Page
NEXT_PUBLIC_STATUS_PAGE_URL=https://status.careconnect.ing
Updated .env.example:
# v18.0: Production Observability
# Axiom (Production Metrics)
AXIOM_TOKEN=
AXIOM_ORG_ID=
AXIOM_DATASET=performance
# Alerts
SLACK_WEBHOOK_URL=
ADMIN_EMAIL=
# Circuit Breaker
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
CIRCUIT_BREAKER_TIMEOUT=60000
Appendix B: Cost Analysis
Estimated Monthly Costs (Production)
| Service | Tier | Cost | Usage Estimate |
|---|---|---|---|
| Vercel | Pro | $20/month | Deployment platform |
| Supabase | Free | $0 | <500MB DB, <2GB bandwidth |
| Axiom | Free | $0 | <500GB ingestion (~10GB actual) |
| Slack | Free | $0 | Webhook only (no paid features) |
| Upptime | Free | $0 | GitHub Actions (public repo) |
| Resend | Free | $0 | <100 emails/month (alerts) |
Total: $20/month (Vercel only)
Scaling Considerations:
- If traffic >100k requests/month: Upgrade Supabase to Pro ($25/month)
- If metrics >500GB/month: Upgrade Axiom to Team ($25/month)
- Current estimates: 10k requests/month, 10GB metrics/month → free tier sufficient
Appendix C: Performance Budget
Target Performance Characteristics:
| Metric | Current (v17.6) | Target (v18.0) | Max Acceptable |
|---|---|---|---|
| Bundle Size (gzip) | 245KB | <260KB | 280KB |
| Initial Load Time | 1.2s | <1.4s | 1.8s |
| Search API (p95) | 380ms | <400ms | 500ms |
| Memory Usage (client) | 45MB | <50MB | 60MB |
| Memory Usage (server) | 120MB | <140MB | 160MB |
New Monitoring Overhead Targets:
- Axiom ingestion: <2ms per batch (async)
- Circuit breaker check: <0.1ms per request
- Performance tracking: <0.5ms per operation (dev only)
- Total overhead: <5ms per request
End of Implementation Plan
Version History:
- v1.0 (2026-01-30): Initial plan created
- Next review: After Phase 1 completion (est. 2026-02-05)