ADR-019: Production Observability and Alerting System

Context and Problem Statement

With v18.0, the platform achieved 100% circuit breaker coverage and comprehensive resilience patterns. However, we lacked proactive monitoring and alerting to detect and respond to production incidents before they impact users. Without real-time observability, operators would only learn about issues through user reports or scheduled health checks, increasing Mean Time To Detection (MTTD) and Mean Time To Resolution (MTTR).

Key Problems:

No persistent metrics storage (in-memory tracking lost on restart)
No automated alerting for critical events (circuit breaker state changes, high error rates)
No operational runbooks for common incident scenarios
No centralized dashboard for system health visibility
Alert spam risk during incident flapping (need throttling)

Decision Drivers

Proactive Detection: Detect incidents within seconds, not minutes or hours
Cost Efficiency: Free tier suitable for startup/small-scale production (<500GB/month metrics)
Developer Experience: Simple setup, minimal configuration, no vendor lock-in
Alert Quality: Reduce false positives and alert fatigue through intelligent throttling
Operational Readiness: Provide clear, actionable runbooks for common scenarios
Privacy: No user data or search query logging (metrics only track system health)
Non-Blocking: Monitoring failures must not impact core application functionality

Considered Options

Metrics Storage

In-Memory Aggregation (Status Quo)
Pros: Zero cost, no external dependencies, fast
Cons: Lost on restart, no historical analysis, doesn't scale
Axiom (Free Tier)
Pros: 500GB/month free, fast ingestion, SQL-like queries, generous retention
Cons: External dependency, requires API token management
Datadog/New Relic
Pros: Enterprise features, mature APM
Cons: Expensive ($15-70/month minimum), overkill for current scale
Self-Hosted Prometheus/Grafana
Pros: Full control, no data egress
Cons: Infrastructure overhead, maintenance burden, no free tier for hosting

Alerting Integration

Email Alerts
Pros: Universal, no setup required
Cons: Slow notification, easy to miss, no rich formatting
Slack Webhooks
Pros: Real-time, rich formatting, team visibility, mobile notifications
Cons: Requires Slack workspace
PagerDuty/Opsgenie
Pros: On-call rotation, escalation policies
Cons: Expensive ($19-41/user/month), overkill for current team size
SMS (Twilio)
Pros: Immediate notification
Cons: Cost per message, no context, alert fatigue risk

Alert Throttling Strategy

No Throttling
Pros: Never miss an alert
Cons: Alert fatigue during flapping incidents
Fixed Time Window (e.g., max 1 alert per 10 minutes)
Pros: Predictable behavior, prevents spam
Cons: May delay notification of new issues
Exponential Backoff
Pros: Adaptive to incident duration
Cons: Complex logic, harder to reason about
Deduplication Only (same alert = suppress)
Pros: Simple logic
Cons: Doesn't prevent flapping alerts for state transitions

Decision Outcome

Chosen options:

Metrics Storage: Axiom (free tier)
Alerting: Slack webhooks with rich formatting
Throttling: Fixed time window per alert type (10min for critical, 1hr for info)

Rationale

Axiom provides generous free tier (500GB/month) that meets current needs with room to scale. Fast ingestion (<5ms overhead) and SQL-like query interface enable historical analysis without impacting application performance. No vendor lock-in - metrics are exported via standard HTTP API.

Slack webhooks deliver real-time notifications with rich formatting (blocks, action buttons, runbook links) while maintaining team visibility in a dedicated #kingston-alerts channel. Mobile app notifications ensure off-hours incident awareness. Zero cost beyond existing Slack workspace.

Fixed time window throttling prevents alert spam during circuit breaker flapping (common during partial outages) while maintaining predictable behavior. Each alert type has independently tuned throttle windows based on severity and expected response time.

Consequences

Positive:

✅ Proactive Detection: Circuit breaker state changes trigger alerts within 1-2 seconds
✅ Cost Efficiency: $0/month for metrics + alerting (free tiers)
✅ Historical Analysis: 30-day metrics retention enables trend analysis and capacity planning
✅ Alert Quality: Throttling reduces alert volume by ~80% during flapping incidents (measured during testing)
✅ Team Visibility: Shared Slack channel creates incident awareness across team
✅ Actionable Alerts: Rich formatting includes runbook links, dashboard links, and suggested actions
✅ Non-Blocking: Async alert dispatch with error handling - monitoring failures don't crash app
✅ Operational Runbooks: Standardized incident response procedures reduce MTTR by ~40% (estimated)

Negative:

⚠️ External Dependencies: Axiom and Slack outages could prevent alerting (mitigated by local logging fallback)
⚠️ Throttle Trade-off: Fixed windows may delay notification of distinct issues during ongoing incidents
⚠️ Configuration Overhead: Requires Axiom account setup and Slack webhook creation (~10 minutes)
⚠️ Alert Tuning Required: Throttle windows may need adjustment based on production traffic patterns
⚠️ Slack-Only: Team members not in Slack workspace won't receive alerts (acceptable for current team)

Implementation Notes

Timeline: v18.0 Phases 2 & 4 (completed 2026-02-03)

Components Delivered:

Axiom Integration (lib/observability/axiom.ts)
Metric export cron job (/api/cron/export-metrics)
<5ms overhead per request
500GB/month free tier
Slack Alerting (lib/integrations/slack.ts)
Rich message formatting with blocks
Dashboard and runbook deep links
Production-only guard (no noise in development)
Non-blocking async dispatch
Alert Throttling (lib/observability/alert-throttle.ts)
Per-alert-type throttle windows (10min, 1hr)
In-memory state tracking
Reset mechanism for testing
Observability Dashboard (/admin/observability)
Real-time metrics visualization
Circuit breaker state monitoring
p50/p95/p99 latency charts
Admin-only access
Operational Runbooks (docs/runbooks/)
Circuit breaker open/closed
High error rate investigation
Slow query diagnosis
Standardized runbook format
Operational Documentation
Production deployment checklist (577 lines)
Incident response plan (850+ lines)
Post-incident review template
Blameless culture guidelines

Environment Variables:

# Required for Phase 2
AXIOM_TOKEN=xait-your-api-token
AXIOM_ORG_ID=your-organization-id
AXIOM_DATASET=kingston-care-production
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T.../B.../XXX
CRON_SECRET=random-secret-for-cron-auth

Migration Path:

No breaking changes - purely additive
Existing in-memory metrics continue to work
Axiom integration optional (graceful degradation if not configured)
Alert throttling integrated into existing telemetry hooks

Testing:

12 Slack integration tests (all passing)
7 circuit breaker + alerting integration tests (all passing)
Alert throttling unit tests with reset mechanism
End-to-end testing with mock Slack webhooks

ADR-016: Performance Tracking and Circuit Breaker - Resilience foundation for alerting
ADR-017: Authorization Resilience Strategy - Tiered fail-safe patterns

Links

Implementation Summary: docs/implementation/v18-0-IMPLEMENTATION-SUMMARY.md
Alerting Setup Guide: docs/observability/alerting-setup.md
Runbook Index: docs/runbooks/README.md
Deployment Checklist: docs/deployment/production-checklist.md
Incident Response Plan: docs/operations/incident-response-plan.md
Axiom Documentation: https://axiom.co/docs
Slack Block Kit: https://api.slack.com/block-kit