ADR 016: Performance Tracking and Circuit Breaker Pattern
Status
Accepted
Date
2026-01-25
Context
CareConnect relies on Supabase for critical data operations including search, service management, analytics, and authorization. As the platform scales, two key operational concerns have emerged:
1. Performance Visibility Gap
Problem: No systematic way to measure or track performance of critical operations
- Search latency unknown (keyword, vector, data loading)
- Database query performance not monitored
- API response times not tracked
- No baseline metrics for regression detection
- Performance degradation goes unnoticed until users complain
Impact:
- Can't identify performance bottlenecks
- No data to guide optimization efforts
- Risk of slow performance degrading user experience
- Difficult to validate that optimizations actually improve performance
2. Cascading Failure Risk
Problem: No protection against Supabase outages or degraded performance
- Every request waits for full database timeout (~30s)
- Failed operations retry indefinitely without backoff
- Single database failure can cascade across the entire application
- No graceful degradation when database is unavailable
- Offline sync hammers failing endpoints
Impact:
- Poor user experience during outages (long hangs instead of fast fails)
- Resource exhaustion from repeated failing requests
- Potential for cascading failures across dependent systems
- No fallback mechanism when database is unavailable
Prior Art
Existing Resilience Mechanisms:
- ADR-010: PWA offline fallback with IndexedDB caching (client-side only)
- Search has JSON fallback (but no automatic triggering)
- Rate limiting on API endpoints (protection from abuse, not internal failures)
Gap: No server-side resilience pattern for database failures
Decision
Implement two complementary operational improvements for v17.5+:
1. Performance Tracking System
A lightweight, opt-in performance instrumentation system:
Core Components:
lib/performance/tracker.ts- Wrapper around logger for tracking operationslib/performance/metrics.ts- In-memory aggregation (p50, p95, p99)- Environment variable gated:
NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING
Instrumented Paths:
- Search operations (total, data load, keyword scoring, vector scoring)
- API routes (request-to-response latency)
- Database queries (by source: Supabase, IndexedDB, JSON fallback)
Design Principles:
- Non-invasive: <1ms overhead per operation
- Opt-in: Disabled by default, enabled per environment
- Development-first: In-memory storage only (no production overhead)
- Structured: Integrates with existing logger for consistency
2. Circuit Breaker Pattern
A resilience pattern that prevents cascading failures:
Core Components:
lib/resilience/circuit-breaker.ts- Generic circuit breaker implementationlib/resilience/supabase-breaker.ts- Supabase-specific wrapperlib/resilience/telemetry.ts- Event logging and monitoring
States:
- CLOSED (normal): Requests pass through
- OPEN (failing): Requests fast-fail, preventing cascading failures
- HALF_OPEN (testing): Allow limited requests to test recovery
Configuration (environment variables):
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=3 # Failures before opening
CIRCUIT_BREAKER_TIMEOUT=30000 # ms before retry (OPEN → HALF_OPEN)
Protected Operations:
- Database queries (
lib/search/data.ts,lib/services.ts,lib/analytics.ts) - Service management (claim, update, fetch)
- Analytics tracking (non-critical, graceful degradation)
- Offline sync (respects circuit state, skips when open)
Design Principles:
- Fail-fast: Prevent long waits during outages
- Automatic recovery: Transitions to HALF_OPEN after timeout
- Fallback-aware: Integrates with existing JSON fallback in search
- Observable: Logs state transitions for debugging
Implementation
Phase 1: Core Infrastructure ✅
Performance Tracking:
// lib/performance/tracker.ts
await trackPerformance(
"search.total",
async () => {
// Operation here
},
{ queryLength: query.length }
)
Circuit Breaker:
// lib/resilience/supabase-breaker.ts
const result = await withCircuitBreaker(
async () => supabase.from("services").select("*"),
async () => loadFromJSONFallback() // Optional fallback
)
Environment Validation (Zod):
// lib/env.ts
CIRCUIT_BREAKER_ENABLED: z.string().transform(val => val !== 'false'),
CIRCUIT_BREAKER_FAILURE_THRESHOLD: z.string().pipe(z.number().int().positive()),
NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING: z.string().transform(val => val === 'true'),
Phase 2: Integration ✅
Files Modified:
lib/search/index.ts- Performance tracking for search orchestrationlib/search/data.ts- Circuit breaker + performance tracking for data loadinglib/services.ts- Circuit breaker for service management (3 functions)lib/analytics.ts- Circuit breaker for analytics (2 functions, graceful degradation)lib/offline/sync.ts- Respects circuit breaker stateapp/api/v1/search/services/route.ts- Performance tracking for APIapp/api/v1/health/route.ts- NEW: Health check endpoint
Files Created:
lib/performance/tracker.ts(120 lines)lib/performance/metrics.ts(235 lines)lib/resilience/circuit-breaker.ts(265 lines)lib/resilience/supabase-breaker.ts(115 lines)lib/resilience/telemetry.ts(130 lines)app/api/v1/health/route.ts(130 lines)
Tests:
tests/lib/performance/tracker.test.ts(16 tests, 100% pass)tests/lib/resilience/circuit-breaker.test.ts(18 tests, 100% pass)
Phase 3: Load Testing ✅
k6 Infrastructure:
tests/load/smoke-test.k6.js- Basic connectivity (1 VU, 30s)tests/load/search-api.k6.js- Realistic load (10-50 VUs, ramp-up)tests/load/sustained-load.k6.js- Stability test (20 VUs, 30min)tests/load/spike-test.k6.js- Spike/stress test (0-100 VUs in 10s)
Documentation:
docs/testing/load-testing.md- Comprehensive guide with thresholds
npm Scripts:
npm run test:load # Main search API test
npm run test:load:smoke # Quick connectivity check
npm run test:load:sustained # Long-running stability
npm run test:load:spike # Spike/stress test
Consequences
Positive
Performance Tracking:
- ✅ Visibility: Clear metrics on operation latencies (p50, p95, p99)
- ✅ Regression Detection: Can establish baseline and catch degradation
- ✅ Optimization Guidance: Data-driven decisions on what to optimize
- ✅ Minimal Overhead: <1ms per operation, opt-in only
- ✅ Development-Friendly: Works locally without external services
Circuit Breaker:
- ✅ Fast Failures: Prevents 30s hangs during outages (fail in <1ms)
- ✅ Graceful Degradation: Automatic fallback to JSON in search
- ✅ Automatic Recovery: Self-healing via HALF_OPEN state
- ✅ Resource Protection: Prevents hammering failing endpoints
- ✅ Observable: State transitions logged for debugging
Operational:
- ✅ Health Check: New
/api/v1/healthendpoint for monitoring - ✅ Load Testing: k6 scripts provide baseline metrics
- ✅ Type Safety: All env vars validated via Zod
Negative
Performance Tracking:
- ⚠️ Memory Usage: In-memory metrics store (mitigated: max 1000 samples/operation, 10min retention, dev-only)
- ⚠️ Per-Process State: Metrics not shared across serverless instances (acceptable for dev/staging)
- ⚠️ Manual Analysis: No automatic alerting (future: integrate with Axiom/Sentry)
Circuit Breaker:
- ⚠️ False Positives: Could open on transient errors (mitigated: 50% failure rate threshold + 3 consecutive failures)
- ⚠️ Per-Process State: Each serverless instance has independent circuit state (acceptable trade-off)
- ⚠️ Coordination Gap: No distributed circuit state (future: could use Redis if needed)
- ⚠️ Auth Implications: Circuit breaker on auth queries could lock out users (mitigated: fail-closed security)
Operational:
- ⚠️ Configuration Complexity: 3 new environment variables to tune
- ⚠️ Incomplete Rollout: Only ~40% of Supabase calls protected initially (API routes need coverage)
Neutral
- k6 Dependency: Adds load testing tool to dev dependencies (~50MB)
- Code Complexity: +1000 lines of code, but well-tested and isolated
Alternatives Considered
1. Use Supabase Built-in Resilience
Considered: Rely on Supabase client library's retry/timeout mechanisms
Rejected Because:
- Supabase retries are for network failures, not database unavailability
- No configurable circuit breaker in Supabase client
- No fallback mechanism for read operations
- Doesn't prevent cascading failures
2. External APM Tool (Axiom, Datadog, New Relic)
Considered: Use external Application Performance Monitoring service
Rejected Because:
- Adds cost and external dependency
- Overkill for current scale
- Can be added later (performance tracking provides foundation)
- Our current logger already supports structured logging for future integration
Future: Once we have baseline metrics from our system, we can selectively send critical metrics to external APM
3. Redis-Based Circuit Breaker
Considered: Use Redis to share circuit breaker state across instances
Rejected Because:
- Adds infrastructure dependency (Redis)
- Overkill for current scale (serverless functions are short-lived)
- Per-instance circuit breaker is sufficient (fail-fast still works)
- Can be added later if coordination becomes critical
Trade-off Accepted: Each instance independently opens its circuit, but all instances benefit from fast failures
4. API Gateway Circuit Breaker (Vercel Edge Config)
Considered: Implement circuit breaker at API gateway level using Vercel Edge Config
Rejected Because:
- Tightly couples to Vercel platform
- Less flexible for fine-grained control
- Harder to test locally
- Our approach works everywhere (local, staging, any platform)
5. Prometheus + Grafana
Considered: Full metrics stack with time-series database
Rejected Because:
- Significant infrastructure overhead
- Overkill for current needs
- Can't run locally easily
- Our in-memory approach is sufficient for development
Future: Can export metrics to Prometheus if scale demands
Configuration
Environment Variables
Added to .env.example:
# Performance Tracking (v17.5+) - Development/Staging Only
NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING=false
# Circuit Breaker Configuration (v17.5+)
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=3 # Failures before opening
CIRCUIT_BREAKER_TIMEOUT=30000 # ms before retry (30s)
Recommended Settings by Environment
Development:
NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING=true # Enable metrics
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=3
CIRCUIT_BREAKER_TIMEOUT=30000
Staging:
NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING=true # Enable metrics
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5 # Slightly higher threshold
CIRCUIT_BREAKER_TIMEOUT=60000 # 1min timeout
Production:
NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING=false # Disable in-memory metrics
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=3
CIRCUIT_BREAKER_TIMEOUT=30000
Monitoring and Observability
Health Check Endpoint
Endpoint: GET /api/v1/health
Response:
{
"status": "healthy" | "degraded" | "unhealthy",
"timestamp": "2026-01-25T...",
"version": "0.1.0",
"checks": {
"database": {
"status": "up" | "down" | "degraded",
"latencyMs": 45
},
"circuitBreaker": {
"enabled": true,
"state": "CLOSED" | "OPEN" | "HALF_OPEN",
"stats": {
"failureCount": 0,
"successCount": 125,
"failureRate": 0.0
}
},
"performance": { // Only when NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING=true
"tracking": true,
"metrics": { /* p50, p95, p99 for all operations */ }
}
}
}
Status Codes:
200: Healthy or degraded503: Unhealthy (circuit open or database down)
Log Events
Circuit Breaker Events:
[INFO] Circuit breaker 'supabase' state transition { from: 'CLOSED', to: 'OPEN', failureCount: 3 }
[WARN] Circuit breaker 'supabase' HALF-OPEN { nextAttemptTime: ... }
[INFO] Circuit breaker 'supabase' CLOSED { successCount: 1 }
Performance Events (when tracking enabled):
[INFO] Performance: search.total { operation: 'search.total', durationMs: 245.32, queryLength: 15 }
[INFO] Performance: search.dataLoad { operation: 'search.dataLoad', durationMs: 42.18, source: 'supabase' }
Rollout Plan
Phase 1: Core Infrastructure (Completed ✅)
- [x] Implement circuit breaker core
- [x] Implement performance tracking
- [x] Add environment variable validation
- [x] Write unit tests (34 tests)
- [x] Create load testing infrastructure
Phase 2: Critical Path Integration (Completed ✅)
- [x] Protect search data loading
- [x] Protect service management functions
- [x] Protect analytics tracking
- [x] Add performance tracking to search
- [x] Add performance tracking to API routes
- [x] Create health check endpoint
Phase 3: Complete Rollout (In Progress ⚠️)
- [ ] Protect remaining API routes (POST, PUT, DELETE endpoints)
- [ ] Protect authorization checks (consider trade-offs)
- [ ] Add authentication to health check endpoint
- [ ] Add
/api/v1/metricsendpoint (dev/staging only) - [ ] Write integration tests
- [ ] Update CLAUDE.md with usage patterns
Phase 4: Validation & Monitoring (Pending)
- [ ] Run load tests and establish baseline metrics
- [ ] Monitor circuit breaker activations in staging
- [ ] Tune thresholds based on real-world data
- [ ] Document operational runbook
- [ ] Add alerting for circuit breaker open events
Success Metrics
Performance Tracking
Goal: Establish baseline performance metrics
- ✅ Track p50, p95, p99 latencies for search operations
- ⏳ Establish baseline: p95 < 800ms for search API
- ⏳ Detect 20%+ performance regressions
- ⏳ Guide optimization efforts with data
Circuit Breaker
Goal: Improve resilience and user experience during outages
- ✅ Fast-fail in <1ms when circuit is open (vs 30s timeout)
- ⏳ Automatic recovery within 30s-1min of service restoration
- ⏳ Zero cascading failures during Supabase outages
- ⏳ Graceful degradation: Search falls back to JSON when DB unavailable
Load Testing
Goal: Validate performance under load
- ✅ Infrastructure in place (k6 scripts)
- ⏳ Smoke test: 100% success rate, p95 < 1000ms
- ⏳ Search API test: <5% error rate, p95 < 800ms
- ⏳ Sustained test: Stable performance over 30min
- ⏳ Spike test: Graceful degradation, no crashes
Future Work
Short Term (v17.6)
- Complete Circuit Breaker Rollout: Protect all API routes and critical operations
- Integration Tests: Test circuit breaker with simulated database failures
- Operational Documentation: Create runbook for circuit breaker incidents
- Baseline Metrics: Run load tests and document baseline performance
Medium Term (v18.0)
- External Monitoring Integration: Send critical metrics to Axiom or Sentry
- Alerting: Configure alerts for circuit breaker open events
- Metrics API: Expose
/api/v1/metricsendpoint for dashboards - Dashboard: Build internal performance dashboard
Long Term (v19.0+)
- Distributed Circuit Breaker: Use Redis for cross-instance coordination (if needed)
- Advanced Metrics: Track business metrics (conversion rates, user journeys)
- Automatic Scaling: Use performance metrics to trigger autoscaling
- AI-Powered Anomaly Detection: Detect performance anomalies automatically
References
- Circuit Breaker Pattern: Martin Fowler
- Release It! (Michael Nygard): Patterns for production-ready software
- k6 Load Testing: k6.io/docs
- Supabase Resilience: Supabase Status
- Related ADRs:
- ADR-010: PWA Offline Fallback and Caching
- ADR-014: Database Index Optimization
- ADR-015: Non-Blocking E2E Tests
Appendix: Circuit Breaker State Machine
┌─────────┐
│ CLOSED │ (Normal operation)
└─────────┘
│
│ Failure threshold exceeded (3 failures OR 50% error rate)
▼
┌─────────┐
│ OPEN │ (Fast-fail, block requests)
└─────────┘
│
│ Timeout elapsed (30s default)
▼
┌─────────┐
│ HALF- │ (Testing recovery)
│ OPEN │
└─────────┘
│
├─ Success ───> CLOSED (Recovery confirmed)
│
└─ Failure ───> OPEN (Still failing, retry later)
Appendix: Performance Tracking Flow
User Request
│
▼
┌─────────────────────────────────────┐
│ trackPerformance('operation') START │
│ • Record start time │
│ • Execute operation │
│ • Record end time │
│ • Calculate duration │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ recordMetric() │
│ • Store in-memory (dev only) │
│ • Log with structured data │
│ • Prune old samples (>10min) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ getMetrics() │
│ • Calculate p50, p95, p99 │
│ • Aggregate by operation │
│ • Available via /api/v1/health │
└─────────────────────────────────────┘