ADR 016: Performance Tracking and Circuit Breaker Pattern

Status

Accepted

Date

2026-01-25

Context

CareConnect relies on Supabase for critical data operations including search, service management, analytics, and authorization. As the platform scales, two key operational concerns have emerged:

1. Performance Visibility Gap

Problem: No systematic way to measure or track performance of critical operations

Search latency unknown (keyword, vector, data loading)
Database query performance not monitored
API response times not tracked
No baseline metrics for regression detection
Performance degradation goes unnoticed until users complain

Impact:

Can't identify performance bottlenecks
No data to guide optimization efforts
Risk of slow performance degrading user experience
Difficult to validate that optimizations actually improve performance

2. Cascading Failure Risk

Problem: No protection against Supabase outages or degraded performance

Every request waits for full database timeout (~30s)
Failed operations retry indefinitely without backoff
Single database failure can cascade across the entire application
No graceful degradation when database is unavailable
Offline sync hammers failing endpoints

Impact:

Poor user experience during outages (long hangs instead of fast fails)
Resource exhaustion from repeated failing requests
Potential for cascading failures across dependent systems
No fallback mechanism when database is unavailable

Prior Art

Existing Resilience Mechanisms:

ADR-010: PWA offline fallback with IndexedDB caching (client-side only)
Search has JSON fallback (but no automatic triggering)
Rate limiting on API endpoints (protection from abuse, not internal failures)

Gap: No server-side resilience pattern for database failures

Decision

Implement two complementary operational improvements for v17.5+:

1. Performance Tracking System

A lightweight, opt-in performance instrumentation system:

Core Components:

lib/performance/tracker.ts - Wrapper around logger for tracking operations
lib/performance/metrics.ts - In-memory aggregation (p50, p95, p99)
Environment variable gated: NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING

Instrumented Paths:

Search operations (total, data load, keyword scoring, vector scoring)
API routes (request-to-response latency)
Database queries (by source: Supabase, IndexedDB, JSON fallback)

Design Principles:

Non-invasive: <1ms overhead per operation
Opt-in: Disabled by default, enabled per environment
Development-first: In-memory storage only (no production overhead)
Structured: Integrates with existing logger for consistency

2. Circuit Breaker Pattern

A resilience pattern that prevents cascading failures:

Core Components:

lib/resilience/circuit-breaker.ts - Generic circuit breaker implementation
lib/resilience/supabase-breaker.ts - Supabase-specific wrapper
lib/resilience/telemetry.ts - Event logging and monitoring

States:

CLOSED (normal): Requests pass through
OPEN (failing): Requests fast-fail, preventing cascading failures
HALF_OPEN (testing): Allow limited requests to test recovery

Configuration (environment variables):

CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=3        # Failures before opening
CIRCUIT_BREAKER_TIMEOUT=30000              # ms before retry (OPEN → HALF_OPEN)

Protected Operations:

Database queries (lib/search/data.ts, lib/services.ts, lib/analytics.ts)
Service management (claim, update, fetch)
Analytics tracking (non-critical, graceful degradation)
Offline sync (respects circuit state, skips when open)

Design Principles:

Fail-fast: Prevent long waits during outages
Automatic recovery: Transitions to HALF_OPEN after timeout
Fallback-aware: Integrates with existing JSON fallback in search
Observable: Logs state transitions for debugging

Implementation

Phase 1: Core Infrastructure ✅

Performance Tracking:

// lib/performance/tracker.ts
await trackPerformance(
  "search.total",
  async () => {
    // Operation here
  },
  { queryLength: query.length }
)

Circuit Breaker:

// lib/resilience/supabase-breaker.ts
const result = await withCircuitBreaker(
  async () => supabase.from("services").select("*"),
  async () => loadFromJSONFallback() // Optional fallback
)

Environment Validation (Zod):

// lib/env.ts
CIRCUIT_BREAKER_ENABLED: z.string().transform(val => val !== 'false'),
CIRCUIT_BREAKER_FAILURE_THRESHOLD: z.string().pipe(z.number().int().positive()),
NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING: z.string().transform(val => val === 'true'),

Phase 2: Integration ✅

Files Modified:

lib/search/index.ts - Performance tracking for search orchestration
lib/search/data.ts - Circuit breaker + performance tracking for data loading
lib/services.ts - Circuit breaker for service management (3 functions)
lib/analytics.ts - Circuit breaker for analytics (2 functions, graceful degradation)
lib/offline/sync.ts - Respects circuit breaker state
app/api/v1/search/services/route.ts - Performance tracking for API
app/api/v1/health/route.ts - NEW: Health check endpoint

Files Created:

lib/performance/tracker.ts (120 lines)
lib/performance/metrics.ts (235 lines)
lib/resilience/circuit-breaker.ts (265 lines)
lib/resilience/supabase-breaker.ts (115 lines)
lib/resilience/telemetry.ts (130 lines)
app/api/v1/health/route.ts (130 lines)

Tests:

tests/lib/performance/tracker.test.ts (16 tests, 100% pass)
tests/lib/resilience/circuit-breaker.test.ts (18 tests, 100% pass)

Phase 3: Load Testing ✅

k6 Infrastructure:

tests/load/smoke-test.k6.js - Basic connectivity (1 VU, 30s)
tests/load/search-api.k6.js - Realistic load (10-50 VUs, ramp-up)
tests/load/sustained-load.k6.js - Stability test (20 VUs, 30min)
tests/load/spike-test.k6.js - Spike/stress test (0-100 VUs in 10s)

Documentation:

docs/testing/load-testing.md - Comprehensive guide with thresholds

npm Scripts:

npm run test:load          # Main search API test
npm run test:load:smoke    # Quick connectivity check
npm run test:load:sustained # Long-running stability
npm run test:load:spike    # Spike/stress test

Consequences

Positive

Performance Tracking:

✅ Visibility: Clear metrics on operation latencies (p50, p95, p99)
✅ Regression Detection: Can establish baseline and catch degradation
✅ Optimization Guidance: Data-driven decisions on what to optimize
✅ Minimal Overhead: <1ms per operation, opt-in only
✅ Development-Friendly: Works locally without external services

Circuit Breaker:

✅ Fast Failures: Prevents 30s hangs during outages (fail in <1ms)
✅ Graceful Degradation: Automatic fallback to JSON in search
✅ Automatic Recovery: Self-healing via HALF_OPEN state
✅ Resource Protection: Prevents hammering failing endpoints
✅ Observable: State transitions logged for debugging

Operational:

✅ Health Check: New /api/v1/health endpoint for monitoring
✅ Load Testing: k6 scripts provide baseline metrics
✅ Type Safety: All env vars validated via Zod

Negative

Performance Tracking:

⚠️ Memory Usage: In-memory metrics store (mitigated: max 1000 samples/operation, 10min retention, dev-only)
⚠️ Per-Process State: Metrics not shared across serverless instances (acceptable for dev/staging)
⚠️ Manual Analysis: No automatic alerting (future: integrate with Axiom/Sentry)

Circuit Breaker:

⚠️ False Positives: Could open on transient errors (mitigated: 50% failure rate threshold + 3 consecutive failures)
⚠️ Per-Process State: Each serverless instance has independent circuit state (acceptable trade-off)
⚠️ Coordination Gap: No distributed circuit state (future: could use Redis if needed)
⚠️ Auth Implications: Circuit breaker on auth queries could lock out users (mitigated: fail-closed security)

Operational:

⚠️ Configuration Complexity: 3 new environment variables to tune
⚠️ Incomplete Rollout: Only ~40% of Supabase calls protected initially (API routes need coverage)

Neutral

k6 Dependency: Adds load testing tool to dev dependencies (~50MB)
Code Complexity: +1000 lines of code, but well-tested and isolated

Alternatives Considered

1. Use Supabase Built-in Resilience

Considered: Rely on Supabase client library's retry/timeout mechanisms

Rejected Because:

Supabase retries are for network failures, not database unavailability
No configurable circuit breaker in Supabase client
No fallback mechanism for read operations
Doesn't prevent cascading failures

2. External APM Tool (Axiom, Datadog, New Relic)

Considered: Use external Application Performance Monitoring service

Rejected Because:

Adds cost and external dependency
Overkill for current scale
Can be added later (performance tracking provides foundation)
Our current logger already supports structured logging for future integration

Future: Once we have baseline metrics from our system, we can selectively send critical metrics to external APM

3. Redis-Based Circuit Breaker

Considered: Use Redis to share circuit breaker state across instances

Rejected Because:

Adds infrastructure dependency (Redis)
Overkill for current scale (serverless functions are short-lived)
Per-instance circuit breaker is sufficient (fail-fast still works)
Can be added later if coordination becomes critical

Trade-off Accepted: Each instance independently opens its circuit, but all instances benefit from fast failures

4. API Gateway Circuit Breaker (Vercel Edge Config)

Considered: Implement circuit breaker at API gateway level using Vercel Edge Config

Rejected Because:

Tightly couples to Vercel platform
Less flexible for fine-grained control
Harder to test locally
Our approach works everywhere (local, staging, any platform)

5. Prometheus + Grafana

Considered: Full metrics stack with time-series database

Rejected Because:

Significant infrastructure overhead
Overkill for current needs
Can't run locally easily
Our in-memory approach is sufficient for development

Future: Can export metrics to Prometheus if scale demands

Configuration

Environment Variables

Added to .env.example:

# Performance Tracking (v17.5+) - Development/Staging Only
NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING=false

# Circuit Breaker Configuration (v17.5+)
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=3     # Failures before opening
CIRCUIT_BREAKER_TIMEOUT=30000           # ms before retry (30s)

Recommended Settings by Environment

Development:

NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING=true  # Enable metrics
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=3
CIRCUIT_BREAKER_TIMEOUT=30000

Staging:

NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING=true  # Enable metrics
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5           # Slightly higher threshold
CIRCUIT_BREAKER_TIMEOUT=60000                 # 1min timeout

Production:

NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING=false # Disable in-memory metrics
CIRCUIT_BREAKER_ENABLED=true
CIRCUIT_BREAKER_FAILURE_THRESHOLD=3
CIRCUIT_BREAKER_TIMEOUT=30000

Monitoring and Observability

Health Check Endpoint

Endpoint: GET /api/v1/health

Response:

{
  "status": "healthy" | "degraded" | "unhealthy",
  "timestamp": "2026-01-25T...",
  "version": "0.1.0",
  "checks": {
    "database": {
      "status": "up" | "down" | "degraded",
      "latencyMs": 45
    },
    "circuitBreaker": {
      "enabled": true,
      "state": "CLOSED" | "OPEN" | "HALF_OPEN",
      "stats": {
        "failureCount": 0,
        "successCount": 125,
        "failureRate": 0.0
      }
    },
    "performance": {  // Only when NEXT_PUBLIC_ENABLE_SEARCH_PERF_TRACKING=true
      "tracking": true,
      "metrics": { /* p50, p95, p99 for all operations */ }
    }
  }
}

Status Codes:

200: Healthy or degraded
503: Unhealthy (circuit open or database down)

Log Events

Circuit Breaker Events:

[INFO] Circuit breaker 'supabase' state transition { from: 'CLOSED', to: 'OPEN', failureCount: 3 }
[WARN] Circuit breaker 'supabase' HALF-OPEN { nextAttemptTime: ... }
[INFO] Circuit breaker 'supabase' CLOSED { successCount: 1 }

Performance Events (when tracking enabled):

[INFO] Performance: search.total { operation: 'search.total', durationMs: 245.32, queryLength: 15 }
[INFO] Performance: search.dataLoad { operation: 'search.dataLoad', durationMs: 42.18, source: 'supabase' }

Rollout Plan

Phase 1: Core Infrastructure (Completed ✅)

[x] Implement circuit breaker core
[x] Implement performance tracking
[x] Add environment variable validation
[x] Write unit tests (34 tests)
[x] Create load testing infrastructure

Phase 2: Critical Path Integration (Completed ✅)

[x] Protect search data loading
[x] Protect service management functions
[x] Protect analytics tracking
[x] Add performance tracking to search
[x] Add performance tracking to API routes
[x] Create health check endpoint

Phase 3: Complete Rollout (In Progress ⚠️)

[ ] Protect remaining API routes (POST, PUT, DELETE endpoints)
[ ] Protect authorization checks (consider trade-offs)
[ ] Add authentication to health check endpoint
[ ] Add /api/v1/metrics endpoint (dev/staging only)
[ ] Write integration tests
[ ] Update CLAUDE.md with usage patterns

Phase 4: Validation & Monitoring (Pending)

[ ] Run load tests and establish baseline metrics
[ ] Monitor circuit breaker activations in staging
[ ] Tune thresholds based on real-world data
[ ] Document operational runbook
[ ] Add alerting for circuit breaker open events

Success Metrics

Performance Tracking

Goal: Establish baseline performance metrics

✅ Track p50, p95, p99 latencies for search operations
⏳ Establish baseline: p95 < 800ms for search API
⏳ Detect 20%+ performance regressions
⏳ Guide optimization efforts with data

Circuit Breaker

Goal: Improve resilience and user experience during outages

✅ Fast-fail in <1ms when circuit is open (vs 30s timeout)
⏳ Automatic recovery within 30s-1min of service restoration
⏳ Zero cascading failures during Supabase outages
⏳ Graceful degradation: Search falls back to JSON when DB unavailable

Load Testing

Goal: Validate performance under load

✅ Infrastructure in place (k6 scripts)
⏳ Smoke test: 100% success rate, p95 < 1000ms
⏳ Search API test: <5% error rate, p95 < 800ms
⏳ Sustained test: Stable performance over 30min
⏳ Spike test: Graceful degradation, no crashes

Future Work

Short Term (v17.6)

Complete Circuit Breaker Rollout: Protect all API routes and critical operations
Integration Tests: Test circuit breaker with simulated database failures
Operational Documentation: Create runbook for circuit breaker incidents
Baseline Metrics: Run load tests and document baseline performance

Medium Term (v18.0)

External Monitoring Integration: Send critical metrics to Axiom or Sentry
Alerting: Configure alerts for circuit breaker open events
Metrics API: Expose /api/v1/metrics endpoint for dashboards
Dashboard: Build internal performance dashboard

Long Term (v19.0+)

Distributed Circuit Breaker: Use Redis for cross-instance coordination (if needed)
Advanced Metrics: Track business metrics (conversion rates, user journeys)
Automatic Scaling: Use performance metrics to trigger autoscaling
AI-Powered Anomaly Detection: Detect performance anomalies automatically

References

Circuit Breaker Pattern: Martin Fowler
Release It! (Michael Nygard): Patterns for production-ready software
k6 Load Testing: k6.io/docs
Supabase Resilience: Supabase Status
Related ADRs:
ADR-010: PWA Offline Fallback and Caching
ADR-014: Database Index Optimization
ADR-015: Non-Blocking E2E Tests

Appendix: Circuit Breaker State Machine

┌─────────┐
│ CLOSED  │ (Normal operation)
└─────────┘
     │
     │ Failure threshold exceeded (3 failures OR 50% error rate)
     ▼
┌─────────┐
│  OPEN   │ (Fast-fail, block requests)
└─────────┘
     │
     │ Timeout elapsed (30s default)
     ▼
┌─────────┐
│ HALF-   │ (Testing recovery)
│ OPEN    │
└─────────┘
     │
     ├─ Success ───> CLOSED (Recovery confirmed)
     │
     └─ Failure ───> OPEN (Still failing, retry later)

Appendix: Performance Tracking Flow

User Request
    │
    ▼
┌─────────────────────────────────────┐
│ trackPerformance('operation') START │
│ • Record start time                 │
│ • Execute operation                 │
│ • Record end time                   │
│ • Calculate duration                │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│ recordMetric()                      │
│ • Store in-memory (dev only)        │
│ • Log with structured data          │
│ • Prune old samples (>10min)        │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│ getMetrics()                        │
│ • Calculate p50, p95, p99           │
│ • Aggregate by operation            │
│ • Available via /api/v1/health      │
└─────────────────────────────────────┘