ADR-0006: Dead Man's Switch Scraper Monitoring¶
Status: Accepted Date: 2026-02-05 Deciders: Development Team Related: M12 Phase 2 - Research Infrastructure
Context¶
Scrapers running on scheduled GitHub Actions can fail silently due to: - API changes at data sources (Ontario Health, Quebec MSSS) - Network issues - Rate limiting - Authentication problems - Database connection failures
Silent failures mean users see stale data without knowing it, undermining the project's credibility and defeating the purpose of a "Health Systems Observatory."
The Problem: How do we ensure scraper failures are detected and reported quickly?
Decision¶
We implemented a Dead Man's Switch monitoring system with multiple layers:
1. Backend: Heartbeat Tracking¶
- Every scraper run writes to
scraper_statustable with timestamp AlertServicesends Pushover notifications for stale/error statescheck_heartbeatCLI command queries database and triggers alerts- Threshold: 60 minutes (scrapers run every 15 minutes, so 4x missed runs)
2. GitHub Actions: Automated Monitoring¶
- Workflow runs every 30 minutes (
heartbeat-monitor.yml) - Calls
check_heartbeatCLI - Sends Pushover alerts on failure
- Non-blocking: doesn't prevent scraper runs
3. Frontend: User-Visible Status¶
SystemStatuscomponent displays real-time health- 3 states: Healthy (<60 min), Degraded (60-120 min), Down (>120 min)
- Polls
/api/healthendpoint every minute - Shows time since last update
- Positioned in page footer
4. API: Health Endpoint¶
/api/healthaggregates status across all sources- Returns overall health flag and per-source details
- Used by both SystemStatus component and external monitoring
Architecture¶
┌─────────────────┐
│ Scraper Runs │
│ (every 15 min) │
└────────┬────────┘
│ writes heartbeat
▼
┌─────────────────┐
│ scraper_status │◄────┐
│ table │ │ queries
└─────────────────┘ │
│ │
│ queries │
▼ │
┌─────────────────┐ │
│ GitHub Action │─────┘
│ (every 30 min) │
└────────┬────────┘
│ if stale
▼
┌─────────────────┐
│ Pushover API │
│ (push notify) │
└─────────────────┘
┌─────────────────┐
│ User Browser │
└────────┬────────┘
│ polls every 60s
▼
┌─────────────────┐
│ /api/health │
└────────┬────────┘
│
▼
┌─────────────────┐
│ SystemStatus │
│ Component │
└─────────────────┘
Rationale¶
Why Pushover?¶
- Simple API - Single POST request, no OAuth complexity
- Free tier - Sufficient for personal projects
- Mobile apps - iOS/Android notifications
- Reliable - No email deliverability issues
- Low latency - Instant push vs email delays
Alternatives Considered: - Email (SMTP) - Rejected: deliverability issues, spam filters, configuration complexity - Slack webhooks - Rejected: requires workspace, overkill for solo project - Twilio SMS - Rejected: costs money for every alert - Discord webhooks - Rejected: not mobile-friendly
Why 60-minute threshold?¶
- Scrapers run every 15 minutes (4 runs per hour)
- 60 minutes = 4 consecutive failures before alert
- Reduces noise from transient failures
- Still fast enough to detect real issues
- Users see "degraded" warning at 60 min, "down" at 120 min
Why visible status indicator?¶
- Transparency - Users know if data is fresh
- Trust building - Shows we monitor our own system
- Professional - Demonstrates operational maturity
- CanMEDS Leader - Infrastructure thinking for portfolio
Implementation Details¶
AlertService (Pushover Integration)¶
class AlertService:
def alert_scraper_stale(self, source_id: str, age_minutes: int):
"""Alert when scraper hasn't run recently."""
self.send_alert(
title=f"⚠️ Scraper Stale: {source_id}",
message=f"No heartbeat for {age_minutes} minutes.",
priority=1, # High priority
url="https://github.com/jerdaw/waittimecanada/actions"
)
Frontend Health States¶
- Healthy (green):
age < 60 min && healthy == true - Degraded (amber):
age >= 60 min && age < 120 min - Down (red):
age >= 120 min || healthy == false - Loading (gray): Initial state before first check
Graceful Degradation¶
- If Pushover credentials not configured: logs to console, doesn't fail
- If database unavailable: health endpoint returns 500 but frontend shows "down"
- If GitHub Actions fails: doesn't block scraper runs
Consequences¶
Positive¶
✅ Early Detection - Failures detected within 60 minutes ✅ Zero Cost - Pushover free tier, GitHub Actions free ✅ User Transparency - Status visible without hunting for it ✅ Professional Image - Shows operational maturity ✅ Admissions Value - Demonstrates Leader competency ✅ Debugging Aid - Health endpoint useful for troubleshooting
Negative¶
⚠️ External Dependency - Relies on Pushover API ⚠️ Manual Setup - Requires creating Pushover account and adding secrets ⚠️ Noise Risk - Could generate false alarms if threshold too low ⚠️ Privacy - Health data is public (but contains no sensitive info)
Neutral¶
🔵 Not Real-Time - 30-minute check interval (acceptable trade-off) 🔵 Single Point of Failure - If database down, all monitoring stops (but database is also required for app)
Alternatives Not Chosen¶
1. No Monitoring¶
- Rejected: Silent failures undermine credibility
- Risk: Users complain about stale data, lose trust
2. Manual Checks¶
- Rejected: Not scalable, requires daily vigilance
- Risk: Failures go unnoticed for days
3. Only Backend Monitoring (No Frontend Indicator)¶
- Rejected: Users deserve transparency
- Missed opportunity: Professional image for portfolio
4. Real-Time WebSocket Status¶
- Rejected: Over-engineered for current scale
- Unnecessary: 1-minute polling sufficient
- Could revisit if needed for production at scale
Testing Strategy¶
Backend Tests¶
- AlertService sends notifications correctly
- Graceful degradation when credentials missing
- check_heartbeat CLI exits with correct codes
Frontend Tests (7 tests)¶
- Renders all 4 status states correctly
- Calculates age correctly
- Polls health endpoint every minute
- Shows time since last update
- Accessible (ARIA attributes)
- Handles network errors gracefully
Manual Testing¶
- Create Pushover account and test alert delivery
- Verify GitHub Action runs and checks heartbeat
- Confirm frontend status updates in real-time
- Test degraded/down states by stopping scrapers
Future Considerations¶
Potential Enhancements¶
- Multi-Channel Alerts - Add email/SMS as fallback
- Alert Deduplication - Don't spam if issue persists
- Recovery Notifications - Alert when scraper recovers
- Status Page - Dedicated
/statuspage with history - Metrics Dashboard - Uptime percentage, MTTR tracking
- SLA Tracking - Document target availability (e.g., 99% uptime)
Production Requirements (Not Yet Implemented)¶
- Configure Pushover in production environment
- Set up monitoring for the monitoring (meta-monitoring)
- Document runbook for responding to alerts
- Consider adding PagerDuty for critical issues
Related Documents¶
- Implementation Plan:
docs/planning/implementation/milestone-12-research.md - Manual Setup:
docs/planning/manual-tasks.md(Pushover configuration) - API Documentation:
docs/API.md(health endpoint)
Success Metrics¶
Operational: - Alert delivery latency < 5 minutes - False positive rate < 5% - Detection of 100% of scraper failures
Portfolio: - Demonstrates "Leader" CanMEDS competency - Shows understanding of production operations - Evidence of professional software practices
Last Updated: 2026-02-05 Review Date: After production deployment