0018. Scraper Observability and Reliability Hardening¶
Date: 2026-02-19
Status: Accepted
Deciders: Jeremy Dawson
Technical Story: docs/planning/roadmap.md P1 ops items ("scraper reliability hardening" + "scraper failure visibility")
Context and Problem Statement¶
Scraper failures were previously represented as free-text errors with limited metadata, and retry/timeout behavior was inconsistent across provincial implementations. This created risk of silent data decay and slowed triage when upstream sources changed or became unavailable.
Decision Drivers¶
- Prevent silent scraper failure and stale data exposure
- Standardize retry/backoff/timeout policy across ON/QC/AB/BC
- Preserve backward compatibility for existing health consumers
- Keep implementation lightweight within current infrastructure and budget limits
Considered Options¶
- Keep existing heartbeat fields and rely on log inspection only
- Add a dedicated per-run event table plus external observability stack
- Extend
scraper_statuswith structured failure and last-success metadata, and enforce shared reliability helpers
Decision Outcome¶
Chosen option: "Extend scraper_status with structured metadata and standardize reliability behavior in shared scraper helpers", because it delivers immediate operational visibility with minimal infrastructure overhead and safe additive schema changes.
Positive Consequences¶
- Operators can distinguish
upstream_unavailablevsparser_breakagevsinfra_runtimevspersistence_failure /api/healthand CLI now expose last-known-good and last-error in one view- Alerting can gate on
consecutive_failuresinstead of a single noisy failure - Rollback remains low risk because legacy heartbeat fields remain intact
Negative Consequences¶
- Additional schema and contract surface area to maintain
- Classification heuristics can still misclassify edge-case failures as
unknown - Retry policies can increase run duration if not monitored
Pros and Cons of the Options¶
Keep existing heartbeat model¶
- Good, because no migration or contract changes required
- Bad, because triage remains manual and slow
- Bad, because silent decay risk remains high
Add external observability/event-history stack now¶
- Good, because deeper analytics and audit trail become possible
- Bad, because it increases cost and operational complexity
- Bad, because it exceeds current project resource constraints
Extend existing heartbeat + shared reliability helpers¶
- Good, because additive schema preserves compatibility
- Good, because it delivers immediate observability in existing CLI/API paths
- Bad, because detailed per-run historical analytics remain limited
Additional Information¶
Implemented artifacts include:
- Migration:
backend/migrations/013_add_scraper_observability_columns.sql - Shared classification + retry constants:
backend/src/waittime/scrapers/observability.py - Heartbeat metadata read/write:
backend/src/waittime/services/database.py,backend/src/waittime/services/heartbeat.py - Operator surfaces:
backend/src/waittime/cli/check_heartbeat.py,frontend/app/api/health/route.ts
Failure category contract:
upstream_unavailableparser_breakageinfra_runtimepersistence_failureunknown