Skip to content

0018. Scraper Observability and Reliability Hardening

Date: 2026-02-19

Status: Accepted

Deciders: Jeremy Dawson

Technical Story: docs/planning/roadmap.md P1 ops items ("scraper reliability hardening" + "scraper failure visibility")

Context and Problem Statement

Scraper failures were previously represented as free-text errors with limited metadata, and retry/timeout behavior was inconsistent across provincial implementations. This created risk of silent data decay and slowed triage when upstream sources changed or became unavailable.

Decision Drivers

  • Prevent silent scraper failure and stale data exposure
  • Standardize retry/backoff/timeout policy across ON/QC/AB/BC
  • Preserve backward compatibility for existing health consumers
  • Keep implementation lightweight within current infrastructure and budget limits

Considered Options

  • Keep existing heartbeat fields and rely on log inspection only
  • Add a dedicated per-run event table plus external observability stack
  • Extend scraper_status with structured failure and last-success metadata, and enforce shared reliability helpers

Decision Outcome

Chosen option: "Extend scraper_status with structured metadata and standardize reliability behavior in shared scraper helpers", because it delivers immediate operational visibility with minimal infrastructure overhead and safe additive schema changes.

Positive Consequences

  • Operators can distinguish upstream_unavailable vs parser_breakage vs infra_runtime vs persistence_failure
  • /api/health and CLI now expose last-known-good and last-error in one view
  • Alerting can gate on consecutive_failures instead of a single noisy failure
  • Rollback remains low risk because legacy heartbeat fields remain intact

Negative Consequences

  • Additional schema and contract surface area to maintain
  • Classification heuristics can still misclassify edge-case failures as unknown
  • Retry policies can increase run duration if not monitored

Pros and Cons of the Options

Keep existing heartbeat model

  • Good, because no migration or contract changes required
  • Bad, because triage remains manual and slow
  • Bad, because silent decay risk remains high

Add external observability/event-history stack now

  • Good, because deeper analytics and audit trail become possible
  • Bad, because it increases cost and operational complexity
  • Bad, because it exceeds current project resource constraints

Extend existing heartbeat + shared reliability helpers

  • Good, because additive schema preserves compatibility
  • Good, because it delivers immediate observability in existing CLI/API paths
  • Bad, because detailed per-run historical analytics remain limited

Additional Information

Implemented artifacts include:

  • Migration: backend/migrations/013_add_scraper_observability_columns.sql
  • Shared classification + retry constants: backend/src/waittime/scrapers/observability.py
  • Heartbeat metadata read/write: backend/src/waittime/services/database.py, backend/src/waittime/services/heartbeat.py
  • Operator surfaces: backend/src/waittime/cli/check_heartbeat.py, frontend/app/api/health/route.ts

Failure category contract:

  • upstream_unavailable
  • parser_breakage
  • infra_runtime
  • persistence_failure
  • unknown