Skip to content

Monitoring + alerting playbook (operators)

Goal: detect user-visible outages and silent automation failures with low noise.

Canonical reference:

  • ../../monitoring-and-ci-checklist.md

External uptime monitors (required)

Ensure monitors exist for:

  • https://api.healtharchive.ca/api/health
  • https://healtharchive.ca/archive
  • https://replay.healtharchive.ca/ (only if you rely on replay)

After changes, you can smoke-test from any machine with internet:

  • healtharchive/scripts/smoke-external-monitors.sh

If you want alerts when systemd timers stop running:

  1. Create checks in your Healthchecks provider.
  2. Store ping URLs only on the VPS:
  3. /etc/healtharchive/healthchecks.env (root-owned)
  4. This file may be shared across multiple automations; it is OK to keep both:
    • legacy HC_* variables (DB backup + disk check)
    • newer HEALTHARCHIVE_HC_PING_* variables (systemd unit templates)
  5. Keep the unit templates installed/updated on the VPS:
  6. sudo ./scripts/vps-install-systemd-units.sh --apply --restart-worker

What “done” means

  • External monitors are green and alert routing is confirmed.
  • If enabled, Healthchecks pings are configured without committing URLs to git.
  • If you use internal Prometheus-based alerts, Alertmanager is configured and test alerts deliver:
  • observability-guide.md#6-configure-alerting
  • If you use WARC tiering to a Storage Box, tiering metrics are enabled so you get high-signal alerts:
  • sudo systemctl enable --now healtharchive-tiering-metrics.timer