Monitoring + alerting playbook (operators)
Goal: detect user-visible outages and silent automation failures with low noise.
Canonical reference:
../../monitoring-and-ci-checklist.md
External uptime monitors (required)
Ensure monitors exist for:
https://api.healtharchive.ca/api/healthhttps://healtharchive.ca/archivehttps://replay.healtharchive.ca/(only if you rely on replay)
After changes, you can smoke-test from any machine with internet:
healtharchive/scripts/smoke-external-monitors.sh
“Timer ran” monitoring (optional, recommended)
If you want alerts when systemd timers stop running:
- Create checks in your Healthchecks provider.
- Store ping URLs only on the VPS:
/etc/healtharchive/healthchecks.env(root-owned)- This file may be shared across multiple automations; it is OK to keep both:
- legacy
HC_*variables (DB backup + disk check) - newer
HEALTHARCHIVE_HC_PING_*variables (systemd unit templates)
- legacy
- Keep the unit templates installed/updated on the VPS:
sudo ./scripts/vps-install-systemd-units.sh --apply --restart-worker
What “done” means
- External monitors are green and alert routing is confirmed.
- If enabled, Healthchecks pings are configured without committing URLs to git.
- If you use internal Prometheus-based alerts, Alertmanager is configured and test alerts deliver:
- observability-guide.md#6-configure-alerting
- If you use WARC tiering to a Storage Box, tiering metrics are enabled so you get high-signal alerts:
sudo systemctl enable --now healtharchive-tiering-metrics.timer