Skip to content

Monitoring + alerting playbook (operators)

Goal: detect user-visible outages and silent automation failures with low noise.

Canonical reference:

../../monitoring-and-ci-checklist.md

External uptime monitors (required)

Ensure monitors exist for:

https://api.healtharchive.ca/api/health
https://healtharchive.ca/archive
https://replay.healtharchive.ca/ (only if you rely on replay)

After changes, you can smoke-test from any machine with internet:

healtharchive/scripts/smoke-external-monitors.sh

“Timer ran” monitoring (optional, recommended)

If you want alerts when systemd timers stop running:

Create checks in your Healthchecks provider.
Store ping URLs only on the VPS:
/etc/healtharchive/healthchecks.env (root-owned)
This file may be shared across multiple automations; it is OK to keep both:
- legacy HC_* variables (DB backup + disk check)
- newer HEALTHARCHIVE_HC_PING_* variables (systemd unit templates)
Keep the unit templates installed/updated on the VPS:
sudo ./scripts/vps-install-systemd-units.sh --apply --restart-worker

What “done” means

External monitors are green and alert routing is confirmed.
If enabled, Healthchecks pings are configured without committing URLs to git.
If you use internal Prometheus-based alerts, Alertmanager is configured and test alerts deliver:
observability-guide.md#6-configure-alerting
If you use WARC tiering to a Storage Box, tiering metrics are enabled so you get high-signal alerts:
sudo systemctl enable --now healtharchive-tiering-metrics.timer