Healthchecks.io parity (env ↔ systemd ↔ Healthchecks)
Do not enable or change production automations until the annual crawl/scrape is finished and the campaign jobs are indexed.
Goal: ensure the Healthchecks.io dashboard is a faithful reflection of what the VPS actually runs (and only that).
This playbook focuses on three sources of truth:
1) systemd timers on the VPS (what actually runs) 2) /etc/healtharchive/healthchecks.env (which pings are wired on the VPS) 3) Healthchecks.io checks (what the dashboard expects to hear from)
Key rule:
- A Healthchecks.io check should exist iff there is a corresponding ping URL in
/etc/healtharchive/healthchecks.env(or a legacyHC_*URL used by the disk/backup scripts).
If you follow that rule, the dashboard cannot drift into “checks we don’t use” or “missing checks for enabled jobs”.
Current state (as of 2026-01-03)
These pings are configured in /etc/healtharchive/healthchecks.env:
HEALTHARCHIVE_HC_PING_REPLAY_RECONCILE→healtharchive-replay-reconcile.timer(daily)HEALTHARCHIVE_HC_PING_SCHEDULE_ANNUAL→healtharchive-schedule-annual.timer(yearly)HEALTHARCHIVE_HC_PING_ANNUAL_SENTINEL→healtharchive-annual-campaign-sentinel.timer(yearly)HEALTHARCHIVE_HC_PING_BASELINE_DRIFT→healtharchive-baseline-drift-check.timer(weekly)HEALTHARCHIVE_HC_PING_PUBLIC_VERIFY→healtharchive-public-surface-verify.timer(daily)HEALTHARCHIVE_HC_PING_ANNUAL_SEARCH_VERIFY→healtharchive-annual-search-verify.timer(daily)HEALTHARCHIVE_HC_PING_CHANGE_TRACKING→healtharchive-change-tracking.timer(daily)HEALTHARCHIVE_HC_PING_COVERAGE_GUARDRAILS→healtharchive-coverage-guardrails.timer(daily)HEALTHARCHIVE_HC_PING_REPLAY_SMOKE→healtharchive-replay-smoke.timer(daily)
Legacy script checks (separate from the systemd wrapper):
HC_DB_BACKUP_URL→healtharchive-db-backup.timer(daily)HC_DISK_URL+HC_DISK_THRESHOLD→healtharchive-disk-check.timer(hourly)
Known “not wired (by design) right now”:
HEALTHARCHIVE_HC_PING_CLEANUP_AUTOMATIONexists as a ping var in the installed unit, but:healtharchive-cleanup-automation.timeris disabled, and/etc/healtharchive/cleanup-automation-enabledsentinel is missing.- Result: do not create a Healthchecks.io check or env var until cleanup automation is intentionally enabled.
Audit checklist (safe; no restarts)
1) List the ping vars currently configured (VPS env file)
This prints only variable names (not URLs):
sudo awk -F= '$1 ~ /^(HEALTHARCHIVE_HC_PING_|HC_)/ {print $1}' /etc/healtharchive/healthchecks.env | sort -u
2) List what timers are actually enabled (what will run)
3) Confirm Healthchecks.io check schedules match reality
Use the NEXT column from systemctl list-timers to configure Healthchecks schedules:
- Hourly timers (disk): Healthchecks “1 hour” period + ~2 hours grace.
- Daily timers: Healthchecks “1 day” period + ~6 hours grace.
- Yearly timers (schedule annual + annual sentinel): Healthchecks cron in UTC + large grace (7–14 days).
If a yearly check is configured with a small grace (hours), it will look “down” most of the year.
4) Interpret Healthchecks "DOWN" / "UP" against systemd state
When triaging a timer-backed Healthchecks alert, use systemd as the source of truth:
sudo systemctl status <unit>.timer --no-pager -lshows the timer's loaded state plus its last/next trigger times.sudo systemctl list-timers --all | grep <unit-prefix>is the fastest cross-check for what actually fired and when.sudo journalctl -u <unit>.service --no-pager -lshows the wrapped command's real output and whether a later rerun succeeded.
Important interpretation rule:
- A later Healthchecks "UP" notification only proves the wrapped service pinged success.
- It does not prove the scheduled timer fired on time.
- A manual
sudo systemctl start <unit>.servicererun sends the same success ping and will clear the check.
Journal visibility caveats:
journalctl -u <unit>.timeroften needssudobecause timer logs are emitted by systemd, not by the service user.- Even with
sudo,journalctl --since ...can still show-- No entries --if the timer's visible startup log line is older than the--sincecutoff. - In that case, trust
systemctl status/systemctl list-timers --allfor the timer trigger history andjournalctl -u <unit>.servicefor the command outcome.
Reconcile: achieve 1:1 parity (what exists vs what should exist)
Rule A — If a timer is enabled and important, it should have a Healthchecks ping
For each enabled “important outcome” timer, ensure:
1) A Healthchecks.io check exists 2) Its ping URL is stored in /etc/healtharchive/healthchecks.env
Important outcome timers (recommended to monitor):
healtharchive-replay-reconcile.timerhealtharchive-public-surface-verify.timerhealtharchive-change-tracking.timerhealtharchive-coverage-guardrails.timerhealtharchive-replay-smoke.timerhealtharchive-annual-search-verify.timerhealtharchive-baseline-drift-check.timerhealtharchive-schedule-annual.timer(yearly)healtharchive-annual-campaign-sentinel.timer(yearly)- legacy:
healtharchive-db-backup.timer,healtharchive-disk-check.timer
High-frequency timers (recommended NOT to monitor in Healthchecks; too noisy):
healtharchive-crawl-metrics.timerhealtharchive-tiering-metrics.timerhealtharchive-crawl-auto-recover.timerhealtharchive-storage-hotpath-auto-recover.timer
Rule B — If a Healthchecks.io check exists, it must correspond to a real job you run
If a Healthchecks.io check exists but:
- there is no enabled timer for it, and
- it is not one of the legacy script checks,
then delete it in Healthchecks.io and remove its env var from /etc/healtharchive/healthchecks.env.
Cleanup automation: what remains to do (deferred until after crawl)
Cleanup automation is currently installed but intentionally disabled:
- Timer:
healtharchive-cleanup-automation.timer(disabled) - Sentinel:
/etc/healtharchive/cleanup-automation-enabled(missing) - Ping var supported by the unit:
HEALTHARCHIVE_HC_PING_CLEANUP_AUTOMATION
Why we are waiting
Enabling cleanup changes production behavior (even if intended to be safe). Defer until after crawl so:
- we avoid adding churn during the annual campaign,
- we can review retention expectations and confirm cleanup boundaries.
Post-crawl enablement checklist (when you decide “yes, enable cleanup”)
1) Review the cleanup behavior and config: - Playbook: ../crawl/cleanup-automation.md - Config: /opt/healtharchive/ops/automation/cleanup-automation.toml
2) Decide Healthchecks schedule (from the timer):
Current schedule (template): weekly Sunday 04:45 UTC.
Recommended Healthchecks schedule for that timer:
- Cron (UTC):
45 4 * * 0 - Grace: 2 days
3) Create the Healthchecks.io check: - Name: healtharchive-cleanup-automation - Schedule: cron above (UTC) - Grace: 2 days
4) Add the ping URL to /etc/healtharchive/healthchecks.env:
Add:
5) Enable cleanup automation (two gates):
sudo install -d -m 0755 /etc/healtharchive
sudo touch /etc/healtharchive/cleanup-automation-enabled
sudo systemctl enable --now healtharchive-cleanup-automation.timer
6) Verify the ping wiring (safe; does not run cleanup):
sudo bash -lc 'set -a; source /etc/healtharchive/healthchecks.env; set +a; /opt/healtharchive/scripts/systemd-healthchecks-wrapper.sh --ping-var HEALTHARCHIVE_HC_PING_CLEANUP_AUTOMATION -- echo ok'
7) Verify real runs on the next scheduled window:
If you decide “no, don’t enable cleanup”, keep it disabled and do not create the Healthchecks check or env var.