Skip to content

Production baseline drift checks (internal)

Goal: avoid “configuration drift” where production stops matching what the project expects (security posture, perms, service units, etc.).

This is implemented as:

1) Desired state (in git): production-baseline-policy.toml 2) Observed state (generated on the VPS): JSON snapshots written to /srv/healtharchive/ops/baseline/ 3) Drift check: compares observed vs policy and fails on required mismatches

Files

  • Policy (edit in git): production-baseline-policy.toml
  • Snapshot generator: ../../scripts/baseline_snapshot.py
  • Drift checker: ../../scripts/check_baseline_drift.py

On the VPS (as haadmin):

cd /opt/healtharchive
./scripts/check_baseline_drift.py --mode live

This writes:

  • observed-<timestamp>.json (machine-readable)
  • drift-report-<timestamp>.txt (human-readable)
  • plus observed-latest.json and drift-report-latest.txt

All files live under /srv/healtharchive/ops/baseline/.

“Local only” mode (no network dependency)

Use local-only mode when you want checks that don’t depend on DNS/TLS/external routing:

./scripts/check_baseline_drift.py --mode local

In local mode:

  • HSTS is validated by parsing /etc/caddy/Caddyfile for the API site block.
  • Admin endpoint checks are skipped (warn-only).

Optional: weekly drift timer (systemd)

If you want drift checks to run automatically (not just during deploys), this repo includes a weekly systemd timer:

  • Templates: docs/deployment/systemd/healtharchive-baseline-drift-check.*
  • Installer helper (VPS): scripts/vps-install-systemd-units.sh --apply
  • Enablement steps: docs/deployment/systemd/README.md

Remediation rule of thumb

When drift is limited to a file generated by one of the VPS installer helpers, prefer re-running the matching installer instead of editing the baseline policy or patching the file by hand.

Example:

  • If drift reports /etc/grafana/provisioning/dashboards/healtharchive.yaml, re-run sudo ./scripts/vps-install-observability-dashboards.sh --apply. That installer is the canonical writer for the Grafana dashboards provisioning file and restores the expected root:root 0644 state.
  • After fixing required drift, verify with ./scripts/check_baseline_drift.py --mode local or rerun the systemd wrapper with sudo systemctl start healtharchive-baseline-drift-check.service.

Healthchecks recovery note

If healtharchive-baseline-drift-check.service is wired to Healthchecks, a manual rerun after remediation will send the same success ping as the scheduled weekly timer.

  • A later Healthchecks "UP" notification confirms the service passed on that rerun.
  • It does not prove the weekly timer fired on schedule.
  • To distinguish "scheduled timer run" from "manual recovery rerun", compare:
  • sudo systemctl status healtharchive-baseline-drift-check.timer --no-pager -l
  • sudo systemctl list-timers --all | grep healtharchive-baseline-drift-check
  • sudo journalctl -u healtharchive-baseline-drift-check.service --no-pager -l

CORS validation

The policy enforces a strict production allowlist (no extra origins) via HEALTHARCHIVE_CORS_ORIGINS.

  • --mode local validates the env file value (CSV set comparison).
  • --mode live additionally probes the API with an Origin: header and checks real Access-Control-Allow-Origin behavior.

Replay + usage invariants

The policy can also pin “public UX” toggles that affect what users see:

  • HEALTHARCHIVE_REPLAY_BASE_URL and HEALTHARCHIVE_REPLAY_PREVIEW_DIR (replay browse URLs + previews)
  • HEALTHARCHIVE_USAGE_METRICS_ENABLED and HEALTHARCHIVE_USAGE_METRICS_WINDOW_DAYS (public /status + /impact)

When to update policy

Update production-baseline-policy.toml only when you intentionally change production invariants:

  • URL strategy (adding staging, changing canonical domains)
  • security posture (HSTS policy, admin auth policy)
  • directory layout / ownership model
  • systemd service names or enablement expectations

Avoid adding “things that change often” to policy (package versions, job counts, etc.).