Production baseline drift checks (internal)

Goal: avoid “configuration drift” where production stops matching what the project expects (security posture, perms, service units, etc.).

This is implemented as:

1) Desired state (in git): production-baseline-policy.toml 2) Observed state (generated on the VPS): JSON snapshots written to /srv/healtharchive/ops/baseline/ 3) Drift check: compares observed vs policy and fails on required mismatches

Files

Policy (edit in git): production-baseline-policy.toml
Snapshot generator: ../../scripts/baseline_snapshot.py
Drift checker: ../../scripts/check_baseline_drift.py

One-shot usage (recommended after any production change)

On the VPS (as haadmin):

cd /opt/healtharchive
./scripts/check_baseline_drift.py --mode live

This writes:

observed-<timestamp>.json (machine-readable)
drift-report-<timestamp>.txt (human-readable)
plus observed-latest.json and drift-report-latest.txt

All files live under /srv/healtharchive/ops/baseline/.

“Local only” mode (no network dependency)

Use local-only mode when you want checks that don’t depend on DNS/TLS/external routing:

./scripts/check_baseline_drift.py --mode local

In local mode:

HSTS is validated by parsing /etc/caddy/Caddyfile for the API site block.
Admin endpoint checks are skipped (warn-only).

Optional: weekly drift timer (systemd)

If you want drift checks to run automatically (not just during deploys), this repo includes a weekly systemd timer:

Templates: docs/deployment/systemd/healtharchive-baseline-drift-check.*
Installer helper (VPS): scripts/vps-install-systemd-units.sh --apply
Enablement steps: docs/deployment/systemd/README.md

Remediation rule of thumb

When drift is limited to a file generated by one of the VPS installer helpers, prefer re-running the matching installer instead of editing the baseline policy or patching the file by hand.

Example:

If drift reports /etc/grafana/provisioning/dashboards/healtharchive.yaml, re-run sudo ./scripts/vps-install-observability-dashboards.sh --apply. That installer is the canonical writer for the Grafana dashboards provisioning file and restores the expected root:root 0644 state.
After fixing required drift, verify with ./scripts/check_baseline_drift.py --mode local or rerun the systemd wrapper with sudo systemctl start healtharchive-baseline-drift-check.service.

Healthchecks recovery note

If healtharchive-baseline-drift-check.service is wired to Healthchecks, a manual rerun after remediation will send the same success ping as the scheduled weekly timer.

A later Healthchecks "UP" notification confirms the service passed on that rerun.
It does not prove the weekly timer fired on schedule.
To distinguish "scheduled timer run" from "manual recovery rerun", compare:
sudo systemctl status healtharchive-baseline-drift-check.timer --no-pager -l
sudo systemctl list-timers --all | grep healtharchive-baseline-drift-check
sudo journalctl -u healtharchive-baseline-drift-check.service --no-pager -l

CORS validation

The policy enforces a strict production allowlist (no extra origins) via HEALTHARCHIVE_CORS_ORIGINS.

--mode local validates the env file value (CSV set comparison).
--mode live additionally probes the API with an Origin: header and checks real Access-Control-Allow-Origin behavior.

Replay + usage invariants

The policy can also pin “public UX” toggles that affect what users see:

HEALTHARCHIVE_REPLAY_BASE_URL and HEALTHARCHIVE_REPLAY_PREVIEW_DIR (replay browse URLs + previews)
HEALTHARCHIVE_USAGE_METRICS_ENABLED and HEALTHARCHIVE_USAGE_METRICS_WINDOW_DAYS (public /status + /impact)

When to update policy

Update production-baseline-policy.toml only when you intentionally change production invariants:

URL strategy (adding staging, changing canonical domains)
security posture (HSTS policy, admin auth policy)
directory layout / ownership model
systemd service names or enablement expectations

Avoid adding “things that change often” to policy (package versions, job counts, etc.).