Production baseline drift checks (internal)
Goal: avoid “configuration drift” where production stops matching what the project expects (security posture, perms, service units, etc.).
This is implemented as:
1) Desired state (in git): production-baseline-policy.toml 2) Observed state (generated on the VPS): JSON snapshots written to /srv/healtharchive/ops/baseline/ 3) Drift check: compares observed vs policy and fails on required mismatches
Files
- Policy (edit in git):
production-baseline-policy.toml - Snapshot generator:
../../scripts/baseline_snapshot.py - Drift checker:
../../scripts/check_baseline_drift.py
One-shot usage (recommended after any production change)
On the VPS (as haadmin):
This writes:
observed-<timestamp>.json(machine-readable)drift-report-<timestamp>.txt(human-readable)- plus
observed-latest.jsonanddrift-report-latest.txt
All files live under /srv/healtharchive/ops/baseline/.
“Local only” mode (no network dependency)
Use local-only mode when you want checks that don’t depend on DNS/TLS/external routing:
In local mode:
- HSTS is validated by parsing
/etc/caddy/Caddyfilefor the API site block. - Admin endpoint checks are skipped (warn-only).
Optional: weekly drift timer (systemd)
If you want drift checks to run automatically (not just during deploys), this repo includes a weekly systemd timer:
- Templates:
docs/deployment/systemd/healtharchive-baseline-drift-check.* - Installer helper (VPS):
scripts/vps-install-systemd-units.sh --apply - Enablement steps:
docs/deployment/systemd/README.md
Remediation rule of thumb
When drift is limited to a file generated by one of the VPS installer helpers, prefer re-running the matching installer instead of editing the baseline policy or patching the file by hand.
Example:
- If drift reports
/etc/grafana/provisioning/dashboards/healtharchive.yaml, re-runsudo ./scripts/vps-install-observability-dashboards.sh --apply. That installer is the canonical writer for the Grafana dashboards provisioning file and restores the expectedroot:root0644state. - After fixing required drift, verify with
./scripts/check_baseline_drift.py --mode localor rerun the systemd wrapper withsudo systemctl start healtharchive-baseline-drift-check.service.
Healthchecks recovery note
If healtharchive-baseline-drift-check.service is wired to Healthchecks, a manual rerun after remediation will send the same success ping as the scheduled weekly timer.
- A later Healthchecks "UP" notification confirms the service passed on that rerun.
- It does not prove the weekly timer fired on schedule.
- To distinguish "scheduled timer run" from "manual recovery rerun", compare:
sudo systemctl status healtharchive-baseline-drift-check.timer --no-pager -lsudo systemctl list-timers --all | grep healtharchive-baseline-drift-checksudo journalctl -u healtharchive-baseline-drift-check.service --no-pager -l
CORS validation
The policy enforces a strict production allowlist (no extra origins) via HEALTHARCHIVE_CORS_ORIGINS.
--mode localvalidates the env file value (CSV set comparison).--mode liveadditionally probes the API with anOrigin:header and checks realAccess-Control-Allow-Originbehavior.
Replay + usage invariants
The policy can also pin “public UX” toggles that affect what users see:
HEALTHARCHIVE_REPLAY_BASE_URLandHEALTHARCHIVE_REPLAY_PREVIEW_DIR(replay browse URLs + previews)HEALTHARCHIVE_USAGE_METRICS_ENABLEDandHEALTHARCHIVE_USAGE_METRICS_WINDOW_DAYS(public/status+/impact)
When to update policy
Update production-baseline-policy.toml only when you intentionally change production invariants:
- URL strategy (adding staging, changing canonical domains)
- security posture (HSTS policy, admin auth policy)
- directory layout / ownership model
- systemd service names or enablement expectations
Avoid adding “things that change often” to policy (package versions, job counts, etc.).