Deploy + verify playbook (production VPS)

Goal: deploy a known-good main and verify production matches policy.

Canonical references:

Preconditions

CI is green on the commit you intend to deploy.
You are on the production VPS and can sudo.
If an incident fix depends on repository changes, make sure those changes are already committed and pushed before you start the VPS procedure.

Identify the exact commit to deploy.
Prefer a pinned deploy ref for incident work so the VPS change is unambiguous.
Verify the commit is already present on origin/main before running the VPS deploy.
Run the deploy gate (recommended one command):
cd /opt/healtharchive && ./scripts/vps-deploy.sh --apply --baseline-mode live
Or pinned: cd /opt/healtharchive && ./scripts/vps-deploy.sh --apply --baseline-mode live --ref <GIT_SHA>

Recommended wrapper (routine use):

This includes:

If your change updates systemd unit templates or Prometheus alert rules, you can apply those as part of the deploy:

./scripts/vps-deploy.sh --apply --baseline-mode live --install-systemd-units --apply-alerting

If the public frontend is externally unavailable, you can deploy backend-only:

Optional: install the wrapper outside the repo so it never dirties /opt/healtharchive:

Notes:

Do not run git pull separately as a substitute for the deploy helper. Use the deploy helper to fetch, install dependencies, run migrations, and restart the right services coherently.
If the operator is following assistant-provided incident commands, do not continue to repo-dependent recovery steps until the deploy output confirms the intended ref is active.
Prefer a real command over an alias; aliases can accidentally persist set -euo pipefail in your interactive shell.
If hetzdeploy --mode backend-only errors with syntax error near unexpected token, you probably still have an alias named hetzdeploy.
- Check: type hetzdeploy
- Remove: unalias hetzdeploy 2>/dev/null || true and delete the alias line from your shell startup files.
--apply-alerting requires alerting to be configured on the VPS (webhook secret present at /etc/healtharchive/observability/alertmanager_webhook_url).

If you are updating the replay banner/template or replay service config on a single-VPS deployment, include replay restart + banner install:

Crawl safety:

If any jobs are status=running, the deploy helper will skip restarting healtharchive-worker by default to avoid SIGTERMing an active crawl.
When you need to force a worker restart (only when safe): ./scripts/vps-deploy.sh --apply --baseline-mode live --force-worker-restart
If you want to explicitly keep the worker untouched regardless of job status: ./scripts/vps-deploy.sh --apply --baseline-mode live --skip-worker-restart
If the deploy gate fails:
Do not retry blindly.
Read the failure output:
- drift report artifacts under /srv/healtharchive/ops/baseline/
- verifier output from verify_public_surface.py
Fix the underlying mismatch (production state vs policy) or intentionally update policy.