Runbook: VPS Backend Deployment (operators)
Purpose: This runbook provides the canonical workflow for deploying the latest healtharchive and enabling automated incident recovery (drift auto-reconcile).
Scope
- Environment: production (single VPS)
- Audience: operator
- Non-goals: This does not cover the initial VPS setup/provisioning (see production-single-vps.md).
Preconditions
- Required access: Tailscale access to the host,
haadminuser withsudopermissions. - Required inputs: No new secrets required for standard deploys.
- Required dependencies:
git,docker,python3-venv(already on production VPS).
Architecture / topology (short)
- Backend: FastAPI (API) + Python Worker.
- Reverse Proxy: Caddy (managed separately).
- Watchdogs:
vps-crawl-auto-recover.py,vps-storage-hotpath-auto-recover.py, and the newvps-drift-auto-reconcile.py.
Procedure
1) SSH to the VPS
Connect using the Tailscale IP or host alias:
2) Run the Deployment Script
Navigate to the repo root and run the safe-by-default deploy helper:
cd /opt/healtharchive
# Dry-run (recommended first)
./scripts/vps-deploy.sh
# Apply deployment
./scripts/vps-deploy.sh --apply
What this changes:
- Pulls the latest code from
main. - Installs/updates Python dependencies in the
.venv. - Runs DB migrations (
alembic upgrade head). - Restarts the
healtharchive-apiservice. - Restarts
healtharchive-worker(only if no jobs are actively crawling). - If the worker restart is intentionally skipped (for example during a controlled maintenance window), the script still reports worker status but keeps the deploy gate on API health and later verification.
3) Enable Auto-Reconciliation Watchdog (One-time)
To prevent future "502 Bad Gateway" errors from missing dependencies, enable the new automated watchdog:
# Install the new systemd units from the repo
sudo ./scripts/vps-install-systemd-units.sh --apply
# Create the sentinel file to enable the drift watchdog
sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/drift-auto-reconcile-enabled
# Enable and start the timer
sudo systemctl enable --now healtharchive-drift-auto-reconcile.timer
What this changes:
- Installs
healtharchive-drift-auto-reconcile.{service,timer}. - Configures the watchdog to run every 5 minutes.
- If it detects "baseline drift" (missing packages), it will auto-trigger a re-deployment to fix the environment.
Verification (“done” criteria)
-
Public surface: Verify
https://api.healtharchive.ca/api/healthreturns HTTP 200. -
Internal health: Check service status:
sudo systemctl status healtharchive-api healtharchive-worker healtharchive-drift-auto-reconcile.timer
Note: if you ran the deploy with --skip-worker-restart, an inactive worker can be expected during the maintenance window. In that case, treat API health and the post-deploy public verification as the deploy-completion gates, then resume the worker under the window procedure you are following.
- Watchdog Metrics: If Prometheus/Grafana is setup, verify
healtharchive_drift_auto_reconcile_enabled 1exists in metrics.
Rollback / recovery
- Fast path: Use
./scripts/vps-deploy.sh --apply --ref <PREVIOUS_SHA>to revert to a known good version. - Watchdog Disable: To stop the auto-recovery, delete the sentinel file:
Troubleshooting
- 502 Bad Gateway: Check if the worker/API is dead:
sudo journalctl -u healtharchive-api -n 100. - Deploy Lock: If the script says a lock is held, check for orphans:
ls -l /tmp/healtharchive-deploy.lock.
References
- Full Production Runbook: production-single-vps.md
- Systemd README: docs/deployment/systemd/README.md