Skip to content

Runbook: VPS Backend Deployment (operators)

Purpose: This runbook provides the canonical workflow for deploying the latest healtharchive and enabling automated incident recovery (drift auto-reconcile).

Scope

  • Environment: production (single VPS)
  • Audience: operator
  • Non-goals: This does not cover the initial VPS setup/provisioning (see production-single-vps.md).

Preconditions

  • Required access: Tailscale access to the host, haadmin user with sudo permissions.
  • Required inputs: No new secrets required for standard deploys.
  • Required dependencies: git, docker, python3-venv (already on production VPS).

Architecture / topology (short)

  • Backend: FastAPI (API) + Python Worker.
  • Reverse Proxy: Caddy (managed separately).
  • Watchdogs: vps-crawl-auto-recover.py, vps-storage-hotpath-auto-recover.py, and the new vps-drift-auto-reconcile.py.

Procedure

1) SSH to the VPS

Connect using the Tailscale IP or host alias:

ssh haadmin@vps-tailscale-alias

2) Run the Deployment Script

Navigate to the repo root and run the safe-by-default deploy helper:

cd /opt/healtharchive

# Dry-run (recommended first)
./scripts/vps-deploy.sh

# Apply deployment
./scripts/vps-deploy.sh --apply

What this changes:

  • Pulls the latest code from main.
  • Installs/updates Python dependencies in the .venv.
  • Runs DB migrations (alembic upgrade head).
  • Restarts the healtharchive-api service.
  • Restarts healtharchive-worker (only if no jobs are actively crawling).
  • If the worker restart is intentionally skipped (for example during a controlled maintenance window), the script still reports worker status but keeps the deploy gate on API health and later verification.

3) Enable Auto-Reconciliation Watchdog (One-time)

To prevent future "502 Bad Gateway" errors from missing dependencies, enable the new automated watchdog:

# Install the new systemd units from the repo
sudo ./scripts/vps-install-systemd-units.sh --apply

# Create the sentinel file to enable the drift watchdog
sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/drift-auto-reconcile-enabled

# Enable and start the timer
sudo systemctl enable --now healtharchive-drift-auto-reconcile.timer

What this changes:

  • Installs healtharchive-drift-auto-reconcile.{service,timer}.
  • Configures the watchdog to run every 5 minutes.
  • If it detects "baseline drift" (missing packages), it will auto-trigger a re-deployment to fix the environment.

Verification (“done” criteria)

  • Public surface: Verify https://api.healtharchive.ca/api/health returns HTTP 200.

  • Internal health: Check service status:

sudo systemctl status healtharchive-api healtharchive-worker healtharchive-drift-auto-reconcile.timer

Note: if you ran the deploy with --skip-worker-restart, an inactive worker can be expected during the maintenance window. In that case, treat API health and the post-deploy public verification as the deploy-completion gates, then resume the worker under the window procedure you are following.

  • Watchdog Metrics: If Prometheus/Grafana is setup, verify healtharchive_drift_auto_reconcile_enabled 1 exists in metrics.

Rollback / recovery

  • Fast path: Use ./scripts/vps-deploy.sh --apply --ref <PREVIOUS_SHA> to revert to a known good version.
  • Watchdog Disable: To stop the auto-recovery, delete the sentinel file:
    sudo rm /etc/healtharchive/drift-auto-reconcile-enabled
    

Troubleshooting

  • 502 Bad Gateway: Check if the worker/API is dead: sudo journalctl -u healtharchive-api -n 100.
  • Deploy Lock: If the script says a lock is held, check for orphans: ls -l /tmp/healtharchive-deploy.lock.

References