HealthArchive.ca – Production on a Single VPS (Hetzner + Tailscale)

This is the record of the current production deployment. It is a single VPS that runs Postgres, the API, the worker, Caddy (TLS), and all archive storage. SSH is private-only via Tailscale; the public internet only sees ports 80/443.

Use this as the canonical runbook for rebuilding the stack, auditing it, or explaining it to new operators.

Shared VPS inventory, ingress ownership, canonical public hosts, and cross-project operations state live in /home/jer/repos/vps/platform-ops. Use /home/jer/repos/vps/platform-ops/docs/standards/PLAT-009-shared-vps-documentation-boundary.md as the default rule for what belongs in this repo versus shared ops documentation.

For recovery from total failure, see the Disaster Recovery Runbook.

Documentation boundary note:

This runbook is canonical for HealthArchive backend behavior on the VPS.
Shared VPS facts that are not specific to HealthArchive alone are canonical in /home/jer/repos/vps/platform-ops.
That includes shared ingress ownership, cross-project host inventory, shared path conventions, and host-wide hardening posture.
The explicit ownership split is documented in /home/jer/repos/vps/platform-ops/docs/standards/PLAT-009-shared-vps-documentation-boundary.md.

1) Hosting / topology

Provider / size: Hetzner Cloud, cx33 (Cost-Optimized, 4 vCPU / 8GB RAM / 80GB SSD)
Region: Nuremberg (cost-optimized not available in US-East at the time)
Public services: healtharchive.ca (canonical), www.healtharchive.ca (redirect alias), and api.healtharchive.ca on 80/443 via Caddy
Replay (optional): replay.healtharchive.ca via Caddy → pywb (see deployment/replay-service-pywb.md)
Private-only: SSH on Tailscale (tailscale0), no public port 22
Storage:
/srv/healtharchive/jobs – archive root (WARCs / job outputs)
/srv/healtharchive/backups – DB dumps
(Optional) StorageBox mount (cold storage / tiering; not a crawl hot-path)
Database: Local Postgres on the VPS
Monitoring/alerts:
Healthchecks.io pings for DB backup success/failure
Healthchecks.io pings for disk-usage threshold
(External uptime checks recommended: /api/health and /archive)
Backups: Nightly pg_dump -Fc → /srv/healtharchive/backups, retained 14 days
Offsite copy: Synology NAS pulls backups over Tailscale via rsync/SSH

2) Provision & OS hardening (Hetzner)

1) Create server: - Type: Cost-Optimized, x86, cx33 - Region: Nuremberg - OS: Ubuntu 24.04 LTS - Attach SSH public key; no password login 2) Hetzner Cloud Firewall (final state): - Allow TCP 80, 443 (anywhere) - Allow UDP 41641 (anywhere) for Tailscale - No public TCP 22 3) OS setup: - Create haadmin (sudo), disable root SSH login, disable SSH passwords - Enable unattended-upgrades - UFW: allow 80/443, allow 22 only on tailscale0, allow 41641/udp

3) Runtime dependencies

On the VPS (as haadmin):

sudo apt update
sudo apt -y install docker.io \
  postgresql postgresql-contrib \
  python3 python3-venv python3-pip \
  git curl build-essential pkg-config unzip
sudo systemctl enable --now docker postgresql

Swap (recommended on cx33)

Annual crawls are long-running and browser-driven; having a small swap file helps avoid OOM-driven churn and reduces time lost to restarts.

Recommended on cx33 (8GB RAM): add a 4G swapfile on the local SSD:

sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
swapon --show

Notes:

Docker Compose is optional for this stack. On Ubuntu 24.04, the packaged Compose plugin is often docker-compose-v2 (not docker-compose-plugin):

sudo apt -y install docker-compose-v2
docker compose version

Directories:

sudo groupadd --system healtharchive 2>/dev/null || true
sudo mkdir -p /srv/healtharchive/jobs /srv/healtharchive/backups /srv/healtharchive/ops
sudo chown -R haadmin:haadmin /srv/healtharchive/jobs
sudo chown root:healtharchive /srv/healtharchive/backups
sudo chmod 2770 /srv/healtharchive/backups
sudo chown root:healtharchive /srv/healtharchive/ops
sudo chmod 2770 /srv/healtharchive/ops

Ops directories (public-safe logs + artifacts):

root:healtharchive ownership + 2770 perms is intentional:
root owns the directory tree
operators (e.g., haadmin) write via the healtharchive group
the setgid bit keeps group ownership consistent on new files/dirs

Create the standard subdirectories:

sudo mkdir -p \
  /srv/healtharchive/ops/baseline \
  /srv/healtharchive/ops/restore-tests \
  /srv/healtharchive/ops/adoption \
  /srv/healtharchive/ops/search-eval
sudo chown -R root:healtharchive /srv/healtharchive/ops
sudo chmod 2770 /srv/healtharchive/ops /srv/healtharchive/ops/*

Postgres:

sudo -u postgres psql -c "CREATE USER healtharchive WITH PASSWORD '<DB_PASSWORD>';"
sudo -u postgres psql -c "CREATE DATABASE healtharchive OWNER healtharchive;"

4) Backend deploy (API + worker, systemd)

Clone + venv:

sudo mkdir -p /opt && sudo chown haadmin:haadmin /opt
git clone https://github.com/jerdaw/healtharchive.git /opt/healtharchive
cd /opt/healtharchive
python3 -m venv .venv
./.venv/bin/pip install --upgrade pip
./.venv/bin/pip install -e ".[dev]" "psycopg[binary]"

Env file (root-owned, group-readable):

sudo groupadd --system healtharchive 2>/dev/null || true
sudo usermod -aG healtharchive haadmin
sudo install -d -m 750 -o root -g healtharchive /etc/healtharchive
sudo tee /etc/healtharchive/backend.env >/dev/null <<'EOF'
HEALTHARCHIVE_ENV=production
HEALTHARCHIVE_DATABASE_URL=postgresql+psycopg://healtharchive:<DB_PASSWORD>@127.0.0.1:5432/healtharchive
# Keep the crawl hot-path on the local SSD for throughput; use the StorageBox only for cold storage/tiering.
HEALTHARCHIVE_ARCHIVE_ROOT=/srv/healtharchive/jobs
HEALTHARCHIVE_ADMIN_TOKEN=<LONG_RANDOM_TOKEN>
HEALTHARCHIVE_CORS_ORIGINS=https://healtharchive.ca,https://www.healtharchive.ca,https://replay.healtharchive.ca
HEALTHARCHIVE_LOG_LEVEL=INFO
HA_SEARCH_RANKING_VERSION=v2
HA_PAGES_FASTPATH=1

# Optional: aggregated, privacy-preserving usage metrics (daily counts only).
# Drives the public reporting pages (`/status` and `/impact`) via `GET /api/usage`.
HEALTHARCHIVE_USAGE_METRICS_ENABLED=1
HEALTHARCHIVE_USAGE_METRICS_WINDOW_DAYS=30

# Optional: change tracking + diff feeds.
HEALTHARCHIVE_CHANGE_TRACKING_ENABLED=1

# Optional: compare-live (snapshot vs current live page).
# Defaults are safe, but you can tune these if needed.
HEALTHARCHIVE_COMPARE_LIVE_ENABLED=1
HEALTHARCHIVE_COMPARE_LIVE_TIMEOUT_SECONDS=8
HEALTHARCHIVE_COMPARE_LIVE_MAX_REDIRECTS=4
HEALTHARCHIVE_COMPARE_LIVE_MAX_BYTES=2000000
HEALTHARCHIVE_COMPARE_LIVE_MAX_ARCHIVE_BYTES=2000000
HEALTHARCHIVE_COMPARE_LIVE_MAX_RENDER_LINES=5000
HEALTHARCHIVE_COMPARE_LIVE_MAX_CONCURRENCY=4
# HEALTHARCHIVE_COMPARE_LIVE_USER_AGENT=HealthArchiveCompareLive/1.0 (+https://healtharchive.ca)

# Optional: research exports.
# Controls the public metadata export endpoints under `/api/exports`.
HEALTHARCHIVE_EXPORTS_ENABLED=1
HEALTHARCHIVE_EXPORTS_DEFAULT_LIMIT=1000
HEALTHARCHIVE_EXPORTS_MAX_LIMIT=10000

# Public site base URL for RSS feed links and public compare URLs.
HEALTHARCHIVE_PUBLIC_SITE_URL=https://healtharchive.ca

# Optional: replay integration (pywb). Enables `browseUrl` fields in the public API.
# HEALTHARCHIVE_REPLAY_BASE_URL=https://replay.healtharchive.ca

# Optional: cached replay preview images (homepage thumbnails for /archive cards).
# HEALTHARCHIVE_REPLAY_PREVIEW_DIR=/srv/healtharchive/replay/previews
EOF
sudo chown root:healtharchive /etc/healtharchive/backend.env
sudo chmod 640 /etc/healtharchive/backend.env

Migrate + seed:

set -a; source /etc/healtharchive/backend.env; set +a
./.venv/bin/alembic upgrade head
./.venv/bin/healtharchive seed-sources
./.venv/bin/healtharchive recompute-page-signals
./.venv/bin/healtharchive rebuild-pages --truncate

Systemd services:

API: /etc/systemd/system/healtharchive-api.service
Prefer the repo-managed template in docs/deployment/systemd/healtharchive-api.service
Default ExecStart runs uvicorn on 127.0.0.1:8001 with HEALTHARCHIVE_API_WORKERS=2
EnvironmentFile=/etc/healtharchive/backend.env
Worker: /etc/systemd/system/healtharchive-worker.service
ExecStart=/opt/healtharchive/.venv/bin/healtharchive start-worker --poll-interval 30

Optional systemd automation (recommended):

Install/update systemd unit templates from this repo:
./scripts/vps-install-systemd-units.sh --apply
Templates + enablement steps: deployment/systemd/README.md
Baseline drift check timer (weekly; low-risk, recommended):
Templates + enablement steps: deployment/systemd/README.md
Annual scheduling timer (Jan 01 UTC) + worker priority drop-in:
Templates + install steps: deployment/systemd/README.md
Replay reconciliation timer (pywb indexing; capped, optional):
Templates + install steps: deployment/systemd/README.md
Change tracking timer (edition-aware diffs; capped):
Templates + install steps: deployment/systemd/README.md
Annual search verification capture (optional; safe):
Templates + enablement steps: deployment/systemd/README.md

Enable + start:

sudo systemctl daemon-reload
sudo systemctl enable --now healtharchive-api healtharchive-worker
curl -i http://127.0.0.1:8001/api/health

Routine deploys (after initial install):

cd /opt/healtharchive

# Dry-run (prints actions):
./scripts/vps-deploy.sh

# Deploy latest main (fast-forward only):
./scripts/vps-deploy.sh --apply

# Deploy pinned commit:
./scripts/vps-deploy.sh --apply --ref <GIT_SHA>

Recommended wrapper (routine use):

./scripts/vps-hetzdeploy.sh

Recommended: install hetzdeploy as a real command (avoid fragile aliases):

sudo ./scripts/vps-install-hetzdeploy.sh --apply

# Then you can run it from anywhere:
hetzdeploy

Notes:

The deploy script runs a baseline drift check by default to catch misconfiguration (filesystem perms, systemd enablement, env allowlists, etc.).
Artifacts are written to: /srv/healtharchive/ops/baseline/
You can skip in emergencies: ./scripts/vps-deploy.sh --apply --skip-baseline-drift
To include live HTTPS checks (HSTS, CORS headers, admin/metrics auth): ./scripts/vps-deploy.sh --apply --baseline-mode live
If you update systemd unit templates in the repo, you can apply them during deploy:
./scripts/vps-deploy.sh --apply --install-systemd-units
If you update Prometheus alert rules, you can apply them during deploy:
./scripts/vps-deploy.sh --apply --apply-alerting
Requires alerting to be configured (webhook secret present at /etc/healtharchive/observability/alertmanager_webhook_url).
The baseline policy (desired state) is versioned in git at: docs/operations/production-baseline-policy.toml
The deploy script runs a public-surface smoke verify by default (public API + frontend + replay + usage):
./scripts/verify_public_surface.py (defaults to https://api.healtharchive.ca and https://healtharchive.ca)
You can skip in emergencies: ./scripts/vps-deploy.sh --apply --skip-public-surface-verify
If the public frontend is externally down, use:
- ./scripts/vps-hetzdeploy.sh --mode backend-only
- Or (if installed): hetzdeploy --mode backend-only
Crawl-safety: if any jobs are status=running, the deploy helper will restart healtharchive-api but will skip restarting healtharchive-worker by default (to avoid SIGTERMing an active crawl).
To force a worker restart (only when you are OK interrupting crawls): ./scripts/vps-deploy.sh --apply --force-worker-restart
To explicitly skip the worker restart regardless of job status: ./scripts/vps-deploy.sh --apply --skip-worker-restart
When the worker restart is skipped, the script reports worker status for visibility but no longer treats an inactive worker as a fatal pre-health-check deploy failure; API readiness and post-deploy verification remain the completion gates.

4.1) Observability (Prometheus + Grafana; operator-only)

This is the private ops stack:

Prometheus collects metrics (backend + host + Postgres exporters).
Grafana shows dashboards (“private stats”).
Alertmanager sends alerts to one operator channel (via the webhook relay).

The important safety rule:

These services bind to 127.0.0.1 on the VPS (loopback-only) and are accessed over the tailnet (Tailscale) using an SSH port-forward.
Do not add Caddy vhosts for Prometheus/Grafana (keep them off the public internet).

Install flow (VPS):

Follow the observability playbooks under docs/operations/playbooks/:
docs/operations/playbooks/observability/observability-guide.md
docs/operations/playbooks/observability/monitoring-and-alerting.md

Where things live (VPS):

Secrets (never commit): /etc/healtharchive/observability/
Prometheus config: /etc/prometheus/prometheus.yml and /etc/prometheus/rules/
Alertmanager config: /etc/prometheus/alertmanager.yml
Grafana dashboards provisioning: /etc/grafana/provisioning/dashboards/healtharchive.yaml
Public-safe dashboard JSON + ops artifacts: /srv/healtharchive/ops/observability/

Access from your laptop (via tailnet-only SSH):

# Tunnel Grafana + Prometheus + admin proxy to your local machine.
# Keep this terminal open.
ssh -N \
  -L 3000:127.0.0.1:3000 \
  -L 9090:127.0.0.1:9090 \
  -L 8002:127.0.0.1:8002 \
  haadmin@<vps-tailscale-ip>

Then open on your laptop:

Grafana: http://127.0.0.1:3000/
Prometheus UI (optional): http://127.0.0.1:9090/
Admin proxy (operator triage; browser-friendly): http://127.0.0.1:8002/

Restart services (VPS):

sudo systemctl restart \
  prometheus \
  prometheus-alertmanager \
  prometheus-node-exporter \
  prometheus-postgres-exporter \
  grafana-server \
  healtharchive-pushover-relay \
  healtharchive-admin-proxy

4.2) Storage Box / `sshfs` stale mount failures (Errno 107)

HealthArchive uses a Storage Box (via sshfs) as a cold tier in production (WARC tiering).

Important failure mode:

A mount can appear “present” but be stale/unreadable, causing:
OSError: [Errno 107] Transport endpoint is not connected

This can break:

crawl progress metrics,
archive job output dirs under /srv/healtharchive/jobs/**,
and the worker/job lifecycle.

Updating

Standard fast-forward to latest main:

cd /opt/healtharchive
./scripts/vps-deploy.sh --apply

[!WARNING] DO NOT manually run git pull on the VPS. Always use ./scripts/vps-deploy.sh or the hetzdeploy alias. This ensures that new dependencies (e.g., Python packages in .venv) are installed and services like the API are restarted cleanly, preventing ModuleNotFoundError crashes.

This does:

git fetch origin main && git merge origin/main --ff-only
pip install -r requirements.txt (into .venv)
alembic upgrade head
systemctl restart healtharchive-api (if not skipping)
systemctl restart healtharchive-worker (if not skipping/forcing)

Fast triage:

cd /opt/healtharchive
./scripts/vps-crawl-status.sh --year "$(date -u +%Y)"
ls -la /srv/healtharchive/storagebox >/dev/null && echo "OK: storagebox readable" || echo "BAD: storagebox unreadable"
mount | rg '/srv/healtharchive/jobs/|/srv/healtharchive/storagebox'

Recovery playbook:

../operations/playbooks/storage/storagebox-sshfs-stale-mount-recovery.md

5) HTTPS + DNS (Caddy)

1) DNS (Namecheap): A api.healtharchive.ca -> <VPS_PUBLIC_IP> 2) Install Caddy: sudo apt -y install caddy 3) Caddyfile: /etc/caddy/Caddyfile

api.healtharchive.ca {
  header Strict-Transport-Security "max-age=31536000"
  reverse_proxy 127.0.0.1:8001
}

4) Validate + reload:

sudo caddy fmt --overwrite /etc/caddy/Caddyfile
sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl reload caddy

Verify:

curl -i https://api.healtharchive.ca/api/health

5.1) Optional: replay service (pywb)

Full-fidelity browsing (CSS/JS/images) requires a replay engine. If you want “click links and stay inside the archived backup”, deploy pywb behind Caddy:

Runbook: deployment/replay-service-pywb.md

Operational warning:

healtharchive cleanup-job --mode temp removes temp dirs including WARCs. Replay depends on WARCs staying on disk, so do not run cleanup for any job you intend to keep replayable. If replay is enabled globally (HEALTHARCHIVE_REPLAY_BASE_URL is set), cleanup-job --mode temp will refuse unless you pass --force.

Optional UX improvement:

If HEALTHARCHIVE_REPLAY_PREVIEW_DIR is configured, the API can serve cached PNG “homepage previews” used by the frontend on /archive. See deployment/replay-service-pywb.md (“Cached source preview images”) for the generation command.

6) Tailscale (SSH/private access only)

Installed on VPS, NAS, and admin workstation.
VPS Tailscale IP: 100.x.y.z (example)
SSH only allowed on tailscale0 in UFW; public port 22 blocked at Hetzner.
Hetzner firewall adds UDP 41641 for better Tailscale connectivity.
Recommended: disable Tailscale key expiry for the VPS and NAS devices in the Tailscale admin UI so access does not silently expire.

Usage:

ssh -i ~/.ssh/healtharchive_hetzner haadmin@100.x.y.z

Public SSH: - Expected to fail: ssh haadmin@api.healtharchive.ca (closed).

7) Backups + NAS pull (rsync over Tailscale)

VPS backup user: - habackup user with NAS public key in /home/habackup/.ssh/authorized_keys

Backup script: /usr/local/bin/healtharchive-db-backup - pg_dump -Fc to /srv/healtharchive/backups/healtharchive_<ts>.dump - Nightly healtharchive_<ts>.dump series retained 14 days - One-off maintenance dumps such as healtharchive_pre_<change>_<ts>.dump are rollback artifacts, not part of the nightly retention set. After the maintenance window is closed and at least one newer nightly dump plus restore evidence exist, remove them from /srv/healtharchive/backups or archive them under /srv/healtharchive/ops/maintenance/... instead of leaving them in the mirrored NAS backup set indefinitely. - Healthchecks /start//fail/success pings (see §8)

Systemd: - /etc/systemd/system/healtharchive-db-backup.service - /etc/systemd/system/healtharchive-db-backup.timer (daily ~03:30 UTC, randomized delay)

NAS pull: - NAS key: ~/.ssh/ha_backup_nas (no passphrase) - SSH config alias on NAS:

Host ha-vps
  HostName 100.x.y.z
  User habackup
  IdentityFile ~/.ssh/ha_backup_nas
  IdentitiesOnly yes
  StrictHostKeyChecking accept-new

Rsync command (used manually + DSM scheduled task):

mkdir -p /volume1/nobak/healtharchive/backups/db
rsync -av --delete ha-vps:/srv/healtharchive/backups/ /volume1/nobak/healtharchive/backups/db/

Make the DSM scheduled task run both lines, not just rsync, so the NAS pull self-heals if the destination path disappears after a share rebuild or manual cleanup.
If /volume1/nobak/healtharchive/backups/db is missing and the task runs only rsync, Synology Task Scheduler will report rsync exit code 11 (mkdir ... failed: No such file or directory).

8) Healthchecks.io (backup + disk)

Secrets file: /etc/healtharchive/healthchecks.env (mode 600)

Notes:

This env file may also be used by the newer systemd unit templates under docs/deployment/systemd/ (which use HEALTHARCHIVE_HC_PING_* variable names). It is OK to keep both sets of variables in the same file.
Avoid placeholder values like https://hc-ping.com/<uuid> in this file if you ever source it from bash; the </> characters can break shell parsing.

HC_DB_BACKUP_URL=https://hc-ping.com/UUID_HERE
HC_DISK_URL=https://hc-ping.com/UUID_HERE
HC_DISK_THRESHOLD=80

Disk check: - Script: /usr/local/bin/healtharchive-disk-check - Service/Timer: healtharchive-disk-check.service / healtharchive-disk-check.timer (hourly) - Pings success; sends /fail if / or /srv/healtharchive exceeds 80%.

9) Synthetic snapshot for smoke testing

Created a minimal WARC + Snapshot for smoke checks: - WARC: /srv/healtharchive/jobs/manual-warcs/viewer-test.warc.gz - Snapshot ID: 1 - Raw: https://api.healtharchive.ca/api/snapshots/raw/1 - Viewer: https://healtharchive.ca/snapshot/1

Use this to verify end-to-end viewer behavior after deploys.

10) Restore drill (completed)

Procedure:

latest="$(ls -t /srv/healtharchive/backups/healtharchive_*.dump | head -n 1)"
sudo -u postgres dropdb --if-exists healtharchive_restore_test
sudo -u postgres createdb healtharchive_restore_test
sudo -u postgres pg_restore --no-owner --no-acl -d healtharchive_restore_test < "$latest"
sudo -u postgres psql -d healtharchive_restore_test -c "select count(*) from snapshots;"
sudo -u postgres dropdb healtharchive_restore_test

Result: restore succeeded, snapshots contained 1 row (the synthetic test snapshot).

11) External uptime checks (recommended)

Configure an external monitor (e.g., UptimeRobot) for: - https://api.healtharchive.ca/api/health - https://healtharchive.ca/archive - (Optional) https://replay.healtharchive.ca/ (if replay is enabled/in use)

Note: some providers use HEAD by default; the backend supports HEAD /api/health.

For a more detailed, step-by-step checklist (including branch protection / CI enforcement), see:

../operations/monitoring-and-ci-checklist.md

12) Current known defaults/assumptions (2026-03)

CORS allowlist: https://healtharchive.ca, https://www.healtharchive.ca (redirect alias compatibility), https://replay.healtharchive.ca
Frontend runtime: direct Docker container on the VPS behind host Caddy
No staging backend; Preview and Production frontends point to the same API
Public SSH closed; Tailscale required for admin/backup access

HealthArchive.ca – Production on a Single VPS (Hetzner + Tailscale)

1) Hosting / topology

2) Provision & OS hardening (Hetzner)

3) Runtime dependencies

Swap (recommended on cx33)

4) Backend deploy (API + worker, systemd)

4.1) Observability (Prometheus + Grafana; operator-only)

4.2) Storage Box / sshfs stale mount failures (Errno 107)

Updating

5) HTTPS + DNS (Caddy)

5.1) Optional: replay service (pywb)

6) Tailscale (SSH/private access only)

7) Backups + NAS pull (rsync over Tailscale)

8) Healthchecks.io (backup + disk)

9) Synthetic snapshot for smoke testing

10) Restore drill (completed)

11) External uptime checks (recommended)

12) Current known defaults/assumptions (2026-03)

4.2) Storage Box / `sshfs` stale mount failures (Errno 107)