Skip to content

WARC storage tiering (SSD + Storage Box)

Goal: keep HealthArchive running on a small VPS SSD by tiering large WARC job directories onto a Hetzner Storage Box (“cold” storage), while still being able to replay pages from cold storage.

This playbook assumes:

  • Production-like host paths (/srv/healtharchive/**)
  • Existing snapshots may already reference absolute paths under /srv/healtharchive/jobs/**
  • You want to keep paths stable (so replay keeps working) while relocating bytes to cheaper storage.

Architecture (what runs where)

  • VPS (hot / canonical paths)
  • Canonical archive root: /srv/healtharchive/jobs
  • Backend services read WARCs from paths recorded in the DB (often absolute paths under /srv/healtharchive/jobs/**).
  • For tiering, we keep these canonical paths intact and mount/bind cold data into them.

  • Storage Box (cold bytes)

  • Mounted on the VPS at: /srv/healtharchive/storagebox
  • Cold mirror root (suggested): /srv/healtharchive/storagebox/jobs
  • You store large job directories here and then bind-mount them into the canonical paths under /srv/healtharchive/jobs/**.

Create the Storage Box (Hetzner console)

Recommended choices for HealthArchive:

  • Plan: BX11 (1 TB) is a good starting tier for cold WARCs.
  • Location: same region as the VPS.
  • Access: SSH key auth (recommended).
  • Additional settings
  • Enable: SSH Support
  • Disable (not needed): SMB Support, WebDAV Support
  • External reachability: prefer disabled (you can access via the VPS; no need to expose to the public internet).
  • Labels: optional; if you use them, keep them simple:
  • project=healtharchive
  • role=warc-cold-storage
  • env=prod

Notes:

  • “Set as default key” in Hetzner means “preselect this key for future Storage Boxes by default” (it doesn’t change the key material).
  • On the VPS, ensure private keys are locked down:
  • chmod 700 ~/.ssh
  • chmod 600 ~/.ssh/hetzner_storagebox
  • chmod 644 ~/.ssh/hetzner_storagebox.pub

Mount the Storage Box on the VPS (sshfs)

Run these on the VPS:

sudo apt-get update
sudo apt-get install -y sshfs
sudo sed -i 's/^#user_allow_other/user_allow_other/' /etc/fuse.conf
sudo mkdir -p /srv/healtharchive/storagebox

Mount (SSH runs on port 23 for Storage Boxes):

GID="$(getent group healtharchive | cut -d: -f3)"
sudo sshfs -p 23 \
  -o IdentityFile=/home/haadmin/.ssh/hetzner_storagebox \
  -o allow_other,default_permissions \
  -o uid="$(id -u haadmin)",gid="${GID}",umask=0027 \
  -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,kernel_cache \
  uNNNNNN@uNNNNNN.your-storagebox.de:/ \
  /srv/healtharchive/storagebox

Sanity check:

touch /srv/healtharchive/storagebox/_probe && rm /srv/healtharchive/storagebox/_probe
df -h /srv/healtharchive/storagebox

Create the cold mirror root:

mkdir -p /srv/healtharchive/storagebox/jobs/imports

If the Storage Box isn’t mounted (e.g., after reboot), tiered paths may fall back to empty local directories and replay will break. Add a small systemd unit to mount it on boot.

Use the repo templates under docs/deployment/systemd/ (installed via scripts/vps-install-systemd-units.sh).

1) Create /etc/healtharchive/storagebox.env (VPS):

sudo install -d -m 0755 /etc/healtharchive
sudo tee /etc/healtharchive/storagebox.env >/dev/null <<'EOF'
STORAGEBOX_HOST=uNNNNNN.your-storagebox.de
STORAGEBOX_USER=uNNNNNN
STORAGEBOX_IDENTITY=/home/haadmin/.ssh/hetzner_storagebox
STORAGEBOX_UID=1000
STORAGEBOX_GID=999
# Optional (defaults are fine for most setups):
# STORAGEBOX_REMOTE_PATH=
# STORAGEBOX_PORT=23
# STORAGEBOX_MOUNT=/srv/healtharchive/storagebox
EOF

Replace:

  • uNNNNNN values with your Storage Box username/host.
  • STORAGEBOX_UID with id -u haadmin.
  • STORAGEBOX_GID with getent group healtharchive | cut -d: -f3.

2) Install templates + enable the mount (VPS):

cd /opt/healtharchive
sudo ./scripts/vps-install-systemd-units.sh --apply
sudo systemctl enable --now healtharchive-storagebox-sshfs.service
systemctl status healtharchive-storagebox-sshfs.service --no-pager
mount | rg /srv/healtharchive/storagebox || true

If it fails due to host key prompts, prime root’s known_hosts once:

sudo ssh -p 23 -i /home/haadmin/.ssh/hetzner_storagebox uNNNNNN@uNNNNNN.your-storagebox.de exit

Move a job directory from SSD → Storage Box (safe swap)

This procedure keeps the canonical path stable.

1) Define paths (VPS):

HOT=/srv/healtharchive/jobs/imports/<job_dir_name>
COLD=/srv/healtharchive/storagebox/jobs/imports/<job_dir_name>

2) Copy to cold tier (VPS):

mkdir -p "$(dirname "$COLD")"
rsync -rltH --info=progress2 --no-owner --no-group --no-perms "$HOT/" "$COLD/"

3) Stop services (VPS):

sudo systemctl stop healtharchive-worker.service
sudo systemctl stop healtharchive-replay.service
sudo systemctl stop healtharchive-api.service

4) Swap the canonical path to point at the cold copy (VPS):

sudo mv "$HOT" "${HOT}.hot-backup"
sudo mkdir -p "$HOT"
sudo mount --bind "$COLD" "$HOT"

5) Start services + verify (VPS):

sudo systemctl start healtharchive-api.service
sudo systemctl start healtharchive-worker.service
sudo systemctl start healtharchive-replay.service

curl -fsS http://127.0.0.1:8001/api/health >/dev/null && echo OK

6) If everything is OK, delete the backup (VPS):

sudo rm -rf "${HOT}.hot-backup"

Rollback (if needed):

sudo umount "$HOT"
sudo rm -rf "$HOT"
sudo mv "${HOT}.hot-backup" "$HOT"

Bind mounts created manually will not survive a reboot. Use the repo’s unit template + manifest so tiered jobs come back automatically.

1) Create the manifest /etc/healtharchive/warc-tiering.binds (VPS):

sudo tee /etc/healtharchive/warc-tiering.binds >/dev/null <<'EOF'
# cold_path hot_path
/srv/healtharchive/storagebox/jobs/imports/legacy-hc-2025-04-21 /srv/healtharchive/jobs/imports/legacy-hc-2025-04-21
/srv/healtharchive/storagebox/jobs/imports/legacy-cihr-2025-04 /srv/healtharchive/jobs/imports/legacy-cihr-2025-04
EOF

2) Enable the service (VPS):

cd /opt/healtharchive
sudo ./scripts/vps-install-systemd-units.sh --apply
sudo systemctl enable --now healtharchive-warc-tiering.service
systemctl status healtharchive-warc-tiering.service --no-pager
mount | rg /srv/healtharchive/jobs/imports/legacy- || true

Note: the template service runs vps-warc-tiering-bind-mounts.sh --apply --repair-stale-mounts so it can automatically unmount stale Errno 107 mountpoints and re-apply bind mounts.

If healtharchive-warc-tiering.service is failed (or alert is firing)

First, gather a read-only diagnostic report (safe while crawls are running):

cd /opt/healtharchive
./scripts/vps-diagnose-warc-tiering.sh

If storage hot-path auto-recover automation is enabled (healtharchive-storage-hotpath-auto-recover.timer + sentinel), it will also attempt to reconcile a stale failed tiering unit state (reset-failed + start) when no stale targets are currently eligible and the base Storage Box mount is readable.

If the alert does not clear after a couple of timer cycles, run the manual reconcile steps below.

Then, during a safe window (recommended: no active replay indexing and no ongoing maintenance), clear the failed state and re-apply tiering:

sudo systemctl reset-failed healtharchive-warc-tiering.service
sudo systemctl start healtharchive-warc-tiering.service
systemctl status healtharchive-warc-tiering.service --no-pager -l

If it fails again, run the tiering script directly (shows the most actionable error output):

sudo /opt/healtharchive/scripts/vps-warc-tiering-bind-mounts.sh --apply --repair-stale-mounts

If the unit is in a failed state from a prior incident, clear it before retrying:

systemctl is-failed healtharchive-warc-tiering.service && sudo systemctl reset-failed healtharchive-warc-tiering.service || true
sudo systemctl start healtharchive-warc-tiering.service

Manual validation (safe):

sudo /opt/healtharchive/scripts/vps-warc-tiering-bind-mounts.sh

If healtharchive-warc-tiering.service repeatedly ends up in failed (e.g., after an sshfs disconnect), consider enabling the tiering health metrics timer so failures are visible quickly:

sudo systemctl enable --now healtharchive-tiering-metrics.timer
curl -s http://127.0.0.1:9100/metrics | rg '^healtharchive_tiering_' || true

Replay note (restart after tiering changes)

Replay runs in a long-lived Docker container and bind-mounts /srv/healtharchive/jobs into /warcs. After fixing stale mounts or changing tiering binds, restart replay so it sees a clean view of the mountpoints:

sudo systemctl restart healtharchive-replay.service
sudo systemctl start healtharchive-replay-smoke.service

Current annual output topology

As of 2026-05-06, the 2026 annual job output directories for HC, PHAC, and CIHR are hot paths under /srv/healtharchive/jobs/** backed by the single Storage Box mount at /srv/healtharchive/storagebox. On sshfs-backed bind mounts, findmnt may still display fstype=fuse.sshfs; validate the topology with vps-annual-output-tiering.py --year 2026 and, when needed, /proc/self/mountinfo or hot/cold stat identity rather than relying only on a visible bind option.

Annual outputs: automatically tier to Storage Box

If you use the annual scheduler timer (healtharchive-schedule-annual.timer), the systemd template now triggers healtharchive-annual-output-tiering.service on success. This bind-mounts the newly enqueued annual job output directories onto the Storage Box tier and briefly stops the worker to reduce race conditions.

To apply the updated template on the VPS:

cd /opt/healtharchive
git pull
sudo ./scripts/vps-install-systemd-units.sh --apply
sudo systemctl daemon-reload

Rehearsal before Jan 01 (optional)

The tiering script selects annual jobs using the Jan 01–Jan 03 UTC window by default. If you want to rehearse the end-to-end scheduling + tiering workflow before Jan 01, you can override the selection window:

1) Stop the worker (prevents any queued jobs from running):

sudo systemctl stop healtharchive-worker.service

2) Enqueue annual jobs (this affects the production DB; delete them afterwards if you do not want them queued):

/opt/healtharchive/.venv/bin/healtharchive schedule-annual --apply --year 2026 --sources hc phac cihr

3) Apply tiering for the jobs you just created (use a short window around “now”):

# Ensure we target the production DB (Postgres), not a local fallback (SQLite):
set -a; source /etc/healtharchive/backend.env; set +a
systemctl is-active postgresql.service

sudo --preserve-env=HEALTHARCHIVE_DATABASE_URL,HEALTHARCHIVE_ARCHIVE_ROOT \
  /opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-annual-output-tiering.py \
  --apply \
  --repair-stale-mounts \
  --allow-repair-running-jobs \
  --year 2026 \
  --created-after "$(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%SZ)"

4) Validate the expected output dirs are mounted (and that storagebox is still mounted):

mount | rg '/srv/healtharchive/storagebox|/srv/healtharchive/jobs/(hc|phac|cihr)/' || true

5) Decide what to do with the queued annual jobs:

  • If you want to keep them queued for Jan 01, leave the worker stopped until you are ready.
  • If you do not want them queued yet, delete them via the admin UI or CLI before restarting the worker.

If you use Prometheus alerting, enable the tiering metrics writer:

sudo systemctl enable --now healtharchive-tiering-metrics.timer

This requires node_exporter to have the textfile collector enabled (the repo installer does this):

cd /opt/healtharchive
git pull
sudo ./scripts/vps-install-observability-exporters.sh --apply

Promote (cold → hot) later (optional)

If you decide a job should be “hot” again:

1) Stop services. 2) umount the canonical path. 3) rsync cold → hot (SSD). 4) Start services.

This is the inverse of the “safe swap” above.

Preflight implications

  • If you intend the upcoming annual campaign outputs to land on the Storage Box, run preflight with:
  • YEAR=2026; ./scripts/vps-preflight-crawl.sh --year "$YEAR" --campaign-archive-root /srv/healtharchive/storagebox/jobs
  • Ensure the Storage Box mount is active before preflight and before the annual campaign runs.

Annual campaign with a tiny SSD (operational pattern)

If the campaign won’t fit on SSD:

1) Keep the Storage Box mounted. 2) Schedule the annual jobs (dry-run first), then create a cold output directory and bind-mount it into the canonical job output directory before the worker runs the job.

Sketch:

# After jobs exist (queued), get the job output dir:
JOB_ID=123
/opt/healtharchive/.venv/bin/healtharchive show-job --id "$JOB_ID" | rg output_dir

# Suppose output_dir is:
HOT=/srv/healtharchive/jobs/hc/20260101T000000Z__hc-2026

# Create a matching cold location and mount it into place:
COLD=/srv/healtharchive/storagebox/jobs/hc/20260101T000000Z__hc-2026
mkdir -p "$COLD"
sudo mount --bind "$COLD" "$HOT"

This keeps DB WARC paths under /srv/healtharchive/jobs/** (stable), while the bytes live on Storage Box.

Common pitfalls

  • mkdir: Permission denied under /srv/healtharchive/storagebox/**:
  • Your sshfs mount is mapped to the wrong UID/GID; remount with uid=$(id -u haadmin), gid=$(getent group healtharchive | cut -d: -f3), and a restrictive umask.
  • rsync ... chgrp failed: Permission denied:
  • Use --no-owner --no-group --no-perms for cross-filesystem copies to Storage Box.
  • Storage Box not mounted (but directories still exist):
  • The filesystem checks will silently target the SSD unless you validate the mount; keep the mount persistent and verify with mount | grep storagebox.