Skip to content

Systemd unit templates (single VPS)

These files are templates meant to be copied onto the production VPS under /etc/systemd/system/.

They implement:

  • API service template (uvicorn on loopback; defaults to 2 workers)
  • Replay service template (pywb on loopback; hardened docker run)
  • Worker service template (canonical healtharchive start-worker entrypoint)
  • Annual scheduling timer (Jan 01 UTC)
  • Worker priority lowering during campaign (always-on, low-risk)
  • Storage Box mount (sshfs) for cold WARC storage (optional but recommended for tiering)
  • WARC tiering bind mounts (Storage Box -> canonical paths) (optional; for tiny-SSD setups)
  • Replay reconciliation timer (pywb indexing; capped)
  • Change tracking timer (edition-aware diffs; capped)
  • Baseline drift check timer (policy vs observed; detects config drift)
  • Public surface verification timer (public API + frontend; deeper than uptime checks)
  • Optional "timer ran" pings (Healthchecks-style)
  • Annual search verification capture (optional, safe)

Assumptions (adjust paths/user if your VPS differs):

  • Repo is deployed at: /opt/healtharchive
  • Venv exists at: /opt/healtharchive/.venv
  • Backend env file: /etc/healtharchive/backend.env
  • Backend system user: haadmin

The backend uses per-job flock lock files to prevent double-running a job and to help watchdog scripts classify whether a job is still actively running.

By default, lock files live under /tmp/healtharchive-job-locks, which can be fragile on hardened systems and during cross-user incident response.

Recommended production lock directory:

  • /srv/healtharchive/ops/locks/jobs

Enablement (on the VPS):

  1. Ensure ops dirs exist:

  2. cd /opt/healtharchive && sudo ./scripts/vps-bootstrap-ops-dirs.sh

  3. Set the env var in /etc/healtharchive/backend.env:

  4. export HEALTHARCHIVE_JOB_LOCK_DIR=/srv/healtharchive/ops/locks/jobs

  5. Restart the worker and any watchdog timers/services that read backend.env during a safe window.

Hard requirement: do not restart the worker while crawls are running unless you accept interruption. Confirm healtharchive list-jobs --status running is empty before restarting.

Recommended (safe, copy/paste checklist):

  • cd /opt/healtharchive && ./scripts/vps-job-lock-dir-cutover.sh

If the script is missing on the VPS, your /opt/healtharchive checkout is behind the repo. You can either deploy/pull first, or stage the cutover manually (no restarts required until your maintenance window):

  • sudo cp -av /etc/healtharchive/backend.env /etc/healtharchive/backend.env.bak.$(date -u +%Y%m%dT%H%M%SZ)
  • Add/update:
  • export HEALTHARCHIVE_JOB_LOCK_DIR=/srv/healtharchive/ops/locks/jobs
  • Ensure the lock dir exists (some older vps-bootstrap-ops-dirs.sh versions did not create it):
  • sudo install -d -m 2770 -o root -g healtharchive /srv/healtharchive/ops/locks/jobs

Files

  • healtharchive-api.service
  • Repo-managed FastAPI/uvicorn service template for the public API.
  • Binds to loopback (127.0.0.1:8001) for Caddy to proxy.
  • Defaults to HEALTHARCHIVE_API_WORKERS=2; override in /etc/healtharchive/backend.env if needed.
  • healtharchive-worker.service
  • Repo-managed worker service template for the long-running crawl worker loop.
  • Uses the canonical CLI entrypoint:
    • ExecStart=/opt/healtharchive/.venv/bin/healtharchive start-worker --poll-interval 30
  • healtharchive-replay.service
  • Repo-managed pywb replay service template for replay.healtharchive.ca.
  • Binds to loopback (127.0.0.1:8090) for Caddy to proxy.
  • Resolves the host hareplay UID and healtharchive GID at startup, then runs docker with -e PYTHONPATH=/webarchive so the managed /srv/healtharchive/replay/sitecustomize.py hook can drop malformed replayed header names before Caddy parses them.
  • healtharchive-schedule-annual.service
  • Apply mode: enqueues annual jobs (--apply) for the current UTC year.
  • Gated by ConditionPathExists=/etc/healtharchive/automation-enabled.
  • RefuseManualStart=yes to prevent accidental systemctl start while the worker is running.
  • healtharchive-schedule-annual.timer
  • Runs at *-01-01 00:05:00 UTC
  • Persistent=true (runs on next boot if missed)
  • healtharchive-schedule-annual-dry-run.service
  • Safe validation service (no DB writes).
  • healtharchive-worker.service.override.conf
  • Drop-in that lowers worker CPU/IO priority to keep the API responsive.
  • healtharchive-replay-reconcile.service
  • Apply mode: runs healtharchive replay-reconcile --apply --max-jobs 1.
  • Gated by ConditionPathExists=/etc/healtharchive/replay-automation-enabled.
  • Uses a lock file under /srv/healtharchive/replay/.locks/ to prevent concurrent runs.
  • Runs as root because it has to bridge host-side replay collection writes with docker exec into the hardened pywb container; running it as haadmin leaves new hareplay-owned collections stuck in missing_index,missing_warc_links.
  • healtharchive-replay-reconcile.timer
  • Daily at *-*-* 02:30:00 UTC
  • Persistent=true (runs on next boot if missed)
  • healtharchive-replay-reconcile-dry-run.service
  • Safe validation service (no docker exec, no filesystem writes beyond the lock file dir).
  • healtharchive-change-tracking.service
  • Apply mode: runs healtharchive compute-changes (edition-aware diffs).
  • Gated by ConditionPathExists=/etc/healtharchive/change-tracking-enabled.
  • healtharchive-change-tracking.timer
  • Daily at *-*-* 03:40:00 UTC
  • Persistent=true (runs on next boot if missed)
  • healtharchive-change-tracking-dry-run.service
  • Safe validation service (no DB writes; reports how many diffs would be computed).
  • scripts/systemd-healthchecks-wrapper.sh
  • Helper for optional Healthchecks-style pings without embedding ping URLs in unit files.
  • healtharchive-annual-search-verify.service
  • Runs scripts/annual-search-verify.sh daily, but captures once per year (idempotent).
  • healtharchive-annual-search-verify.timer
  • Daily timer for healtharchive-annual-search-verify.service.
  • healtharchive-coverage-guardrails.service + .timer
  • Writes coverage regression guardrails to the node_exporter textfile collector.
  • Gated by ConditionPathExists=/etc/healtharchive/coverage-guardrails-enabled.
  • healtharchive-replay-smoke.service + .timer
  • Runs replay smoke tests against the latest indexed job per source (node_exporter textfile).
  • Gated by ConditionPathExists=/etc/healtharchive/replay-smoke-enabled.
  • healtharchive-cleanup-automation.service + .timer
  • Cleans indexed jobs using safe temp-nonwarc mode (keeps WARCs).
  • Gated by ConditionPathExists=/etc/healtharchive/cleanup-automation-enabled.
  • healtharchive-disk-threshold-cleanup.service + .timer
  • Runs safe temp-nonwarc cleanup in threshold mode (no-op when disk is below threshold).
  • Gated by ConditionPathExists=/etc/healtharchive/cleanup-automation-enabled.
  • healtharchive-baseline-drift-check.service
  • Runs scripts/check_baseline_drift.py (policy vs observed; writes artifacts under /srv/healtharchive/ops/baseline/).
  • Gated by ConditionPathExists=/etc/healtharchive/baseline-drift-enabled.
  • healtharchive-baseline-drift-check.timer
  • Weekly timer for healtharchive-baseline-drift-check.service.
  • healtharchive-public-surface-verify.service
  • Runs scripts/verify_public_surface.py (public API + frontend; includes changes/RSS and partner pages).
  • Gated by ConditionPathExists=/etc/healtharchive/public-verify-enabled.
  • Intended as a deeper “synthetic check” than external uptime monitors.
  • healtharchive-public-surface-verify.timer
  • Daily timer for healtharchive-public-surface-verify.service.
  • healtharchive-tiering-metrics.service + .timer
  • Writes a small set of tiering health metrics to the node_exporter textfile collector.
  • Used to alert on Storage Box / tiering failures without needing a systemd collector.
  • Prereq: node_exporter must run with --collector.textfile.directory=/var/lib/node_exporter/textfile_collector (configured by scripts/vps-install-observability-exporters.sh).
  • healtharchive-crawl-metrics.service + .timer
  • Writes per-job crawl progress/stall metrics (based on crawlStatus logs) to the node_exporter textfile collector.
  • Used to alert on stalled crawls without manual log tailing.
  • Prereq: node_exporter textfile collector is enabled (same as tiering metrics).
  • healtharchive-crawl-auto-recover.service + .timer
  • Optional automation to recover stalled crawl jobs by marking stale running jobs as retryable (and restarting the worker when needed).
  • Gated by ConditionPathExists=/etc/healtharchive/crawl-auto-recover-enabled.
  • Disabled by default; enable only after you’re comfortable with the thresholds/caps in scripts/vps-crawl-auto-recover.py.
  • Note: automation-first alerting for crawl stalls assumes this watchdog is enabled and its textfile metrics are fresh.
  • healtharchive-worker-auto-start.service + .timer
  • Optional automation to ensure the worker is running when it should be (jobs pending + storage OK).
  • Gated by ConditionPathExists=/etc/healtharchive/worker-auto-start-enabled.
  • Conservative by default; prefers a “do nothing” skip over unsafe starts.
  • Note: automation-first worker-down alert suppression assumes this watchdog is enabled and its textfile metrics are fresh.
  • healtharchive-drift-auto-reconcile.service + .timer
  • Optional automation to recover from deployment dependency drift (calls vps-deploy.sh).
  • Read-only unless drift is found in baseline report; triggered every 5 minutes.
  • Gated by ConditionPathExists=/etc/healtharchive/drift-auto-reconcile-enabled.
  • healtharchive-storage-hotpath-auto-recover.service + .timer
  • Optional automation to recover stale/unreadable hot paths caused by sshfs/FUSE mount failures (Errno 107).
  • Gated by ConditionPathExists=/etc/healtharchive/storage-hotpath-auto-recover-enabled.
  • Disabled by default; enable only after dry-run validation and only if you’re comfortable with the safety caps in scripts/vps-storage-hotpath-auto-recover.py.
  • Note: automation-first suppression of Errno 107 symptom alerts assumes this watchdog is enabled and its textfile metrics are fresh.
  • healtharchive-storage-watchdog-burnin-snapshot.service + .timer
  • Optional read-only daily snapshot of the storage hot-path watchdog burn-in summary.
  • Gated by ConditionPathExists=/etc/healtharchive/storage-watchdog-burnin-enabled.
  • Writes dated JSON artifacts under /srv/healtharchive/ops/burnin/storage-watchdog/ (and latest.json).
  • healtharchive-storagebox-sshfs.service
  • Mounts a Hetzner Storage Box at /srv/healtharchive/storagebox via sshfs.
  • Reads configuration from /etc/healtharchive/storagebox.env.
  • Intended for tiered WARC storage on small SSD hosts.
  • healtharchive-warc-tiering.service
  • Applies bind mounts from /etc/healtharchive/warc-tiering.binds so canonical archive paths under /srv/healtharchive/jobs/** resolve to Storage Box data.
  • Runs before the API/worker/replay services start.
  • healtharchive-annual-output-tiering.service
  • After annual jobs are enqueued, bind-mounts each annual job output_dir onto the Storage Box tier.
  • Triggered via OnSuccess= in healtharchive-schedule-annual.service (template).
  • healtharchive-annual-campaign-sentinel.service + .timer
  • Runs a “day-of” annual readiness gate automatically: preflight + annual-status + tiering checks.
  • Writes a small Prometheus textfile metric so Alertmanager can notify on failures.

These timers are safe-by-default and gated by sentinel files. Enable only what matches your operational readiness.

  • Change tracking (healtharchive-change-tracking.timer)
  • Recommended to enable once the snapshot_changes table exists and a dry run succeeds without errors.
  • Annual scheduling (healtharchive-schedule-annual.timer)
  • Enable only after an annual dry-run succeeds and storage headroom is confirmed.
  • Replay reconcile (healtharchive-replay-reconcile.timer)
  • Enable only if replay is enabled and stable.
  • Annual search verification (healtharchive-annual-search-verify.timer)
  • Optional; safe to enable if you want a yearly search QA artifact.
  • Coverage guardrails (healtharchive-coverage-guardrails.timer)
  • Recommended once you have at least two annual editions indexed.
  • Replay smoke tests (healtharchive-replay-smoke.timer)
  • Enable only if replay is enabled and stable.
  • Cleanup automation (healtharchive-cleanup-automation.timer)
  • Optional; keep caps conservative and review first dry-run.
  • Disk threshold cleanup (healtharchive-disk-threshold-cleanup.timer)
  • Optional; runs every 30 minutes but only applies cleanup when disk usage exceeds the configured threshold.
  • Baseline drift check (healtharchive-baseline-drift-check.timer)
  • Recommended; low-risk and catches “silent” ops drift.
  • Storage hot-path auto-recover (healtharchive-storage-hotpath-auto-recover.timer)
  • Dangerous if misconfigured; only enable after you’ve validated Phase 1 alerts and run the watchdog in dry-run mode.
  • The unit is gated by a venv presence check and the watchdog skips runs while the deploy lock is held (to avoid flapping during active deploys).
  • Storage watchdog burn-in snapshots (healtharchive-storage-watchdog-burnin-snapshot.timer)
  • Read-only; safe to enable during rollout/burn-in weeks so evidence is captured automatically.
  • Worker auto-start watchdog (healtharchive-worker-auto-start.timer)
  • Recommended once you’re confident in the single-VPS production automation stack.
  • The unit is sentinel-gated, refuses starts when Storage Box is unreadable, and performs conservative stale status=running drift reconciliation before deciding whether to start the worker.
  • Drift auto-reconcile watchdog (healtharchive-drift-auto-reconcile.timer)
  • Recommended for self-healing missing pip dependencies that cause API 502s.
  • Runs every 5 minutes and invokes vps-deploy.sh if the baseline report catches virtual environment drift.

If a timer is enabled, also ensure its sentinel file exists under /etc/healtharchive/ (see the enablement sections below).


Install / update on the VPS

Preferred (one command; installs the managed API/replay/worker templates, timer templates, and the worker priority drop-in):

cd /opt/healtharchive
sudo ./scripts/vps-install-systemd-units.sh --apply --restart-worker

Run a crawl job detached (optional)

If you need to run a specific DB-backed crawl job manually (for debugging or recovery), prefer launching it as a transient systemd unit so your SSH session doesn’t need to stay open:

sudo /opt/healtharchive/scripts/vps-run-db-job-detached.py --id 7 --retry-first

Follow the printed journalctl -u <unit>.service -f command to tail logs.

If you are using WARC tiering with a Storage Box, also create these files on the VPS:

  • /etc/healtharchive/storagebox.env
  • Configuration consumed by healtharchive-storagebox-sshfs.service.
  • /etc/healtharchive/warc-tiering.binds
  • Bind mount manifest consumed by healtharchive-warc-tiering.service.

See: docs/operations/playbooks/storage/warc-storage-tiering.md.

Before enabling timers that write artifacts under /srv/healtharchive/ops/, ensure the ops directories exist with the expected permissions:

cd /opt/healtharchive
sudo ./scripts/vps-bootstrap-ops-dirs.sh

Manual install (equivalent):

Copy unit files:

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-schedule-annual.service \
  /etc/systemd/system/healtharchive-schedule-annual.service

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-schedule-annual.timer \
  /etc/systemd/system/healtharchive-schedule-annual.timer

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-schedule-annual-dry-run.service \
  /etc/systemd/system/healtharchive-schedule-annual-dry-run.service

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-replay-reconcile.service \
  /etc/systemd/system/healtharchive-replay-reconcile.service

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-replay-reconcile.timer \
  /etc/systemd/system/healtharchive-replay-reconcile.timer

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-replay-reconcile-dry-run.service \
  /etc/systemd/system/healtharchive-replay-reconcile-dry-run.service

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-change-tracking.service \
  /etc/systemd/system/healtharchive-change-tracking.service

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-change-tracking.timer \
  /etc/systemd/system/healtharchive-change-tracking.timer

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-change-tracking-dry-run.service \
  /etc/systemd/system/healtharchive-change-tracking-dry-run.service

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-annual-search-verify.service \
  /etc/systemd/system/healtharchive-annual-search-verify.service

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-annual-search-verify.timer \
  /etc/systemd/system/healtharchive-annual-search-verify.timer

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-baseline-drift-check.service \
  /etc/systemd/system/healtharchive-baseline-drift-check.service

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-baseline-drift-check.timer \
  /etc/systemd/system/healtharchive-baseline-drift-check.timer

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-public-surface-verify.service \
  /etc/systemd/system/healtharchive-public-surface-verify.service

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-public-surface-verify.timer \
  /etc/systemd/system/healtharchive-public-surface-verify.timer

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-drift-auto-reconcile.service \
  /etc/systemd/system/healtharchive-drift-auto-reconcile.service

sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-drift-auto-reconcile.timer \
  /etc/systemd/system/healtharchive-drift-auto-reconcile.timer

Install the worker priority drop-in:

sudo install -d -m 0755 -o root -g root /etc/systemd/system/healtharchive-worker.service.d
sudo install -m 0644 -o root -g root \
  /opt/healtharchive/docs/deployment/systemd/healtharchive-worker.service.override.conf \
  /etc/systemd/system/healtharchive-worker.service.d/override.conf

Reload systemd:

sudo systemctl daemon-reload

Restart worker to pick up priority changes:

sudo systemctl restart healtharchive-worker

Verify the priority values:

systemctl show healtharchive-worker -p Nice -p IOSchedulingClass -p IOSchedulingPriority

Optional: "timer ran" pings (Healthchecks-style)

This repo does not commit ping URLs. If you want "did it run?" checks, create a root-owned env file on the VPS:

sudo install -d -m 0755 -o root -g root /etc/healtharchive
# Only create the file if missing (don't clobber existing values like HC_DB_BACKUP_URL).
sudo test -f /etc/healtharchive/healthchecks.env || sudo install -m 0600 -o root -g root /dev/null /etc/healtharchive/healthchecks.env
sudo chown root:root /etc/healtharchive/healthchecks.env
sudo chmod 0600 /etc/healtharchive/healthchecks.env

Edit /etc/healtharchive/healthchecks.env and set (examples):

# You do NOT need to set every variable listed here.
# Only set the variables for Healthchecks checks you have actually created.
# If a variable is missing/empty, the service still runs; it just won't ping.
#
# Note: the Healthchecks "Name" can be anything; it does not need to match the
# env var name. The env var is just how systemd finds the ping URL.
HEALTHARCHIVE_HC_PING_REPLAY_RECONCILE=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_SCHEDULE_ANNUAL=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_ANNUAL_SENTINEL=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_CHANGE_TRACKING=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_BASELINE_DRIFT=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_PUBLIC_VERIFY=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_ANNUAL_SEARCH_VERIFY=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_COVERAGE_GUARDRAILS=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_REPLAY_SMOKE=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_CLEANUP_AUTOMATION=https://hc-ping.com/UUID_HERE

Notes:

  • The unit templates use EnvironmentFile=-/etc/healtharchive/healthchecks.env so the file is optional.
  • If you also use the legacy Healthchecks-based scripts described in ../production-single-vps.md (DB backup + disk check), keep their variables in the same file too (HC_DB_BACKUP_URL, HC_DISK_URL, HC_DISK_THRESHOLD).
  • If set, services will best-effort ping:
  • <url>/start at the beginning
  • <url> on success
  • <url>/fail on failure
  • Ping failures do not fail the service.

Audit Healthchecks alignment (safe)

This script compares:

  • What ping env vars are set in /etc/healtharchive/healthchecks.env
  • What ping vars are referenced by installed systemd unit files (via --ping-var ...)
  • Which timers exist (for manual cross-check with Healthchecks “last ping” timestamps)

Run on the VPS:

cd /opt/healtharchive
sudo python3 ./scripts/verify_healthchecks_alignment.py

If it reports “referenced but unset”, you either:

  • Intentionally have pings disabled for those timers (OK), or
  • Should create the missing checks in Healthchecks and add the missing env vars.

If it reports “set but unused”, you likely have a stale env var (remove it) or the unit that used to reference it was removed/renamed.


Validate the annual scheduler (safe)

This dry-run service exercises DB connectivity + scheduler output without creating jobs:

sudo systemctl start healtharchive-schedule-annual-dry-run.service
sudo journalctl -u healtharchive-schedule-annual-dry-run.service -n 200 --no-pager

Do not run healtharchive-schedule-annual.service manually in production; it enqueues jobs and the worker may start crawling immediately.


Validate replay reconciliation (safe)

This dry-run service exercises DB connectivity + filesystem drift detection without running any docker exec commands:

sudo systemctl start healtharchive-replay-reconcile-dry-run.service
sudo journalctl -u healtharchive-replay-reconcile-dry-run.service -n 200 --no-pager

Validate change tracking (safe)

This dry-run service exercises DB connectivity and reports how many diffs would be computed:

sudo systemctl start healtharchive-change-tracking-dry-run.service
sudo journalctl -u healtharchive-change-tracking-dry-run.service -n 200 --no-pager

If you see an error like relation "snapshot_changes" does not exist, apply migrations first (idempotent):

cd /opt/healtharchive
sudo -u haadmin /opt/healtharchive/.venv/bin/alembic upgrade head

Enable automation (Jan 01)

Create the automation sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/automation-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-schedule-annual.timer
systemctl list-timers | rg healtharchive-schedule-annual || systemctl list-timers | grep healtharchive-schedule-annual

Note: do not enable the .service units directly; only the .timer should be enabled.


Enable replay reconciliation automation (optional)

Create the replay automation sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/replay-automation-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-replay-reconcile.timer
systemctl list-timers | rg healtharchive-replay-reconcile || systemctl list-timers | grep healtharchive-replay-reconcile

Note: by default, the timer only reconciles replay indexing. Preview image generation is intentionally left manual/capped until you decide it’s stable enough to automate.


Enable change tracking automation (optional)

Create the change tracking sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/change-tracking-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-change-tracking.timer
systemctl list-timers | rg healtharchive-change-tracking || systemctl list-timers | grep healtharchive-change-tracking

Enable annual search verification capture (optional)

This captures golden-query /api/search JSON once per year after the annual campaign becomes search-ready.

The service is idempotent:

  • If the campaign isn't ready, it exits 0 (no failure spam).
  • If artifacts already exist for the current year/run-id, it exits 0.

Enable the timer:

sudo systemctl enable --now healtharchive-annual-search-verify.timer
systemctl list-timers | rg healtharchive-annual-search-verify || systemctl list-timers | grep healtharchive-annual-search-verify

Artifacts default to:

  • /srv/healtharchive/ops/search-eval/<year>/final/

To force a re-run for the current year, delete that directory and run the service once.


Enable coverage guardrails (optional)

This emits daily metrics comparing the latest indexed annual job to the prior year per source.

Create the sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/coverage-guardrails-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-coverage-guardrails.timer
systemctl list-timers | rg healtharchive-coverage-guardrails || systemctl list-timers | grep healtharchive-coverage-guardrails

Enable replay smoke tests (optional)

This runs lightweight replay checks against the latest indexed job per source.

Create the sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/replay-smoke-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-replay-smoke.timer
systemctl list-timers | rg healtharchive-replay-smoke || systemctl list-timers | grep healtharchive-replay-smoke

Enable cleanup automation (optional)

This runs safe temp-nonwarc cleanup for older indexed jobs (keeps WARCs).

Create the sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/cleanup-automation-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-cleanup-automation.timer
systemctl list-timers | rg healtharchive-cleanup-automation || systemctl list-timers | grep healtharchive-cleanup-automation

Enable disk threshold cleanup (optional)

This is an event-driven safety net for disk pressure:

  • it runs on a frequent timer (every 30 minutes),
  • but only applies cleanup if disk usage exceeds threshold_trigger_percent from ops/automation/cleanup-automation.toml,
  • and uses threshold_max_jobs_per_run as the cap when triggered.

It is gated by the same sentinel file as weekly cleanup:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/cleanup-automation-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-disk-threshold-cleanup.timer
systemctl list-timers | rg healtharchive-disk-threshold-cleanup || systemctl list-timers | grep healtharchive-disk-threshold-cleanup

Rollback:

sudo systemctl disable --now healtharchive-disk-threshold-cleanup.timer

Enable storage hot-path auto-recovery (optional; high impact)

This automation attempts conservative self-healing for the specific failure class:

  • OSError: [Errno 107] Transport endpoint is not connected

It can unmount stale hot paths and re-apply tiering. It will only stop the worker when either:

  • a running job output directory is detected as stale (Errno 107), or
  • there are no running jobs (to prevent races while repairing mountpoints for the next jobs).

It also probes the output dirs of the next queued/retryable jobs to prevent infra-error retry storms (a stale mountpoint for a retryable job should be repaired before the worker selects it).

If no stale targets are currently eligible but healtharchive-warc-tiering.service is stuck in failed, the watchdog will attempt a conservative reconcile (reset-failed + start) when the base Storage Box mount is readable. This helps clear persistent HealthArchiveWarcTieringFailed alerts caused by stale historical unit state.

After successful mount recovery it restarts replay (best-effort) so replay sees a clean view of /srv/healtharchive/jobs.

Keep it disabled by default and enable only after:

  • Phase 1 alerting/metrics are working (you have visibility),
  • you have validated the watchdog in dry-run mode first.
  • If you plan to rely on automation-first alert suppression for stale-mount symptoms, also verify the watchdog textfile metrics stay fresh in Prometheus after enablement.

Create the sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/storage-hotpath-auto-recover-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-storage-hotpath-auto-recover.timer
systemctl list-timers | rg healtharchive-storage-hotpath-auto-recover || systemctl list-timers | grep healtharchive-storage-hotpath-auto-recover

Rollback:

sudo systemctl disable --now healtharchive-storage-hotpath-auto-recover.timer
sudo rm -f /etc/healtharchive/storage-hotpath-auto-recover-enabled

The watchdog writes state under:

  • /srv/healtharchive/ops/watchdog/storage-hotpath-auto-recover.json

and emits node_exporter textfile metrics via:

  • healtharchive_storage_hotpath_auto_recover.prom

Enable storage watchdog burn-in snapshots (optional; low impact)

This automation captures a daily read-only snapshot of the storage watchdog burn-in report so you have evidence artifacts even if nobody remembers to run the command manually.

Precondition: ops directories exist (idempotent):

cd /opt/healtharchive
sudo ./scripts/vps-bootstrap-ops-dirs.sh

Create the sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/storage-watchdog-burnin-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-storage-watchdog-burnin-snapshot.timer
systemctl list-timers | rg healtharchive-storage-watchdog-burnin-snapshot || systemctl list-timers | grep healtharchive-storage-watchdog-burnin-snapshot

Artifacts:

  • /srv/healtharchive/ops/burnin/storage-watchdog/latest.json
  • /srv/healtharchive/ops/burnin/storage-watchdog/storage-watchdog-burnin-YYYYMMDD.json

Rollback:

sudo systemctl disable --now healtharchive-storage-watchdog-burnin-snapshot.timer
sudo rm -f /etc/healtharchive/storage-watchdog-burnin-enabled

Enable worker auto-start watchdog (optional; conservative)

This automation exists to prevent “everything stopped” failures where the system is healthy enough to run, but healtharchive-worker.service is down and jobs are pending.

It will only start the worker when all of these are true:

  • the worker unit is inactive,
  • there are pending crawl jobs (status in (queued, retryable)),
  • the Storage Box mount is readable,
  • the deploy lock is not present (or is stale),
  • and there are no DB jobs in status=running (conservative safety gate).

If you plan to rely on automation-first suppression for worker-down notifications, also verify the watchdog textfile metrics are present/fresh in Prometheus after enablement.

Create the sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/worker-auto-start-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-worker-auto-start.timer
systemctl list-timers | rg healtharchive-worker-auto-start || systemctl list-timers | grep healtharchive-worker-auto-start

Rollback:

sudo systemctl disable --now healtharchive-worker-auto-start.timer
sudo rm -f /etc/healtharchive/worker-auto-start-enabled

The watchdog writes state under:

  • /srv/healtharchive/ops/watchdog/worker-auto-start.json

and emits node_exporter textfile metrics via:

  • healtharchive_worker_auto_start.prom

This automation prevents “API 502 Bad Gateway” API crashes that occur when dependencies (like Python packages in .venv) drift from the desired codebase following an incomplete/manual code update.

It reads the results from healtharchive-baseline-drift-check.timer's periodic checks to decide if an infrastructure environment rebuild (via scripts/vps-deploy.sh) is needed, applying cooldowns to avoid flapping.

Create the sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/drift-auto-reconcile-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-drift-auto-reconcile.timer
systemctl list-timers | rg healtharchive-drift-auto-reconcile || systemctl list-timers | grep healtharchive-drift-auto-reconcile

Rollback:

sudo systemctl disable --now healtharchive-drift-auto-reconcile.timer
sudo rm -f /etc/healtharchive/drift-auto-reconcile-enabled

The watchdog writes state under:

  • /srv/healtharchive/ops/watchdog/drift-auto-reconcile.json

and emits node_exporter textfile metrics via:

  • healtharchive_drift_auto_reconcile.prom

Baseline drift checks validate that production still matches the project’s expected invariants (security posture, perms, unit enablement).

Create the sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/baseline-drift-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-baseline-drift-check.timer
systemctl list-timers | rg healtharchive-baseline-drift-check || systemctl list-timers | grep healtharchive-baseline-drift-check

Artifacts are written under:

  • /srv/healtharchive/ops/baseline/

If the drift check fails, inspect:

  • /srv/healtharchive/ops/baseline/drift-report-latest.txt
  • journalctl -u healtharchive-baseline-drift-check.service --no-pager -l

If required drift is fixed and you want to confirm recovery immediately:

sudo systemctl start healtharchive-baseline-drift-check.service
sudo journalctl -u healtharchive-baseline-drift-check.service -n 200 --no-pager -l

Healthchecks interpretation note:

  • A success ping after healtharchive-baseline-drift-check.service is started manually means the service passed on that rerun.
  • It does not prove the weekly timer itself fired on schedule.
  • To confirm the actual timer window, compare:
  • sudo systemctl status healtharchive-baseline-drift-check.timer --no-pager -l
  • sudo systemctl list-timers --all | grep healtharchive-baseline-drift-check
  • sudo journalctl -u healtharchive-baseline-drift-check.timer --no-pager -l
  • journalctl -u ...timer --since ... can still show -- No entries -- if the visible timer startup log predates your --since cutoff.

This is a deeper synthetic check than external uptime monitors. It validates:

  • public API health, sources, search, snapshot detail and raw HTML
  • replay browse URL (unless skipped)
  • exports manifest and export endpoint HEADs
  • changes feed + RSS
  • key frontend pages, including /brief, /cite, /methods, and /governance

Create the sentinel file:

sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/public-verify-enabled

Enable the timer:

sudo systemctl enable --now healtharchive-public-surface-verify.timer
systemctl list-timers | rg healtharchive-public-surface-verify || systemctl list-timers | grep healtharchive-public-surface-verify

Rollback / disable quickly

  • Disable timer immediately:
sudo systemctl disable --now healtharchive-schedule-annual.timer
  • Disable all scheduling automation immediately:
sudo rm -f /etc/healtharchive/automation-enabled
  • Disable replay reconciliation automation immediately:
sudo systemctl disable --now healtharchive-replay-reconcile.timer
sudo rm -f /etc/healtharchive/replay-automation-enabled
  • Disable annual search verification automation immediately:
sudo systemctl disable --now healtharchive-annual-search-verify.timer
  • Remove the worker priority override:
sudo rm -f /etc/systemd/system/healtharchive-worker.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl restart healtharchive-worker