Systemd unit templates (single VPS)
These files are templates meant to be copied onto the production VPS under /etc/systemd/system/.
They implement:
- API service template (uvicorn on loopback; defaults to 2 workers)
- Replay service template (pywb on loopback; hardened docker run)
- Worker service template (canonical
healtharchive start-workerentrypoint) - Annual scheduling timer (Jan 01 UTC)
- Worker priority lowering during campaign (always-on, low-risk)
- Storage Box mount (sshfs) for cold WARC storage (optional but recommended for tiering)
- WARC tiering bind mounts (Storage Box -> canonical paths) (optional; for tiny-SSD setups)
- Replay reconciliation timer (pywb indexing; capped)
- Change tracking timer (edition-aware diffs; capped)
- Baseline drift check timer (policy vs observed; detects config drift)
- Public surface verification timer (public API + frontend; deeper than uptime checks)
- Optional "timer ran" pings (Healthchecks-style)
- Annual search verification capture (optional, safe)
Assumptions (adjust paths/user if your VPS differs):
- Repo is deployed at:
/opt/healtharchive - Venv exists at:
/opt/healtharchive/.venv - Backend env file:
/etc/healtharchive/backend.env - Backend system user:
haadmin
Job lock directory (recommended)
The backend uses per-job flock lock files to prevent double-running a job and to help watchdog scripts classify whether a job is still actively running.
By default, lock files live under /tmp/healtharchive-job-locks, which can be fragile on hardened systems and during cross-user incident response.
Recommended production lock directory:
/srv/healtharchive/ops/locks/jobs
Enablement (on the VPS):
-
Ensure ops dirs exist:
-
cd /opt/healtharchive && sudo ./scripts/vps-bootstrap-ops-dirs.sh -
Set the env var in
/etc/healtharchive/backend.env: -
export HEALTHARCHIVE_JOB_LOCK_DIR=/srv/healtharchive/ops/locks/jobs - Restart the worker and any watchdog timers/services that read
backend.envduring a safe window.
Hard requirement: do not restart the worker while crawls are running unless you accept interruption. Confirm healtharchive list-jobs --status running is empty before restarting.
Recommended (safe, copy/paste checklist):
cd /opt/healtharchive && ./scripts/vps-job-lock-dir-cutover.sh
If the script is missing on the VPS, your /opt/healtharchive checkout is behind the repo. You can either deploy/pull first, or stage the cutover manually (no restarts required until your maintenance window):
sudo cp -av /etc/healtharchive/backend.env /etc/healtharchive/backend.env.bak.$(date -u +%Y%m%dT%H%M%SZ)- Add/update:
export HEALTHARCHIVE_JOB_LOCK_DIR=/srv/healtharchive/ops/locks/jobs- Ensure the lock dir exists (some older
vps-bootstrap-ops-dirs.shversions did not create it): sudo install -d -m 2770 -o root -g healtharchive /srv/healtharchive/ops/locks/jobs
Files
healtharchive-api.service- Repo-managed FastAPI/uvicorn service template for the public API.
- Binds to loopback (
127.0.0.1:8001) for Caddy to proxy. - Defaults to
HEALTHARCHIVE_API_WORKERS=2; override in/etc/healtharchive/backend.envif needed. healtharchive-worker.service- Repo-managed worker service template for the long-running crawl worker loop.
- Uses the canonical CLI entrypoint:
ExecStart=/opt/healtharchive/.venv/bin/healtharchive start-worker --poll-interval 30
healtharchive-replay.service- Repo-managed pywb replay service template for
replay.healtharchive.ca. - Binds to loopback (
127.0.0.1:8090) for Caddy to proxy. - Resolves the host
hareplayUID andhealtharchiveGID at startup, then runs docker with-e PYTHONPATH=/webarchiveso the managed/srv/healtharchive/replay/sitecustomize.pyhook can drop malformed replayed header names before Caddy parses them. healtharchive-schedule-annual.service- Apply mode: enqueues annual jobs (
--apply) for the current UTC year. - Gated by
ConditionPathExists=/etc/healtharchive/automation-enabled. RefuseManualStart=yesto prevent accidentalsystemctl startwhile the worker is running.healtharchive-schedule-annual.timer- Runs at
*-01-01 00:05:00 UTC Persistent=true(runs on next boot if missed)healtharchive-schedule-annual-dry-run.service- Safe validation service (no DB writes).
healtharchive-worker.service.override.conf- Drop-in that lowers worker CPU/IO priority to keep the API responsive.
healtharchive-replay-reconcile.service- Apply mode: runs
healtharchive replay-reconcile --apply --max-jobs 1. - Gated by
ConditionPathExists=/etc/healtharchive/replay-automation-enabled. - Uses a lock file under
/srv/healtharchive/replay/.locks/to prevent concurrent runs. - Runs as root because it has to bridge host-side replay collection writes with
docker execinto the hardened pywb container; running it ashaadminleaves newhareplay-owned collections stuck inmissing_index,missing_warc_links. healtharchive-replay-reconcile.timer- Daily at
*-*-* 02:30:00 UTC Persistent=true(runs on next boot if missed)healtharchive-replay-reconcile-dry-run.service- Safe validation service (no docker exec, no filesystem writes beyond the lock file dir).
healtharchive-change-tracking.service- Apply mode: runs
healtharchive compute-changes(edition-aware diffs). - Gated by
ConditionPathExists=/etc/healtharchive/change-tracking-enabled. healtharchive-change-tracking.timer- Daily at
*-*-* 03:40:00 UTC Persistent=true(runs on next boot if missed)healtharchive-change-tracking-dry-run.service- Safe validation service (no DB writes; reports how many diffs would be computed).
scripts/systemd-healthchecks-wrapper.sh- Helper for optional Healthchecks-style pings without embedding ping URLs in unit files.
healtharchive-annual-search-verify.service- Runs
scripts/annual-search-verify.shdaily, but captures once per year (idempotent). healtharchive-annual-search-verify.timer- Daily timer for
healtharchive-annual-search-verify.service. healtharchive-coverage-guardrails.service+.timer- Writes coverage regression guardrails to the node_exporter textfile collector.
- Gated by
ConditionPathExists=/etc/healtharchive/coverage-guardrails-enabled. healtharchive-replay-smoke.service+.timer- Runs replay smoke tests against the latest indexed job per source (node_exporter textfile).
- Gated by
ConditionPathExists=/etc/healtharchive/replay-smoke-enabled. healtharchive-cleanup-automation.service+.timer- Cleans indexed jobs using safe
temp-nonwarcmode (keeps WARCs). - Gated by
ConditionPathExists=/etc/healtharchive/cleanup-automation-enabled. healtharchive-disk-threshold-cleanup.service+.timer- Runs safe
temp-nonwarccleanup in threshold mode (no-op when disk is below threshold). - Gated by
ConditionPathExists=/etc/healtharchive/cleanup-automation-enabled. healtharchive-baseline-drift-check.service- Runs
scripts/check_baseline_drift.py(policy vs observed; writes artifacts under/srv/healtharchive/ops/baseline/). - Gated by
ConditionPathExists=/etc/healtharchive/baseline-drift-enabled. healtharchive-baseline-drift-check.timer- Weekly timer for
healtharchive-baseline-drift-check.service. healtharchive-public-surface-verify.service- Runs
scripts/verify_public_surface.py(public API + frontend; includes changes/RSS and partner pages). - Gated by
ConditionPathExists=/etc/healtharchive/public-verify-enabled. - Intended as a deeper “synthetic check” than external uptime monitors.
healtharchive-public-surface-verify.timer- Daily timer for
healtharchive-public-surface-verify.service. healtharchive-tiering-metrics.service+.timer- Writes a small set of tiering health metrics to the node_exporter textfile collector.
- Used to alert on Storage Box / tiering failures without needing a systemd collector.
- Prereq: node_exporter must run with
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector(configured byscripts/vps-install-observability-exporters.sh). healtharchive-crawl-metrics.service+.timer- Writes per-job crawl progress/stall metrics (based on crawlStatus logs) to the node_exporter textfile collector.
- Used to alert on stalled crawls without manual log tailing.
- Prereq: node_exporter textfile collector is enabled (same as tiering metrics).
healtharchive-crawl-auto-recover.service+.timer- Optional automation to recover stalled crawl jobs by marking stale running jobs as retryable (and restarting the worker when needed).
- Gated by
ConditionPathExists=/etc/healtharchive/crawl-auto-recover-enabled. - Disabled by default; enable only after you’re comfortable with the thresholds/caps in
scripts/vps-crawl-auto-recover.py. - Note: automation-first alerting for crawl stalls assumes this watchdog is enabled and its textfile metrics are fresh.
healtharchive-worker-auto-start.service+.timer- Optional automation to ensure the worker is running when it should be (jobs pending + storage OK).
- Gated by
ConditionPathExists=/etc/healtharchive/worker-auto-start-enabled. - Conservative by default; prefers a “do nothing” skip over unsafe starts.
- Note: automation-first worker-down alert suppression assumes this watchdog is enabled and its textfile metrics are fresh.
healtharchive-drift-auto-reconcile.service+.timer- Optional automation to recover from deployment dependency drift (calls
vps-deploy.sh). - Read-only unless drift is found in baseline report; triggered every 5 minutes.
- Gated by
ConditionPathExists=/etc/healtharchive/drift-auto-reconcile-enabled. healtharchive-storage-hotpath-auto-recover.service+.timer- Optional automation to recover stale/unreadable hot paths caused by
sshfs/FUSE mount failures (Errno 107). - Gated by
ConditionPathExists=/etc/healtharchive/storage-hotpath-auto-recover-enabled. - Disabled by default; enable only after dry-run validation and only if you’re comfortable with the safety caps in
scripts/vps-storage-hotpath-auto-recover.py. - Note: automation-first suppression of
Errno 107symptom alerts assumes this watchdog is enabled and its textfile metrics are fresh. healtharchive-storage-watchdog-burnin-snapshot.service+.timer- Optional read-only daily snapshot of the storage hot-path watchdog burn-in summary.
- Gated by
ConditionPathExists=/etc/healtharchive/storage-watchdog-burnin-enabled. - Writes dated JSON artifacts under
/srv/healtharchive/ops/burnin/storage-watchdog/(andlatest.json). healtharchive-storagebox-sshfs.service- Mounts a Hetzner Storage Box at
/srv/healtharchive/storageboxviasshfs. - Reads configuration from
/etc/healtharchive/storagebox.env. - Intended for tiered WARC storage on small SSD hosts.
healtharchive-warc-tiering.service- Applies bind mounts from
/etc/healtharchive/warc-tiering.bindsso canonical archive paths under/srv/healtharchive/jobs/**resolve to Storage Box data. - Runs before the API/worker/replay services start.
healtharchive-annual-output-tiering.service- After annual jobs are enqueued, bind-mounts each annual job output_dir onto the Storage Box tier.
- Triggered via
OnSuccess=inhealtharchive-schedule-annual.service(template). healtharchive-annual-campaign-sentinel.service+.timer- Runs a “day-of” annual readiness gate automatically: preflight + annual-status + tiering checks.
- Writes a small Prometheus textfile metric so Alertmanager can notify on failures.
Recommended enablement guidance
These timers are safe-by-default and gated by sentinel files. Enable only what matches your operational readiness.
- Change tracking (
healtharchive-change-tracking.timer) - Recommended to enable once the
snapshot_changestable exists and a dry run succeeds without errors. - Annual scheduling (
healtharchive-schedule-annual.timer) - Enable only after an annual dry-run succeeds and storage headroom is confirmed.
- Replay reconcile (
healtharchive-replay-reconcile.timer) - Enable only if replay is enabled and stable.
- Annual search verification (
healtharchive-annual-search-verify.timer) - Optional; safe to enable if you want a yearly search QA artifact.
- Coverage guardrails (
healtharchive-coverage-guardrails.timer) - Recommended once you have at least two annual editions indexed.
- Replay smoke tests (
healtharchive-replay-smoke.timer) - Enable only if replay is enabled and stable.
- Cleanup automation (
healtharchive-cleanup-automation.timer) - Optional; keep caps conservative and review first dry-run.
- Disk threshold cleanup (
healtharchive-disk-threshold-cleanup.timer) - Optional; runs every 30 minutes but only applies cleanup when disk usage exceeds the configured threshold.
- Baseline drift check (
healtharchive-baseline-drift-check.timer) - Recommended; low-risk and catches “silent” ops drift.
- Storage hot-path auto-recover (
healtharchive-storage-hotpath-auto-recover.timer) - Dangerous if misconfigured; only enable after you’ve validated Phase 1 alerts and run the watchdog in dry-run mode.
- The unit is gated by a venv presence check and the watchdog skips runs while the deploy lock is held (to avoid flapping during active deploys).
- Storage watchdog burn-in snapshots (
healtharchive-storage-watchdog-burnin-snapshot.timer) - Read-only; safe to enable during rollout/burn-in weeks so evidence is captured automatically.
- Worker auto-start watchdog (
healtharchive-worker-auto-start.timer) - Recommended once you’re confident in the single-VPS production automation stack.
- The unit is sentinel-gated, refuses starts when Storage Box is unreadable, and performs conservative stale
status=runningdrift reconciliation before deciding whether to start the worker. - Drift auto-reconcile watchdog (
healtharchive-drift-auto-reconcile.timer) - Recommended for self-healing missing pip dependencies that cause API 502s.
- Runs every 5 minutes and invokes
vps-deploy.shif the baseline report catches virtual environment drift.
If a timer is enabled, also ensure its sentinel file exists under /etc/healtharchive/ (see the enablement sections below).
Install / update on the VPS
Preferred (one command; installs the managed API/replay/worker templates, timer templates, and the worker priority drop-in):
Run a crawl job detached (optional)
If you need to run a specific DB-backed crawl job manually (for debugging or recovery), prefer launching it as a transient systemd unit so your SSH session doesn’t need to stay open:
Follow the printed journalctl -u <unit>.service -f command to tail logs.
If you are using WARC tiering with a Storage Box, also create these files on the VPS:
/etc/healtharchive/storagebox.env- Configuration consumed by
healtharchive-storagebox-sshfs.service. /etc/healtharchive/warc-tiering.binds- Bind mount manifest consumed by
healtharchive-warc-tiering.service.
See: docs/operations/playbooks/storage/warc-storage-tiering.md.
Before enabling timers that write artifacts under /srv/healtharchive/ops/, ensure the ops directories exist with the expected permissions:
Manual install (equivalent):
Copy unit files:
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-schedule-annual.service \
/etc/systemd/system/healtharchive-schedule-annual.service
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-schedule-annual.timer \
/etc/systemd/system/healtharchive-schedule-annual.timer
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-schedule-annual-dry-run.service \
/etc/systemd/system/healtharchive-schedule-annual-dry-run.service
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-replay-reconcile.service \
/etc/systemd/system/healtharchive-replay-reconcile.service
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-replay-reconcile.timer \
/etc/systemd/system/healtharchive-replay-reconcile.timer
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-replay-reconcile-dry-run.service \
/etc/systemd/system/healtharchive-replay-reconcile-dry-run.service
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-change-tracking.service \
/etc/systemd/system/healtharchive-change-tracking.service
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-change-tracking.timer \
/etc/systemd/system/healtharchive-change-tracking.timer
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-change-tracking-dry-run.service \
/etc/systemd/system/healtharchive-change-tracking-dry-run.service
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-annual-search-verify.service \
/etc/systemd/system/healtharchive-annual-search-verify.service
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-annual-search-verify.timer \
/etc/systemd/system/healtharchive-annual-search-verify.timer
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-baseline-drift-check.service \
/etc/systemd/system/healtharchive-baseline-drift-check.service
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-baseline-drift-check.timer \
/etc/systemd/system/healtharchive-baseline-drift-check.timer
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-public-surface-verify.service \
/etc/systemd/system/healtharchive-public-surface-verify.service
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-public-surface-verify.timer \
/etc/systemd/system/healtharchive-public-surface-verify.timer
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-drift-auto-reconcile.service \
/etc/systemd/system/healtharchive-drift-auto-reconcile.service
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-drift-auto-reconcile.timer \
/etc/systemd/system/healtharchive-drift-auto-reconcile.timer
Install the worker priority drop-in:
sudo install -d -m 0755 -o root -g root /etc/systemd/system/healtharchive-worker.service.d
sudo install -m 0644 -o root -g root \
/opt/healtharchive/docs/deployment/systemd/healtharchive-worker.service.override.conf \
/etc/systemd/system/healtharchive-worker.service.d/override.conf
Reload systemd:
Restart worker to pick up priority changes:
Verify the priority values:
Optional: "timer ran" pings (Healthchecks-style)
This repo does not commit ping URLs. If you want "did it run?" checks, create a root-owned env file on the VPS:
sudo install -d -m 0755 -o root -g root /etc/healtharchive
# Only create the file if missing (don't clobber existing values like HC_DB_BACKUP_URL).
sudo test -f /etc/healtharchive/healthchecks.env || sudo install -m 0600 -o root -g root /dev/null /etc/healtharchive/healthchecks.env
sudo chown root:root /etc/healtharchive/healthchecks.env
sudo chmod 0600 /etc/healtharchive/healthchecks.env
Edit /etc/healtharchive/healthchecks.env and set (examples):
# You do NOT need to set every variable listed here.
# Only set the variables for Healthchecks checks you have actually created.
# If a variable is missing/empty, the service still runs; it just won't ping.
#
# Note: the Healthchecks "Name" can be anything; it does not need to match the
# env var name. The env var is just how systemd finds the ping URL.
HEALTHARCHIVE_HC_PING_REPLAY_RECONCILE=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_SCHEDULE_ANNUAL=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_ANNUAL_SENTINEL=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_CHANGE_TRACKING=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_BASELINE_DRIFT=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_PUBLIC_VERIFY=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_ANNUAL_SEARCH_VERIFY=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_COVERAGE_GUARDRAILS=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_REPLAY_SMOKE=https://hc-ping.com/UUID_HERE
HEALTHARCHIVE_HC_PING_CLEANUP_AUTOMATION=https://hc-ping.com/UUID_HERE
Notes:
- The unit templates use
EnvironmentFile=-/etc/healtharchive/healthchecks.envso the file is optional. - If you also use the legacy Healthchecks-based scripts described in
../production-single-vps.md(DB backup + disk check), keep their variables in the same file too (HC_DB_BACKUP_URL,HC_DISK_URL,HC_DISK_THRESHOLD). - If set, services will best-effort ping:
<url>/startat the beginning<url>on success<url>/failon failure- Ping failures do not fail the service.
Audit Healthchecks alignment (safe)
This script compares:
- What ping env vars are set in
/etc/healtharchive/healthchecks.env - What ping vars are referenced by installed systemd unit files (via
--ping-var ...) - Which timers exist (for manual cross-check with Healthchecks “last ping” timestamps)
Run on the VPS:
If it reports “referenced but unset”, you either:
- Intentionally have pings disabled for those timers (OK), or
- Should create the missing checks in Healthchecks and add the missing env vars.
If it reports “set but unused”, you likely have a stale env var (remove it) or the unit that used to reference it was removed/renamed.
Validate the annual scheduler (safe)
This dry-run service exercises DB connectivity + scheduler output without creating jobs:
sudo systemctl start healtharchive-schedule-annual-dry-run.service
sudo journalctl -u healtharchive-schedule-annual-dry-run.service -n 200 --no-pager
Do not run healtharchive-schedule-annual.service manually in production; it enqueues jobs and the worker may start crawling immediately.
Validate replay reconciliation (safe)
This dry-run service exercises DB connectivity + filesystem drift detection without running any docker exec commands:
sudo systemctl start healtharchive-replay-reconcile-dry-run.service
sudo journalctl -u healtharchive-replay-reconcile-dry-run.service -n 200 --no-pager
Validate change tracking (safe)
This dry-run service exercises DB connectivity and reports how many diffs would be computed:
sudo systemctl start healtharchive-change-tracking-dry-run.service
sudo journalctl -u healtharchive-change-tracking-dry-run.service -n 200 --no-pager
If you see an error like relation "snapshot_changes" does not exist, apply migrations first (idempotent):
Enable automation (Jan 01)
Create the automation sentinel file:
Enable the timer:
sudo systemctl enable --now healtharchive-schedule-annual.timer
systemctl list-timers | rg healtharchive-schedule-annual || systemctl list-timers | grep healtharchive-schedule-annual
Note: do not enable the .service units directly; only the .timer should be enabled.
Enable replay reconciliation automation (optional)
Create the replay automation sentinel file:
Enable the timer:
sudo systemctl enable --now healtharchive-replay-reconcile.timer
systemctl list-timers | rg healtharchive-replay-reconcile || systemctl list-timers | grep healtharchive-replay-reconcile
Note: by default, the timer only reconciles replay indexing. Preview image generation is intentionally left manual/capped until you decide it’s stable enough to automate.
Enable change tracking automation (optional)
Create the change tracking sentinel file:
Enable the timer:
sudo systemctl enable --now healtharchive-change-tracking.timer
systemctl list-timers | rg healtharchive-change-tracking || systemctl list-timers | grep healtharchive-change-tracking
Enable annual search verification capture (optional)
This captures golden-query /api/search JSON once per year after the annual campaign becomes search-ready.
The service is idempotent:
- If the campaign isn't ready, it exits 0 (no failure spam).
- If artifacts already exist for the current year/run-id, it exits 0.
Enable the timer:
sudo systemctl enable --now healtharchive-annual-search-verify.timer
systemctl list-timers | rg healtharchive-annual-search-verify || systemctl list-timers | grep healtharchive-annual-search-verify
Artifacts default to:
/srv/healtharchive/ops/search-eval/<year>/final/
To force a re-run for the current year, delete that directory and run the service once.
Enable coverage guardrails (optional)
This emits daily metrics comparing the latest indexed annual job to the prior year per source.
Create the sentinel file:
Enable the timer:
sudo systemctl enable --now healtharchive-coverage-guardrails.timer
systemctl list-timers | rg healtharchive-coverage-guardrails || systemctl list-timers | grep healtharchive-coverage-guardrails
Enable replay smoke tests (optional)
This runs lightweight replay checks against the latest indexed job per source.
Create the sentinel file:
Enable the timer:
sudo systemctl enable --now healtharchive-replay-smoke.timer
systemctl list-timers | rg healtharchive-replay-smoke || systemctl list-timers | grep healtharchive-replay-smoke
Enable cleanup automation (optional)
This runs safe temp-nonwarc cleanup for older indexed jobs (keeps WARCs).
Create the sentinel file:
Enable the timer:
sudo systemctl enable --now healtharchive-cleanup-automation.timer
systemctl list-timers | rg healtharchive-cleanup-automation || systemctl list-timers | grep healtharchive-cleanup-automation
Enable disk threshold cleanup (optional)
This is an event-driven safety net for disk pressure:
- it runs on a frequent timer (every 30 minutes),
- but only applies cleanup if disk usage exceeds
threshold_trigger_percentfromops/automation/cleanup-automation.toml, - and uses
threshold_max_jobs_per_runas the cap when triggered.
It is gated by the same sentinel file as weekly cleanup:
Enable the timer:
sudo systemctl enable --now healtharchive-disk-threshold-cleanup.timer
systemctl list-timers | rg healtharchive-disk-threshold-cleanup || systemctl list-timers | grep healtharchive-disk-threshold-cleanup
Rollback:
Enable storage hot-path auto-recovery (optional; high impact)
This automation attempts conservative self-healing for the specific failure class:
OSError: [Errno 107] Transport endpoint is not connected
It can unmount stale hot paths and re-apply tiering. It will only stop the worker when either:
- a running job output directory is detected as stale (Errno 107), or
- there are no running jobs (to prevent races while repairing mountpoints for the next jobs).
It also probes the output dirs of the next queued/retryable jobs to prevent infra-error retry storms (a stale mountpoint for a retryable job should be repaired before the worker selects it).
If no stale targets are currently eligible but healtharchive-warc-tiering.service is stuck in failed, the watchdog will attempt a conservative reconcile (reset-failed + start) when the base Storage Box mount is readable. This helps clear persistent HealthArchiveWarcTieringFailed alerts caused by stale historical unit state.
After successful mount recovery it restarts replay (best-effort) so replay sees a clean view of /srv/healtharchive/jobs.
Keep it disabled by default and enable only after:
- Phase 1 alerting/metrics are working (you have visibility),
- you have validated the watchdog in dry-run mode first.
- If you plan to rely on automation-first alert suppression for stale-mount symptoms, also verify the watchdog textfile metrics stay fresh in Prometheus after enablement.
Create the sentinel file:
sudo install -m 0644 -o root -g root /dev/null /etc/healtharchive/storage-hotpath-auto-recover-enabled
Enable the timer:
sudo systemctl enable --now healtharchive-storage-hotpath-auto-recover.timer
systemctl list-timers | rg healtharchive-storage-hotpath-auto-recover || systemctl list-timers | grep healtharchive-storage-hotpath-auto-recover
Rollback:
sudo systemctl disable --now healtharchive-storage-hotpath-auto-recover.timer
sudo rm -f /etc/healtharchive/storage-hotpath-auto-recover-enabled
The watchdog writes state under:
/srv/healtharchive/ops/watchdog/storage-hotpath-auto-recover.json
and emits node_exporter textfile metrics via:
healtharchive_storage_hotpath_auto_recover.prom
Enable storage watchdog burn-in snapshots (optional; low impact)
This automation captures a daily read-only snapshot of the storage watchdog burn-in report so you have evidence artifacts even if nobody remembers to run the command manually.
Precondition: ops directories exist (idempotent):
Create the sentinel file:
Enable the timer:
sudo systemctl enable --now healtharchive-storage-watchdog-burnin-snapshot.timer
systemctl list-timers | rg healtharchive-storage-watchdog-burnin-snapshot || systemctl list-timers | grep healtharchive-storage-watchdog-burnin-snapshot
Artifacts:
/srv/healtharchive/ops/burnin/storage-watchdog/latest.json/srv/healtharchive/ops/burnin/storage-watchdog/storage-watchdog-burnin-YYYYMMDD.json
Rollback:
sudo systemctl disable --now healtharchive-storage-watchdog-burnin-snapshot.timer
sudo rm -f /etc/healtharchive/storage-watchdog-burnin-enabled
Enable worker auto-start watchdog (optional; conservative)
This automation exists to prevent “everything stopped” failures where the system is healthy enough to run, but healtharchive-worker.service is down and jobs are pending.
It will only start the worker when all of these are true:
- the worker unit is inactive,
- there are pending crawl jobs (
status in (queued, retryable)), - the Storage Box mount is readable,
- the deploy lock is not present (or is stale),
- and there are no DB jobs in
status=running(conservative safety gate).
If you plan to rely on automation-first suppression for worker-down notifications, also verify the watchdog textfile metrics are present/fresh in Prometheus after enablement.
Create the sentinel file:
Enable the timer:
sudo systemctl enable --now healtharchive-worker-auto-start.timer
systemctl list-timers | rg healtharchive-worker-auto-start || systemctl list-timers | grep healtharchive-worker-auto-start
Rollback:
sudo systemctl disable --now healtharchive-worker-auto-start.timer
sudo rm -f /etc/healtharchive/worker-auto-start-enabled
The watchdog writes state under:
/srv/healtharchive/ops/watchdog/worker-auto-start.json
and emits node_exporter textfile metrics via:
healtharchive_worker_auto_start.prom
Enable drift auto-reconcile watchdog (recommended)
This automation prevents “API 502 Bad Gateway” API crashes that occur when dependencies (like Python packages in .venv) drift from the desired codebase following an incomplete/manual code update.
It reads the results from healtharchive-baseline-drift-check.timer's periodic checks to decide if an infrastructure environment rebuild (via scripts/vps-deploy.sh) is needed, applying cooldowns to avoid flapping.
Create the sentinel file:
Enable the timer:
sudo systemctl enable --now healtharchive-drift-auto-reconcile.timer
systemctl list-timers | rg healtharchive-drift-auto-reconcile || systemctl list-timers | grep healtharchive-drift-auto-reconcile
Rollback:
sudo systemctl disable --now healtharchive-drift-auto-reconcile.timer
sudo rm -f /etc/healtharchive/drift-auto-reconcile-enabled
The watchdog writes state under:
/srv/healtharchive/ops/watchdog/drift-auto-reconcile.json
and emits node_exporter textfile metrics via:
healtharchive_drift_auto_reconcile.prom
Enable baseline drift checks (recommended)
Baseline drift checks validate that production still matches the project’s expected invariants (security posture, perms, unit enablement).
Create the sentinel file:
Enable the timer:
sudo systemctl enable --now healtharchive-baseline-drift-check.timer
systemctl list-timers | rg healtharchive-baseline-drift-check || systemctl list-timers | grep healtharchive-baseline-drift-check
Artifacts are written under:
/srv/healtharchive/ops/baseline/
If the drift check fails, inspect:
/srv/healtharchive/ops/baseline/drift-report-latest.txtjournalctl -u healtharchive-baseline-drift-check.service --no-pager -l
If required drift is fixed and you want to confirm recovery immediately:
sudo systemctl start healtharchive-baseline-drift-check.service
sudo journalctl -u healtharchive-baseline-drift-check.service -n 200 --no-pager -l
Healthchecks interpretation note:
- A success ping after
healtharchive-baseline-drift-check.serviceis started manually means the service passed on that rerun. - It does not prove the weekly timer itself fired on schedule.
- To confirm the actual timer window, compare:
sudo systemctl status healtharchive-baseline-drift-check.timer --no-pager -lsudo systemctl list-timers --all | grep healtharchive-baseline-drift-checksudo journalctl -u healtharchive-baseline-drift-check.timer --no-pager -ljournalctl -u ...timer --since ...can still show-- No entries --if the visible timer startup log predates your--sincecutoff.
Enable public surface verification (optional, recommended)
This is a deeper synthetic check than external uptime monitors. It validates:
- public API health, sources, search, snapshot detail and raw HTML
- replay browse URL (unless skipped)
- exports manifest and export endpoint HEADs
- changes feed + RSS
- key frontend pages, including
/brief,/cite,/methods, and/governance
Create the sentinel file:
Enable the timer:
sudo systemctl enable --now healtharchive-public-surface-verify.timer
systemctl list-timers | rg healtharchive-public-surface-verify || systemctl list-timers | grep healtharchive-public-surface-verify
Rollback / disable quickly
- Disable timer immediately:
- Disable all scheduling automation immediately:
- Disable replay reconciliation automation immediately:
sudo systemctl disable --now healtharchive-replay-reconcile.timer
sudo rm -f /etc/healtharchive/replay-automation-enabled
- Disable annual search verification automation immediately:
- Remove the worker priority override: