Post-Reboot Annual Job Tiering Verification
Type: Validation Runbook Category: Operations / Storage Tiering Last updated: 2026-04-17
Purpose
After a VPS reboot or rescue/maintenance window, verify that annual crawl jobs are safe to resume:
- the Storage Box base mount is healthy
- annual job output dirs are tiered and readable
- the worker user can write to queued/retryable annual output dirs
- annual metadata/config drift has not broken tiering/automation assumptions
- only then should retries or worker restarts happen
When to use this: After any reboot, rescue boot, or storage maintenance during annual campaign season.
Preconditions
- You are on the VPS.
- Backend checkout is at
/opt/healtharchive. - Backend env file is
/etc/healtharchive/backend.env. - Prefer a maintenance window with
healtharchive-worker.servicestopped while mounts are being repaired.
1) Load Env And Capture Read-Only State
cd /opt/healtharchive
YEAR=2026
HA=/opt/healtharchive/.venv/bin/healtharchive
set -a; source /etc/healtharchive/backend.env; set +a
./scripts/vps-crawl-status.sh --year "$YEAR"
"$HA" annual-status --year "$YEAR"
"$HA" check-db
systemctl status postgresql.service --no-pager -l
Expected:
check-dbsucceedsannual-statusreturns normally- you have a fresh snapshot of job ids, statuses, and output dirs before making changes
If check-db fails, stop here and fix DB/env first.
2) Verify Storage Box Base Mount
findmnt /srv/healtharchive/storagebox
ls -ld /srv/healtharchive/storagebox
ls /srv/healtharchive/storagebox/jobs >/dev/null
Expected:
findmntshows the Storage Box mount- directory listing works without
Transport endpoint is not connected
If this fails, repair the base Storage Box mount before touching any annual job.
3) Verify Per-Job Output Dirs
For each annual job you care about:
JOB_ID=7
"$HA" show-job --id "$JOB_ID"
OUT_DIR="$("$HA" show-job --id "$JOB_ID" | awk -F': +' '/^Output dir:/ {print $2}')"
findmnt -T "$OUT_DIR" -o TARGET,SOURCE,FSTYPE,OPTIONS
ls -ld "$OUT_DIR"
Optional worker-user writability probe for queued/retryable annual jobs:
WORKER_USER="$(systemctl show -p User --value healtharchive-worker.service)"
sudo -u "$WORKER_USER" test -w "$OUT_DIR" && echo "OK: writable" || echo "BAD: not writable"
Expected:
OUT_DIRexists and is readablefindmnt -T "$OUT_DIR"shows the path is mounted from the Storage Box tier, not left on/dev/sda1- the worker user can write the output dir for queued/retryable jobs
If ls or findmnt hits Errno 107, treat it as stale-mount recovery, not a retry-budget problem.
4) Verify Annual Metadata / Config Drift
Dry-run the annual reconciliation command before restarting the worker:
Expected:
UNCHANGEDfor jobs already carrying canonical annual metadata and source profilesWOULD UPDATEif a job is missing annual metadata (campaign_kind/year/date/date_utc/scheduler_version) or has source-profile drift
If reconciliation reports drift, apply it before retrying the job:
This is the preferred app-local fix for annual metadata/config drift.
5) Run Annual Output Tiering Dry-Run
sudo --preserve-env=HEALTHARCHIVE_DATABASE_URL,HEALTHARCHIVE_ARCHIVE_ROOT \
/opt/healtharchive/.venv/bin/python3 \
/opt/healtharchive/scripts/vps-annual-output-tiering.py \
--year "$YEAR"
Expected:
- annual jobs show
OK - or the script prints a bounded reason such as
STALE,WARN ... unexpected_mount_type, orUNHEALTHY
Note: run the script with preserved HEALTHARCHIVE_DATABASE_URL when using sudo. Otherwise the process can fall back to local SQLite and report misleading database errors such as no such table: sources.
If the script reports stale or unexpected mounts, repair them before retrying.
6) Repair Tiering / Mounts If Needed
Stop the worker first:
Repair stale annual output-dir mounts:
sudo --preserve-env=HEALTHARCHIVE_DATABASE_URL,HEALTHARCHIVE_ARCHIVE_ROOT \
/opt/healtharchive/.venv/bin/python3 \
/opt/healtharchive/scripts/vps-annual-output-tiering.py \
--year "$YEAR" \
--apply \
--repair-stale-mounts \
--allow-repair-running-jobs
If the script reported unexpected_mount_type, use:
sudo --preserve-env=HEALTHARCHIVE_DATABASE_URL,HEALTHARCHIVE_ARCHIVE_ROOT \
/opt/healtharchive/.venv/bin/python3 \
/opt/healtharchive/scripts/vps-annual-output-tiering.py \
--year "$YEAR" \
--apply \
--repair-unexpected-mounts \
--allow-repair-running-jobs
Then re-run steps 3 and 5.
7) Only Then Touch Retry State
If the job is failed or has exhausted retry budget after storage/config are healthy:
"$HA" reset-retry-count --id 7
"$HA" reset-retry-count --id 7 --apply --reason "post-reboot annual recovery"
"$HA" retry-job --id 7
Expected:
- dry-run shows the intended retry-count change
- apply sets
retry_countback to0 retry-jobmoves afailedcrawl back toretryable
Do not reset retries before mount/writability/config checks pass.
8) Reset Crawl State Only If Resume State Is The Remaining Problem
Use this only when storage/writability/metadata are already healthy and the job still shows the known poisoned resume/temp pattern:
- repeated resume churn with no useful progress
- known empty/unprocessable-WARC tail
- stale
.tmp*,.archive_state.json, or.zimit_resume.yaml
Dry-run first:
Apply only for a non-running job:
For current HC/PHAC annual profiles, this should be a fallback tool, not the first recovery step; their canonical execution policy already prefers fresh-only runs with automatic poisoned-state reset.
9) Restart Worker And Verify Pickup
sudo systemctl start healtharchive-worker.service
sudo journalctl -u healtharchive-worker.service -n 200 --no-pager
./scripts/vps-crawl-status.sh --year "$YEAR"
Expected:
- no root-device guardrail error
- no
Errno 107/permission-denied output-dir failure - the intended job moves to
runningor remains cleanlyretryablewith a bounded next action
Common Failure Modes
Database/env drift
Symptom:
healtharchivecommands fail with connection errors or SQLite fallback
Fix:
Output dir still on root disk
Symptom:
- worker refuses to start an annual job because the output dir is still on
/dev/sda1
Fix:
- verify Storage Box base mount
- re-run annual tiering apply
- confirm
findmnt -T "$OUT_DIR"points at the Storage Box path
Stale annual hot path (Errno 107)
Symptom:
ls,findmnt, or tiering probes hitTransport endpoint is not connected
Fix:
- stop the worker
- run
vps-annual-output-tiering.py --apply --repair-stale-mounts - re-check the job output dir before retrying