Incident: Annual crawl — job output dirs on root disk caused disk pressure + crawl pauses (2026-02-04)
Status: closed
Metadata
- Date (UTC): 2026-02-04
- Severity (see
severity.md): sev1 - Environment: production (single VPS)
- Primary area: storage
- Owner: (unassigned)
- Start (UTC): 2026-02-04T13:23:39Z (first operator snapshot showing disk pressure)
- End (UTC): 2026-02-04T16:47:12Z (operator snapshot showing jobs resumed + root disk healthy)
Summary
The annual 2026 crawl campaign hit sustained root-disk pressure on the VPS (/dev/sda1 reached ~84–86% used), which triggered the worker’s disk safety guardrail (≥85% usage) and prevented new crawl progress. Investigation showed that annual crawl output directories for CIHR (~50GB) and PHAC (~1.2GB) were on the local root filesystem under /srv/healtharchive/jobs/** instead of being tiered to the Storage Box. We paused crawls to avoid a disk-full failure, migrated the output directories to the Storage Box, re-established the expected mounts under /srv/healtharchive/jobs/**, and resumed automation.
Impact
- User-facing impact:
- Public API/site remained healthy.
- Annual 2026 campaign was
Ready for search: NOand made little/no crawl progress while paused. - Internal impact (ops burden, automation failures, etc):
- Operator time to pause long-running crawls and perform storage tiering.
- Risk of a disk-full incident avoided by pausing early.
- Data impact:
- Data loss: unknown (no evidence observed).
- Data integrity risk: low-to-medium (risk was primarily “disk full” / interrupted writes if left running).
- Recovery completeness: complete (output dirs remounted to Storage Box; jobs resumed).
- Duration:
- Disk pressure was present before 2026-02-04T13:23:39Z and resolved by ~16:xxZ; campaign was running again by 16:47Z.
Detection
- Operator ran
./scripts/vps-crawl-status.sh --year 2026and saw: - Root disk at ~84–86% used.
- Worker log warnings indicating the disk guardrail was active (“Disk usage at 85% exceeds threshold…”).
sudo du -xhd3 /srv/healtharchive/jobs | sort -h | tailshowed ~47–50GB under CIHR annual output on the local disk.
Most useful signals:
df -h /(root usage)sudo du -xhd3 /srv/healtharchive/jobs | sort -h | tail -40(who is using local disk)findmnt -T /srv/healtharchive/jobs/<source>/<job_dir> -o SOURCE,FSTYPE,OPTIONS(is this actually on the Storage Box?)
Decision log
- 2026-02-04T14:4xZ — Decision: pause all long-running annual crawls (why: avoid disk-full failure; preserve service and data integrity).
- 2026-02-04T15:3xZ — Decision: migrate annual output dirs with
rsync(why: keep WARC/state artifacts; fastest path to reclaim root disk without losing crawl progress). - 2026-02-04T16:0xZ — Decision: resume automation only after mounts validated (why: prevent immediately writing back to local disk).
Timeline (UTC)
- 2026-02-04T13:23:39Z —
vps-crawl-statussnapshot: root disk ~86% used; annual campaign running (HC+CIHR), PHAC retryable. - 2026-02-04T13:52:xxZ — Worker logs show disk guardrail active (≥85%), skipping crawl starts.
- 2026-02-04T14:4xZ — Automation disabled and crawls stopped; jobs recovered to
retryablein DB for safe restart later. - 2026-02-04T14:52:45Z — Fresh Postgres DB backup taken and copied to the Storage Box (with rsync flags adjusted to avoid sshfs
chownfailures). - 2026-02-04T15:32:26Z — Storage Box
sshfsmount confirmed active. - 2026-02-04T15:3xZ → 16:0xZ — CIHR (~50GB) and PHAC (~1.2GB) annual output directories copied to the Storage Box via
rsync. - 2026-02-04T16:0xZ — Annual output tiering applied; output dirs mounted back under
/srv/healtharchive/jobs/**; local copies deleted; root disk returned to ~19% used. - 2026-02-04T16:15:33Z — Services restarted; worker resumed.
- 2026-02-04T16:47:12Z —
vps-crawl-statussnapshot shows 3 running annual jobs and root disk healthy.
Root cause
- Immediate trigger:
- Annual crawl output for at least CIHR and PHAC was written to the VPS root filesystem under
/srv/healtharchive/jobs/**, consuming ~50GB locally and pushing root above the worker’s safety threshold. - Underlying cause(s):
- Annual output tiering/mounts were not in place for those job output dirs at the time the crawls ran (post-reboot / maintenance window).
- Manual ops workflows are easy to run in an “unsafe order” (worker running while mounts not validated).
- The annual output tiering script can be run with missing env exports / DB offline, which causes confusing failures (SQLite “no such table”) that can delay recovery.
Contributing factors
- Long-running crawls reduce opportunities for a clean maintenance window.
- Root disk is fixed size and close to the worker’s disk threshold when any large output dir lands locally.
rsynctosshfsmountpoints can fail on ownership/permissions by default (requires explicit flags).- CIHR job (job 8) config drift: missing annual campaign metadata made annual tooling less reliable until patched.
Resolution / Recovery
0) Pause/stop automation and crawls (make a maintenance window)
Disable the automations so they don’t immediately restart jobs while mounts are in flux:
# Disable crawl auto-recover and worker auto-start
sudo mv /etc/healtharchive/crawl-auto-recover-enabled{,.disabled} 2>/dev/null || true
sudo mv /etc/healtharchive/worker-auto-start-enabled{,.disabled} 2>/dev/null || true
sudo systemctl stop healtharchive-crawl-auto-recover.timer || true
sudo systemctl stop healtharchive-worker-auto-start.timer || true
# Stop worker (and any transient crawl units)
sudo systemctl stop healtharchive-worker.service || true
systemctl list-units --all 'healtharchive-job*' --no-pager
sudo systemctl stop <healtharchive-jobX-...>.service
Mark stopped jobs restartable:
set -a; source /etc/healtharchive/backend.env; set +a
/opt/healtharchive/.venv/bin/healtharchive recover-stale-jobs --older-than-minutes 1 --apply --source <source>
1) Ensure backups exist
sudo systemctl start healtharchive-db-backup.service
ls -lt /srv/healtharchive/backups/healtharchive_*.dump | head -n 3
2) Restore/verify Storage Box mount
sudo systemctl start healtharchive-storagebox-sshfs.service
df -h /srv/healtharchive/storagebox
findmnt -T /srv/healtharchive/storagebox -o SOURCE,FSTYPE,OPTIONS
3) Migrate large local output dirs to Storage Box
Use rsync flags that don’t try to preserve ownership/perms on sshfs:
sudo rsync -rtv --info=progress2 --partial --inplace \
--no-owner --no-group --no-perms \
/srv/healtharchive/jobs/cihr/20260101T000502Z__cihr-20260101/ \
/srv/healtharchive/storagebox/jobs/cihr/20260101T000502Z__cihr-20260101/
Optional “sanity dry-run” to see drift (but do not delete without thinking):
sudo rsync -rtvn --delete \
--no-owner --no-group --no-perms \
/srv/healtharchive/jobs/<source>/<job_dir>/ \
/srv/healtharchive/storagebox/jobs/<source>/<job_dir>/
4) Re-establish the expected mounts under /srv/healtharchive/jobs/**
Key gotcha: the tiering script must target Postgres. Make sure env vars are exported and Postgres is running, otherwise you may see SQLite errors like no such table: sources.
sudo systemctl start postgresql.service
sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; \
/opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-annual-output-tiering.py --apply --year 2026'
Validate mountpoints:
findmnt -T /srv/healtharchive/jobs/hc/20260101T000502Z__hc-20260101 -o SOURCE,FSTYPE
findmnt -T /srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101 -o SOURCE,FSTYPE
findmnt -T /srv/healtharchive/jobs/cihr/20260101T000502Z__cihr-20260101 -o SOURCE,FSTYPE
5) Delete local copies and verify disk health
sudo rm -rf /srv/healtharchive/jobs/*/*__*.local-*
df -h /
sudo du -xhd3 /srv/healtharchive/jobs | sort -h | tail -40
6) Resume services and automation
# Re-enable sentinels
sudo mv /etc/healtharchive/crawl-auto-recover-enabled.disabled /etc/healtharchive/crawl-auto-recover-enabled 2>/dev/null || sudo touch /etc/healtharchive/crawl-auto-recover-enabled
sudo mv /etc/healtharchive/worker-auto-start-enabled.disabled /etc/healtharchive/worker-auto-start-enabled 2>/dev/null || sudo touch /etc/healtharchive/worker-auto-start-enabled
sudo systemctl enable --now healtharchive-crawl-auto-recover.timer
sudo systemctl enable --now healtharchive-worker-auto-start.timer
sudo systemctl start healtharchive-api.service healtharchive-replay.service postgresql.service
sudo systemctl start healtharchive-worker.service
Post-incident verification
- Public surface checks:
curl -s http://127.0.0.1:8001/api/health && echocd /opt/healtharchive && ./scripts/verify_public_surface.py(when appropriate)- Worker/job health checks:
cd /opt/healtharchive && ./scripts/vps-crawl-status.sh --year 2026systemctl list-units --all 'healtharchive-job*' --no-pager- Storage/mount checks:
df -h / /srv/healtharchive/storageboxfindmnt -T /srv/healtharchive/jobs/<source>/<job_dir> -o SOURCE,FSTYPE,OPTIONS
Public communication
None. (No observed user-facing downtime; annual campaign internal pipeline issue.)
Open questions
- What is the “source of truth” workflow after reboot/rescue to ensure annual output tiering is restored before the worker runs?
- Should we add a boot-time (or worker-start-time) invariant check that annual output dirs are mounted to the Storage Box?
- Can we reduce the likelihood of needing a rescue-mode window for “disk mystery” investigations by improving on-host diagnostics and documentation?
Action items (TODOs)
- Add a runbook section: “Annual output tiering after reboot/rescue” (owner=ops, priority=high, due=2026-02-08)
- Add a guardrail: worker refuses to start annual crawls when output dir is on
/dev/sda1(owner=eng, priority=high, due=2026-02-15) - Improve
vps-annual-output-tiering.pyUX: - detect “Postgres not running / env not exported” and print a single-line fix. (owner=eng, priority=medium, due=2026-02-15)
- Ensure annual job configs always include
campaign_kind/yearmetadata (owner=eng, priority=medium, due=2026-02-15)
Automation opportunities
- Safe automation:
- On boot (or before starting the worker), run an idempotent annual tiering “ensure” pass for the current campaign year.
- Alert when
/srv/healtharchive/jobs/**output dirs are on the root filesystem while a campaign is active. - What should stay manual:
- Any automated deletion of local
.local-*directories should remain manual unless preceded by a strong integrity check (to avoid data loss).
References / Artifacts
- Related investigation:
../../planning/implemented/2026-02-01-disk-usage-investigation.md - Related playbooks:
../playbooks/storage/warc-storage-tiering.md../playbooks/storage/storagebox-sshfs-stale-mount-recovery.md- Commands used during recovery:
./scripts/vps-crawl-status.sh --year 2026scripts/vps-annual-output-tiering.pyhealtharchive-storagebox-sshfs.service