Skip to content

Incident: Annual crawl — job output dirs on root disk caused disk pressure + crawl pauses (2026-02-04)

Status: closed

Metadata

  • Date (UTC): 2026-02-04
  • Severity (see severity.md): sev1
  • Environment: production (single VPS)
  • Primary area: storage
  • Owner: (unassigned)
  • Start (UTC): 2026-02-04T13:23:39Z (first operator snapshot showing disk pressure)
  • End (UTC): 2026-02-04T16:47:12Z (operator snapshot showing jobs resumed + root disk healthy)

Summary

The annual 2026 crawl campaign hit sustained root-disk pressure on the VPS (/dev/sda1 reached ~84–86% used), which triggered the worker’s disk safety guardrail (≥85% usage) and prevented new crawl progress. Investigation showed that annual crawl output directories for CIHR (~50GB) and PHAC (~1.2GB) were on the local root filesystem under /srv/healtharchive/jobs/** instead of being tiered to the Storage Box. We paused crawls to avoid a disk-full failure, migrated the output directories to the Storage Box, re-established the expected mounts under /srv/healtharchive/jobs/**, and resumed automation.

Impact

  • User-facing impact:
  • Public API/site remained healthy.
  • Annual 2026 campaign was Ready for search: NO and made little/no crawl progress while paused.
  • Internal impact (ops burden, automation failures, etc):
  • Operator time to pause long-running crawls and perform storage tiering.
  • Risk of a disk-full incident avoided by pausing early.
  • Data impact:
  • Data loss: unknown (no evidence observed).
  • Data integrity risk: low-to-medium (risk was primarily “disk full” / interrupted writes if left running).
  • Recovery completeness: complete (output dirs remounted to Storage Box; jobs resumed).
  • Duration:
  • Disk pressure was present before 2026-02-04T13:23:39Z and resolved by ~16:xxZ; campaign was running again by 16:47Z.

Detection

  • Operator ran ./scripts/vps-crawl-status.sh --year 2026 and saw:
  • Root disk at ~84–86% used.
  • Worker log warnings indicating the disk guardrail was active (“Disk usage at 85% exceeds threshold…”).
  • sudo du -xhd3 /srv/healtharchive/jobs | sort -h | tail showed ~47–50GB under CIHR annual output on the local disk.

Most useful signals:

  • df -h / (root usage)
  • sudo du -xhd3 /srv/healtharchive/jobs | sort -h | tail -40 (who is using local disk)
  • findmnt -T /srv/healtharchive/jobs/<source>/<job_dir> -o SOURCE,FSTYPE,OPTIONS (is this actually on the Storage Box?)

Decision log

  • 2026-02-04T14:4xZ — Decision: pause all long-running annual crawls (why: avoid disk-full failure; preserve service and data integrity).
  • 2026-02-04T15:3xZ — Decision: migrate annual output dirs with rsync (why: keep WARC/state artifacts; fastest path to reclaim root disk without losing crawl progress).
  • 2026-02-04T16:0xZ — Decision: resume automation only after mounts validated (why: prevent immediately writing back to local disk).

Timeline (UTC)

  • 2026-02-04T13:23:39Z — vps-crawl-status snapshot: root disk ~86% used; annual campaign running (HC+CIHR), PHAC retryable.
  • 2026-02-04T13:52:xxZ — Worker logs show disk guardrail active (≥85%), skipping crawl starts.
  • 2026-02-04T14:4xZ — Automation disabled and crawls stopped; jobs recovered to retryable in DB for safe restart later.
  • 2026-02-04T14:52:45Z — Fresh Postgres DB backup taken and copied to the Storage Box (with rsync flags adjusted to avoid sshfs chown failures).
  • 2026-02-04T15:32:26Z — Storage Box sshfs mount confirmed active.
  • 2026-02-04T15:3xZ → 16:0xZ — CIHR (~50GB) and PHAC (~1.2GB) annual output directories copied to the Storage Box via rsync.
  • 2026-02-04T16:0xZ — Annual output tiering applied; output dirs mounted back under /srv/healtharchive/jobs/**; local copies deleted; root disk returned to ~19% used.
  • 2026-02-04T16:15:33Z — Services restarted; worker resumed.
  • 2026-02-04T16:47:12Z — vps-crawl-status snapshot shows 3 running annual jobs and root disk healthy.

Root cause

  • Immediate trigger:
  • Annual crawl output for at least CIHR and PHAC was written to the VPS root filesystem under /srv/healtharchive/jobs/**, consuming ~50GB locally and pushing root above the worker’s safety threshold.
  • Underlying cause(s):
  • Annual output tiering/mounts were not in place for those job output dirs at the time the crawls ran (post-reboot / maintenance window).
  • Manual ops workflows are easy to run in an “unsafe order” (worker running while mounts not validated).
  • The annual output tiering script can be run with missing env exports / DB offline, which causes confusing failures (SQLite “no such table”) that can delay recovery.

Contributing factors

  • Long-running crawls reduce opportunities for a clean maintenance window.
  • Root disk is fixed size and close to the worker’s disk threshold when any large output dir lands locally.
  • rsync to sshfs mountpoints can fail on ownership/permissions by default (requires explicit flags).
  • CIHR job (job 8) config drift: missing annual campaign metadata made annual tooling less reliable until patched.

Resolution / Recovery

0) Pause/stop automation and crawls (make a maintenance window)

Disable the automations so they don’t immediately restart jobs while mounts are in flux:

# Disable crawl auto-recover and worker auto-start
sudo mv /etc/healtharchive/crawl-auto-recover-enabled{,.disabled} 2>/dev/null || true
sudo mv /etc/healtharchive/worker-auto-start-enabled{,.disabled} 2>/dev/null || true

sudo systemctl stop healtharchive-crawl-auto-recover.timer || true
sudo systemctl stop healtharchive-worker-auto-start.timer || true

# Stop worker (and any transient crawl units)
sudo systemctl stop healtharchive-worker.service || true
systemctl list-units --all 'healtharchive-job*' --no-pager
sudo systemctl stop <healtharchive-jobX-...>.service

Mark stopped jobs restartable:

set -a; source /etc/healtharchive/backend.env; set +a
/opt/healtharchive/.venv/bin/healtharchive recover-stale-jobs --older-than-minutes 1 --apply --source <source>

1) Ensure backups exist

sudo systemctl start healtharchive-db-backup.service
ls -lt /srv/healtharchive/backups/healtharchive_*.dump | head -n 3

2) Restore/verify Storage Box mount

sudo systemctl start healtharchive-storagebox-sshfs.service
df -h /srv/healtharchive/storagebox
findmnt -T /srv/healtharchive/storagebox -o SOURCE,FSTYPE,OPTIONS

3) Migrate large local output dirs to Storage Box

Use rsync flags that don’t try to preserve ownership/perms on sshfs:

sudo rsync -rtv --info=progress2 --partial --inplace \
  --no-owner --no-group --no-perms \
  /srv/healtharchive/jobs/cihr/20260101T000502Z__cihr-20260101/ \
  /srv/healtharchive/storagebox/jobs/cihr/20260101T000502Z__cihr-20260101/

Optional “sanity dry-run” to see drift (but do not delete without thinking):

sudo rsync -rtvn --delete \
  --no-owner --no-group --no-perms \
  /srv/healtharchive/jobs/<source>/<job_dir>/ \
  /srv/healtharchive/storagebox/jobs/<source>/<job_dir>/

4) Re-establish the expected mounts under /srv/healtharchive/jobs/**

Key gotcha: the tiering script must target Postgres. Make sure env vars are exported and Postgres is running, otherwise you may see SQLite errors like no such table: sources.

sudo systemctl start postgresql.service
sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; \
  /opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-annual-output-tiering.py --apply --year 2026'

Validate mountpoints:

findmnt -T /srv/healtharchive/jobs/hc/20260101T000502Z__hc-20260101 -o SOURCE,FSTYPE
findmnt -T /srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101 -o SOURCE,FSTYPE
findmnt -T /srv/healtharchive/jobs/cihr/20260101T000502Z__cihr-20260101 -o SOURCE,FSTYPE

5) Delete local copies and verify disk health

sudo rm -rf /srv/healtharchive/jobs/*/*__*.local-*
df -h /
sudo du -xhd3 /srv/healtharchive/jobs | sort -h | tail -40

6) Resume services and automation

# Re-enable sentinels
sudo mv /etc/healtharchive/crawl-auto-recover-enabled.disabled /etc/healtharchive/crawl-auto-recover-enabled 2>/dev/null || sudo touch /etc/healtharchive/crawl-auto-recover-enabled
sudo mv /etc/healtharchive/worker-auto-start-enabled.disabled /etc/healtharchive/worker-auto-start-enabled 2>/dev/null || sudo touch /etc/healtharchive/worker-auto-start-enabled

sudo systemctl enable --now healtharchive-crawl-auto-recover.timer
sudo systemctl enable --now healtharchive-worker-auto-start.timer

sudo systemctl start healtharchive-api.service healtharchive-replay.service postgresql.service
sudo systemctl start healtharchive-worker.service

Post-incident verification

  • Public surface checks:
  • curl -s http://127.0.0.1:8001/api/health && echo
  • cd /opt/healtharchive && ./scripts/verify_public_surface.py (when appropriate)
  • Worker/job health checks:
  • cd /opt/healtharchive && ./scripts/vps-crawl-status.sh --year 2026
  • systemctl list-units --all 'healtharchive-job*' --no-pager
  • Storage/mount checks:
  • df -h / /srv/healtharchive/storagebox
  • findmnt -T /srv/healtharchive/jobs/<source>/<job_dir> -o SOURCE,FSTYPE,OPTIONS

Public communication

None. (No observed user-facing downtime; annual campaign internal pipeline issue.)

Open questions

  • What is the “source of truth” workflow after reboot/rescue to ensure annual output tiering is restored before the worker runs?
  • Should we add a boot-time (or worker-start-time) invariant check that annual output dirs are mounted to the Storage Box?
  • Can we reduce the likelihood of needing a rescue-mode window for “disk mystery” investigations by improving on-host diagnostics and documentation?

Action items (TODOs)

  • Add a runbook section: “Annual output tiering after reboot/rescue” (owner=ops, priority=high, due=2026-02-08)
  • Add a guardrail: worker refuses to start annual crawls when output dir is on /dev/sda1 (owner=eng, priority=high, due=2026-02-15)
  • Improve vps-annual-output-tiering.py UX:
  • detect “Postgres not running / env not exported” and print a single-line fix. (owner=eng, priority=medium, due=2026-02-15)
  • Ensure annual job configs always include campaign_kind/year metadata (owner=eng, priority=medium, due=2026-02-15)

Automation opportunities

  • Safe automation:
  • On boot (or before starting the worker), run an idempotent annual tiering “ensure” pass for the current campaign year.
  • Alert when /srv/healtharchive/jobs/** output dirs are on the root filesystem while a campaign is active.
  • What should stay manual:
  • Any automated deletion of local .local-* directories should remain manual unless preceded by a strong integrity check (to avoid data loss).

References / Artifacts

  • Related investigation: ../../planning/implemented/2026-02-01-disk-usage-investigation.md
  • Related playbooks:
  • ../playbooks/storage/warc-storage-tiering.md
  • ../playbooks/storage/storagebox-sshfs-stale-mount-recovery.md
  • Commands used during recovery:
  • ./scripts/vps-crawl-status.sh --year 2026
  • scripts/vps-annual-output-tiering.py
  • healtharchive-storagebox-sshfs.service