Incident: Annual crawl — PHAC output dir not writable (2026-01-09)
Status: draft
Metadata
- Date (UTC): 2026-01-09
- Severity: sev2
- Environment: production
- Primary area: crawl + storage
- Owner: (unassigned)
- Start (UTC): 2026-01-08T20:22:04Z
- End (UTC): 2026-01-09T13:39:52Z (mitigated; awaiting successful retry)
Summary
The annual crawl job for phac (job 7) repeatedly failed immediately because its job output_dir was not writable (PermissionError while creating a .writable_test_* file). The job produced no WARCs and consumed its retry budget.
Recovery restored a writable output directory and reset the job’s retry budget (retry_count=0) so the worker can safely reattempt it when capacity is available.
Impact
- User-facing impact: none directly, but annual campaign remained
Ready for search: NOwhile jobs were incomplete. - Internal impact: operator intervention required;
phacjob blocked; retry budget consumed. - Data impact:
- Data loss: no (no WARCs were produced).
- Data integrity risk: low (failure-to-start; no partial WARC writes).
- Recovery completeness: partial (job left
retryable; not yet re-run at time of write-up). - Duration: ~17 hours (first failure → operator repair + retry reset).
Detection
./scripts/vps-crawl-status.sh --year 2026showed:phacjob 7:status=retryable,crawl_rc=1,crawl_status=failed,WARC files=0- Worker journal showed the root symptom during job startup:
CRITICAL ... Output directory ... is invalid or not writable: [Errno 13] Permission denied: .../.writable_test_<pid>
Decision log
- 2026-01-09 — Avoided interventions that stop
healtharchive-worker.servicewhilecihrwas actively crawling (to reduce the risk of turning an in-progress crawl into afailedjob at max retries).
Timeline (UTC)
- 2026-01-08T20:22:04Z — Worker picked job 7 (
phac); job failed immediately due tooutput_dirnot writable (Errno 13). - 2026-01-09T05:21:15Z — Status snapshot: job 7 still
retryable/failed with0WARCs. - 2026-01-09T13:10Z — Confirmed the job output dir is an
sshfshot path mountpoint (findmnt -T <output_dir>showsfstype=fuse.sshfs). - 2026-01-09T13:10Z — Attempted
chownof the output dir failed (Permission denied) because the path is onsshfs. - 2026-01-09T13:26:21Z —
healtharchive validate-job-config --id 7confirmed crawler command construction and output dir resolution. - 2026-01-09T13:39:52Z — Reset
retry_countto0via Python + SQLAlchemy session so job can be retried safely.
Root cause
- Immediate trigger:
archive_toolrefused to start becauseoutput_dirwas not writable. - Underlying cause(s): job output directory mount/permissions were not compatible with the worker/crawler runtime user (details TBD).
Contributing factors
- Direct
psqlaccess from the operator account failed due to missing local DB role mapping (e.g.,role "haadmin" does not exist). - The
output_diris onsshfs, so ownership fixes viachownare not available on the VPS; recovery requires “make the mount writable” rather than “change owner”.
Resolution / Recovery
- Diagnosed job output dir mount + permissions:
- Confirmed job config and path:
healtharchive show-job --id 7→Output dir: /srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101
- Confirmed it is an
sshfshot path mountpoint:findmnt -T /srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101 -o TARGET,SOURCE,FSTYPE,OPTIONS
- Confirmed the worker user:
systemctl show -p User -p Group healtharchive-worker.service→User=haadmin,Group=haadmin
- Attempted to fix ownership failed (
Permission denied) because the output dir is onsshfs:sudo chown <worker_user>:<worker_group> <output_dir>
- Ensured a writable output dir:
- Verified writability with a host-level probe:
touch /srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101/.writable_test && rm /srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101/.writable_test
- Validated annual tiering state for
phac:sudo /opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-annual-output-tiering.py --year 2026 --sources phac --apply
- Validated job configuration:
healtharchive validate-job-config --id 7- Reset the job retry budget:
- Direct
psqlaccess failed due to missing DB roles for the operator account (role "haadmin" does not exist,role "root" does not exist). -
Used a small Python snippet with
ha_backend.db.get_session()to setretry_count=0forjob_id=7:- ```bash /opt/healtharchive/.venv/bin/python3 - <<'PY' from ha_backend.db import get_session from ha_backend.models import ArchiveJob
job_id = 7 with get_session() as session: job = session.get(ArchiveJob, job_id) if job is None: raise SystemExit(f"job {job_id} not found") old = job.retry_count job.retry_count = 0 session.commit() print(f"OK job_id={job_id} retry_count {old} -> {job.retry_count}") PY ```
Post-incident verification
- Confirmed output dir is writable on the host.
- Confirmed job config dry-run passes.
- Confirmed job shows
Status: retryable,Retry count: 0.
Open questions (still unknown)
- Why did this
sshfshot path mount become non-writable for the worker user? - Is there any automation that can proactively detect “output dir not writable” before a crawl attempt consumes retries?
- Should the worker/crawler user be changed to a dedicated service account (instead of an operator user) to reduce permission drift?
Action items (TODOs)
- Identify why this job’s
output_dirwas not writable (mount type + UID/GID expectations) and document the invariant we rely on. (priority=high) - Add an operator-safe command to reset a crawl job’s retry budget:
healtharchive reset-retry-count(dry-run by default;--applyrequired; skips running/lock-held jobs). (implemented 2026-02-06) - Consider treating “output dir not writable” as an
infra_errorclass so it does not consume retry budget. (priority=medium) - Add a short ops note: when
psqlroles are missing, use the DB session method (Python snippet) rather than forcingpsqlas root. (priority=low)
Automation opportunities
- Add a periodic “job output dir writability probe” (metrics + alert) for queued/running annual jobs.
- Expand tiering/repair automation to ensure hot-path output dirs are consistently mounted/writable before a crawl starts.
References / Artifacts
- Operator snapshot script:
scripts/vps-crawl-status.sh - Incident response playbook:
../playbooks/core/incident-response.md - Crawl stalls playbook:
../playbooks/crawl/crawl-stalls.md - Storage hot-path incidents:
../playbooks/storage/storagebox-sshfs-stale-mount-recovery.md - Note: job 7 produced no combined crawl logs because it failed before
archive_toolstarted. - Related:
2026-01-09-annual-crawl-hc-job-stalled.md