Incident: Annual crawl — Errno 107 hot-path issue triggered infra-error thrash and worker stopped (2026-01-24)
Status: draft
Metadata
- Date (UTC): 2026-01-24
- Severity (see
severity.md): sev1 - Environment: production (single VPS)
- Primary area: storage
- Owner: (unassigned)
- Start (UTC): 2026-01-24T06:17:10Z (approx; storage hot-path auto-recover began running repeatedly)
- End (UTC): 2026-01-24T12:31:03Z (approx; worker restarted and annual crawl resumed)
Summary
During the annual 2026 campaign, Storage Box “hot path” mountpoints under /srv/healtharchive/jobs/** intermittently became stale and returned OSError: [Errno 107] Transport endpoint is not connected. The worker then repeatedly picked the PHAC annual job (job 7), immediately failed with an infra error, and re-picked it in a tight loop. Shortly after, the worker service became inactive, leaving the campaign with no running jobs until manual intervention restarted the worker and re-launched the HC crawl (job 6).
Impact
- User-facing impact: none directly observed, but annual campaign remained
Ready for search: NOand made no progress while the worker was down. - Internal impact (ops burden, automation failures, etc):
- Alert noise (repeated infra errors and hot-path auto-recover runs).
- Worker stopped; no jobs ran until manual restart.
- Increased risk of “new crawl phase” restarts (loss of frontier continuity) after recovery.
- Data impact:
- Data loss: unknown (no evidence of WARC deletion in this incident record).
- Data integrity risk: medium (stale mounts can interrupt writes; “new crawl phase” can reduce completeness by losing crawl frontier).
- Recovery completeness: partial (worker restarted and crawl resumed; underlying Errno 107 trigger not fully understood).
- Duration: ~6h 14m (approx; 06:17Z → 12:31Z).
Detection
- Operator reported overnight warning/error notifications (approx 5 hours before 12:22Z).
./scripts/vps-crawl-status.sh --year 2026at ~12:22Z showed:FAIL worker service is not activehealtharchive_crawl_running_jobs 0phacjob markedcrawl_status=infra_error- storage hot-path watchdog recorded tiering failures mentioning
Errno 107for PHAC/CIHR hot paths. - Worker journal around 06:28Z showed rapid repetition of the same infra error for job 7.
healtharchive-crawl-auto-recover.servicewas running every ~5 minutes throughout the window and consistently completed successfully (“Deactivated successfully”), which suggests it was likely not the component that stopped the worker. (See timeline and artifacts.)
Timeline (UTC)
- 2026-01-24T05:45:10Z —
healtharchive-crawl-auto-recover.serviceruns (timer-driven), completes successfully. - 2026-01-24T05:50:10Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T05:55:10Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T06:00:10Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T06:05:01Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T06:10:10Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T06:15:07Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T06:17:10Z —
healtharchive-storage-hotpath-auto-recover.servicebegins running repeatedly (timer-driven). - 2026-01-24T06:28:01Z — Worker picks job 7 (PHAC) and immediately raises
Errno 107for the job output dir; the same job is re-picked repeatedly within the same second (infra-error thrash). - 2026-01-24T06:28:02Z —
healtharchive-worker.servicebecomes inactive (stopped); no running jobs remain. - 2026-01-24T06:30:10Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T06:35:10Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T06:40:10Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T06:45:05Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T06:50:01Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T06:55:10Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T07:00:05Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T07:05:01Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T07:10:01Z —
healtharchive-crawl-auto-recover.serviceruns, completes successfully. - 2026-01-24T12:22:11Z — Operator status snapshot confirms worker inactive and no running jobs.
- 2026-01-24T12:31:03Z — Manual recovery: worker started; job 6 (HC) restarts a crawl container and begins a “new crawl phase”.
Root cause
- Immediate trigger:
- One or more hot-path mountpoints returned
Errno 107during job execution (observed for PHAC output dir; watchdog also reported PHAC/CIHR paths). - Underlying cause(s):
- Likely intermittent
sshfs/FUSE hot-path instability (similar failure mode to2026-01-08-storage-hotpath-sshfs-stale-mount.md), but root trigger remains unconfirmed. - Worker behavior on infra errors allowed immediate re-pick of the same job with no effective cooldown, causing log spam and increased operational risk.
- The worker stop event at
2026-01-24T06:28:02Zwas asystemdstop; based on available logs,healtharchive-crawl-auto-recover.servicedoes not appear to be the direct cause (it continued running successfully before/after).
Contributing factors
- Infra error handling did not appear to enforce a cooldown/backoff for fast-failing jobs, enabling a tight loop.
- Hot-path auto-recovery ran frequently but did not resolve the condition before the worker stopped.
- Operator CLI confusion: running
healtharchive show-job --id 6without exporting/etc/healtharchive/backend.envdefaulted to SQLite and failed withno such table: archive_jobs(not causal, but slowed diagnosis).
Resolution / Recovery
Manual recovery steps performed (state-changing):
1) Confirm mounts were readable again (spot-check):
timeout 5 ls -la /srv/healtharchive/storagebox >/dev/null
timeout 5 ls -la /srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101 >/dev/null
timeout 5 ls -la /srv/healtharchive/jobs/cihr/20260101T000502Z__cihr-20260101 >/dev/null
2) Restart the worker:
3) Verify the crawl restarted and a zimit container is running:
./scripts/vps-crawl-status.sh --year 2026
sudo systemctl --no-pager --full status healtharchive-worker.service
Observed outcome: job 6 (HC) restarted at ~12:31Z and began writing new crawl temp dirs/WARCs under the existing job output directory.
Additional recovery work (2026-01-25)
- Reset the retry budget for jobs 7 (PHAC) and 8 (CIHR) by writing
retry_count=0directly via the backend ORM so the re-runs behaved like fresh attempts. - Fixed the output-directory permissions (
chown -R haadmin:haadmin,chmod 755) and confirmed writability by touching.writable_test_manualinside each job dir ashaadmin. - Launched the jobs through the transient
systemd-run-based helper (scripts/vps-run-db-job-detached.py / systemd-run … run-db-job --id …) so the crawls kept running while our SSH sessions closed. - Relaxed permissions on the existing
.tmpt*directories (docker run --rm -v "…:/output" alpine sh -c 'chmod -R a+rX /output/.tmp*') so the hot-path watchdog/ops scripts could read the WARCs without manual chmods.
Post-incident verification
- Worker/job health checks:
sudo systemctl status healtharchive-worker.service --no-pager -l(worker active)./scripts/vps-crawl-status.sh --year 2026(running job detected; metrics OK; crawlStatus advancing)- Storage/mount checks:
findmnt -T /srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101findmnt -T /srv/healtharchive/jobs/cihr/20260101T000502Z__cihr-20260101- Integrity checks:
- Not performed as part of this initial incident note; consider WARC sampling on job 6 after stabilization.
Open questions (still unknown)
- What is the underlying trigger for hot-path mount staleness (network blip, Storage Box behavior, sshfs option mismatch, local FUSE behavior)?
- Why did the worker service stop shortly after the infra-error thrash (explicit stop by automation vs worker exit)?
- Does the current infra-error cooldown/backoff logic actually prevent tight re-pick loops in practice, and is it observable via metrics/logs?
Action items (TODOs)
- Implement mitigations and recovery improvements in
docs/planning/implemented/2026-01-24-infra-error-and-storage-hotpath-hardening.md. (owner=eng, priority=high, due=2026-02-01) - Investigate underlying cause of the hot-path staleness (Storage Box/sshfs/network/FUSE). (owner=eng, priority=high, due=2026-02-01)
- Add a worker-side guardrail to prevent tight “pick same job instantly” loops on infra errors (cooldown + metric). (owner=eng, priority=high, due=2026-02-01)
- Add a playbook section for “CLI shows sqlite/no such table” (env export reminder) to reduce operator confusion. (owner=ops, priority=low, due=2026-02-01)
Follow-up implementation details
- Added
scripts/vps-run-db-job-detached.pyand updateddocs/deployment/systemd/README.mdto point operators at this helper so specific jobs can be re-run within transientsystemd-rununits without keeping a shell session attached. - Extended
scripts/vps-storage-hotpath-auto-recover.pyto probe the output directories of queued/retryable jobs (not just the currently running ones) so stale hot paths can be detected before the worker picks them; detection still strictly unmounts only the specific stale targets. - Extended the archive tool to discover
.tmp*temp directories immediately after the container starts and periodically throughout the crawl. When--relax-permsis enabled the helper now runs during the crawl (configurable interval) so host commands can read WARCs without manualchmod, not just after the job finishes. - Added a
last_healthy_*timestamp to the storage hot-path watchdog state and Prometheus textfile metrics, while continuing to gate actual recovery attempts via the existinglast_apply_*cooldown/cap fields. This gives dashboards a clearer signal when the watchdog has seen no stale targets.
Automation opportunities
- Safe automation:
- Detect
Errno 107at the job output-dir boundary and pause/park affected jobs with a cooldown rather than thrashing the worker loop. - If
Errno 107is detected with high confidence, automate a conservative recovery sequence (stop worker → unmount stale hot paths only → re-apply tiering/bind mounts → restart worker), but only when crawl integrity risk is acceptable. - What should stay manual:
- Any automation that unmounts paths used by an actively running crawl container should be guarded heavily (risk of interrupting writes / frontier continuity).
References / Artifacts
- Related incident:
docs/operations/incidents/2026-01-08-storage-hotpath-sshfs-stale-mount.md - Follow-up roadmap:
docs/planning/implemented/2026-01-24-infra-error-and-storage-hotpath-hardening.md healtharchive-crawl-auto-recover.servicejournal window:sudo journalctl -u healtharchive-crawl-auto-recover.service --since '2026-01-24 05:45:00' --until '2026-01-24 07:15:00' --no-pager | tail -n 200- Most relevant worker log excerpt (redacted):
Jan 24 06:28:01 <vps> healtharchive[...]: 2026-01-24 06:28:01,050 [INFO] healtharchive.worker: Worker picked job 7 for source phac (...) with status retryable and retry_count 0
Jan 24 06:28:01 <vps> healtharchive[...]: 2026-01-24 06:28:01,056 [WARNING] healtharchive.jobs: Job 7 raised during archive_tool execution: [Errno 107] Transport endpoint is not connected: '/srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101'
Jan 24 06:28:01 <vps> healtharchive[...]: 2026-01-24 06:28:01,063 [WARNING] healtharchive.worker: Crawl for job 7 failed due to infra error (RC=1). Not consuming retry budget (retry_count=0).
Jan 24 06:28:01 <vps> healtharchive[...]: 2026-01-24 06:28:01,068 [INFO] healtharchive.worker: Worker picked job 7 for source phac (...) with status retryable and retry_count 0
Jan 24 06:28:02 <vps> systemd[1]: Stopping healtharchive-worker.service - HealthArchive Worker...
Jan 24 06:28:02 <vps> systemd[1]: healtharchive-worker.service: Deactivated successfully.