Skip to content

Infra-error retry storms + storage hot-path resilience (Implemented 2026-01-24)

Status: Implemented | Scope: Prevent single-VPS retry storms and “everything stopped” states caused by infrastructure failures (notably Errno 107 stale sshfs/FUSE mountpoints) during the 2026 annual campaign.

Outcomes

  • Worker resilience:
  • Added an infra-error cooldown so jobs that end in crawler_status=infra_error are not immediately re-selected in a tight loop.
  • Improved logging/operator signal around infra errors vs crawl failures.
  • Storage hot-path auto-recover hardening:
  • Detect stale/unreadable mountpoints (Errno 107) not just for running jobs, but also for “next jobs” (queued/retryable) to prevent retry storms.
  • Conservative recovery sequence with caps/cooldowns and deploy-lock avoidance.
  • Tiering helpers support stale-mount repair flags for safer recovery.
  • Worker auto-start safety:
  • Added a conservative watchdog to start the worker only when it should be running (jobs pending + storage OK), sentinel-gated.
  • Observability:
  • Exported watchdog metrics to node_exporter textfile collector (and documented enablement).

Canonical Docs Updated

  • docs/deployment/systemd/README.md
  • docs/operations/playbooks/storage/storagebox-sshfs-stale-mount-recovery.md
  • docs/operations/playbooks/storage/storagebox-sshfs-stale-mount-drills.md
  • docs/operations/playbooks/validation/automation-maintenance.md
  • docs/operations/thresholds-and-tuning.md

Decisions Created (if any)

  • docs/decisions/2026-01-24-single-vps-ops-automation-guardrails-for-crawl-and-storage.md

Historical Context

This plan was triggered by a 2026-01-24 production incident involving Errno 107 hot-path failures. Detailed implementation history and verification steps are preserved in git.