Infra-error retry storms + storage hot-path resilience (Implemented 2026-01-24)
Status: Implemented | Scope: Prevent single-VPS retry storms and “everything stopped” states caused by infrastructure failures (notably Errno 107 stale sshfs/FUSE mountpoints) during the 2026 annual campaign.
Outcomes
- Worker resilience:
- Added an infra-error cooldown so jobs that end in
crawler_status=infra_errorare not immediately re-selected in a tight loop. - Improved logging/operator signal around infra errors vs crawl failures.
- Storage hot-path auto-recover hardening:
- Detect stale/unreadable mountpoints (Errno 107) not just for running jobs, but also for “next jobs” (queued/retryable) to prevent retry storms.
- Conservative recovery sequence with caps/cooldowns and deploy-lock avoidance.
- Tiering helpers support stale-mount repair flags for safer recovery.
- Worker auto-start safety:
- Added a conservative watchdog to start the worker only when it should be running (jobs pending + storage OK), sentinel-gated.
- Observability:
- Exported watchdog metrics to node_exporter textfile collector (and documented enablement).
Canonical Docs Updated
docs/deployment/systemd/README.mddocs/operations/playbooks/storage/storagebox-sshfs-stale-mount-recovery.mddocs/operations/playbooks/storage/storagebox-sshfs-stale-mount-drills.mddocs/operations/playbooks/validation/automation-maintenance.mddocs/operations/thresholds-and-tuning.md
Decisions Created (if any)
docs/decisions/2026-01-24-single-vps-ops-automation-guardrails-for-crawl-and-storage.md
Historical Context
This plan was triggered by a 2026-01-24 production incident involving Errno 107 hot-path failures. Detailed implementation history and verification steps are preserved in git.