Operational resilience improvements (Implemented 2026-02-01)
Status: Implemented | Scope: Single-VPS operational hardening for the 2026 annual campaign: reduce disk-pressure wedges, improve crawl recovery safety, and make watchdog behavior observable and drillable.
Outcomes
- Disk pressure safeguards:
- Worker pre-crawl headroom gate (prevents starting new jobs when disk is too full).
- Safe cleanup automation for indexed jobs (
temp-nonwarc) plus a disk-threshold safety net. - Crawl recovery hardening:
- Crawl auto-recover watchdog uses a 60m stall threshold with a guard window to avoid disrupting healthy crawls.
- Safe dry-run drills for crawl auto-recover (simulate stalled jobs; verify planned actions end-to-end).
- Automated “soft recover” option: mark a stalled job retryable without stopping the worker when another job is progressing.
- Storage hot-path hardening (Errno 107):
- Storage hot-path auto-recover watchdog detects stale/unreadable mountpoints and repairs them with bounded, rate-limited actions.
- Tiering helpers run with stale-mount repair flags to reduce manual intervention during campaigns.
- Operator UX:
- Clearer operator workflows for recovering stalled jobs and validating watchdog behavior via drills and metrics.
- Follow-up captured:
- Ongoing
dfvsdudisk-usage discrepancy investigation kept as an active plan for a future maintenance window.
Canonical Docs Updated
docs/operations/thresholds-and-tuning.mddocs/deployment/systemd/README.mddocs/operations/playbooks/crawl/crawl-stalls.mddocs/operations/playbooks/crawl/crawl-auto-recover-drills.mddocs/operations/playbooks/crawl/cleanup-automation.mddocs/operations/playbooks/storage/storagebox-sshfs-stale-mount-recovery.md
Decisions Created (if any)
docs/decisions/2026-02-03-crawl-job-db-state-reconciliation.md
Historical Context
This plan was executed during the 2026 annual campaign while prioritizing safe-by-default automation (sentinel-gated, capped, observable). Detailed implementation history is preserved in git.