Skip to content

Operational resilience improvements (Implemented 2026-02-01)

Status: Implemented | Scope: Single-VPS operational hardening for the 2026 annual campaign: reduce disk-pressure wedges, improve crawl recovery safety, and make watchdog behavior observable and drillable.

Outcomes

  • Disk pressure safeguards:
  • Worker pre-crawl headroom gate (prevents starting new jobs when disk is too full).
  • Safe cleanup automation for indexed jobs (temp-nonwarc) plus a disk-threshold safety net.
  • Crawl recovery hardening:
  • Crawl auto-recover watchdog uses a 60m stall threshold with a guard window to avoid disrupting healthy crawls.
  • Safe dry-run drills for crawl auto-recover (simulate stalled jobs; verify planned actions end-to-end).
  • Automated “soft recover” option: mark a stalled job retryable without stopping the worker when another job is progressing.
  • Storage hot-path hardening (Errno 107):
  • Storage hot-path auto-recover watchdog detects stale/unreadable mountpoints and repairs them with bounded, rate-limited actions.
  • Tiering helpers run with stale-mount repair flags to reduce manual intervention during campaigns.
  • Operator UX:
  • Clearer operator workflows for recovering stalled jobs and validating watchdog behavior via drills and metrics.
  • Follow-up captured:
  • Ongoing df vs du disk-usage discrepancy investigation kept as an active plan for a future maintenance window.

Canonical Docs Updated

  • docs/operations/thresholds-and-tuning.md
  • docs/deployment/systemd/README.md
  • docs/operations/playbooks/crawl/crawl-stalls.md
  • docs/operations/playbooks/crawl/crawl-auto-recover-drills.md
  • docs/operations/playbooks/crawl/cleanup-automation.md
  • docs/operations/playbooks/storage/storagebox-sshfs-stale-mount-recovery.md

Decisions Created (if any)

  • docs/decisions/2026-02-03-crawl-job-db-state-reconciliation.md

Historical Context

This plan was executed during the 2026 annual campaign while prioritizing safe-by-default automation (sentinel-gated, capped, observable). Detailed implementation history is preserved in git.