Storage Watchdog Observability Hardening (Implemented 2026-02-06)
Status: Implemented | Scope: Improve confidence and signal quality for Storage Box hot-path auto-recovery (Errno 107).
Outcomes
- Added explicit tests for hot-path auto-recover behavior and failure modes:
tests/test_ops_storage_hotpath_auto_recover.py- Added persistent failed-apply alerting (startup-safe, low-noise default severity):
ops/observability/alerting/healtharchive-alerts.yml(HealthArchiveStorageHotpathApplyFailedPersistent)tests/test_ops_alert_rules.py- Added burn-in tooling for evidence capture and a clean gate:
scripts/vps-storage-watchdog-burnin-report.pytests/test_ops_storage_watchdog_burnin_report.py- Added optional daily snapshot scheduling for burn-in (read-only):
docs/deployment/systemd/healtharchive-storage-watchdog-burnin-snapshot.servicedocs/deployment/systemd/healtharchive-storage-watchdog-burnin-snapshot.timerscripts/vps-storage-watchdog-burnin-snapshot.sh
Canonical Docs Updated
docs/operations/monitoring-and-alerting.mddocs/operations/thresholds-and-tuning.mddocs/operations/playbooks/storage/storagebox-sshfs-stale-mount-recovery.mddocs/operations/playbooks/storage/storagebox-sshfs-stale-mount-drills.mddocs/operations/playbooks/validation/automation-maintenance.md
Validation
make ciremains green.- Burn-in gate command returns exit 0 when no unresolved issues exist:
python3 scripts/vps-storage-watchdog-burnin-report.py --window-hours 168 --require-clean
Historical Context
Detailed implementation narrative and tuning is preserved in git history.