Incident: Auto-recover stall detection bugs (2026-02-06)
Status: closed
Metadata
- Date (UTC): 2026-02-06
- Severity: sev2
- Environment: production
- Primary area: crawl
- Owner: jerdaw
- Start (UTC): 2026-02-05T06:07:24Z (hc job 6 last progress)
- End (UTC): 2026-02-06T02:40:07Z (job 6 recovered and restarted)
Summary
The crawl auto-recover watchdog failed to detect and recover a stalled hc job (job 6) for ~20 hours due to a chain of four bugs. The watchdog ran every 5 minutes but incorrectly reported "no_stalled_jobs" despite the metrics exporter correctly flagging the stall. Manual investigation revealed bugs in stale log detection, runner detection, and cross-user lock file access. After fixing all four bugs, the watchdog successfully recovered the job automatically.
Impact
- User-facing impact: No 2026 annual search data available during the ~20h stall period.
- Internal impact: Auto-recovery automation broken; required manual debugging.
- Data impact:
- Data loss: no (WARCs preserved on disk)
- Data integrity risk: no
- Recovery completeness: complete (job resumed from last checkpoint)
- Duration: ~20 hours of stalled crawl; auto-recovery restored in ~2 hours of debugging/fixes.
Detection
- Detected via manual crawl status check (
./scripts/vps-crawl-status.sh --year 2026) - Metrics showed disconnect:
healtharchive_crawl_running_job_stalled{job_id="6"} = 1but auto-recover reportedreason="no_stalled_jobs" - Most useful signals:
- Metrics exporter vs auto-recover discrepancy
- Auto-recover service logs showing PermissionError crashes
sysctl fs.protected_regular = 2kernel setting
Timeline (UTC)
- 2026-02-05T06:07:24Z — hc job 6 last progress (crawled 299 pages)
- 2026-02-06T00:00:00Z — Operator checks crawl status, notices 18+ hour stall
- 2026-02-06T00:15:00Z — Investigation begins: metrics show stall, auto-recover doesn't
- 2026-02-06T00:30:00Z — Root cause 1 found:
_find_job_logusing stale DB path - 2026-02-06T00:45:00Z — Fix 1 deployed (4bddb7c)
- 2026-02-06T01:00:00Z — Auto-recover now detects stall but crashes on PermissionError
- 2026-02-06T01:15:00Z — Root cause 2 found:
_detect_job_runnernot checking job locks - 2026-02-06T01:30:00Z — Fix 2 deployed (1648f74)
- 2026-02-06T01:45:00Z — Auto-recover still crashes: PermissionError on lock file
- 2026-02-06T02:00:00Z — Root cause 3 found: OSError not caught in recovery CLI
- 2026-02-06T02:10:00Z — Fix 3 deployed (e77212b)
- 2026-02-06T02:25:00Z — Still fails:
fs.protected_regular=2blocks O_CREAT - 2026-02-06T02:30:00Z — Root cause 4 found: O_CREAT on existing file in /tmp
- 2026-02-06T02:35:00Z — Fix 4 deployed (e073749), recovery succeeds
- 2026-02-06T02:40:07Z — Job 6 automatically restarted, crawling resumed
Root cause
Four cascading bugs in the auto-recovery chain:
-
Stale log detection: Auto-recover's
_find_job_logreturned the DBcombined_log_pathimmediately without checking for newer logs on disk. If the DB path pointed to an old log (from a previous attempt), the watchdog parsed stale data and skipped the job. -
Runner detection gap:
_detect_job_runnerdidn't check held job locks. A dead crawl subprocess + live worker lock = incorrectly classified as runner="none", triggering soft-recovery instead of full recovery. -
Permission error handling: Lock probes only caught
JobAlreadyRunningError, notOSError. When cross-user permission issues occurred (root auto-recover vs haadmin worker), the command crashed instead of skipping gracefully. -
fs.protected_regular kernel protection: The kernel sysctl
fs.protected_regular=2blocksO_CREATon existing files in world-writable sticky directories (/tmp) when the caller doesn't own the file. The_job_lockfunction usedO_CREATunconditionally, causing EACCES for cross-user probes.
Contributing factors
- Metrics exporter had already been fixed for bug #1 (stale log detection), but auto-recover was not updated in sync
- No integration tests covering cross-user lock file scenarios (root vs non-root)
fs.protected_regular=2is a modern security feature not documented in the job lock implementation
Resolution / Recovery
- Fixed stale log detection (commit 4bddb7c):
- Updated
_find_job_loginscripts/vps-crawl-auto-recover.pyto match metrics exporter logic -
Added tests for newest-by-mtime selection
-
Fixed runner detection (commit 1648f74):
- Added job lock probe as final fallback in
_detect_job_runner -
Classify as "worker" when lock is held and worker is running
-
Fixed permission error handling (commit e77212b):
- Catch
OSErrorincmd_recover_stale_jobslock probe -
Treat
PermissionErroras "potentially held" in auto-recover -
Fixed O_CREAT issue (commit e073749):
- Try
O_RDWRfirst in_job_lock, fall back toO_CREAT | O_RDWRonly for new files -
Bypasses
fs.protected_regularrestrictions on existing files -
Fixed
/tmp/healtharchive-job-locks/permissions: - One-time:
chmod 1777 /tmp/healtharchive-job-locks/ - One-time:
chmod 666 /tmp/healtharchive-job-locks/job-*.lock
Post-incident verification
- Auto-recover successfully detected and recovered job 6 at 02:35 UTC
- Worker automatically picked up retryable job 6 at 02:40 UTC
- New crawl log created, crawlStatus advancing normally (5 pages/3 min)
- All 238 tests pass, CI green
- Auto-recover timer running every 5 min without errors
Action items
- Fix stale log detection in auto-recover (completed: 4bddb7c)
- Fix runner detection to check job locks (completed: 1648f74)
- Handle PermissionError in lock probes (completed: e77212b)
- Avoid O_CREAT on existing lock files (completed: e073749)
- Document fs.protected_regular interaction in MEMORY.md (completed)
- Create incident record (this file)
- Consider moving job locks out of /tmp to dedicated directory (priority=low, future improvement)
Automation opportunities
- The fixes enable fully automatic recovery for future stalls
- No additional automation needed — the bugs prevented existing automation from working
- Keep metrics exporter and auto-recover
_find_job_loglogic in sync (ongoing maintenance)
References / Artifacts
- Playbook:
docs/operations/playbooks/crawl/crawl-stalls.md - Auto-recover script:
scripts/vps-crawl-auto-recover.py - Metrics exporter:
scripts/vps-crawl-metrics-textfile.py - Test suite:
tests/test_ops_crawl_auto_recover_find_job_log.py - Operator scratch notes: (local, not in repo)
- Commits: 4bddb7c, 1648f74, e77212b, e073749