Skip to content

Incident: Annual crawl — HC job stalled (2026-01-09)

Status: resolved

Metadata

  • Date (UTC): 2026-01-09
  • Severity: sev1
  • Environment: production
  • Primary area: crawl
  • Owner: (unassigned)
  • Start (UTC): 2026-01-09T07:34:37Z (last observed crawl progress)
  • End (UTC): 2026-01-16T02:56:12Z

Summary

The annual crawl job for hc (job 6) entered a stalled state: crawlStatus stopped advancing and the crawl metrics exporter flagged it as stalled. The stall correlated with repeated Navigation timeout warnings on canada.ca pages.

Manual recovery (stop worker + recover stale jobs) was intentionally deferred while cihr (job 8) was actively crawling, to avoid turning an in-progress crawl into a failed job at max retries. The blocked recovery was later performed successfully on 2026-01-16.

Impact

  • User-facing impact: annual campaign remained Ready for search: NO.
  • Internal impact: operator attention required; hc crawl not progressing.
  • Data impact:
  • Data loss: unknown (WARCs exist in temp dirs, but crawl completeness is unknown until completion).
  • Data integrity risk: low/unknown (no specific corruption signals observed; primarily a progress/stall problem).
  • Recovery completeness: recovered for the 2026-01-09 stalled attempt.

Detection

  • ./scripts/vps-crawl-status.sh --year 2026 --job-id 6:
  • healtharchive_crawl_running_job_stalled{job_id="6",source="hc"} 1
  • last_progress_age_seconds climbed into multi-hour range.
  • crawlStatus tail stopped advancing.
  • recent timeouts showed repeated Navigation timeout of 90000 ms exceeded.

Decision log

  • 2026-01-09 — Deferred the “stop worker + recover stale jobs” procedure while job 8 (cihr) was actively crawling to reduce the risk of interrupting it at max retries.

Timeline (UTC)

  • 2026-01-09T06:05:14Z — Job 6 started (latest observed start time in status snapshot).
  • 2026-01-09T07:34:37Z — Last observed crawlStatus progress for job 6 (crawled=437, total=3209, pending=1).
  • 2026-01-09T12:57:17Z — Status snapshot shows multi-hour no-progress and stalled=1.
  • 2026-01-09T13:33:23Z — Status snapshot still shows stalled=1.
  • 2026-01-16T02:56:12Z — Manual recovery performed (stop worker + recover stale jobs). Job 6 restarted and began a new crawl attempt.

Root cause

Unknown. Strong signals point to crawl progress blocked by repeated page load failures/timeouts and/or a crawler worker getting stuck on a specific URL.

As of 2026-01-16, the job showed many net::ERR_HTTP2_PROTOCOL_ERROR failures on canada.ca and archive_tool applied repeated backoff delays after hitting its HTTP/network error threshold.

Contributing factors

  • Many canada.ca pages timed out (90s navigation timeouts), increasing the chance of long “pending page” windows.
  • hc and cihr were both running; the safest recovery approach (stopping the worker) would interrupt both.

Decision: Manual Recovery (Option C)

We elected to stick with the manual recover-stale-jobs procedure (documented in ../playbooks/crawl/crawl-stalls.md) rather than automating granular per-job stops. The risk of interrupting a healthy concurrent job is acceptable given the rarity of stalls, and stopping the worker is the safest way to ensure no partial state corruption.

Resolution / Recovery

Performed on 2026-01-16 (VPS):

  • Followed docs/operations/playbooks/crawl/crawl-stalls.md:
  • sudo systemctl stop healtharchive-worker.service
  • set -a; source /etc/healtharchive/backend.env; set +a
  • /opt/healtharchive/.venv/bin/healtharchive recover-stale-jobs --older-than-minutes 5 --apply --source hc --limit 1
  • sudo systemctl start healtharchive-worker.service
  • Verified the job restarted (Started at updated) and a new combined log was created.

Post-incident verification

  • The 2026-01-16 recovery created a new combined log and advanced Started at for job 6.
  • The 2026-01-09 stalled attempt was superseded by the restarted crawl. Later annual-crawl follow-up work is tracked separately in the ops roadmap and newer incident notes.

Open questions (still unknown)

  • What exact URL/work unit is the crawler stuck on (if any), and does it repeat across retries?
  • Are timeouts driven by site performance, network issues, headless browser instability, or scope rules?
  • Would changing timeouts/adaptive restart thresholds reduce repeat stalls without harming completeness?

Action items (TODOs)

  • After cihr completes (or during a maintenance window), perform the planned recovery steps and update this note with outcomes. (priority=high)
  • (Pending Operator Check) If the stall repeats, capture the specific repeated URL(s) and assess whether scope/timeout tuning is warranted. (owner=ops, priority=medium)
  • Consider tightening/clarifying automation boundaries: per-job recovery without stopping unrelated active crawls.
  • Decision: Explicitly deferred/rejected in favor of manual "stop worker + recover" procedure (Option C) to minimize complexity.

Automation opportunities

  • Improve “stalled crawl” detection to include the most recent pending URL and age as part of operator output (snapshot script) and/or alert annotations.
  • Investigate whether recovery can be scoped to a single crawl process/container without stopping the entire worker loop (risk: false positives and partial state).

References / Artifacts

  • Operator snapshot script: scripts/vps-crawl-status.sh
  • Latest combined log (as of 2026-01-09 12:57Z snapshot): /srv/healtharchive/jobs/hc/20260101T000502Z__hc-20260101/archive_new_crawl_phase_-_attempt_1_20260109_060517.combined.log
  • Latest combined log after 2026-01-16 recovery: /srv/healtharchive/jobs/hc/20260101T000502Z__hc-20260101/archive_new_crawl_phase_-_attempt_1_20260116_025617.combined.log
  • Playbook: ../playbooks/crawl/crawl-stalls.md
  • Playbook: ../playbooks/core/incident-response.md
  • Related: 2026-01-09-annual-crawl-phac-output-dir-permission-denied.md