Crawl stalls (monitoring + recovery)

Use this playbook when a crawl job is running but appears stalled (no progress for an extended period), or when you receive the HealthArchiveCrawlStalled alert.

Quick triage (recommended first):

cd /opt/healtharchive
./scripts/vps-crawl-status.sh --year 2026

Notes:

The status script is read-only (no restarts, no DB writes); it’s safe mid-crawl.
If the combined log is very large and you only want recent timeout signals, use:
./scripts/vps-crawl-status.sh --year 2026 --recent-lines 20000
If you need to distinguish HTML/runtime friction from download/media frontier churn, run:
./scripts/vps-crawl-content-report.py --job-id JOB_ID
On very large live jobs, bound the scan explicitly: timeout 120 ./scripts/vps-crawl-content-report.py --job-id JOB_ID --max-log-files 1 --max-log-bytes 262144 --max-warc-files 3

1) Identify the stalled job

On the VPS:

set -a; source /etc/healtharchive/backend.env; set +a
/opt/healtharchive/.venv/bin/healtharchive list-jobs --status running --limit 10

Then inspect the specific job:

/opt/healtharchive/.venv/bin/healtharchive show-job --id JOB_ID

2) Confirm “no progress”

Find the newest combined log for the job’s output directory:

JOBDIR="/srv/healtharchive/jobs/SOURCE/YYYYMMDDTHHMMSSZ__name"
ls -lt "${JOBDIR}"/archive_*.combined.log | head -n 5
LOG="$(ls -t "${JOBDIR}"/archive_*.combined.log | head -n 1)"

Check the most recent crawlStatus line(s):

rg -n '"context":"crawlStatus"' "${LOG}" | tail -n 5

If crawled is not increasing for a long time (often with repeated Navigation timeout warnings), treat it as stalled.

3) Recovery (safe-by-default)

Before running recovery commands, answer this first:

Is the proposed fix just operational recovery, or does it depend on a repo change such as source scope updates, source-profile tuning, watchdog logic, or reconcile behavior?
If it depends on a repo change, stop here. Commit, push, and deploy that fix first, then verify the VPS checkout contains it before restarting the crawl.

If you confirm the crawl is stalled and you want to restart it, do:

# Stop the worker (interrupts the current crawl process).
sudo systemctl stop healtharchive-worker.service

# Mark the running job retryable so the worker can pick it up again.
set -a; source /etc/healtharchive/backend.env; set +a
/opt/healtharchive/.venv/bin/healtharchive recover-stale-jobs \
  --older-than-minutes 5 \
  --require-no-progress-seconds 3600 \
  --apply \
  --source SOURCE \
  --limit 5

# Start the worker again.
sudo systemctl start healtharchive-worker.service

Then confirm the worker picked the job up again and crawlStatus is moving:

sudo systemctl status healtharchive-worker.service --no-pager
sudo journalctl -u healtharchive-worker.service -n 50 --no-pager

Notes

archive_tool has built-in monitoring/adaptation; most stalls should self-heal, but this recovery is the “break glass” operator workflow.
Prefer one controlled restart after a deployed fix over repeated blind retries against the same live config.
Optional: you can enable the healtharchive-crawl-auto-recover.timer watchdog (sentinel: /etc/healtharchive/crawl-auto-recover-enabled) once you’re confident in the thresholds/caps.
To periodically validate the watchdog logic safely on production, run the drills in:
crawl-auto-recover-drills.md
The watchdog is designed to avoid interrupting a healthy crawl; when another job is actively making progress, it may “soft recover” zombie status=running jobs by marking them retryable without restarting the worker.
If enabled via systemd, the watchdog can also auto-start underfilled annual jobs (--ensure-min-running-jobs) to maintain concurrency.
See: docs/operations/thresholds-and-tuning.md and the “queue fill / auto-start” drills in crawl-auto-recover-drills.md.

If the watchdog is enabled but prints SKIP ... max recoveries reached, you can still do the manual recovery above, or (carefully) run the watchdog script once with a higher cap:

sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; /opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-crawl-auto-recover.py --apply --max-recoveries-per-job-per-day 4'

If stalls repeat for the same URL(s), consider narrowing scope rules or adjusting crawler timeouts in the source’s job configuration.
For recurring source-specific failures, treat job_registry.py and annual reconciliation as the canonical fix path, not one-off VPS-only tweaks.