Crawl stalls (monitoring + recovery)
Use this playbook when a crawl job is running but appears stalled (no progress for an extended period), or when you receive the HealthArchiveCrawlStalled alert.
Quick triage (recommended first):
Notes:
- The status script is read-only (no restarts, no DB writes); it’s safe mid-crawl.
- If the combined log is very large and you only want recent timeout signals, use:
./scripts/vps-crawl-status.sh --year 2026 --recent-lines 20000- If you need to distinguish HTML/runtime friction from download/media frontier churn, run:
./scripts/vps-crawl-content-report.py --job-id JOB_ID- On very large live jobs, bound the scan explicitly:
timeout 120 ./scripts/vps-crawl-content-report.py --job-id JOB_ID --max-log-files 1 --max-log-bytes 262144 --max-warc-files 3
1) Identify the stalled job
On the VPS:
set -a; source /etc/healtharchive/backend.env; set +a
/opt/healtharchive/.venv/bin/healtharchive list-jobs --status running --limit 10
Then inspect the specific job:
2) Confirm “no progress”
Find the newest combined log for the job’s output directory:
JOBDIR="/srv/healtharchive/jobs/SOURCE/YYYYMMDDTHHMMSSZ__name"
ls -lt "${JOBDIR}"/archive_*.combined.log | head -n 5
LOG="$(ls -t "${JOBDIR}"/archive_*.combined.log | head -n 1)"
Check the most recent crawlStatus line(s):
If crawled is not increasing for a long time (often with repeated Navigation timeout warnings), treat it as stalled.
3) Recovery (safe-by-default)
Before running recovery commands, answer this first:
- Is the proposed fix just operational recovery, or does it depend on a repo change such as source scope updates, source-profile tuning, watchdog logic, or reconcile behavior?
- If it depends on a repo change, stop here. Commit, push, and deploy that fix first, then verify the VPS checkout contains it before restarting the crawl.
If you confirm the crawl is stalled and you want to restart it, do:
# Stop the worker (interrupts the current crawl process).
sudo systemctl stop healtharchive-worker.service
# Mark the running job retryable so the worker can pick it up again.
set -a; source /etc/healtharchive/backend.env; set +a
/opt/healtharchive/.venv/bin/healtharchive recover-stale-jobs \
--older-than-minutes 5 \
--require-no-progress-seconds 3600 \
--apply \
--source SOURCE \
--limit 5
# Start the worker again.
sudo systemctl start healtharchive-worker.service
Then confirm the worker picked the job up again and crawlStatus is moving:
sudo systemctl status healtharchive-worker.service --no-pager
sudo journalctl -u healtharchive-worker.service -n 50 --no-pager
Notes
archive_toolhas built-in monitoring/adaptation; most stalls should self-heal, but this recovery is the “break glass” operator workflow.- Prefer one controlled restart after a deployed fix over repeated blind retries against the same live config.
- Optional: you can enable the
healtharchive-crawl-auto-recover.timerwatchdog (sentinel:/etc/healtharchive/crawl-auto-recover-enabled) once you’re confident in the thresholds/caps. - To periodically validate the watchdog logic safely on production, run the drills in:
crawl-auto-recover-drills.md- The watchdog is designed to avoid interrupting a healthy crawl; when another job is actively making progress, it may “soft recover” zombie
status=runningjobs by marking themretryablewithout restarting the worker. - If enabled via systemd, the watchdog can also auto-start underfilled annual jobs (
--ensure-min-running-jobs) to maintain concurrency. - See:
docs/operations/thresholds-and-tuning.mdand the “queue fill / auto-start” drills incrawl-auto-recover-drills.md. - If the watchdog is enabled but prints
SKIP ... max recoveries reached, you can still do the manual recovery above, or (carefully) run the watchdog script once with a higher cap: - If stalls repeat for the same URL(s), consider narrowing scope rules or adjusting crawler timeouts in the source’s job configuration.
- For recurring source-specific failures, treat
job_registry.pyand annual reconciliation as the canonical fix path, not one-off VPS-only tweaks.