Runbook: Crawl Temp Dirs High
Alert Name: HealthArchiveCrawlTempDirsHigh Severity: Warning
Trigger
This alert fires when a running crawl has more than 100 tracked .tmp* directories for at least 1 hour and that count has grown by at least 5 over the last 12 hours:
healtharchive_crawl_running_job_temp_dirs_count > 100max_over_time(temp_dirs_count[12h]) - min_over_time(temp_dirs_count[12h]) >= 5
The metric comes from temp_dirs_host_paths in the job's .archive_state.json, so this is real crawl-stage churn, not just a loose filesystem glob. The growth guard is there to suppress stale warnings when a long-lived crawl has a historically high temp-dir count that is no longer climbing.
Meaning
The job has created an unusually large number of crawl-phase temp directories. Common causes are:
- repeated adaptive container restarts
- repeated resume or new-crawl phases after failures
- storage or permission instability on the output path
- a long-running crawl that is still making progress but is thrashing on a subset of URLs
Do not treat this alert as an instruction to delete .tmp* during the crawl. Those directories may still contain WARCs and resume state needed by the active job.
Quick Diagnosis
Start with a read-only snapshot on the VPS:
cd /opt/healtharchive
./scripts/vps-crawl-status.sh --year 2026 --job-id <JOB_ID> --recent-lines 20000
./scripts/vps-crawl-content-report.py --job-id <JOB_ID>
curl -s http://127.0.0.1:9100/metrics | rg 'healtharchive_crawl_running_job_(temp_dirs_count|container_restarts_done|last_progress_age_seconds|stalled|crawl_rate_ppm|output_dir_ok|output_dir_errno|log_probe_ok|log_probe_errno|state_file_ok|state_parse_ok|new_crawl_phase_count|resume_crawl_count)\{job_id="<JOB_ID>"'
set -a; source /etc/healtharchive/backend.env; set +a
/opt/healtharchive/.venv/bin/healtharchive show-job --id <JOB_ID>
sudo journalctl -u healtharchive-worker.service -n 400 --no-pager
If the content report is still expensive on a large live job, keep it bounded:
timeout 120 ./scripts/vps-crawl-content-report.py \
--job-id <JOB_ID> \
--max-log-files 1 \
--max-log-bytes 262144 \
--max-warc-files 3
Then inspect the job directory directly:
JOBDIR="/srv/healtharchive/jobs/<source>/<JOB_DIR>"
find "${JOBDIR}" -maxdepth 1 -type d -name '.tmp*' | wc -l
LOG="$(ls -t "${JOBDIR}"/archive_*.combined.log | head -n 1)"
rg -n '"context":"crawlStatus"' "${LOG}" | tail -n 10
rg -n 'Attempting adaptive container restart|Resume Crawl|New Crawl Phase|Navigation timeout|Page load timed out|Transport endpoint is not connected|Permission denied|No space left on device' "${LOG}" | tail -n 200
find "${JOBDIR}" -name '*.warc.gz' -printf '%TY-%Tm-%Td %TT %p\n' 2>/dev/null | sort | tail -n 10
Classify the incident:
- Still progressing:
crawlStatus.crawledis moving and recent WARCs are still appearing. The alert is an early warning; do not clean up yet. - Storage / permission instability:
output_dir_errno=107, unreadable logs/state,Permission denied, or stale-mount symptoms. - Restart / timeout thrash: restarts, resume phases, or repeated timeout errors dominate the recent log window.
- State drift: DB still says
running, but there is no active crawl container and no recent progress.
Safety Checks
Before any recovery command:
- If the proposed remediation depends on a repo change, commit it, push it, deploy it on the VPS, and verify the live checkout contains that change before recovering the job.
- Do not run
healtharchive cleanup-jobagainst arunningjob. - For terminal jobs, prefer
cleanup-job --mode temp-nonwarc. Use legacy--mode temponly when you intentionally want to discard WARCs/replay data.
Recovery
Running job branch
If the job is still running, do not delete .tmp*.
- If the evidence shows
Errno 107or stale mount symptoms, follow:docs/operations/playbooks/storage/storagebox-sshfs-stale-mount-recovery.md - If the job is no longer making progress, follow:
docs/operations/playbooks/crawl/crawl-stalls.md - If the restart budget is nearly exhausted, also review:
docs/operations/runbooks/crawl-restart-budget-low.md
This alert is a classifier. The action is usually recovery of the underlying storage or crawl-thrash issue, not temp-dir deletion.
Terminal job branch
If the job is indexed or index_failed, reclaim space safely with temp-nonwarc:
set -a; source /etc/healtharchive/backend.env; set +a
/opt/healtharchive/.venv/bin/healtharchive cleanup-job --id <JOB_ID> --mode temp-nonwarc --dry-run
/opt/healtharchive/.venv/bin/healtharchive cleanup-job --id <JOB_ID> --mode temp-nonwarc
If you intentionally do not need replay retention and want destructive cleanup, use --mode temp only after confirming that is acceptable for the job.
Verification
For a recovered running job:
- the worker is active and recent
crawlStatuslines show forward progress - the per-job metrics no longer show flat progress or storage errors
- temp-dir count stops climbing abnormally fast
For a cleaned terminal job:
healtharchive show-job --id <JOB_ID>showscleanupStatusupdated- the job directory no longer contains the old
.tmp*directories - stable WARCs remain under
<output_dir>/warcs/whentemp-nonwarcwas used
References
docs/operations/playbooks/crawl/crawl-stalls.mddocs/operations/runbooks/crawl-restart-budget-low.mddocs/operations/playbooks/storage/storagebox-sshfs-stale-mount-recovery.mddocs/operations/playbooks/crawl/cleanup-automation.md