Monitoring & Alerting Strategy - Annual Crawl Campaign
Last Updated: 2026-03-23
Overview
This document defines the monitoring strategy for the HealthArchive annual crawl campaign. We use a combination of systemd timers, Python scripts, and Prometheus node_exporter textfile collectors to expose custom metrics about crawl health, restart stability, and progress.
Metric Sources
Custom metrics are written to the node_exporter textfile collector directory:
/var/lib/node_exporter/textfile_collector/
Primary files (single-VPS annual campaign):
healtharchive_crawl.prom- Written by
scripts/vps-crawl-metrics-textfile.py - Triggered every 1 minute by
healtharchive-crawl-metrics.timer healtharchive_storage_hotpath_auto_recover.prom- Written by
scripts/vps-storage-hotpath-auto-recover.py - Triggered every 1 minute by
healtharchive-storage-hotpath-auto-recover.timer(sentinel-gated) healtharchive_worker_auto_start.prom- Written by
scripts/vps-worker-auto-start.py - Triggered every 2 minutes by
healtharchive-worker-auto-start.timer(sentinel-gated)
Key Metric Families
| Metric Name | Type | Description |
|---|---|---|
healtharchive_crawl_running_jobs | Gauge | Count of currently active jobs in the DB. |
healtharchive_worker_active | Gauge | 1 = worker systemd unit is active. |
healtharchive_jobs_pending_crawl | Gauge | Count of jobs in status in (queued, retryable). |
healtharchive_jobs_infra_error_recent_total{minutes="10"} | Gauge | Count of jobs recently failing due to infra errors (windowed). |
healtharchive_worker_should_be_running | Gauge | 1 = pending crawl jobs exist and Storage Box mount is readable. |
healtharchive_worker_auto_start_last_run_timestamp_seconds | Gauge | Last worker auto-start watchdog run time (freshness signal when enabled). |
healtharchive_worker_auto_start_last_result{result,reason} | Gauge | Last worker auto-start watchdog outcome (one-hot by labels). |
healtharchive_worker_auto_start_start_attempts_total | Counter | Total worker auto-start attempts (with success/fail companion counters). |
healtharchive_worker_auto_start_reconciled_running_jobs | Gauge | Stale status=running rows reconciled to retryable in the latest watchdog run. |
healtharchive_crawl_auto_recover_last_run_timestamp_seconds | Gauge | Last crawl auto-recover watchdog run time (freshness signal when enabled). |
healtharchive_crawl_auto_recover_scope_drift_jobs | Gauge | Number of running jobs where scope filter drift was detected in the latest watchdog run. |
healtharchive_crawl_auto_recover_scope_rewrites_total | Counter | Total scope filter rewrites applied by crawl auto-recover. |
healtharchive_crawl_auto_recover_degraded_jobs | Gauge | Number of running jobs currently classified as degraded (slow but progressing). |
healtharchive_crawl_auto_recover_degraded_streak{job_id,source} | Gauge | Consecutive watchdog runs where a job has remained degraded. |
healtharchive_crawl_running_job_state_file_ok | Gauge | 1 = .archive_state.json is readable and valid. 0 = Probe failed (SSHFS/Permissions issue). |
healtharchive_crawl_running_job_container_restarts_done | Gauge | Cumulative count of Zimit container restarts for the current job. |
healtharchive_crawl_running_job_last_progress_age_seconds | Gauge | Time since the last "pages crawled" increment in the logs (or, when no increment appears in the inspected log window, a lower bound based on the oldest visible crawlStatus event). |
healtharchive_crawl_running_job_stalled | Gauge | 1 = Progress stalled > 1 hour. |
healtharchive_crawl_running_job_output_dir_ok | Gauge | 1 = Output directory is accessible. |
healtharchive_crawl_annual_pending_job_output_dir_writable{source,job_id,status,year} | Gauge | 1 = Queued/retryable annual job output dir would be writable by the worker user (permission drift detection). |
healtharchive_crawl_running_job_log_probe_ok | Gauge | 1 = Combined log file is readable. |
healtharchive_crawl_running_job_crawl_rate_ppm | Gauge | Pages per minute crawl rate (from crawlStatus log window). |
healtharchive_crawl_running_job_new_crawl_phase_count | Gauge | Count of New Crawl Phase stage starts seen in the current combined-log tail window. |
healtharchive_crawl_running_job_resume_crawl_count | Gauge | Count of Resume Crawl stage starts seen in the current combined-log tail window. |
healtharchive_crawl_running_job_progress_known | Gauge | 1 = Progress metrics parsed from crawlStatus logs. |
healtharchive_crawl_metrics_timestamp_seconds | Gauge | Unix timestamp when metrics were last written. |
healtharchive_jobs_infra_error_recent_total{window="10m"} | Gauge | Count of jobs with infra errors in rolling window. |
Alerting Thresholds
Alerts are defined in:
ops/observability/alerting/healtharchive-alerts.yml
Alerting Policy (automation-first)
The annual crawl alerting policy is now automation-first:
- Built-in watchdogs (worker auto-start, crawl auto-recover, storage hot-path auto-recover) should get a chance to self-heal first.
- Alerts should page/notify primarily when:
- automation is disabled or unavailable,
- automation telemetry is stale (you can no longer trust suppression), or
- automation failed / the condition persisted after the watchdog had multiple runs.
- Crawl throughput and crawl-phase churn are treated as dashboard signals (trend analysis), not direct notification signals.
In practice this means:
Errno 107stale-mount symptoms are escalated via storage watchdog alerts, not duplicate per-job output-dir alerts.HealthArchiveWorkerDownWhileJobsPendingwaits longer and suppresses while the deploy lock is active when worker auto-start automation is enabled and healthy.- Separate watchdog-freshness alerts protect against “silent” suppression caused by stopped timers/scripts.
1) Worker availability (high-signal, post-auto-start)
Alert: HealthArchiveWorkerDownWhileJobsPending
- Threshold (effective):
- Base condition:
healtharchive_worker_should_be_running == 1 and healtharchive_worker_active == 0 - With worker auto-start enabled and fresh metrics: only alerts after the condition persists for 20m and the deploy lock is not active.
- Fallback: if worker auto-start automation is disabled/absent, still alerts on the same base symptom (with the same alert rule).
- Meaning: There is pending crawl work and storage appears usable, but the worker service remains down after the automation-first window (or automation is unavailable).
- Action: Check
healtharchive-worker.servicelogs, recent deploy activity, and worker auto-start watchdog state/metrics.
2) SSHFS/Mount Stability
Alert: HealthArchiveCrawlOutputDirUnreadable (and related probe alerts)
- Threshold:
healtharchive_crawl_running_job_output_dir_ok == 0andoutput_dir_errno != 107for 2m. - Meaning: A running crawl job cannot access its output directory for a non-stale-mount reason (permissions/path/etc.).
- Action: Investigate the specific non-107 error.
Errno 107stale mount cases should escalate via the storage hot-path watchdog alerts below.
Alert: HealthArchiveAnnualOutputDirNotWritable
- Threshold: probe-user OK, writable probe = 0, and
writable_errno != 107for 10m. - Meaning: A queued/retryable annual job output dir is not writable for a non-stale-mount reason (commonly permission drift /
Errno 13). - Action: Run crawl preflight checks for the specific job output dir mount and writability; re-apply annual output tiering if needed.
Alert: HealthArchiveStorageHotpathStaleUnrecovered
- Threshold:
healtharchive_storage_hotpath_auto_recover_detected_targets > 0for 10m (when the automation is enabled). - Meaning: Hot-path auto-recover still sees stale/unreadable paths after 10 minutes.
- Action: Inspect
/srv/healtharchive/ops/watchdog/storage-hotpath-auto-recover.jsonand consider manual unmount + tiering re-apply.
Alert: HealthArchiveStorageHotpathApplyFailedPersistent
- Threshold: watchdog enabled, at least one apply attempt,
last_apply_ok == 0, and last apply timestamp older than 24h (for 30m). - Meaning: Hot-path auto-recover apply mode has remained in a failed terminal state for over a day.
- Action: Inspect
/srv/healtharchive/ops/watchdog/storage-hotpath-auto-recover.json(last_apply_errors,last_apply_warnings), then follow stale mount recovery playbook steps and re-run a controlled dry-run/apply verification.
3) Restart stability
Alert: HealthArchiveCrawlContainerRestartsHigh
- Threshold: restart budget near exhaustion and still increasing within the last 12 hours (for 30m):
- HC:
healtharchive_crawl_running_job_container_restarts_done{source="hc"} >= 19(budget 24) - PHAC:
healtharchive_crawl_running_job_container_restarts_done{source="phac"} >= 24(budget 30) - CIHR:
healtharchive_crawl_running_job_container_restarts_done{source="cihr"} >= 16(budget 20) - Meaning: The crawler has consumed most of its adaptive restart budget and recent telemetry still shows new restart churn. This is intended to suppress frozen-history warnings after a crawl stabilizes.
- Action: Review worker logs and combined logs around restarts; check for repeated timeouts on the same URL or storage errors before the job exhausts its restart budget.
4) Progress Stalls
Alert: HealthArchiveCrawlStalled
- Threshold:
healtharchive_crawl_running_job_stalled == 1(for 30m). - Meaning: The crawler is running but hasn't archived a new page in over an hour, and the stall persisted long enough to warrant manual review even if crawl auto-recover is enabled.
- Action: Check if the crawler is stuck on a massive PDF or looped trap. If crawl auto-recover is enabled, also inspect its watchdog state/metrics to confirm whether automation attempted recovery.
Note on state file mtime: The .archive_state.json mtime may appear stale even during healthy crawls, because the state file is only written on certain lifecycle events (container restarts, phase changes), not on every progress update. Use last_progress_age_seconds (derived from crawlStatus log entries) for stall detection, not state_mtime_age_seconds. When the inspected log window shows repeated crawlStatus events with a flat crawled count, last_progress_age_seconds ages from the oldest visible crawlStatus event instead of resetting on every failed-page churn event.
4.1) Degraded Throughput (slow but progressing)
Alert: HealthArchiveCrawlRateDegraded
- Threshold: For HC/PHAC,
crawl_rate_ppm < 2whilelast_progress_age_seconds <= 300andstalled == 0(for 45m). - Meaning: The crawl is alive but underperforming for an extended period, usually due to queue composition (binary-heavy links), repeated retries, or long-tail page slowness.
- Action: Check combined logs for timeout/binary pressure and inspect auto-recover scope/degraded metrics:
healtharchive_crawl_auto_recover_scope_drift_jobshealtharchive_crawl_auto_recover_scope_rewrites_totalhealtharchive_crawl_auto_recover_degraded_streakFor HC/PHAC, the watchdog can now run bounded degraded recoveries whendegraded-action=recoveris enabled:- it only acts after the degraded streak threshold is met
- it preserves the canonical annual execution policy before recovery
- it skips worker-level recovery when another healthy crawl would be interrupted by the guard window
- it marks the degraded job
retryableafter stopping the safe runner target instead of allowing open-ended low-rate churn
4.2 Stale Running Rows
- The worker now performs a preflight stale-running reconciliation before selecting new work.
- If a job is still marked
runningin the DB but has no held job lock and no recent crawl progress, it is demoted back toretryableautomatically. - This is meant to prevent dead PHAC/HC attempts from blocking fresh retries behind a zombie DB row.
5) Infrastructure Errors
Alert: HealthArchiveInfraErrorsHigh
- Threshold:
healtharchive_jobs_infra_error_recent_total{window="10m"} >= 3(for 5m). - Meaning: Multiple jobs are failing due to infrastructure errors (errno 107 stale mount, permission denied, etc.) in a short window.
- Action: Check Storage Box mount health, run hot-path recovery, verify output directory permissions.
6) Metrics Freshness
Alert: HealthArchiveCrawlMetricsStale
- Threshold:
(time() - healtharchive_crawl_metrics_timestamp_seconds) > 600(for 5m). - Meaning: The crawl metrics textfile hasn't been updated in over 10 minutes.
- Action: Check if
healtharchive-crawl-metrics.timeris running andvps-crawl-metrics-textfile.pyis succeeding.
Alert: HealthArchiveWorkerAutoStartMetricsStale
- Threshold: worker auto-start enabled and
(time() - last_run_timestamp) > 600(for 5m). - Meaning: Worker auto-start automation is enabled, but its metrics are stale. You should not assume worker-down alerts are still automation-aware until this is fixed.
- Action: Check
healtharchive-worker-auto-start.timerandhealtharchive-worker-auto-start.servicelogs/state.
Alert: HealthArchiveCrawlAutoRecoverMetricsStale
- Threshold: crawl auto-recover enabled and
(time() - last_run_timestamp) > 900(for 10m). - Meaning: Crawl auto-recover automation is enabled, but its metrics are stale. Automation-first stall recovery may not be running.
- Action: Check
healtharchive-crawl-auto-recover.timerandhealtharchive-crawl-auto-recover.servicelogs/state.
7) Deploy Lock Persistence
Alert: HealthArchiveDeployLockPersistent
- Threshold:
healtharchive_crawl_auto_recover_deploy_lock_present == 1(for 4h). - Meaning: The deploy lock file has been held for over 4 hours, preventing crawl auto-recover from taking any recovery actions.
- Action: Check whether a deploy is genuinely in progress. If the lock is stale (leftover from a failed deploy), remove it manually:
rm /tmp/healtharchive-deploy.lock.
8) Temp Directory Accumulation
Alert: HealthArchiveCrawlTempDirsHigh
- Threshold:
healtharchive_crawl_running_job_temp_dirs_count > 100and the tracked temp-dir count has grown by at least 5 over the last 12 hours (for 1h). - Meaning: A running crawl job has accumulated over 100 tracked
.tmp*directories and the count is still climbing, usually from repeated resume/new-crawl phases, adaptive restarts, or storage/permission churn. This count comes from.archive_state.json, so it reflects real crawl-state accumulation rather than a filesystem glob. - Action: If the job is still running, do not run
cleanup-job; first classify the incident usingvps-crawl-status.sh, per-job crawl metrics, and the combined log, then follow the storage/stall/restart-budget runbooks as appropriate. If the job is alreadyindexedorindex_failed, reclaim space withhealtharchive cleanup-job --id <ID> --mode temp-nonwarc(prefer--dry-runfirst). Use legacy--mode temponly when you explicitly intend to discard WARCs/replay data.
Dashboard-heavy Crawl Performance Signals
These remain monitored via Grafana trend panels. The only direct throughput alert is sustained degraded throughput for HC/PHAC (HealthArchiveCrawlRateDegraded).
healtharchive_crawl_running_job_crawl_rate_ppmhealtharchive_crawl_running_job_new_crawl_phase_counthealtharchive_crawl_running_job_resume_crawl_counthealtharchive_crawl_running_job_last_progress_age_secondshealtharchive_crawl_running_job_container_restarts_done
If progress_known == 0, crawl_rate_ppm == -1, and resume_crawl_count keeps rising while the state file still updates, treat that as a likely no-progress resume loop even if the job still appears running in the DB.
At the crawler level, stall_timeout_minutes now also covers the case where a stage never emits any crawlStatus at all. That turns silent "running but no stats ever arrive" hangs into a monitored intervention path instead of letting them sit indefinitely.
For HC/PHAC annual jobs, the execution policy now also provides hard stop conditions outside the watchdog:
resume_policy=fresh_onlyprevents repeatResume Crawlloops- poisoned resume state can be auto-reset before the next attempt
- repeated fresh Browsertrix failures are bounded and can auto-promote the job to the
http_warcfallback backend
Use:
ops/observability/dashboards/healtharchive-pipeline-health.json- Grafana access quickstart (SSH port-forward preferred):
observability-and-private-stats.md - Full observability setup/runbook:
playbooks/observability/observability-guide.md
The dashboard includes longitudinal crawl-rate panels (raw + 30m average) and watchdog activity/freshness panels to support investigation without alert spam.
Indexing Monitoring
Indexing runs after the crawl completes.
- Active Indexing: Check worker logs for
Indexing for job <ID> completed successfully. - Failure Detection:
healtharchive_job_crawl_status{status="completed"}ANDhealtharchive_job_indexed_pages == 0for > 1 hour indicates a broken pipeline.