Skip to content

Monitoring & Alerting Strategy - Annual Crawl Campaign

Last Updated: 2026-03-23

Overview

This document defines the monitoring strategy for the HealthArchive annual crawl campaign. We use a combination of systemd timers, Python scripts, and Prometheus node_exporter textfile collectors to expose custom metrics about crawl health, restart stability, and progress.

Metric Sources

Custom metrics are written to the node_exporter textfile collector directory:

  • /var/lib/node_exporter/textfile_collector/

Primary files (single-VPS annual campaign):

  • healtharchive_crawl.prom
  • Written by scripts/vps-crawl-metrics-textfile.py
  • Triggered every 1 minute by healtharchive-crawl-metrics.timer
  • healtharchive_storage_hotpath_auto_recover.prom
  • Written by scripts/vps-storage-hotpath-auto-recover.py
  • Triggered every 1 minute by healtharchive-storage-hotpath-auto-recover.timer (sentinel-gated)
  • healtharchive_worker_auto_start.prom
  • Written by scripts/vps-worker-auto-start.py
  • Triggered every 2 minutes by healtharchive-worker-auto-start.timer (sentinel-gated)

Key Metric Families

Metric Name Type Description
healtharchive_crawl_running_jobs Gauge Count of currently active jobs in the DB.
healtharchive_worker_active Gauge 1 = worker systemd unit is active.
healtharchive_jobs_pending_crawl Gauge Count of jobs in status in (queued, retryable).
healtharchive_jobs_infra_error_recent_total{minutes="10"} Gauge Count of jobs recently failing due to infra errors (windowed).
healtharchive_worker_should_be_running Gauge 1 = pending crawl jobs exist and Storage Box mount is readable.
healtharchive_worker_auto_start_last_run_timestamp_seconds Gauge Last worker auto-start watchdog run time (freshness signal when enabled).
healtharchive_worker_auto_start_last_result{result,reason} Gauge Last worker auto-start watchdog outcome (one-hot by labels).
healtharchive_worker_auto_start_start_attempts_total Counter Total worker auto-start attempts (with success/fail companion counters).
healtharchive_worker_auto_start_reconciled_running_jobs Gauge Stale status=running rows reconciled to retryable in the latest watchdog run.
healtharchive_crawl_auto_recover_last_run_timestamp_seconds Gauge Last crawl auto-recover watchdog run time (freshness signal when enabled).
healtharchive_crawl_auto_recover_scope_drift_jobs Gauge Number of running jobs where scope filter drift was detected in the latest watchdog run.
healtharchive_crawl_auto_recover_scope_rewrites_total Counter Total scope filter rewrites applied by crawl auto-recover.
healtharchive_crawl_auto_recover_degraded_jobs Gauge Number of running jobs currently classified as degraded (slow but progressing).
healtharchive_crawl_auto_recover_degraded_streak{job_id,source} Gauge Consecutive watchdog runs where a job has remained degraded.
healtharchive_crawl_running_job_state_file_ok Gauge 1 = .archive_state.json is readable and valid. 0 = Probe failed (SSHFS/Permissions issue).
healtharchive_crawl_running_job_container_restarts_done Gauge Cumulative count of Zimit container restarts for the current job.
healtharchive_crawl_running_job_last_progress_age_seconds Gauge Time since the last "pages crawled" increment in the logs (or, when no increment appears in the inspected log window, a lower bound based on the oldest visible crawlStatus event).
healtharchive_crawl_running_job_stalled Gauge 1 = Progress stalled > 1 hour.
healtharchive_crawl_running_job_output_dir_ok Gauge 1 = Output directory is accessible.
healtharchive_crawl_annual_pending_job_output_dir_writable{source,job_id,status,year} Gauge 1 = Queued/retryable annual job output dir would be writable by the worker user (permission drift detection).
healtharchive_crawl_running_job_log_probe_ok Gauge 1 = Combined log file is readable.
healtharchive_crawl_running_job_crawl_rate_ppm Gauge Pages per minute crawl rate (from crawlStatus log window).
healtharchive_crawl_running_job_new_crawl_phase_count Gauge Count of New Crawl Phase stage starts seen in the current combined-log tail window.
healtharchive_crawl_running_job_resume_crawl_count Gauge Count of Resume Crawl stage starts seen in the current combined-log tail window.
healtharchive_crawl_running_job_progress_known Gauge 1 = Progress metrics parsed from crawlStatus logs.
healtharchive_crawl_metrics_timestamp_seconds Gauge Unix timestamp when metrics were last written.
healtharchive_jobs_infra_error_recent_total{window="10m"} Gauge Count of jobs with infra errors in rolling window.

Alerting Thresholds

Alerts are defined in:

  • ops/observability/alerting/healtharchive-alerts.yml

Alerting Policy (automation-first)

The annual crawl alerting policy is now automation-first:

  • Built-in watchdogs (worker auto-start, crawl auto-recover, storage hot-path auto-recover) should get a chance to self-heal first.
  • Alerts should page/notify primarily when:
  • automation is disabled or unavailable,
  • automation telemetry is stale (you can no longer trust suppression), or
  • automation failed / the condition persisted after the watchdog had multiple runs.
  • Crawl throughput and crawl-phase churn are treated as dashboard signals (trend analysis), not direct notification signals.

In practice this means:

  • Errno 107 stale-mount symptoms are escalated via storage watchdog alerts, not duplicate per-job output-dir alerts.
  • HealthArchiveWorkerDownWhileJobsPending waits longer and suppresses while the deploy lock is active when worker auto-start automation is enabled and healthy.
  • Separate watchdog-freshness alerts protect against “silent” suppression caused by stopped timers/scripts.

1) Worker availability (high-signal, post-auto-start)

Alert: HealthArchiveWorkerDownWhileJobsPending

  • Threshold (effective):
  • Base condition: healtharchive_worker_should_be_running == 1 and healtharchive_worker_active == 0
  • With worker auto-start enabled and fresh metrics: only alerts after the condition persists for 20m and the deploy lock is not active.
  • Fallback: if worker auto-start automation is disabled/absent, still alerts on the same base symptom (with the same alert rule).
  • Meaning: There is pending crawl work and storage appears usable, but the worker service remains down after the automation-first window (or automation is unavailable).
  • Action: Check healtharchive-worker.service logs, recent deploy activity, and worker auto-start watchdog state/metrics.

2) SSHFS/Mount Stability

Alert: HealthArchiveCrawlOutputDirUnreadable (and related probe alerts)

  • Threshold: healtharchive_crawl_running_job_output_dir_ok == 0 and output_dir_errno != 107 for 2m.
  • Meaning: A running crawl job cannot access its output directory for a non-stale-mount reason (permissions/path/etc.).
  • Action: Investigate the specific non-107 error. Errno 107 stale mount cases should escalate via the storage hot-path watchdog alerts below.

Alert: HealthArchiveAnnualOutputDirNotWritable

  • Threshold: probe-user OK, writable probe = 0, and writable_errno != 107 for 10m.
  • Meaning: A queued/retryable annual job output dir is not writable for a non-stale-mount reason (commonly permission drift / Errno 13).
  • Action: Run crawl preflight checks for the specific job output dir mount and writability; re-apply annual output tiering if needed.

Alert: HealthArchiveStorageHotpathStaleUnrecovered

  • Threshold: healtharchive_storage_hotpath_auto_recover_detected_targets > 0 for 10m (when the automation is enabled).
  • Meaning: Hot-path auto-recover still sees stale/unreadable paths after 10 minutes.
  • Action: Inspect /srv/healtharchive/ops/watchdog/storage-hotpath-auto-recover.json and consider manual unmount + tiering re-apply.

Alert: HealthArchiveStorageHotpathApplyFailedPersistent

  • Threshold: watchdog enabled, at least one apply attempt, last_apply_ok == 0, and last apply timestamp older than 24h (for 30m).
  • Meaning: Hot-path auto-recover apply mode has remained in a failed terminal state for over a day.
  • Action: Inspect /srv/healtharchive/ops/watchdog/storage-hotpath-auto-recover.json (last_apply_errors, last_apply_warnings), then follow stale mount recovery playbook steps and re-run a controlled dry-run/apply verification.

3) Restart stability

Alert: HealthArchiveCrawlContainerRestartsHigh

  • Threshold: restart budget near exhaustion and still increasing within the last 12 hours (for 30m):
  • HC: healtharchive_crawl_running_job_container_restarts_done{source="hc"} >= 19 (budget 24)
  • PHAC: healtharchive_crawl_running_job_container_restarts_done{source="phac"} >= 24 (budget 30)
  • CIHR: healtharchive_crawl_running_job_container_restarts_done{source="cihr"} >= 16 (budget 20)
  • Meaning: The crawler has consumed most of its adaptive restart budget and recent telemetry still shows new restart churn. This is intended to suppress frozen-history warnings after a crawl stabilizes.
  • Action: Review worker logs and combined logs around restarts; check for repeated timeouts on the same URL or storage errors before the job exhausts its restart budget.

4) Progress Stalls

Alert: HealthArchiveCrawlStalled

  • Threshold: healtharchive_crawl_running_job_stalled == 1 (for 30m).
  • Meaning: The crawler is running but hasn't archived a new page in over an hour, and the stall persisted long enough to warrant manual review even if crawl auto-recover is enabled.
  • Action: Check if the crawler is stuck on a massive PDF or looped trap. If crawl auto-recover is enabled, also inspect its watchdog state/metrics to confirm whether automation attempted recovery.

Note on state file mtime: The .archive_state.json mtime may appear stale even during healthy crawls, because the state file is only written on certain lifecycle events (container restarts, phase changes), not on every progress update. Use last_progress_age_seconds (derived from crawlStatus log entries) for stall detection, not state_mtime_age_seconds. When the inspected log window shows repeated crawlStatus events with a flat crawled count, last_progress_age_seconds ages from the oldest visible crawlStatus event instead of resetting on every failed-page churn event.

4.1) Degraded Throughput (slow but progressing)

Alert: HealthArchiveCrawlRateDegraded

  • Threshold: For HC/PHAC, crawl_rate_ppm < 2 while last_progress_age_seconds <= 300 and stalled == 0 (for 45m).
  • Meaning: The crawl is alive but underperforming for an extended period, usually due to queue composition (binary-heavy links), repeated retries, or long-tail page slowness.
  • Action: Check combined logs for timeout/binary pressure and inspect auto-recover scope/degraded metrics:
  • healtharchive_crawl_auto_recover_scope_drift_jobs
  • healtharchive_crawl_auto_recover_scope_rewrites_total
  • healtharchive_crawl_auto_recover_degraded_streak For HC/PHAC, the watchdog can now run bounded degraded recoveries when degraded-action=recover is enabled:
  • it only acts after the degraded streak threshold is met
  • it preserves the canonical annual execution policy before recovery
  • it skips worker-level recovery when another healthy crawl would be interrupted by the guard window
  • it marks the degraded job retryable after stopping the safe runner target instead of allowing open-ended low-rate churn

4.2 Stale Running Rows

  • The worker now performs a preflight stale-running reconciliation before selecting new work.
  • If a job is still marked running in the DB but has no held job lock and no recent crawl progress, it is demoted back to retryable automatically.
  • This is meant to prevent dead PHAC/HC attempts from blocking fresh retries behind a zombie DB row.

5) Infrastructure Errors

Alert: HealthArchiveInfraErrorsHigh

  • Threshold: healtharchive_jobs_infra_error_recent_total{window="10m"} >= 3 (for 5m).
  • Meaning: Multiple jobs are failing due to infrastructure errors (errno 107 stale mount, permission denied, etc.) in a short window.
  • Action: Check Storage Box mount health, run hot-path recovery, verify output directory permissions.

6) Metrics Freshness

Alert: HealthArchiveCrawlMetricsStale

  • Threshold: (time() - healtharchive_crawl_metrics_timestamp_seconds) > 600 (for 5m).
  • Meaning: The crawl metrics textfile hasn't been updated in over 10 minutes.
  • Action: Check if healtharchive-crawl-metrics.timer is running and vps-crawl-metrics-textfile.py is succeeding.

Alert: HealthArchiveWorkerAutoStartMetricsStale

  • Threshold: worker auto-start enabled and (time() - last_run_timestamp) > 600 (for 5m).
  • Meaning: Worker auto-start automation is enabled, but its metrics are stale. You should not assume worker-down alerts are still automation-aware until this is fixed.
  • Action: Check healtharchive-worker-auto-start.timer and healtharchive-worker-auto-start.service logs/state.

Alert: HealthArchiveCrawlAutoRecoverMetricsStale

  • Threshold: crawl auto-recover enabled and (time() - last_run_timestamp) > 900 (for 10m).
  • Meaning: Crawl auto-recover automation is enabled, but its metrics are stale. Automation-first stall recovery may not be running.
  • Action: Check healtharchive-crawl-auto-recover.timer and healtharchive-crawl-auto-recover.service logs/state.

7) Deploy Lock Persistence

Alert: HealthArchiveDeployLockPersistent

  • Threshold: healtharchive_crawl_auto_recover_deploy_lock_present == 1 (for 4h).
  • Meaning: The deploy lock file has been held for over 4 hours, preventing crawl auto-recover from taking any recovery actions.
  • Action: Check whether a deploy is genuinely in progress. If the lock is stale (leftover from a failed deploy), remove it manually: rm /tmp/healtharchive-deploy.lock.

8) Temp Directory Accumulation

Alert: HealthArchiveCrawlTempDirsHigh

  • Threshold: healtharchive_crawl_running_job_temp_dirs_count > 100 and the tracked temp-dir count has grown by at least 5 over the last 12 hours (for 1h).
  • Meaning: A running crawl job has accumulated over 100 tracked .tmp* directories and the count is still climbing, usually from repeated resume/new-crawl phases, adaptive restarts, or storage/permission churn. This count comes from .archive_state.json, so it reflects real crawl-state accumulation rather than a filesystem glob.
  • Action: If the job is still running, do not run cleanup-job; first classify the incident using vps-crawl-status.sh, per-job crawl metrics, and the combined log, then follow the storage/stall/restart-budget runbooks as appropriate. If the job is already indexed or index_failed, reclaim space with healtharchive cleanup-job --id <ID> --mode temp-nonwarc (prefer --dry-run first). Use legacy --mode temp only when you explicitly intend to discard WARCs/replay data.

Dashboard-heavy Crawl Performance Signals

These remain monitored via Grafana trend panels. The only direct throughput alert is sustained degraded throughput for HC/PHAC (HealthArchiveCrawlRateDegraded).

  • healtharchive_crawl_running_job_crawl_rate_ppm
  • healtharchive_crawl_running_job_new_crawl_phase_count
  • healtharchive_crawl_running_job_resume_crawl_count
  • healtharchive_crawl_running_job_last_progress_age_seconds
  • healtharchive_crawl_running_job_container_restarts_done

If progress_known == 0, crawl_rate_ppm == -1, and resume_crawl_count keeps rising while the state file still updates, treat that as a likely no-progress resume loop even if the job still appears running in the DB.

At the crawler level, stall_timeout_minutes now also covers the case where a stage never emits any crawlStatus at all. That turns silent "running but no stats ever arrive" hangs into a monitored intervention path instead of letting them sit indefinitely.

For HC/PHAC annual jobs, the execution policy now also provides hard stop conditions outside the watchdog:

  • resume_policy=fresh_only prevents repeat Resume Crawl loops
  • poisoned resume state can be auto-reset before the next attempt
  • repeated fresh Browsertrix failures are bounded and can auto-promote the job to the http_warc fallback backend

Use:

  • ops/observability/dashboards/healtharchive-pipeline-health.json
  • Grafana access quickstart (SSH port-forward preferred): observability-and-private-stats.md
  • Full observability setup/runbook: playbooks/observability/observability-guide.md

The dashboard includes longitudinal crawl-rate panels (raw + 30m average) and watchdog activity/freshness panels to support investigation without alert spam.

Indexing Monitoring

Indexing runs after the crawl completes.

  • Active Indexing: Check worker logs for Indexing for job <ID> completed successfully.
  • Failure Detection: healtharchive_job_crawl_status{status="completed"} AND healtharchive_job_indexed_pages == 0 for > 1 hour indicates a broken pipeline.