Operational Thresholds and Tuning Guide

This document centralizes all operational thresholds used in HealthArchive automation, monitoring, and safeguards.

Last updated: 2026-02-06

Overview

HealthArchive uses conservative thresholds to prevent runaway automation and ensure system stability. All automation is opt-in via sentinel files and includes multiple safety caps.

General principles: - Automation is safe-by-default (dry-run unless explicitly enabled) - Rate limits prevent flapping (cooldowns + per-hour/day caps) - Deploy locks prevent conflicts during maintenance - All thresholds are tunable but have sensible defaults

Disk Management Thresholds

Worker Pre-Crawl Disk Check

Parameter	Value	Location	Rationale
Threshold	85%	`src/ha_backend/worker/main.py` (`DISK_HEADROOM_THRESHOLD_PERCENT`)	Allows ~11GB buffer for multi-GB annual crawls
Check frequency	Every job selection	Worker loop	Prevents mid-crawl disk-full failures
Action	Skip job selection	Worker logs warning	Jobs remain queued until space is freed

Tuning guidance: - Lower to 80% if crawls are consistently small (e.g., test jobs) - Raise to 88% if disk is oversized and buffer is excessive - Don't raise above 90% - leaves too little margin for error

Related alerts: HealthArchiveDiskUsageHigh (warning), HealthArchiveDiskUsageCritical

Alerting Thresholds

Severity	Threshold	Duration	Location	Action
Warning	>85%	30 minutes	`ops/observability/alerting/healtharchive-alerts.yml` (`HealthArchiveDiskUsageHigh`)	Page on-call during business hours
Critical	>92%	10 minutes	`ops/observability/alerting/healtharchive-alerts.yml` (`HealthArchiveDiskUsageCritical`)	Page on-call immediately

Tuning guidance: - Warning duration (30min) gives time to react without false positives - Critical threshold (92%) leaves ~6GB for emergency response - Don't raise critical above 95% - risk of sudden disk-full

See: docs/operations/disk-baseline-and-cleanup.md (current baseline + cleanup posture)

Cleanup Automation

Parameter	Value	Location	Purpose
Min age	14 days	`ops/automation/cleanup-automation.toml` (`min_age_days`)	Avoid cleaning recent jobs
Keep latest per source	2	`ops/automation/cleanup-automation.toml` (`keep_latest_per_source`)	Preserve recent snapshots
Max jobs per run (weekly)	1	`ops/automation/cleanup-automation.toml` (`max_jobs_per_run`)	Conservative incremental cleanup
Threshold trigger	80%	`ops/automation/cleanup-automation.toml` (`threshold_trigger_percent`)	Only run threshold cleanup when disk exceeds this
Max jobs per run (threshold)	5	`ops/automation/cleanup-automation.toml` (`threshold_max_jobs_per_run`)	More aggressive cleanup under disk pressure
Cleanup mode	`temp-nonwarc`	`scripts/vps-cleanup-automation.py` (`healtharchive cleanup-job --mode temp-nonwarc`)	Preserves WARCs (safe)

Tuning guidance: - Increase threshold_max_jobs_per_run to 7-10 only if disk pressure is chronic and the cleanup is consistently safe - Decrease min_age_days to 7 if disk pressure is chronic - Increase keep_latest_per_source to 3+ if operators need more history

Implementation notes: - Weekly cleanup: docs/deployment/systemd/healtharchive-cleanup-automation.service + docs/deployment/systemd/healtharchive-cleanup-automation.timer - Disk threshold cleanup: docs/deployment/systemd/healtharchive-disk-threshold-cleanup.service + docs/deployment/systemd/healtharchive-disk-threshold-cleanup.timer (runs every 30 min; no-op when below threshold)

Crawl Recovery Thresholds

Stall Detection

Parameter	Value	Location	Rationale
Stall threshold	3600s (60 min)	`scripts/vps-crawl-auto-recover.py` (`--stall-threshold-seconds`, default: 3600)	Balance between false positives and timely recovery
Progress metric	`crawled` count unchanged	Parsed from combined log	Reliable indicator of actual progress
Guard window	600s (10 min)	`scripts/vps-crawl-auto-recover.py` (`--skip-if-any-job-progress-within-seconds`, default: 600)	Avoid interrupting healthy crawls

Tuning guidance: - Lower to 1800s (30min) for fast sites (e.g., small test crawls) - Raise to 5400s (90min) or 7200s (120min) for very slow sites or flaky networks - Don't lower below 1800s (30min) — risks false positives during normal slow periods - Automation-first alerting note: HealthArchiveCrawlStalled should be treated as a post-watchdog/manual-review signal only if the crawl auto-recover watchdog is enabled and its metrics are fresh in Prometheus.

Recovery Rate Limits

Parameter	Value	Location	Rationale
Per-job daily cap	3	`scripts/vps-crawl-auto-recover.py` (`--max-recoveries-per-job-per-day`, default: 3)	Prevents restart loops for fundamentally broken jobs
Soft recovery enabled	True	`scripts/vps-crawl-auto-recover.py` (`--soft-recover-when-guarded`, default: true)	Mark stalled jobs retryable without stopping healthy crawls

Tuning guidance: - Increase per-job cap to 5 for known-flaky sources (e.g., sites with frequent timeouts) - Disable soft recovery (--no-soft-recover-when-guarded) only for debugging

Recovery enhancements (auto-applied): - enable_adaptive_restart=True - max_container_restarts floor from source profile (hc=24, phac=30, cihr=20) - See: scripts/vps-crawl-auto-recover.py (_ensure_recovery_tool_options)

Degraded Throughput Detection (observe-first)

Parameter	Value	Location	Rationale
Degraded sources	`hc,phac`	`docs/deployment/systemd/healtharchive-crawl-auto-recover.service` (`--degraded-sources`)	Focus on broad canada.ca crawls with known binary-link pressure
Rate threshold	`<2.0 ppm`	`scripts/vps-crawl-auto-recover.py` (`--degraded-rate-threshold-ppm`, default: 2.0)	Detect sustained low throughput before hard stalls
Recent progress cap	`<=300s`	`scripts/vps-crawl-auto-recover.py` (`--degraded-max-progress-age-seconds`, default: 300)	Distinguish degraded from stalled jobs
Consecutive runs	`6`	`scripts/vps-crawl-auto-recover.py` (`--degraded-min-consecutive-runs`, default: 6)	Suppress transient dips
Action mode	`observe`	`docs/deployment/systemd/healtharchive-crawl-auto-recover.service` (`--degraded-action observe`)	Alert-first rollout to avoid restart churn

Tuning guidance: - Keep --degraded-action observe until the 6–24h long-window reassessment confirms persistent degradation. - If false positives occur, increase consecutive runs before lowering threshold strictness. - If persistent low throughput remains after scope reconciliation, use controlled restart playbooks instead of immediate automatic recoveries.

Crawl Auto-Start (Queue Fill)

When enabled, the crawl auto-recover watchdog can also act as a queue fill mechanism: if there are no stalled jobs, but the annual campaign is running fewer than N jobs, it can auto-start one queued/retryable annual job via systemd-run.

This is designed to avoid the operational failure mode where a stalled job gets marked retryable but never returns to running because the worker is already busy with another crawl.

Auto-Start Thresholds

Parameter	Value	Location	Rationale
Min running jobs	3	`docs/deployment/systemd/healtharchive-crawl-auto-recover.service` (`--ensure-min-running-jobs`)	Keep annual campaign concurrency stable
Per-job daily cap	3	`scripts/vps-crawl-auto-recover.py` (`--max-starts-per-job-per-day`, default: 3)	Prevent auto-start loops
Disk safety limit	88%	`docs/deployment/systemd/healtharchive-crawl-auto-recover.service` (`--start-max-disk-usage-percent`)	Avoid starting new crawls when disk is near full

Implementation notes: - Auto-start only considers jobs with config.campaign_kind="annual" and matching config.campaign_year. - Auto-start runs the job using systemd-run (detached) and applies Docker caps via env vars: - HEALTHARCHIVE_DOCKER_CPU_LIMIT (default: 1.0; configurable via --start-docker-cpu-limit) - HEALTHARCHIVE_DOCKER_MEMORY_LIMIT (default: 3g; configurable via --start-docker-memory-limit) - Alerting integration note: worker-down notifications can be delayed/suppressed based on worker auto-start automation only when healtharchive-worker-auto-start.timer is enabled and healtharchive_worker_auto_start.prom metrics are fresh. - Worker auto-start watchdog now includes stale running-row reconciliation: - --reconcile-running-drift --reconcile-older-than-minutes 10 --reconcile-limit 10 - It only reconciles rows with no active crawl process detected for their output dir.

Storage Hot-Path Recovery Thresholds

Stale Mount Detection

Parameter	Value	Location	Rationale
Min failure age	120s (2 min)	`scripts/vps-storage-hotpath-auto-recover.py` (`--min-failure-age-seconds`, default: 120)	Avoid acting on transient failures
Confirm runs	2 consecutive	`scripts/vps-storage-hotpath-auto-recover.py` (`--confirm-runs`, default: 2)	Require persistence before acting
Detection signal	Errno 107	Probed via `os.stat()`	"Transport endpoint is not connected"

Probed locations: 1. Running job output dirs 2. Next queued/retryable job output dirs (prevents retry storms) 3. Manifest hot paths (tiering bind mounts)

Tuning guidance: - Don't lower min_failure_age - transient failures are common - Don't reduce confirm_runs - single observations may be false positives

Recovery Rate Limits

Parameter	Value	Location	Rationale
Cooldown	15 minutes	`scripts/vps-storage-hotpath-auto-recover.py` (`--cooldown-seconds`, default: 900)	Prevent flapping after recovery
Hourly cap	2	`scripts/vps-storage-hotpath-auto-recover.py` (`--max-recoveries-per-hour`, default: 2)	Global safety limit
Daily cap	6 global, 3/job	`scripts/vps-storage-hotpath-auto-recover.py` (`--max-recoveries-per-day`, default: 6; `--max-recoveries-per-job-per-day`, default: 3)	Prevent runaway automation

Tuning guidance: - Increase cooldown to 30min if recovery attempts fail repeatedly - Increase hourly/daily caps cautiously - investigate root cause instead - Don't bypass caps in automation - they prevent pathological loops - Alerting integration note: job-level Errno 107 unreadable/writability symptom alerts can be demoted/suppressed only if the storage hot-path watchdog is enabled and healtharchive_storage_hotpath_auto_recover.prom metrics are fresh.

Persistent Failed-Apply Alert

Parameter	Value	Location	Rationale
Failed-apply age threshold	>24h	`ops/observability/alerting/healtharchive-alerts.yml` (`HealthArchiveStorageHotpathApplyFailedPersistent`)	Catch long-lived failed recovery state
Startup guard	`apply_total > 0`	Same alert expr	Avoid first-run/startup false positives
Alert duration	30m	Same alert rule (`for: 30m`)	Avoid transient signal noise
Initial severity	warning	Same alert rule	Tune in burn-in before considering escalation

Tuning guidance: - Keep age threshold at 24h until burn-in confirms clear signal quality. - If alert is too noisy, investigate watchdog state churn first; do not hide failures by increasing to multiple days. - Escalate to critical only after at least one week of clean behavior and verified operator response path.

SSHFS Mount Options

Option	Value	Location	Purpose
reconnect	Enabled	`docs/deployment/systemd/healtharchive-storagebox-sshfs.service`	Auto-reconnect on connection loss
ServerAliveInterval	15s	systemd service	Send keepalive every 15 seconds
ServerAliveCountMax	3	systemd service	Disconnect after 3 missed keepalives (45s total)
kernel_cache	Enabled	systemd service	Performance optimization

Tuning guidance: - Lower ServerAliveInterval to 10s if mounts go stale frequently - Don't raise ServerAliveCountMax - delays detection of stale connections - reconnect should always be enabled

Known issue: Stale mounts still occur despite hardened options (root cause under investigation).

See: docs/planning/implemented/2026-02-01-operational-resilience-improvements.md

Deploy Lock Protection

Parameter	Value	Location	Purpose
Max age	2 hours	`scripts/vps-crawl-auto-recover.py` + `scripts/vps-storage-hotpath-auto-recover.py` (`--deploy-lock-max-age-seconds`, default: 2h)	Stale lock detection
Lock file	`/tmp/healtharchive-deploy.lock`	Deploy script + watchdogs	Prevent watchdog/deploy conflicts
Lock mechanism	`flock`	`scripts/vps-deploy.sh`	Atomic lock acquisition

Tuning guidance: - Increase max age if deploys routinely take >2 hours (investigate why) - Don't decrease below 1 hour - normal deploys can take 30-45 minutes

Infra Error Cooldown

Parameter	Value	Location	Rationale
Cooldown	10 minutes	`src/ha_backend/worker/main.py` (`INFRA_ERROR_RETRY_COOLDOWN_MINUTES`)	Prevent retry storms when infra is unhealthy
Infra errors	Errno 107, Errno 5, `OSError` during job launch	`src/ha_backend/infra_errors.py`	Infrastructure failures (not crawl failures)

Tuning guidance: - Increase to 20min if infrastructure is persistently unstable - Decrease to 5min if false positives are common (careful!)

See: docs/planning/implemented/2026-01-24-infra-error-and-storage-hotpath-hardening.md

Annual Output Dir Writability Probe

These checks detect permission drift for queued/retryable annual jobs before a crawl attempt consumes retries.

Parameter	Value	Location	Rationale
Probe cadence	every 1 minute	`docs/deployment/systemd/healtharchive-crawl-metrics.timer`	Early warning without paging storms
Probe target	queued/retryable annual jobs	`scripts/vps-crawl-metrics-textfile.py`	Bounded cardinality (annual jobs only)
Probe identity	`haadmin`	`scripts/vps-crawl-metrics-textfile.py` (`--annual-writability-probe-user`)	Matches worker runtime user
Alert duration	10m	`ops/observability/alerting/healtharchive-alerts.yml` (`HealthArchiveAnnualOutputDirNotWritable`)	Avoid transient noise
Severity	warning	same alert rule	Fix before retries are consumed

Triage signals:

..._errno == 13: permission drift (output dir not writable for worker user)
..._errno == 107: stale sshfs hot path (follow storage hot-path recovery)

Archive Tool (Crawler) Adaptive Thresholds

Annual Per-Source Profiles

Annual jobs are source-tuned (not one-size-fits-all). Canonical values live in src/ha_backend/job_registry.py and are reconciled by scripts/vps-crawl-auto-recover.py during recovery/auto-start flows.

Source	Initial workers	Stall timeout	Timeout/HTTP threshold	Backoff	Max restarts	Rationale
`hc`	2	75 min	55 / 55	15 min	24	Moderate tolerance for canada.ca long-tail behavior.
`phac`	2	90 min	65 / 65	3 min	30	Highest tolerance due historically high restart churn.
`cihr`	3	45 min	35 / 35	1 min	20	Faster/cleaner profile to improve throughput and fault detection.

Tuning guidance: - Change source profiles in job_registry first; keep watchdog reconciliation aligned. - For completeness-first posture, prefer targeted, versioned source-profile or scope changes in the repo before repeated manual recoveries. - Only increase tolerance (stall/restart budget) when evidence shows the crawl is still making useful progress and the issue is intermittent rather than continuous thrash. - Only reduce thresholds when repeated evidence shows low false-positive restart risk. - For recurring canada.ca transport errors, prefer a source-profile browser compatibility change (for example Browsertrix --extraChromeArgs) before expanding exclusions or repeatedly recycling the same job config.

One-Time Annual Backfill/Reconciliation

When migrating an existing campaign from shared defaults to per-source tuning, reconcile existing annual jobs in-place:

# Review changes first (dry-run)
healtharchive reconcile-annual-tool-options --year 2026

# Apply changes
healtharchive reconcile-annual-tool-options --year 2026 --apply

What this command does: - Reconciles baseline annual values to source profile values (hc, phac, cihr) - Preserves explicit non-baseline overrides (except restart floor enforcement) - Ensures annual safety defaults (enable_monitoring, enable_adaptive_restart, skip_final_build, docker_shm_size=1g)

See: src/archive_tool/constants.py, scripts/vps-crawl-auto-recover.py

Summary Table: All Thresholds

Category	Threshold	Value	Priority	Location
Disk	Worker headroom	85%	P0	`worker/main.py`
	Alert warning	85% for 30m	P1	alerting YAML
	Alert critical	92% for 10m	P0	alerting YAML
Crawl	Stall threshold	60 min	P1	`vps-crawl-auto-recover.py`
	Recovery cap	3/job/day	P1	`vps-crawl-auto-recover.py`
	New-crawl-phase churn	>=3 (30m)	P1	alerting YAML
	Slow-rate alert (HC)	<1.5 ppm (30m)	P1	alerting YAML
	Slow-rate alert (PHAC)	<1.5 ppm (30m)	P1	alerting YAML
	Slow-rate alert (CIHR)	<3 ppm (30m)	P1	alerting YAML
Storage	Stale mount age	120s	P1	`vps-storage-hotpath-auto-recover.py`
	Recovery cooldown	15 min	P1	`vps-storage-hotpath-auto-recover.py`
	Recovery cap	6/day global	P1	`vps-storage-hotpath-auto-recover.py`
	Failed apply persistence alert	>24h + 30m	P1	alerting YAML
Infra	Retry cooldown	10 min	P1	`worker/main.py`
SSHFS	Keepalive interval	15s	P1	systemd service

Tuning Workflow

When adjusting thresholds:

Document the change: Update this file with new values and rationale
Test in staging (if available): Validate behavior before production
Monitor metrics: Watch Prometheus/Grafana for impact
Iterate conservatively: Small adjustments, measure, repeat
Update automation: Adjust watchdog caps if needed

Anti-patterns: - Disabling safety caps to "fix" underlying issues - Tuning based on single incidents without trend analysis - Raising thresholds indefinitely instead of fixing root cause

Disk baseline: docs/operations/disk-baseline-and-cleanup.md
Alerting strategy: docs/operations/monitoring-and-alerting.md
Stale mount playbook: docs/operations/playbooks/storage/storagebox-sshfs-stale-mount-recovery.md
Crawl stall playbook: docs/operations/playbooks/crawl/crawl-stalls.md
Operational resilience improvements: docs/planning/implemented/2026-02-01-operational-resilience-improvements.md

Operational Thresholds and Tuning Guide

Overview

Disk Management Thresholds

Worker Pre-Crawl Disk Check

Alerting Thresholds

Cleanup Automation

Crawl Recovery Thresholds

Stall Detection

Recovery Rate Limits

Degraded Throughput Detection (observe-first)

Crawl Auto-Start (Queue Fill)

Auto-Start Thresholds

Storage Hot-Path Recovery Thresholds

Stale Mount Detection

Recovery Rate Limits

Persistent Failed-Apply Alert

SSHFS Mount Options

Deploy Lock Protection

Infra Error Cooldown

Annual Output Dir Writability Probe

Archive Tool (Crawler) Adaptive Thresholds

Annual Per-Source Profiles

One-Time Annual Backfill/Reconciliation

Summary Table: All Thresholds

Tuning Workflow

Related Documentation