Crawl Operability - Locks, Writability, and Retry Controls (Implemented 2026-04-14)
Status: Implemented | Scope: Hardened job locking, annual output-dir health visibility, and retry-budget recovery UX in repo, then completed the production lock-dir cutover during the 2026-04-14 maintenance window.
Outcomes
- Job locks no longer force
/tmp-style1777semantics on dedicated lock directories; production now usesHEALTHARCHIVE_JOB_LOCK_DIR=/srv/healtharchive/ops/locks/jobs. - Added annual queued/retryable output-dir writability probes to
scripts/vps-crawl-metrics-textfile.py, plusHealthArchiveAnnualOutputDirNotWritablealert coverage. - Added audited
healtharchive reset-retry-countCLI support for operator-safe retry-budget resets. - Added
scripts/vps-job-lock-dir-cutover.shand systemd deployment guidance for staged rollout and rollback. - Completed the production lock-dir cutover on 2026-04-14 by restarting the API and worker with
/etc/healtharchive/backend.envalready pointing at/srv/healtharchive/ops/locks/jobs.
Canonical Docs Updated
docs/deployment/systemd/README.mddocs/reference/cli-commands.mddocs/operations/monitoring-and-alerting.mddocs/operations/thresholds-and-tuning.mddocs/operations/healtharchive-ops-roadmap.md
Historical Context
Full implementation detail is preserved in git history. The remaining storage mount-topology work lives in ../2026-02-06-hotpath-staleness-root-cause-investigation.md; this plan's lock-dir cutover is complete.