Crawl auto-recover drills (safe on production)
Goal: periodically prove that:
- the crawl auto-recover watchdog is installed and runnable, and
- the watchdog would take sensible actions for a stalled job,
…without actually stopping services or writing to the production watchdog state/metrics.
0) Safety rules
- Never run the crawl auto-recover watchdog with
--applyas part of a drill. - For drills, always override:
--state-file(use a/tmp/...path)--lock-file(use a/tmp/...path)--textfile-out-dir(use/tmp)--textfile-out-file(use a drill filename)
The watchdog enforces this automatically when you use drill flags.
1) Pick a job ID to simulate
Pick a real job ID from the database (it does not need to be stalled).
For the “guard window” drill below, it helps if at least one other job is currently running and making progress.
set -a; source /etc/healtharchive/backend.env; set +a
/opt/healtharchive/.venv/bin/healtharchive list-jobs --status running --limit 10
Also pick a job ID that is not currently running (queued/retryable is fine):
set -a; source /etc/healtharchive/backend.env; set +a
/opt/healtharchive/.venv/bin/healtharchive list-jobs --limit 20
Pick one job_id from the output, for example 7.
2) Drill: simulate a stalled job (soft recovery path)
This exercises the common “safe” path where another job is still making progress, so the watchdog would avoid worker restarts.
Important: soft recovery is only allowed when the watchdog can confirm the stalled job has no active runner (i.e., it’s a zombie status=running DB row). In drill mode we force that classification with:
--simulate-stalled-job-runner none
cd /opt/healtharchive
sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; \
/opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-crawl-auto-recover.py \
--simulate-stalled-job-id 7 \
--simulate-stalled-job-runner none \
--state-file /tmp/healtharchive-crawl-auto-recover.drill.state.json \
--lock-file /tmp/healtharchive-crawl-auto-recover.drill.lock \
--textfile-out-dir /tmp \
--textfile-out-file healtharchive_crawl_auto_recover.drill.prom'
Expected output includes:
DRILL: simulate-stalled-job-id activePlanned actions (dry-run):recover-stale-jobs ... --apply --source ...
Confirm the drill metrics were written:
3) Drill: simulate a stalled job (full recovery path)
This forces the watchdog to show the “full recovery” plan by disabling the guard window.
3a) Full recovery (job is running under the worker)
cd /opt/healtharchive
sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; \
/opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-crawl-auto-recover.py \
--skip-if-any-job-progress-within-seconds 0 \
--simulate-stalled-job-id 7 \
--simulate-stalled-job-runner worker \
--state-file /tmp/healtharchive-crawl-auto-recover.full-drill.state.json \
--lock-file /tmp/healtharchive-crawl-auto-recover.full-drill.lock \
--textfile-out-dir /tmp \
--textfile-out-file healtharchive_crawl_auto_recover.full-drill.prom'
Expected output includes:
systemctl stop healtharchive-worker.servicerecover-stale-jobs ... --apply --source ...systemctl start healtharchive-worker.service
3b) Full recovery (job is running in a systemd-run transient unit)
Use any realistic transient unit name (this is a drill-only override):
cd /opt/healtharchive
sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; \
/opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-crawl-auto-recover.py \
--skip-if-any-job-progress-within-seconds 0 \
--simulate-stalled-job-id 7 \
--simulate-stalled-job-runner systemd_unit \
--simulate-stalled-job-runner-unit healtharchive-job7-phac-3way.service \
--state-file /tmp/healtharchive-crawl-auto-recover.full-drill.state.json \
--lock-file /tmp/healtharchive-crawl-auto-recover.full-drill.lock \
--textfile-out-dir /tmp \
--textfile-out-file healtharchive_crawl_auto_recover.full-drill.prom'
Expected output includes:
systemctl stop healtharchive-job7-phac-3way.servicerecover-stale-jobs ... --apply --source ...systemctl start healtharchive-job7-phac-3way.service
Notes
- In all drill cases above, the watchdog remains in dry-run mode and does not actually stop services.
- If you omit the
--simulate-stalled-job-runner ...override, the watchdog will attempt best-effort runner detection (worker vs transient unit) from the live system.
cd /opt/healtharchive
sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; \
/opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-crawl-auto-recover.py \
--skip-if-any-job-progress-within-seconds 0 \
--simulate-stalled-job-id 7 \
--state-file /tmp/healtharchive-crawl-auto-recover.full-drill.state.json \
--lock-file /tmp/healtharchive-crawl-auto-recover.full-drill.lock \
--textfile-out-dir /tmp \
--textfile-out-file healtharchive_crawl_auto_recover.full-drill.prom'
Expected output includes:
Planned actions (dry-run):systemctl stop ...(eitherhealtharchive-worker.serviceor ahealtharchive-job<id>-*.servicetransient unit)recover-stale-jobs ... --apply --source ...systemctl start ...(matching the stop target above)
4) Drill: queue fill / auto-start (safe on production)
Goal: prove the watchdog would auto-start a queued/retryable annual job when the annual campaign is underfilled (fewer than N running jobs), without actually starting anything.
Safety rules:
- Do not pass
--apply. - Always override:
--state-file,--lock-file,--textfile-out-dir,--textfile-out-file(use/tmppaths)
This drill does not require --simulate-stalled-job-id — it exercises the “no stalled jobs” path.
Steps
1) Confirm how many jobs are currently status=running:
set -a; source /etc/healtharchive/backend.env; set +a
/opt/healtharchive/.venv/bin/healtharchive list-jobs --status running --limit 10
2) Run a dry-run with --ensure-min-running-jobs set above the current count (example uses 3):
cd /opt/healtharchive
sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; \
/opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-crawl-auto-recover.py \
--ensure-min-running-jobs 3 \
--state-file /tmp/healtharchive-crawl-auto-recover.start-drill.state.json \
--lock-file /tmp/healtharchive-crawl-auto-recover.start-drill.lock \
--textfile-out-dir /tmp \
--textfile-out-file healtharchive_crawl_auto_recover.start-drill.prom'
Expected output includes:
DRY-RUN: would auto-start annual job_id=...Planned actions (dry-run):systemd-run ... healtharchive run-db-job --id ...
Notes:
- Queue fill only targets annual jobs for the selected campaign year.
- For legacy annual jobs missing
campaign_kind/campaign_year, the watchdog infers annual jobs from the canonical-YYYY0101suffix (for examplephac-20260101). In--applymode, it will also backfill missing campaign metadata before starting the job.
Confirm the drill metrics were written:
5) Cleanup
Drill artifacts are safe to delete: