Cleanup automation (safe temp cleanup)
Goal: remove .tmp* crawl directories from older indexed jobs without breaking replay.
Canonical refs:
- cleanup command:
healtharchive cleanup-job --mode temp-nonwarc - systemd unit templates:
../../../deployment/systemd/README.md - replay retention note:
../../growth-constraints.md
What this does
- Picks indexed jobs older than a minimum age.
- Keeps the latest N per source.
- Runs safe cleanup (
temp-nonwarc) to preserve WARCs. - Emits node_exporter metrics:
healtharchive_cleanup_applied_total
Enablement (VPS)
sudo touch /etc/healtharchive/cleanup-automation-enabled
sudo systemctl enable --now healtharchive-cleanup-automation.timer
Manual dry-run
Warning: starting healtharchive-cleanup-automation.service will apply cleanup (it is the automation entrypoint). Use the script directly for a dry-run preview.
sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; /opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-cleanup-automation.py --config /opt/healtharchive/ops/automation/cleanup-automation.toml --out-dir /tmp --out-file healtharchive_cleanup_dryrun.prom'
cat /tmp/healtharchive_cleanup_dryrun.prom
If cleanup fails
- Check the job output directory exists and is readable:
- Run the cleanup command manually:
Config
Edit ops/automation/cleanup-automation.toml to adjust age, caps, and retain count.