Skip to content

Cleanup automation (safe temp cleanup)

Goal: remove .tmp* crawl directories from older indexed jobs without breaking replay.

Canonical refs:

  • cleanup command: healtharchive cleanup-job --mode temp-nonwarc
  • systemd unit templates: ../../../deployment/systemd/README.md
  • replay retention note: ../../growth-constraints.md

What this does

  • Picks indexed jobs older than a minimum age.
  • Keeps the latest N per source.
  • Runs safe cleanup (temp-nonwarc) to preserve WARCs.
  • Emits node_exporter metrics:
  • healtharchive_cleanup_applied_total

Enablement (VPS)

sudo touch /etc/healtharchive/cleanup-automation-enabled
sudo systemctl enable --now healtharchive-cleanup-automation.timer

Manual dry-run

Warning: starting healtharchive-cleanup-automation.service will apply cleanup (it is the automation entrypoint). Use the script directly for a dry-run preview.

sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; /opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-cleanup-automation.py --config /opt/healtharchive/ops/automation/cleanup-automation.toml --out-dir /tmp --out-file healtharchive_cleanup_dryrun.prom'
cat /tmp/healtharchive_cleanup_dryrun.prom

If cleanup fails

  1. Check the job output directory exists and is readable:
    /opt/healtharchive/.venv/bin/healtharchive show-job --id <JOB_ID>
    
  2. Run the cleanup command manually:
    /opt/healtharchive/.venv/bin/healtharchive cleanup-job --id <JOB_ID> --mode temp-nonwarc --dry-run
    

Config

Edit ops/automation/cleanup-automation.toml to adjust age, caps, and retain count.