Disk Baseline and Automated Cleanup

Last Updated: 2026-02-01 VPS: Hetzner 75GB single-VPS production

Current Baseline

Normal operating disk usage: ~82% Available space: ~14GB Alert thresholds: - Warning: >85% for 30m - Critical: >92% for 10m

Why 82% Baseline?

The VPS uses a tiered storage architecture: - Local disk (75GB): System, Docker, logs, temp crawl data - Storagebox (1TB): Final WARCs, ZIMs, large job data via SSHFS mounts

Local disk breakdown (~61GB used): - System/packages: ~3.1GB (/usr) - Docker: ~7GB (/var/lib/docker) - Logs: ~2GB (/var/log) - Ephemeral data: ~1GB (/srv local, temp crawl dirs) - OS/kernel: ~48GB (includes filesystem metadata, journal, reserves)

Automated Cleanup

1. Docker Cleanup (Weekly)

Timer: docker-cleanup.timer (weekly) Script: /usr/local/bin/docker-cleanup.sh Actions:

docker image prune -a -f  # Remove unused images
docker system prune -f    # Remove stopped containers, networks

Expected impact: Frees 2-4GB per week

2. Log Rotation

Journald (/etc/systemd/journald.conf): - SystemMaxUse=500M - Cap journal size - SystemKeepFree=2G - Ensure 2GB always free - MaxFileSec=1week - Rotate weekly

Docker container logs (/etc/docker/daemon.json): - max-size: 10m - Max 10MB per log file - max-file: 3 - Keep 3 rotations (30MB total per container)

Expected impact: Prevents runaway log growth, keeps logs <2GB

3. Manual Cleanup Commands

When disk >85%, run these manually:

# Clean Docker
docker image prune -a -f
docker system prune -f

# Rotate logs
sudo journalctl --vacuum-size=500M

# Truncate large container logs
sudo truncate -s 0 /var/lib/docker/containers/*/CONTAINER-json.log

# Check what's consuming space
sudo du -xsh /* 2>/dev/null | sort -hr | head -10

Worker Pre-Crawl Disk Check

Threshold: 85% Behavior: Worker skips job selection if disk >85%

This prevents starting crawls that would fail mid-flight due to disk pressure.

Monitoring

Metrics: node_filesystem_avail_bytes, node_filesystem_size_bytes Dashboard: Grafana "HealthArchive - Infrastructure" Status command: healtharchive status (shows disk usage with color coding)

Troubleshooting

Disk >85% Sustained

Check Docker images: docker system df
Check logs: sudo du -sh /var/log
Check temp crawl dirs: du -xsh /srv/healtharchive/jobs/*/
Run manual cleanup (see above)

Disk >92% (Critical)

Stop active crawls if necessary: docker ps → docker stop <id>
Run all cleanup commands
Consider truncating container logs
If still critical, investigate filesystem accounting with sudo du -xsh /

False Alarm: du Reports >100GB

If du -sh /srv/healtharchive/jobs/* reports huge sizes (>100GB), it's traversing SSHFS mounts and reporting remote storagebox data.

Fix: Use du -xsh to stay on local filesystem only:

sudo du -xsh /srv/healtharchive/jobs/*

Or just use df -h / for filesystem truth.

History

2026-02-01: Established 82% baseline after Docker/log cleanup freed 5.4GB
2026-01-31: Disk pressure incident (89% → cleanup → 82%)
2026-01-24: Automated tiering for annual jobs deployed