WARC integrity verification (post-incident + pre-index)

Use this playbook when you suspect WARC corruption or replay integrity risk, especially after:

This playbook is intentionally procedural; for background see:

Roadmap/incident context: docs/planning/implemented/2026-01-08-storagebox-sshfs-stale-mount-recovery-and-integrity.md
Storage infra recovery: storagebox-sshfs-stale-mount-recovery.md

0) Safety rules (do not skip)

Never quarantine while a job is running.
Never quarantine after a job has been indexed (i.e., when Snapshot rows exist): moving WARCs breaks replay because Snapshot.warc_path must remain valid.
If verification failures are infra_error, treat it as a storage incident first (recover mounts), not corruption.

The CLI enforces the most important guards and will refuse unsafe operations.

The healtharchive verify-warcs command supports three levels:

Level 0 (cheap): file exists, is readable, size > 0
Level 1 (moderate, default): gzip stream integrity (detect truncation/CRC issues)
Level 2 (heavier): WARC parseability (iterate records; streams bodies)

Run a report-only verification:

cd /opt/healtharchive
sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; healtharchive verify-warcs --job-id <JOB_ID> --level 1'

Bound the work if you’re validating an incident window:

healtharchive verify-warcs --job-id <JOB_ID> --level 1 --since-minutes 180 --limit-warcs 50

Optional: write a Prometheus node_exporter textfile metric:

healtharchive verify-warcs --job-id <JOB_ID> --level 1 --metrics-file /var/lib/node_exporter/textfile_collector/healtharchive_warc_verify.prom

This is usually mount instability, not corruption.

If the job has no Snapshot rows (not indexed), quarantine the corrupt WARCs:

healtharchive verify-warcs --job-id <JOB_ID> --level 1 --apply-quarantine

This will:

move corrupt WARCs under <output_dir>/warcs_quarantine/<timestamp>/...
write <output_dir>/WARCS_QUARANTINED.txt with provenance + sha256
set the job back to retryable and reset retry_count so the worker can re-run it

Then let the worker pick it up (or restart the worker if it’s not running).

Do not quarantine: this breaks replay.

Treat it as a critical integrity incident:

stop automated cleanup for the affected job
preserve the job output directory as-is
capture a verification report (--json-out recommended)
decide whether to rebuild the dataset / replay from backups, or to re-crawl the affected source

Escalate via ../core/incident-response.md and record the outcome in docs/operations/mentions-log.md.