WARC integrity verification (post-incident + pre-index)
Use this playbook when you suspect WARC corruption or replay integrity risk, especially after:
- sshfs/FUSE mount instability (
Errno 107: Transport endpoint is not connected) - unexpected crawler/container termination during WARC writes
- manual intervention on job output directories
This playbook is intentionally procedural; for background see:
- Roadmap/incident context:
docs/planning/implemented/2026-01-08-storagebox-sshfs-stale-mount-recovery-and-integrity.md - Storage infra recovery:
storagebox-sshfs-stale-mount-recovery.md
0) Safety rules (do not skip)
- Never quarantine while a job is
running. - Never quarantine after a job has been indexed (i.e., when
Snapshotrows exist): moving WARCs breaks replay becauseSnapshot.warc_pathmust remain valid. - If verification failures are
infra_error, treat it as a storage incident first (recover mounts), not corruption.
The CLI enforces the most important guards and will refuse unsafe operations.
1) Pick a verification level (cost vs confidence)
The healtharchive verify-warcs command supports three levels:
- Level 0 (cheap): file exists, is readable, size > 0
- Level 1 (moderate, default): gzip stream integrity (detect truncation/CRC issues)
- Level 2 (heavier): WARC parseability (iterate records; streams bodies)
Recommended posture on a single VPS:
- Post-incident window: Level 1 for WARCs touched during the incident window.
- “Always on” before indexing: Level 0 (built into the indexing pipeline; optional deeper checks via env).
2) Verify WARCs for a job (report-only)
Run a report-only verification:
cd /opt/healtharchive
sudo bash -lc 'set -a; source /etc/healtharchive/backend.env; set +a; healtharchive verify-warcs --job-id <JOB_ID> --level 1'
Bound the work if you’re validating an incident window:
Optional: write a Prometheus node_exporter textfile metric:
healtharchive verify-warcs --job-id <JOB_ID> --level 1 --metrics-file /var/lib/node_exporter/textfile_collector/healtharchive_warc_verify.prom
3) If verification fails with infra_error
This is usually mount instability, not corruption.
- Follow
storagebox-sshfs-stale-mount-recovery.md. - After recovery, re-run
verify-warcs.
4) If verification fails with corrupt_or_unreadable (pre-index only)
If the job has no Snapshot rows (not indexed), quarantine the corrupt WARCs:
This will:
- move corrupt WARCs under
<output_dir>/warcs_quarantine/<timestamp>/... - write
<output_dir>/WARCS_QUARANTINED.txtwith provenance + sha256 - set the job back to
retryableand resetretry_countso the worker can re-run it
Then let the worker pick it up (or restart the worker if it’s not running).
5) If verification fails after indexing (snapshots exist)
Do not quarantine: this breaks replay.
Treat it as a critical integrity incident:
- stop automated cleanup for the affected job
- preserve the job output directory as-is
- capture a verification report (
--json-outrecommended) - decide whether to rebuild the dataset / replay from backups, or to re-crawl the affected source
Escalate via ../core/incident-response.md and record the outcome in docs/operations/mentions-log.md.