Skip to content

Incident: PHAC post-reboot tiering loss and fallback recovery (2026-04-20)

Status: closed

Metadata

  • Date (UTC): 2026-04-20
  • Severity (see operations/incidents/severity.md): sev2
  • Environment: production
  • Primary area: crawl
  • Owner: Jeremy Dawson
  • Start (UTC): 2026-04-20T13:20:03Z
  • End (UTC): 2026-04-20T14:35:09Z

Summary

After the VPS recovery window, PHAC annual job 7 restarted into a broken output-dir state: its hot path had drifted off the Storage Box tier and onto a local placeholder that the worker could not write. Once tiering and writability were restored, fresh Browsertrix still failed both PHAC seed documents with net::ERR_HTTP2_PROTOCOL_ERROR. The bounded recovery was to validate the playwright_warc fallback on the live seeds, deploy the fallback-WARC numbering fix, and resume PHAC under the fallback backend.

Impact

  • User-facing impact:
  • the 2026 annual campaign stayed not-search-ready longer than expected
  • PHAC was unavailable for new indexed search content while job 7 was down
  • Internal impact (ops burden, automation failures, etc):
  • manual VPS triage, storage repair, and job-state intervention were required
  • the current ops tracker and planning docs drifted behind the live PHAC state
  • Data impact:
  • Data loss: no
  • Data integrity risk: yes, before the fallback-WARC numbering fix was deployed
  • Recovery completeness: complete
  • Duration:
  • about 75 minutes from the failed Browsertrix restart to the healthy fallback restart

Detection

  • Detected during operator-led post-reboot annual-crawl verification with:
  • ./scripts/vps-crawl-status.sh --year 2026 --job-id 7
  • healtharchive show-job --id 7
  • findmnt -T "$OUT_DIR"
  • worker-user writability probes on the PHAC output dir
  • Most useful signals:
  • show-job / combined-log tails showing .archive_state.json permission failures and then fresh Browsertrix ERR_HTTP2_PROTOCOL_ERROR seed failures
  • probe-browser-fetch demonstrating that the pinned playwright_warc runtime could fetch both PHAC seeds successfully
  • crawl metrics confirming fallback activation and sustained progress once the fallback run was live

Decision log

  • 2026-04-20T14:25:13Z — Decision: use probe-browser-fetch before any PHAC backend promotion (why: confirm the fallback runtime is viable on the actual seed URLs, risks: adds a bounded operator step but avoids blind restarts)
  • 2026-04-20T14:30:00Z — Decision: deploy the fallback-WARC numbering fix before re-running PHAC under playwright_warc (why: prevent fallback reruns from overwriting warc-000001.warc.gz, risks: delays restart until the repo fix is deployed)

Timeline (UTC)

  • 2026-04-20T13:20:03Z — PHAC job 7 starts again after the earlier recovery window.
  • 2026-04-20T13:21:00Z — Fresh Browsertrix attempt logs net::ERR_HTTP2_PROTOCOL_ERROR on https://www.canada.ca/en/public-health.html.
  • 2026-04-20T13:21:01Z — Fresh Browsertrix attempt logs the same error on https://www.canada.ca/fr/sante-publique.html.
  • 2026-04-20T13:21:43Z — Post-reboot verification snapshot shows PHAC running with WARC files (discovered): 273 but WARC files: 0, and no indexed pages.
  • 2026-04-20T14:24:51Z — healtharchive probe-browser-fetch starts against the two PHAC seed URLs.
  • 2026-04-20T14:25:13Z — Probe confirms both seed URLs are fetchable via playwright_warc with 200.
  • 2026-04-20T14:30:03Z — PHAC is recovered from stale-running state to retryable, then patched to capture_backend=playwright_warc.
  • 2026-04-20T14:35:09Z — PHAC starts cleanly under playwright_warc.
  • 2026-04-20T15:00:13Z — Live crawl verification shows PHAC progressing under fallback with crawled=191, failed=0, and new stable WARCs appended above warc-000273.warc.gz.

Root cause

  • Immediate trigger:
  • post-reboot annual output-dir tiering drift left the PHAC hot path on an unwritable local placeholder, so worker state writes failed
  • Underlying cause(s):
  • the annual output-dir topology still relied on direct sshfs mounts rather than the intended bind-mount layout, making post-reboot drift harder to reason about
  • PHAC seed fetches still reproduce Browsertrix/Canada.ca HTTP/2 document failures even after the managed Browsertrix config and --disable-http2 propagation fixes

Contributing factors

  • The production run started with the older fallback backends that would have reused warc-000001.warc.gz on reruns, so the safe fallback relaunch had to wait for a repo deployment.
  • PHAC and CIHR were both active, which raised the bar for any maintenance that could interrupt live output dirs.
  • The current ops roadmap still described PHAC as failed/parked, which added documentation drift during the live recovery session.

Resolution / Recovery

  1. Verified the Storage Box tier was healthy and isolated the PHAC output-dir problem to the hot-path drift / writability failure.
  2. Re-ran annual output-dir tiering with the production backend environment loaded so the helper talked to PostgreSQL instead of falling back to SQLite.
  3. Confirmed PHAC output-dir writability for the worker user again.
  4. Let the fresh Browsertrix retry prove the deeper failure mode: both PHAC seeds failed with net::ERR_HTTP2_PROTOCOL_ERROR.
  5. Ran healtharchive probe-browser-fetch for the PHAC seed URLs and verified playwright_warc succeeded on both.
  6. Patched PHAC job 7 to capture_backend=playwright_warc.
  7. Deployed the repo fix that makes fallback backends append to the next free stable WARC slot.
  8. Let the worker restart PHAC under fallback and verified healthy crawl progress plus appended stable WARC numbering.
  9. Repaired HC replay collection ownership/readiness for job 6 and redeployed the replay-reconcile systemd template so future replay automation no longer runs as haadmin.
  10. Traced the remaining HC public replay 502 to a malformed archived cookie header line (AWSALBCORS=...) that pywb surfaced and Caddy rejected, then deployed replay header sanitization plus the replay unit fix that loads /srv/healtharchive/replay/sitecustomize.py via PYTHONPATH=/webarchive.
  11. Optimized raw snapshot WARC lookup, updated the public verifier to report transport timeouts cleanly, and split the verifier timeout budget so the raw HTML probe can tolerate slower WARC reads without masking other checks.
  12. Re-ran ./scripts/verify_public_surface.py successfully on 2026-04-23 with replay and raw snapshot checks both passing.

Post-incident verification

What we did to confirm we’re actually healthy (and not just “running”).

  • Public surface checks:
  • probe-browser-fetch returned 200 for both PHAC seed pages
  • ./scripts/verify_public_surface.py passed on 2026-04-23 after the replay and raw-snapshot follow-through was complete
  • Worker/job health checks:
  • healtharchive show-job --id 7
  • ./scripts/vps-crawl-status.sh --year 2026 --job-id 7
  • crawl metrics showed configured_backend=playwright_warc, fallback_active=1, failed=0, and fresh progress
  • Storage/mount checks (if relevant):
  • findmnt -T "$OUT_DIR"
  • worker-user writability probe on the PHAC output dir
  • Integrity checks (if relevant):
  • stable WARC numbering advanced to warc-000275.warc.gz and higher instead of reusing warc-000001.warc.gz

Open questions (still unknown)

  • Should PHAC remain Browsertrix-first in future annual campaigns, or should the source default change now that the 2026 fallback run has been indexed?
  • Is the temporary PHAC public-health-notices exclusion still necessary after the fallback run has been indexed and coverage is reviewed?

Action items (TODOs)

  • PHAC completion/indexing outcome: the fallback crawl completed, reconcile-completed-indexing --source phac --limit 1 was rerun under nohup, and job 7 indexed successfully on 2026-04-29 with 121940 snapshots; the PHAC annual edition report now shows research_ready with 20723 captured URLs. (completed 2026-04-29) (owner=Jeremy Dawson, priority=high, due=2026-04-21)
  • Index HC job 6 once the annual run window allows it (completed 2026-04-23; 262567 snapshots indexed) (owner=Jeremy Dawson, priority=high, due=2026-04-21)
  • Finish the HC replay indexing/ownership repair: replay-reconcile --apply --job-id 6 succeeded once rerun as root, and c9600341 redeployed the replay-reconcile systemd template so future automation no longer runs as haadmin (completed 2026-04-23) (owner=Jeremy Dawson, priority=high, due=2026-04-24)
  • Deploy the API-side browse-URL suppression patch so public browseUrl fields are omitted whenever a job’s replay collection is missing or incomplete (completed 2026-04-23 via c9600341) (owner=Jeremy Dawson, priority=high, due=2026-04-24)
  • Fix the remaining HC public replay 502: repaired replay ownership/readiness, then fixed the malformed archived cookie-header path by loading replay header sanitization through PYTHONPATH=/webarchive; public replay and browseUrl verification now pass (completed 2026-04-23 via c9600341, 8f9558d6, and ca085c58) (owner=Jeremy Dawson, priority=high, due=2026-04-24)
  • Stabilize raw snapshot public verification: optimized WARC lookup and updated the verifier to handle transport timeouts cleanly with a dedicated raw-snapshot timeout budget; ./scripts/verify_public_surface.py now passes with default settings on production (completed 2026-04-23 via 2b0b4001, 88e97736, and a27a0d05) (owner=Jeremy Dawson, priority=high, due=2026-04-24)
  • Revisit PHAC’s long-term Browsertrix/default-backend strategy after reviewing the indexed fallback coverage (owner=Jeremy Dawson, priority=medium, due=2026-04-30)
  • Restart the worker during the next safe maintenance window after PHAC/CIHR are idle so the deployed a3e0dece worker-side rowcount/logging fix becomes active in production (owner=Jeremy Dawson, priority=medium, due=2026-05-15)
  • Review the preserved VPS branch prod-pre-a3e0dece and decide whether its detached pre-deploy commits (d8e2534e, 607df02b, 48cfe3f9) need cherry-pick, replacement, or explicit retirement (owner=Jeremy Dawson, priority=medium, due=2026-05-01)
  • Convert annual output dirs from direct sshfs mounts to bind mounts during the next acceptable maintenance window after the annual crawl is idle (owner=Jeremy Dawson, priority=medium, due=2026-05-15)

Automation opportunities

  • Keep the post-reboot verification path centered on annual-status, show-job, vps-crawl-status.sh, and probe-browser-fetch so future PHAC recoveries do not rely on ad hoc log archaeology.
  • The direct-sshfs annual output-dir topology should still be retired in a later maintenance window; that remains the long-term reduction in post-reboot tiering drift risk.

References / Artifacts

  • ./scripts/vps-crawl-status.sh snapshot(s):
  • timestamp_utc=2026-04-20T13:21:43Z
  • timestamp_utc=2026-04-20T15:08:00Z
  • Relevant log path(s):
  • /srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101/archive_initial_crawl_-_attempt_1_20260420_132017.combined.log
  • /srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101/archive_playwright_warc_capture_20260420_143518.combined.log
  • Metric names:
  • healtharchive_crawl_running_job_configured_backend_info
  • healtharchive_crawl_running_job_fallback_active
  • healtharchive_crawl_running_job_crawl_rate_ppm
  • healtharchive_crawl_running_job_last_progress_age_seconds
  • Related playbooks/runbooks:
  • docs/operations/playbooks/validation/post-reboot-tiering-verify.md
  • docs/operations/playbooks/crawl/annual-campaign.md
  • docs/operations/playbooks/storage/storagebox-sshfs-stale-mount-recovery.md
  • docs/operations/healtharchive-ops-roadmap.md