Incident: PHAC post-reboot tiering loss and fallback recovery (2026-04-20)
Status: closed
Metadata
- Date (UTC): 2026-04-20
- Severity (see
operations/incidents/severity.md): sev2 - Environment: production
- Primary area: crawl
- Owner: Jeremy Dawson
- Start (UTC): 2026-04-20T13:20:03Z
- End (UTC): 2026-04-20T14:35:09Z
Summary
After the VPS recovery window, PHAC annual job 7 restarted into a broken output-dir state: its hot path had drifted off the Storage Box tier and onto a local placeholder that the worker could not write. Once tiering and writability were restored, fresh Browsertrix still failed both PHAC seed documents with net::ERR_HTTP2_PROTOCOL_ERROR. The bounded recovery was to validate the playwright_warc fallback on the live seeds, deploy the fallback-WARC numbering fix, and resume PHAC under the fallback backend.
Impact
- User-facing impact:
- the 2026 annual campaign stayed not-search-ready longer than expected
- PHAC was unavailable for new indexed search content while job
7was down - Internal impact (ops burden, automation failures, etc):
- manual VPS triage, storage repair, and job-state intervention were required
- the current ops tracker and planning docs drifted behind the live PHAC state
- Data impact:
- Data loss: no
- Data integrity risk: yes, before the fallback-WARC numbering fix was deployed
- Recovery completeness: complete
- Duration:
- about 75 minutes from the failed Browsertrix restart to the healthy fallback restart
Detection
- Detected during operator-led post-reboot annual-crawl verification with:
./scripts/vps-crawl-status.sh --year 2026 --job-id 7healtharchive show-job --id 7findmnt -T "$OUT_DIR"- worker-user writability probes on the PHAC output dir
- Most useful signals:
show-job/ combined-log tails showing.archive_state.jsonpermission failures and then fresh BrowsertrixERR_HTTP2_PROTOCOL_ERRORseed failuresprobe-browser-fetchdemonstrating that the pinnedplaywright_warcruntime could fetch both PHAC seeds successfully- crawl metrics confirming fallback activation and sustained progress once the fallback run was live
Decision log
- 2026-04-20T14:25:13Z — Decision: use
probe-browser-fetchbefore any PHAC backend promotion (why: confirm the fallback runtime is viable on the actual seed URLs, risks: adds a bounded operator step but avoids blind restarts) - 2026-04-20T14:30:00Z — Decision: deploy the fallback-WARC numbering fix before re-running PHAC under
playwright_warc(why: prevent fallback reruns from overwritingwarc-000001.warc.gz, risks: delays restart until the repo fix is deployed)
Timeline (UTC)
- 2026-04-20T13:20:03Z — PHAC job
7starts again after the earlier recovery window. - 2026-04-20T13:21:00Z — Fresh Browsertrix attempt logs
net::ERR_HTTP2_PROTOCOL_ERRORonhttps://www.canada.ca/en/public-health.html. - 2026-04-20T13:21:01Z — Fresh Browsertrix attempt logs the same error on
https://www.canada.ca/fr/sante-publique.html. - 2026-04-20T13:21:43Z — Post-reboot verification snapshot shows PHAC running with
WARC files (discovered): 273butWARC files: 0, and no indexed pages. - 2026-04-20T14:24:51Z —
healtharchive probe-browser-fetchstarts against the two PHAC seed URLs. - 2026-04-20T14:25:13Z — Probe confirms both seed URLs are fetchable via
playwright_warcwith200. - 2026-04-20T14:30:03Z — PHAC is recovered from stale-running state to
retryable, then patched tocapture_backend=playwright_warc. - 2026-04-20T14:35:09Z — PHAC starts cleanly under
playwright_warc. - 2026-04-20T15:00:13Z — Live crawl verification shows PHAC progressing under fallback with
crawled=191,failed=0, and new stable WARCs appended abovewarc-000273.warc.gz.
Root cause
- Immediate trigger:
- post-reboot annual output-dir tiering drift left the PHAC hot path on an unwritable local placeholder, so worker state writes failed
- Underlying cause(s):
- the annual output-dir topology still relied on direct
sshfsmounts rather than the intended bind-mount layout, making post-reboot drift harder to reason about - PHAC seed fetches still reproduce Browsertrix/Canada.ca HTTP/2 document failures even after the managed Browsertrix config and
--disable-http2propagation fixes
Contributing factors
- The production run started with the older fallback backends that would have reused
warc-000001.warc.gzon reruns, so the safe fallback relaunch had to wait for a repo deployment. - PHAC and CIHR were both active, which raised the bar for any maintenance that could interrupt live output dirs.
- The current ops roadmap still described PHAC as failed/parked, which added documentation drift during the live recovery session.
Resolution / Recovery
- Verified the Storage Box tier was healthy and isolated the PHAC output-dir problem to the hot-path drift / writability failure.
- Re-ran annual output-dir tiering with the production backend environment loaded so the helper talked to PostgreSQL instead of falling back to SQLite.
- Confirmed PHAC output-dir writability for the worker user again.
- Let the fresh Browsertrix retry prove the deeper failure mode: both PHAC seeds failed with
net::ERR_HTTP2_PROTOCOL_ERROR. - Ran
healtharchive probe-browser-fetchfor the PHAC seed URLs and verifiedplaywright_warcsucceeded on both. - Patched PHAC job
7tocapture_backend=playwright_warc. - Deployed the repo fix that makes fallback backends append to the next free stable WARC slot.
- Let the worker restart PHAC under fallback and verified healthy crawl progress plus appended stable WARC numbering.
- Repaired HC replay collection ownership/readiness for job
6and redeployed the replay-reconcile systemd template so future replay automation no longer runs ashaadmin. - Traced the remaining HC public replay
502to a malformed archived cookie header line (AWSALBCORS=...) that pywb surfaced and Caddy rejected, then deployed replay header sanitization plus the replay unit fix that loads/srv/healtharchive/replay/sitecustomize.pyviaPYTHONPATH=/webarchive. - Optimized raw snapshot WARC lookup, updated the public verifier to report transport timeouts cleanly, and split the verifier timeout budget so the raw HTML probe can tolerate slower WARC reads without masking other checks.
- Re-ran
./scripts/verify_public_surface.pysuccessfully on 2026-04-23 with replay and raw snapshot checks both passing.
Post-incident verification
What we did to confirm we’re actually healthy (and not just “running”).
- Public surface checks:
probe-browser-fetchreturned200for both PHAC seed pages./scripts/verify_public_surface.pypassed on 2026-04-23 after the replay and raw-snapshot follow-through was complete- Worker/job health checks:
healtharchive show-job --id 7./scripts/vps-crawl-status.sh --year 2026 --job-id 7- crawl metrics showed
configured_backend=playwright_warc,fallback_active=1,failed=0, and fresh progress - Storage/mount checks (if relevant):
findmnt -T "$OUT_DIR"- worker-user writability probe on the PHAC output dir
- Integrity checks (if relevant):
- stable WARC numbering advanced to
warc-000275.warc.gzand higher instead of reusingwarc-000001.warc.gz
Open questions (still unknown)
- Should PHAC remain Browsertrix-first in future annual campaigns, or should the source default change now that the 2026 fallback run has been indexed?
- Is the temporary PHAC
public-health-noticesexclusion still necessary after the fallback run has been indexed and coverage is reviewed?
Action items (TODOs)
- PHAC completion/indexing outcome: the fallback crawl completed,
reconcile-completed-indexing --source phac --limit 1was rerun undernohup, and job7indexed successfully on 2026-04-29 with121940snapshots; the PHAC annual edition report now showsresearch_readywith20723captured URLs. (completed 2026-04-29) (owner=Jeremy Dawson, priority=high, due=2026-04-21) - Index HC job
6once the annual run window allows it (completed 2026-04-23;262567snapshots indexed) (owner=Jeremy Dawson, priority=high, due=2026-04-21) - Finish the HC replay indexing/ownership repair:
replay-reconcile --apply --job-id 6succeeded once rerun as root, andc9600341redeployed the replay-reconcile systemd template so future automation no longer runs ashaadmin(completed 2026-04-23) (owner=Jeremy Dawson, priority=high, due=2026-04-24) - Deploy the API-side browse-URL suppression patch so public
browseUrlfields are omitted whenever a job’s replay collection is missing or incomplete (completed 2026-04-23 viac9600341) (owner=Jeremy Dawson, priority=high, due=2026-04-24) - Fix the remaining HC public replay
502: repaired replay ownership/readiness, then fixed the malformed archived cookie-header path by loading replay header sanitization throughPYTHONPATH=/webarchive; public replay andbrowseUrlverification now pass (completed 2026-04-23 viac9600341,8f9558d6, andca085c58) (owner=Jeremy Dawson, priority=high, due=2026-04-24) - Stabilize raw snapshot public verification: optimized WARC lookup and updated the verifier to handle transport timeouts cleanly with a dedicated raw-snapshot timeout budget;
./scripts/verify_public_surface.pynow passes with default settings on production (completed 2026-04-23 via2b0b4001,88e97736, anda27a0d05) (owner=Jeremy Dawson, priority=high, due=2026-04-24) - Revisit PHAC’s long-term Browsertrix/default-backend strategy after reviewing the indexed fallback coverage (owner=Jeremy Dawson, priority=medium, due=2026-04-30)
- Restart the worker during the next safe maintenance window after PHAC/CIHR are idle so the deployed
a3e0deceworker-side rowcount/logging fix becomes active in production (owner=Jeremy Dawson, priority=medium, due=2026-05-15) - Review the preserved VPS branch
prod-pre-a3e0deceand decide whether its detached pre-deploy commits (d8e2534e,607df02b,48cfe3f9) need cherry-pick, replacement, or explicit retirement (owner=Jeremy Dawson, priority=medium, due=2026-05-01) - Convert annual output dirs from direct
sshfsmounts to bind mounts during the next acceptable maintenance window after the annual crawl is idle (owner=Jeremy Dawson, priority=medium, due=2026-05-15)
Automation opportunities
- Keep the post-reboot verification path centered on
annual-status,show-job,vps-crawl-status.sh, andprobe-browser-fetchso future PHAC recoveries do not rely on ad hoc log archaeology. - The direct-
sshfsannual output-dir topology should still be retired in a later maintenance window; that remains the long-term reduction in post-reboot tiering drift risk.
References / Artifacts
./scripts/vps-crawl-status.shsnapshot(s):timestamp_utc=2026-04-20T13:21:43Ztimestamp_utc=2026-04-20T15:08:00Z- Relevant log path(s):
/srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101/archive_initial_crawl_-_attempt_1_20260420_132017.combined.log/srv/healtharchive/jobs/phac/20260101T000502Z__phac-20260101/archive_playwright_warc_capture_20260420_143518.combined.log- Metric names:
healtharchive_crawl_running_job_configured_backend_infohealtharchive_crawl_running_job_fallback_activehealtharchive_crawl_running_job_crawl_rate_ppmhealtharchive_crawl_running_job_last_progress_age_seconds- Related playbooks/runbooks:
docs/operations/playbooks/validation/post-reboot-tiering-verify.mddocs/operations/playbooks/crawl/annual-campaign.mddocs/operations/playbooks/storage/storagebox-sshfs-stale-mount-recovery.mddocs/operations/healtharchive-ops-roadmap.md