Incident notes (internal)
This folder contains incident notes / lightweight postmortems for production and operations issues.
Goals:
- Capture what happened (timeline + impact) while it’s fresh.
- Record the root cause and recovery steps (so we can repeat them safely).
- Track “still to do” work that reduces repeat incidents (docs, guardrails, automation).
Related:
- Operator recovery steps:
../playbooks/core/incident-response.md - Ops playbooks index:
../playbooks/README.md - Severity rubric:
severity.md
What goes here
Use an incident note when any of the following are true:
- Public site/API degraded or down.
- Crawl/indexing is stuck, repeatedly failing, or risking data integrity.
- Storage/mount issues (e.g. Errno 107) or “hot path” problems.
- You had to do manual intervention beyond routine operations.
Do not use incident notes for planned maintenance; record that in the changelog (and/or a runbook update) instead.
Naming
One file per incident:
YYYY-MM-DD-<short-slug>.md(use UTC date of incident start).- If multiple incidents share a date, add a suffix:
...-a,...-b.
Example:
2026-01-09-annual-crawl-phac-output-dir-permission-denied.md2026-01-16-replay-smoke-503-and-warctieringfailed.md
Incident notes index
- CIHR WARC-complete crawl resumed after ZIM build failure (2026-05-03)
- PHAC post-reboot tiering loss and fallback recovery (2026-04-20)
- Annual crawl PHAC canada.ca HTTP/2 thrash (2026-03-23)
- API 502 Bad Gateway (2026-02-20)
- API search/changes 500 due to missing dedupe migration (2026-02-06)
- Auto-recover stall detection bugs (2026-02-06)
- Annual crawl output dirs on root disk (2026-02-04)
- Storage watchdog unmount filter bug (2026-02-02)
- Infra error 107 + hot-path thrash + worker stop (2026-01-24)
- Replay smoke 503 + tiering failures (2026-01-16)
- Annual crawl stalled (HC) (2026-01-09)
- PHAC output dir permission denied (2026-01-09)
- Storage hot-path sshfs stale mount (2026-01-08)
How to write one
1) Copy the template: ../../_templates/incident-template.md 2) Fill the top metadata and a short summary immediately. 3) Add a timeline (UTC) as you work. 4) After recovery, fill root cause + follow-ups. 5) Link any follow-up playbooks/runbooks/roadmaps you touched. 6) If this incident changes user expectations (outage/degradation, integrity risk, security posture, policy change), add a public-safe note in /changelog and/or /status (no sensitive details; changelog process: https://github.com/jerdaw/healtharchive/blob/main/frontend/docs/changelog-process.md).
The template includes an Action items (TODOs) section; use checkboxes so it’s obvious what work remains.
If the incident requires engineering work (automation, new scripts, behavior changes), capture it as a follow-up and create a focused implementation plan under docs/planning/ (then link it from the incident note).
What not to include
- Secrets (tokens, passwords, Healthchecks URLs).
- Private emails or non-public IPs/hostnames.
- Full logs. Prefer:
- the exact log path(s),
- the most relevant ~20–50 lines, and
- one or two
vps-*.shsnapshots.
Style
- Keep it blameless: focus on systems, invariants, and guardrails (not individuals).
- Prefer concrete facts over speculation; if something is unknown, label it as such.
- Record commands that changed state (DB writes, mounts, restarts) and what they affected.