Documentation process audit (2026-01-09)

Scope: HealthArchive project documentation processes and subprocesses across:

healtharchive (ops, runbooks, incident notes, canonical internal docs)
the frontend app in healtharchive/frontend/ (public policy/reporting surfaces, UX copy, changelog)
healtharchive-datasets (dataset release documentation and integrity posture)
the older “workspace of sibling repos” convention used in /home/jer/LocalSync/healtharchive/

Goal: assess whether the documentation system is well-designed, maintainable, and aligned with modern best practices (docs-as-code + operational excellence), and identify concrete upgrades.

Executive summary

Overall: the project’s documentation system is already unusually strong for its size. It is structured around a clear “docs-as-code” posture, high-signal operational procedures, and drift-resistant separation of backlog vs implementation plans vs canonical docs.

Key strengths (high confidence):

Drift prevention by design: single canonical sources + pointer docs, and explicit backlog/plan/canonical separation (docs/planning/** vs docs/**).
Operations maturity: production runbook, playbooks, cadence checklists, monitoring/CI setup guidance, and safety posture are explicit and actionable.
Incident management: severity rubric + incident template + operator response playbook + at least one real incident note showing good practice.
Public vs private boundaries: explicit contracts for privacy-preserving usage metrics, issue-report retention, and non-public admin/metrics access.
Reproducibility: dataset release integrity rules (checksums + manifest invariants) documented and operationalized.

Primary remaining gaps (fixable, low risk):

Templates / consistency: you had strong examples, but no standard templates for new runbooks/playbooks, and changelog updates were not documented as an SOP.
Lifecycle + review cadence: docs avoided duplication well, but “how we keep docs correct over time” could be more explicit (lightweight review + deprecation pattern).
Public communication integration: incident notes were solid, but the “when do we update /changelog and/or /status?” expectation wasn’t explicit enough.

Inventory of documentation “process surfaces”

This is the set of documents that define how documentation is produced, maintained, and used (not every domain-specific doc).

Governance / doc architecture

Canonical documentation policy and source-of-truth rules:
docs/documentation-guidelines.md
Index structure (discoverability):
docs/README.md
docs/operations/README.md
docs/operations/playbooks/README.md
docs/planning/README.md
docs/frontend/README.md

Planning / change management

Backlog and implementation plan workflow:
docs/roadmap-process.md
docs/planning/roadmap.md
docs/planning/implemented/

Incidents and post-incident learning

Incident SOP and artifacts:
docs/operations/incidents/README.md
docs/_templates/incident-template.md
docs/operations/incidents/severity.md
docs/operations/playbooks/core/incident-response.md

Operations and reliability subprocesses (repeatable routines)

Cadence and routines:
docs/operations/ops-cadence-checklist.md
Monitoring/CI and deploy gating:
docs/operations/monitoring-and-ci-checklist.md
docs/operations/playbooks/core/deploy-and-verify.md
docs/operations/baseline-drift.md
Backup/restore validation:
docs/operations/restore-test-procedure.md
docs/_templates/restore-test-log-template.md
Data handling and privacy posture:
docs/operations/data-handling-retention.md
docs/operations/observability-and-private-stats.md

Public-facing reporting surfaces (documentation for users)

Changelog content lives in code, but is effectively “public documentation”:
https://github.com/jerdaw/healtharchive/blob/main/frontend/src/content/changelog.ts
(process) https://github.com/jerdaw/healtharchive/blob/main/frontend/docs/changelog-process.md
Status/impact pages are public reporting surfaces (operational transparency):
https://github.com/jerdaw/healtharchive/blob/main/frontend/src/app/%5Blocale%5D/status/page.tsx
https://github.com/jerdaw/healtharchive/blob/main/frontend/src/app/%5Blocale%5D/impact/page.tsx

Release + reproducibility subprocesses

Dataset releases and integrity expectations:
https://github.com/jerdaw/healtharchive-datasets/blob/main/README.md
docs/operations/export-integrity-contract.md
docs/operations/dataset-release-runbook.md

Evaluation against modern best practices

This section uses a simple “Green / Yellow / Red” maturity signal.

1) Discoverability & information architecture — Green

Evidence:

Dedicated indices exist for backend docs, ops docs, playbooks, and roadmaps.
File naming is descriptive and stable (runbook, checklist, playbook).
Cross-repo “canonical doc” pointers exist (env wiring, partner kit, data dictionary).

Residual risks:

Some “project-level” navigation still relies on GitHub blob links for in-tree frontend code/docs that live outside MkDocs navigation.

2) Single source of truth / drift control — Green

Evidence:

Explicit canonical sources + pointer strategy.
Explicit separation:
backlog (planning/roadmap.md)
active plans (docs/planning/*.md)
canonical docs (deployment/ops/dev)

Residual risks:

Any duplicated non-git copies (e.g., ops roadmap) are inherently drift-prone; currently mitigated by explicit “keep synced” guidance.

3) Operational excellence (runbooks, playbooks, verification) — Green

Evidence:

Production runbook is explicit about topology, security posture, and setup steps:
docs/deployment/production-single-vps.md
Deploy is treated as a verified procedure with a defined gate (“green main” + VPS verification).
Baseline drift is operationalized as policy+observed+diff.
Restore tests and dataset verification have explicit SOPs and templates.

Residual risks:

As more workflows accumulate, templates become important to prevent playbooks/runbooks diverging in structure/quality.

4) Incident response & learning system — Green (with small upgrades)

Evidence:

Incident notes have a clear SOP, a severity rubric, and a good template.
The template includes: impact, detection, timeline, root cause, recovery, verification, action items.
The repo has at least one high-quality real incident note, with follow-ups tied to a roadmap.

Residual risks:

Public communication expectations (status/changelog) were implicit; now explicit guidance exists, but you may still want to decide a project stance:
“We always publish a public-safe note for sev0/sev1” vs “only when it changes user expectations”.

5) Public transparency & user-facing documentation — Yellow

Evidence:

The site includes /governance, /terms, /privacy, /changelog, /report, /status, /impact.
Copy inventory and disclaimer matrices exist to keep safety posture coherent.

Gaps:

The changelog is a core public accountability surface, but without an explicit SOP it risks becoming stale or inconsistent (especially across EN/FR).

6) Security + privacy documentation posture — Green

Evidence:

Clear “no secrets in git” posture across docs.
Admin/metrics are explicitly private-only; tailnet access model is documented.
Data retention and PHI risk are explicitly addressed for issue reports and logs.

Residual risks:

If the project ever adds more operators, formalize “who has access to what” and credential rotation as explicit operator subprocesses.

7) Reproducibility and research integrity — Green

Evidence:

Export endpoints have defined ordering/pagination invariants.
Dataset releases are immutable objects with checksum verification and manifest invariants.
Corrections are expected to be documented rather than silently rewriting history.

Improvements implemented in this audit (2026-01-09)

These are low-risk upgrades that make doc creation and maintenance more consistent:

Docs reference sanity checks (broken links/path refs):
Backend: scripts/check_docs_references.py (wired into Makefile)
Frontend: frontend/scripts/check-doc-references.mjs (wired into frontend/package.json)
Datasets: https://github.com/jerdaw/healtharchive-datasets/blob/main/scripts/check_docs_references.py (wired into Makefile)
Standardized templates:
docs/_templates/runbook-template.md
docs/_templates/playbook-template.md
Decision records mechanism:
docs/decisions/README.md
docs/_templates/decision-template.md
Clearer doc taxonomy, quality bar, and lifecycle guidance:
docs/documentation-guidelines.md
Public changelog SOP (source of truth, format, localization rules):
https://github.com/jerdaw/healtharchive/blob/main/frontend/docs/changelog-process.md
Stronger “incident → public-safe note” expectation (optional but recommended for sev0/sev1):
docs/operations/incidents/README.md
docs/_templates/incident-template.md
docs/operations/incidents/severity.md
docs/operations/ops-cadence-checklist.md
Process nudges in PR templates:
.github/pull_request_template.md

Recommendations (next steps)

P0 (high value, low effort)

Decide an explicit public incident disclosure posture:
Option A: always add a public-safe /changelog entry for sev0/sev1 incidents.
Option B: only add a public-safe entry when it changes user expectations (outage, integrity risk, policy change).
Make doc maintenance part of normal ops:
During the quarterly cadence, skim the production runbook + incident response playbook and fix drift discovered during real operations.

P1 (medium value, moderate effort)

(Implemented) Docs link/path sanity checks + decision records are now in place.

P2 (later / if team grows)

If/when there are multiple regular committers:
switch to PR-only merges (branch protection required checks),
introduce CODEOWNERS for high-risk areas (deployment/ops/policy pages),
require review for public-policy copy changes.

“Top notch” principles to keep

Prefer stable, scripted entrypoints over fragile shell snippets.
Keep internal docs public-safe by default (assume they may be shared).
Separate “what exists and how to operate it” from “how we got here” (planning/implemented plans).
Treat verification as first-class: every operational procedure should define what “done” means.