HealthArchive ops roadmap (internal)

This file tracks the current ops roadmap/todo items only. Keep it short and current.

For historical roadmaps and upgrade context, see:

docs/planning/README.md (backend repo)

Keep the two synced copies of this file aligned:

Backend repo: docs/operations/healtharchive-ops-roadmap.md
Optional local working copy (non-git): if you keep a separate ops checklist outside the repo, keep it in sync with this canonical file.

Recurring ops (non-IRL, ongoing)

Quarterly: run a restore test and record a public-safe log entry in /srv/healtharchive/ops/restore-tests/.
Quarterly: add an adoption signals entry in /srv/healtharchive/ops/adoption/ (links + aggregates only).
Quarterly: confirm dataset release exists and passes checksum verification (sha256sum -c SHA256SUMS).
Quarterly: confirm core timers are enabled and succeeding (recommended: on the VPS run cd /opt/healtharchive && ./scripts/verify_ops_automation.sh; then spot-check journalctl -u <service>).
Quarterly: docs drift skim: re-read the production runbook + incident response and fix any drift you notice (keep docs matching reality).

Current status (as of 2026-05-06)

Live facts below come from operator-provided VPS output, not direct assistant production access.

2026 annual campaign status:
hc job 6 is indexed, search-ready, and research-ready.
- Indexed pages: 262567.
- Backend: playwright_warc fallback, labeled through annual-edition provenance.
phac job 7 is indexed, search-ready, and research-ready.
- Indexed pages: 121940.
- Backend: playwright_warc fallback, labeled through annual-edition provenance.
- Manual reindex evidence: /srv/healtharchive/ops/manual-runs/phac-reindex-20260429T051607Z.log shows Indexing for job 7 completed successfully with 121940 snapshot(s)., followed by Indexed: 1, Failed: 0, Jobs: 7. Completion timestamp in the log: 2026-04-29 14:45:29 UTC.
cihr job 8 is indexed, search-ready, and research-ready.
- Indexed pages: 557972.
- Backend: browsertrix.
- Crawler stage: operator_accepted_warcs_after_zim_build_failure.
- Final WARC details: 689 stable WARC files, 689 discovered WARC files, stable WARC source, manifest valid, total WARC size 709.83 GB.
- Annual edition report: /srv/healtharchive/jobs/editions/cihr/2026/coverage-report.json reported Status=research_ready, Search ready=True, and Research ready=True.
Annual search readiness is restored: annual-status --year 2026 reports Ready for search: YES, and production deploy/public-surface verification after the search follow-through passed on 2026-05-06.
Job lock-dir cutover remains complete:
/etc/healtharchive/backend.env points at /srv/healtharchive/ops/locks/jobs.
API and worker were both restarted during the 2026-04-14 maintenance window, so the env change is live in production.
Annual output-dir mount topology is still unexpected for 2026 annual output dirs:
direct sshfs mounts remain in place instead of bind mounts.
conversion remains intentionally deferred until a future maintenance window after the annual crawl is idle or during an explicitly accepted interruption.
Worker/watchdog posture was restored after CIHR indexing:
healtharchive-worker.service active
healtharchive-crawl-auto-recover.timer active
/etc/healtharchive/crawl-auto-recover-enabled present
healtharchive-storage-hotpath-auto-recover.timer active
Public-surface verification is no longer blocked by search latency:
Deploys through e9129c4eda31ce8a2b6072454e2ae48f484ecbad passed the production deploy helper, baseline drift check, and public-surface verifier.
Public verifier now reaches API health/stats/sources/exports/search, snapshot detail, raw HTML, replay URL, usage/changes/RSS, frontend English and French pages, snapshot pages, and report forwarder checks.
Final warm-up timing samples after the search-performance deploys:
- q=covid&pageSize=1: 3.252s, 5.476s, 2.487s, 2.389s, 1.959s
- q=covid&pageSize=1&view=pages: 8.959s, 6.742s, 4.787s, 4.566s, 4.285s
- pageSize=1: 6.793s, 1.885s, 3.678s, 2.339s, 2.067s
- pageSize=1&source=cihr: 5.919s, 2.329s, 2.502s, 3.070s, 2.491s
Remaining search work is future DB/index-plan tuning for broad q=...&view=pages if repeated warm-cache samples exceed the desired response target.
Repo-side WARC-complete/ZIM-finalization recurrence prevention is deployed: WARC-complete Browsertrix runs with final crawlStatus pending=0 and discoverable WARCs are eligible for indexing instead of automatically starting another resume crawl.
CIHR failed-URL review is complete:
final crawlStatus reported failed=26, but the failure increments were final retry exhaustion events;
25 page/route URLs already had exact job 8 snapshot coverage;
the lone uncovered URL was a render-asset image /images/ipph_launch_may_2024-1.jpg, accepted as a non-page gap.
Alerting/report hygiene from the recent crawl work is deployed:
bounded content reporting is the preferred operator diagnostic for live crawl cost/failure classification.
stale historical crawl warnings are reduced; investigate throughput/churn trends in Grafana rather than via direct throughput pages.

Current priority order

Treat the following as the current ops execution order:

Optional: investigate broad q=...&view=pages DB/index-plan tuning if repeated warm-cache samples stay above the desired response target.
Routine quarterly ops and evidence collection.

Current ops tasks (implementation already exists; enable/verify)

PHAC 2026 salvage/indexing is complete.
Job 7 is indexed and its annual edition report is regenerated.
PHAC policy follow-through is closed for the next annual cycle: keep the labeled playwright_warc fallback posture and keep the temporary high-churn exclusions unless a separate live verification proves those Browsertrix paths are stable.
2026 PHAC coverage is research-ready with missingUrlCount=0; no targeted recrawl is needed.
Annual output-dir mount topology conversion is complete.
On 2026-05-06, jobs 6, 7, and 8 were converted from direct per-job sshfs mounts to hot paths rooted in the single Storage Box mount.
Verification showed one sshfs process for /srv/healtharchive/storagebox, matching hot/cold directory identity, annual status indexed=3, and replay smoke 200 for HC, PHAC, and CIHR.
CIHR incident follow-through:
Job 8 is indexed and annual search-ready; do not start additional indexing unless later checks prove the indexed rows are unusable.
Public search performance, raw snapshot checks, replay checks, and WARC-complete/ZIM-finalization recurrence prevention are deployed and verified.
Failed-URL review is complete; no targeted follow-up capture is needed for this incident.
The incident note is incidents/2026-05-03-cihr-warc-complete-zim-build-resume-loop.md.
Large indexing hygiene for manual production runs:
Always load production env first: cd /opt/healtharchive && set -a; source /etc/healtharchive/backend.env; set +a.
Use nohup or tmux for multi-hour indexing, capture logs under /srv/healtharchive/ops/manual-runs/, and consider renice +10 -p <pid>.
Monitor ps plus /proc/<pid>/io; an increasing rchar with high CPU means indexing is still making progress even if DB status has not committed.
Do not start duplicate reconcile-completed-indexing commands for the same source/job.
If a client process exits but PostgreSQL shows a long-lived idle in transaction, confirm the job did not commit and inspect blockers before terminating only the stale backend.
Preserved VPS branch review is complete.
prod-pre-a3e0dece forked at 37a48988 and contains three old pre-deploy commits.
The deployed HEAD includes the later synchronized follow-up PR 58cefc5a plus newer annual edition, replay, public-search, and incident closeout work.
Do not merge or cherry-pick prod-pre-a3e0dece; its diff against deployed HEAD would delete newer production state. It can be deleted on the VPS after this roadmap update is deployed.
Next steps:
- compare prod-pre-a3e0dece against main
- decide whether each preserved commit needs cherry-pick, replacement, or explicit retirement
- do not delete the branch until that review is documented
Maintenance window (after 2026 annual crawl is idle): convert annual output dirs from direct sshfs mounts to bind mounts.
Why defer: unmount/re-mount of a live job output dir can interrupt in-progress crawls; benefit is reduced Errno 107 blast radius, but not worth forced interruption mid-campaign.
Detection (crawl-safe): python3 /opt/healtharchive/scripts/vps-annual-output-tiering.py --year 2026
Repair (maintenance only): stop the worker and ensure crawl containers are stopped, then:
- sudo python3 /opt/healtharchive/scripts/vps-annual-output-tiering.py --year 2026 --apply --repair-unexpected-mounts --allow-repair-running-jobs
After any reboot/rescue/maintenance where mounts may drift:
Verify Storage Box mount is active (healtharchive-storagebox-sshfs.service).
Re-apply annual output tiering for the active campaign year and confirm job output dirs are on Storage Box (see incident: incidents/2026-02-04-annual-crawl-output-dirs-on-root-disk.md).
After deploying new crawl tuning defaults (or if an annual campaign was started before the change):
Reconcile already-created annual job configs so retries/restarts adopt the new per-source profiles:
- Dry-run: healtharchive reconcile-annual-tool-options --year <YEAR>
- Apply: healtharchive reconcile-annual-tool-options --year <YEAR> --apply
Verify the new Docker resource limit environment variables are set appropriately on VPS if defaults need adjustment:
HEALTHARCHIVE_DOCKER_MEMORY_LIMIT (default: 4g)
HEALTHARCHIVE_DOCKER_CPU_LIMIT (default: 1.5)
Post-deploy follow-through (alerting):
Review notification volume and alert outcomes after 7 days (firing + resolved counts by alertname/severity).
Confirm crawl throughput/churn investigations are being done via Grafana (HealthArchive - Pipeline Health) and not missed due to notification removal.
Consider a future composite crawl-degradation alert only if dashboard review repeatedly reveals actionable issues that are not otherwise alerted.
On the next relevant page-group rebuild, verify that logs show unknown instead of negative counts when PostgreSQL rowcount is indeterminate.

IRL / external validation (active; runs in parallel with ops)

External validation work is not blocked by the active CIHR monitoring or the remaining maintenance-window items. HC and PHAC are indexed and research-ready, CIHR is indexed and research-ready, and the bind-mount conversion remains deferred to a later maintenance window. Outreach and scholarly output can proceed independently on any day.

The active plan is:

../planning/2026-02-admissions-strengthening-plan.md — phases, effort, and sequence for all external/IRL work.

Current status as of 2026-04-14:

Phase 1 items (outreach, uptime monitoring, portfolio page, ethics/governance update) are not yet started.
The plan was created 2026-02-25; 4 weeks have elapsed, placing the timeline in Phase 1–2 territory.
The mentions log remains empty (zero confirmed partners, verifiers, or citations).
The single highest-leverage unblocking action is: send the first outreach batch (5–10 contacts, using existing templates at ../operations/outreach-templates.md and the playbook at playbooks/external/outreach-and-verification.md).

Treat external outreach as a parallel track to daily ops — not something to start "once ops settles." Ops will not fully settle before application deadlines.