HealthArchive ops roadmap (internal)
This file tracks the current ops roadmap/todo items only. Keep it short and current.
For historical roadmaps and upgrade context, see:
docs/planning/README.md(backend repo)
Keep the two synced copies of this file aligned:
- Backend repo:
docs/operations/healtharchive-ops-roadmap.md - Optional local working copy (non-git): if you keep a separate ops checklist outside the repo, keep it in sync with this canonical file.
Recurring ops (non-IRL, ongoing)
- Quarterly: run a restore test and record a public-safe log entry in
/srv/healtharchive/ops/restore-tests/. - Quarterly: add an adoption signals entry in
/srv/healtharchive/ops/adoption/(links + aggregates only). - Quarterly: confirm dataset release exists and passes checksum verification (
sha256sum -c SHA256SUMS). - Quarterly: confirm core timers are enabled and succeeding (recommended: on the VPS run
cd /opt/healtharchive && ./scripts/verify_ops_automation.sh; then spot-checkjournalctl -u <service>). - Quarterly: docs drift skim: re-read the production runbook + incident response and fix any drift you notice (keep docs matching reality).
Current status (as of 2026-05-06)
Live facts below come from operator-provided VPS output, not direct assistant production access.
- 2026 annual campaign status:
hcjob6is indexed, search-ready, and research-ready.- Indexed pages:
262567. - Backend:
playwright_warcfallback, labeled through annual-edition provenance.
- Indexed pages:
phacjob7is indexed, search-ready, and research-ready.- Indexed pages:
121940. - Backend:
playwright_warcfallback, labeled through annual-edition provenance. - Manual reindex evidence:
/srv/healtharchive/ops/manual-runs/phac-reindex-20260429T051607Z.logshowsIndexing for job 7 completed successfully with 121940 snapshot(s)., followed byIndexed: 1,Failed: 0,Jobs: 7. Completion timestamp in the log:2026-04-29 14:45:29 UTC.
- Indexed pages:
cihrjob8is indexed, search-ready, and research-ready.- Indexed pages:
557972. - Backend:
browsertrix. - Crawler stage:
operator_accepted_warcs_after_zim_build_failure. - Final WARC details:
689stable WARC files,689discovered WARC files, stable WARC source, manifest valid, total WARC size709.83 GB. - Annual edition report:
/srv/healtharchive/jobs/editions/cihr/2026/coverage-report.jsonreportedStatus=research_ready,Search ready=True, andResearch ready=True.
- Indexed pages:
- Annual search readiness is restored:
annual-status --year 2026reportsReady for search: YES, and production deploy/public-surface verification after the search follow-through passed on 2026-05-06. - Job lock-dir cutover remains complete:
/etc/healtharchive/backend.envpoints at/srv/healtharchive/ops/locks/jobs.- API and worker were both restarted during the 2026-04-14 maintenance window, so the env change is live in production.
- Annual output-dir mount topology is still unexpected for 2026 annual output dirs:
- direct
sshfsmounts remain in place instead of bind mounts. - conversion remains intentionally deferred until a future maintenance window after the annual crawl is idle or during an explicitly accepted interruption.
- Worker/watchdog posture was restored after CIHR indexing:
healtharchive-worker.serviceactivehealtharchive-crawl-auto-recover.timeractive/etc/healtharchive/crawl-auto-recover-enabledpresenthealtharchive-storage-hotpath-auto-recover.timeractive- Public-surface verification is no longer blocked by search latency:
- Deploys through
e9129c4eda31ce8a2b6072454e2ae48f484ecbadpassed the production deploy helper, baseline drift check, and public-surface verifier. - Public verifier now reaches API health/stats/sources/exports/search, snapshot detail, raw HTML, replay URL, usage/changes/RSS, frontend English and French pages, snapshot pages, and report forwarder checks.
- Final warm-up timing samples after the search-performance deploys:
q=covid&pageSize=1:3.252s,5.476s,2.487s,2.389s,1.959sq=covid&pageSize=1&view=pages:8.959s,6.742s,4.787s,4.566s,4.285spageSize=1:6.793s,1.885s,3.678s,2.339s,2.067spageSize=1&source=cihr:5.919s,2.329s,2.502s,3.070s,2.491s
- Remaining search work is future DB/index-plan tuning for broad
q=...&view=pagesif repeated warm-cache samples exceed the desired response target. - Repo-side WARC-complete/ZIM-finalization recurrence prevention is deployed: WARC-complete Browsertrix runs with final crawlStatus
pending=0and discoverable WARCs are eligible for indexing instead of automatically starting another resume crawl. - CIHR failed-URL review is complete:
- final crawlStatus reported
failed=26, but the failure increments were final retry exhaustion events; - 25 page/route URLs already had exact job
8snapshot coverage; - the lone uncovered URL was a render-asset image
/images/ipph_launch_may_2024-1.jpg, accepted as a non-page gap. - Alerting/report hygiene from the recent crawl work is deployed:
- bounded content reporting is the preferred operator diagnostic for live crawl cost/failure classification.
- stale historical crawl warnings are reduced; investigate throughput/churn trends in Grafana rather than via direct throughput pages.
Current priority order
Treat the following as the current ops execution order:
- Optional: investigate broad
q=...&view=pagesDB/index-plan tuning if repeated warm-cache samples stay above the desired response target. - Routine quarterly ops and evidence collection.
Current ops tasks (implementation already exists; enable/verify)
- PHAC 2026 salvage/indexing is complete.
- Job
7is indexed and its annual edition report is regenerated. - PHAC policy follow-through is closed for the next annual cycle: keep the labeled
playwright_warcfallback posture and keep the temporary high-churn exclusions unless a separate live verification proves those Browsertrix paths are stable. - 2026 PHAC coverage is research-ready with
missingUrlCount=0; no targeted recrawl is needed. - Annual output-dir mount topology conversion is complete.
- On 2026-05-06, jobs
6,7, and8were converted from direct per-jobsshfsmounts to hot paths rooted in the single Storage Box mount. - Verification showed one
sshfsprocess for/srv/healtharchive/storagebox, matching hot/cold directory identity, annual statusindexed=3, and replay smoke200for HC, PHAC, and CIHR. - CIHR incident follow-through:
- Job
8is indexed and annual search-ready; do not start additional indexing unless later checks prove the indexed rows are unusable. - Public search performance, raw snapshot checks, replay checks, and WARC-complete/ZIM-finalization recurrence prevention are deployed and verified.
- Failed-URL review is complete; no targeted follow-up capture is needed for this incident.
- The incident note is
incidents/2026-05-03-cihr-warc-complete-zim-build-resume-loop.md. - Large indexing hygiene for manual production runs:
- Always load production env first:
cd /opt/healtharchive && set -a; source /etc/healtharchive/backend.env; set +a. - Use
nohuportmuxfor multi-hour indexing, capture logs under/srv/healtharchive/ops/manual-runs/, and considerrenice +10 -p <pid>. - Monitor
psplus/proc/<pid>/io; an increasingrcharwith high CPU means indexing is still making progress even if DB status has not committed. - Do not start duplicate
reconcile-completed-indexingcommands for the same source/job. - If a client process exits but PostgreSQL shows a long-lived
idle in transaction, confirm the job did not commit and inspect blockers before terminating only the stale backend. - Preserved VPS branch review is complete.
prod-pre-a3e0deceforked at37a48988and contains three old pre-deploy commits.- The deployed
HEADincludes the later synchronized follow-up PR58cefc5aplus newer annual edition, replay, public-search, and incident closeout work. - Do not merge or cherry-pick
prod-pre-a3e0dece; its diff against deployedHEADwould delete newer production state. It can be deleted on the VPS after this roadmap update is deployed. - Next steps:
- compare
prod-pre-a3e0deceagainstmain - decide whether each preserved commit needs cherry-pick, replacement, or explicit retirement
- do not delete the branch until that review is documented
- compare
- Maintenance window (after 2026 annual crawl is idle): convert annual output dirs from direct
sshfsmounts to bind mounts. - Why defer: unmount/re-mount of a live job output dir can interrupt in-progress crawls; benefit is reduced Errno 107 blast radius, but not worth forced interruption mid-campaign.
- Detection (crawl-safe):
python3 /opt/healtharchive/scripts/vps-annual-output-tiering.py --year 2026 - Repair (maintenance only): stop the worker and ensure crawl containers are stopped, then:
sudo python3 /opt/healtharchive/scripts/vps-annual-output-tiering.py --year 2026 --apply --repair-unexpected-mounts --allow-repair-running-jobs
- After any reboot/rescue/maintenance where mounts may drift:
- Verify Storage Box mount is active (
healtharchive-storagebox-sshfs.service). - Re-apply annual output tiering for the active campaign year and confirm job output dirs are on Storage Box (see incident:
incidents/2026-02-04-annual-crawl-output-dirs-on-root-disk.md). - After deploying new crawl tuning defaults (or if an annual campaign was started before the change):
- Reconcile already-created annual job configs so retries/restarts adopt the new per-source profiles:
- Dry-run:
healtharchive reconcile-annual-tool-options --year <YEAR> - Apply:
healtharchive reconcile-annual-tool-options --year <YEAR> --apply
- Dry-run:
- Verify the new Docker resource limit environment variables are set appropriately on VPS if defaults need adjustment:
HEALTHARCHIVE_DOCKER_MEMORY_LIMIT(default: 4g)HEALTHARCHIVE_DOCKER_CPU_LIMIT(default: 1.5)- Post-deploy follow-through (alerting):
- Review notification volume and alert outcomes after 7 days (firing + resolved counts by alertname/severity).
- Confirm crawl throughput/churn investigations are being done via Grafana (
HealthArchive - Pipeline Health) and not missed due to notification removal. - Consider a future composite crawl-degradation alert only if dashboard review repeatedly reveals actionable issues that are not otherwise alerted.
- On the next relevant page-group rebuild, verify that logs show
unknowninstead of negative counts when PostgreSQL rowcount is indeterminate.
IRL / external validation (active; runs in parallel with ops)
External validation work is not blocked by the active CIHR monitoring or the remaining maintenance-window items. HC and PHAC are indexed and research-ready, CIHR is indexed and research-ready, and the bind-mount conversion remains deferred to a later maintenance window. Outreach and scholarly output can proceed independently on any day.
The active plan is:
../planning/2026-02-admissions-strengthening-plan.md— phases, effort, and sequence for all external/IRL work.
Current status as of 2026-04-14:
- Phase 1 items (outreach, uptime monitoring, portfolio page, ethics/governance update) are not yet started.
- The plan was created 2026-02-25; 4 weeks have elapsed, placing the timeline in Phase 1–2 territory.
- The mentions log remains empty (zero confirmed partners, verifiers, or citations).
- The single highest-leverage unblocking action is: send the first outreach batch (5–10 contacts, using existing templates at
../operations/outreach-templates.mdand the playbook atplaybooks/external/outreach-and-verification.md).
Treat external outreach as a parallel track to daily ops — not something to start "once ops settles." Ops will not fully settle before application deadlines.