Monitoring, uptime, and CI checklist
This file pulls together the ongoing operations aspects of the project:
- Uptime and health monitoring.
- Metrics and alerting.
- CI enforcement and branch protection.
It is meant to complement:
../deployment/hosting-and-live-server-to-dos.md../deployment/staging-rollout-checklist.md../deployment/production-rollout-checklist.mdservice-levels.md— for SLO targets and commitments
0. Implementation steps (CI + external monitoring)
This section is a practical, sequential setup plan for enforcing CI and configuring external monitoring in the real world (GitHub + UptimeRobot, etc.).
Important: most of this is not “code you deploy” — it is configuration in:
- GitHub repository settings (branch protection)
- Your monitoring provider (UptimeRobot, Healthchecks, etc.)
Step 0 — Baseline audit + decisions (operator)
Objective: avoid duplicate monitors, avoid alert noise, and avoid “unknown settings drift”.
- Inventory current external monitors (UptimeRobot, etc.):
- Monitor name
- URL
- Interval + timeout
- Alert contacts/routes
- Any keyword/body assertions
- Decide alert routing:
- Which alerts should page you vs. just email (recommended: only “site down” pages; everything else emails).
- Recommended ops rule: only notify on conditions that usually require human action.
- Prefer dashboard-only for crawl throughput/churn trends.
- Prefer post-auto-recovery alerts (notify after watchdog attempts fail/persist), not first-symptom alerts.
- Decide the
mainbranch policy: - Mode A — Solo-fast (recommended for this project right now): direct pushes to
main.- CI still runs on every push to
main. - Deploys are gated by “green main” + VPS verification steps (below).
- CI still runs on every push to
- Mode B — Multi-committer (defer until needed): PR-only merges into
mainwith required status checks + code owner review (track in../planning/roadmap.md). - Switch to Mode B when there is more than one regular committer, or when you want stricter enforcement than “green main + local hooks”.
Verification:
- You can point to a quick note (even in a personal doc) listing current monitors + what each covers.
Step 1 — Verify CI runs on main pushes (operator)
Objective: ensure CI runs on pushes to main so you can treat “green main” as the deploy gate.
- Confirm GitHub Actions workflows are enabled:
- Repo → Actions → ensure workflows are enabled (not disabled by org/fork policy).
- Push a trivial commit to
main(e.g. a doc tweak). - Confirm the workflow runs and passes on that commit.
Verification:
- GitHub Actions shows the backend CI workflow completing successfully on
main.
Check name inventory (branch protection)
Use stable workflow/job check names shown in GitHub’s UI. Avoid renaming workflow/job IDs after you start requiring them.
As of the monorepo migration branch, the checks are used as follows:
- Monorepo required on
main:Backend CI / test,Backend CI / api-health,Frontend CI / contract-sync,Frontend CI / lint-and-test - Monorepo non-required, but useful:
Backend CI / e2e-smoke,Frontend CI / docker-build-smoke - Backend-only optional broader gate:
Backend CI (Full) / test-full(nightly/manual) - Workflow-file hygiene when enabled:
actionlint - Datasets repo (when protecting datasets
main):Datasets CI / lint(required)
Step 1a — Dependabot review/merge policy (operator + repo config)
Objective: reduce weekly maintenance overhead while preserving CI gating and human-only authorship in repository history.
Current repo policy:
- Dependabot PRs stay open for human review and human-authored landing commits.
semver-patchandsemver-minorupdates may still be accepted quickly once CI passes, but they should be recreated in a human-authored commit or branch instead of merging the bot-authored branch directly.semver-majorupdates remain manual review by default.- Close superseded or declined bot PRs after the human-authored replacement lands (or after you decide not to take the update).
- Monorepo required checks on
mainshould includeBackend CI / test,Backend CI / api-health,Frontend CI / contract-sync, andFrontend CI / lint-and-test. Backend CI / e2e-smokeandFrontend CI / docker-build-smokeremain useful non-required signals unless branch-protection appetite changes later.- The main branch no longer relies on broad CODEOWNERS review requests for dependency PRs, which avoids reviewer-notification noise while still keeping merge control with a human.
Step 1b — End-to-end smoke checks (CI)
Objective: catch regressions where the apps “build” but user‑critical paths fail at runtime.
What the smoke does:
- Starts the backend locally (uvicorn) with a tiny seeded SQLite + WARC dataset.
- Builds and starts the frontend locally (
next start) pointing at that backend. - Runs
healtharchive/scripts/verify_public_surface.pyagainst: - Frontend:
/archive,/fr/archive,/snapshot/{id},/fr/snapshot/{id}, and other key pages - API:
/api/health,/api/sources,/api/search,/api/snapshot/{id},/api/usage,/api/exports,/api/changes - Replay (pywb) is intentionally skipped in CI (
--skip-replay). - The verifier includes minimal “not just 200” assertions:
/archivepages must include a stable Next.js marker (/_next/static/)./snapshot/{id}pages must include the snapshot title returned by the API.
Where it runs:
- Monorepo backend workflow:
.github/workflows/backend-ci.ymljobe2e-smoke - Tests backend changes against the in-tree
frontend/. - Runs on pushes, pull requests, and manual runs from one checkout.
- Monorepo frontend workflow:
.github/workflows/frontend-ci.yml - Verifies contract sync,
make frontend-ci, and a frontend Docker build smoke.
Local reproduction (from the monorepo root):
On failure, the script prints the tail of the backend/frontend logs that it writes under:
healtharchive/.tmp/ci-e2e-smoke/
CI uploads the backend smoke logs as a GitHub Actions artifact on failure:
- Backend repo:
backend-e2e-smoke-artifacts
Step 2 — Solo-fast deploy gate (operator; recommended)
Objective: prevent broken deploys by only deploying when main is green.
Workflow (recommended):
- Local guardrails (recommended while branch protections are relaxed):
- Run checks before you push:
- From the monorepo root:
make backend-cimake contract-checkmake frontend-ci- optional broader gate:
make monorepo-ci - Datasets repo still runs its own checks separately.
- Optional before deploys:
healtharchive: make check-full
- Optional but recommended: install pre-push hooks so you can't forget:
- Backend:
scripts/install-pre-push-hook.sh(setHA_PRE_PUSH_FULL=1formake check-full) - Frontend:
frontend/scripts/install-pre-push-hook.sh - Datasets: https://github.com/jerdaw/healtharchive-datasets/blob/main/scripts/install-pre-push-hook.sh
- Backend:
- Push to
main. - Wait for GitHub Actions to go green on that commit.
- Deploy on the VPS:
- Recommended (one command):
./scripts/vps-deploy.sh --apply --baseline-mode live- Includes baseline drift + public-surface verify by default.
- If you use a local alias like
dodeploy, ensure you still run:./scripts/check_baseline_drift.py --mode live./scripts/verify_public_surface.py
Verification:
- The VPS deploy completes and both verification scripts pass.
Future (tighten later):
- When there are multiple committers or when you want stricter enforcement, switch to PR-only merges and require the backend/frontend checks in branch protection (track in
../planning/roadmap.md).
Step 3 — External uptime monitoring (operator; UptimeRobot settings)
Objective: catch real, user-visible failures with minimal noise.
Recommended minimal monitor set (HTTP(s) checks):
- API health
- URL:
https://api.healtharchive.ca/api/health - Expected: HTTP 200
- Interval: 1–5 minutes
- Note: backend supports
HEAD /api/healthfor providers that default toHEAD. - Frontend integration
- URL:
https://healtharchive.ca/archive - Expected: HTTP 200
- Interval: 5 minutes
- Optional: keyword assertion (stable string that should always appear).
- Replay base URL (optional but recommended if you rely on replay)
- URL:
https://replay.healtharchive.ca/ - Expected: HTTP 200
- Interval: 5–10 minutes
Optional, higher-signal replay monitoring (recommended later):
- After annual jobs exist and are replay-indexed, add 1–3 “known-good replay URL” monitors (one per source or one total) pointing at a stable capture inside a
job-<id>collection. Update them annually.
Verification:
- Optional pre-flight from the VPS (or any machine with internet +
curl): ./scripts/smoke-external-monitors.sh- All monitors show “Up”.
- Alerting routes work (optional test: intentionally break a monitor briefly).
- Crawl-performance investigations use Grafana dashboard trends (crawl rate / progress age / restarts) rather than direct throughput alerts.
- Grafana access quickstart (SSH port-forward preferred):
observability-and-private-stats.md - Full observability setup/runbook:
playbooks/observability/observability-guide.md
Step 4 — Timer ran monitoring (Healthchecks-style; optional but recommended)
Objective: get alerted if systemd timers stop running (even when the site is up).
This is intentionally optional: you already have high-value uptime checks in Step 3, but "timer ran" alerts are useful for catching silent failures (timer disabled, unit failing, disk low refusal, etc.).
Recommended checks to monitor:
- Baseline drift check (weekly):
healtharchive-baseline-drift-check.timer- Ping variable:
HEALTHARCHIVE_HC_PING_BASELINE_DRIFT - Public surface verification (daily):
healtharchive-public-surface-verify.timer- Ping variable:
HEALTHARCHIVE_HC_PING_PUBLIC_VERIFY - Replay reconcile (daily):
healtharchive-replay-reconcile.timer- Ping variable:
HEALTHARCHIVE_HC_PING_REPLAY_RECONCILE - Change tracking (daily):
healtharchive-change-tracking.timer- Ping variable:
HEALTHARCHIVE_HC_PING_CHANGE_TRACKING - Annual scheduler (yearly):
healtharchive-schedule-annual.timer- Ping variable:
HEALTHARCHIVE_HC_PING_SCHEDULE_ANNUAL - Annual search verify (daily, idempotent once per year):
healtharchive-annual-search-verify.timer- Ping variable:
HEALTHARCHIVE_HC_PING_ANNUAL_SEARCH_VERIFY - Coverage guardrails (daily):
healtharchive-coverage-guardrails.timer- Ping variable:
HEALTHARCHIVE_HC_PING_COVERAGE_GUARDRAILS - Replay smoke tests (daily):
healtharchive-replay-smoke.timer- Ping variable:
HEALTHARCHIVE_HC_PING_REPLAY_SMOKE - Cleanup automation (weekly):
healtharchive-cleanup-automation.timer- Ping variable:
HEALTHARCHIVE_HC_PING_CLEANUP_AUTOMATION
Note: avoid pinging high-frequency timers (e.g., crawl metrics, crawl auto-recover) to reduce noise. Prefer Prometheus watchdog-freshness alerts for those timers instead of per-run external pings.
Implementation approach (VPS):
- Create a check in your Healthchecks provider for each timer you care about.
- Store ping URLs only on the VPS in a root-owned env file:
/etc/healtharchive/healthchecks.env(mode 0600, root:root)- Note: this file may be shared across multiple automations; it is OK to keep both:
- legacy
HC_*variables (DB backup + disk check) - newer
HEALTHARCHIVE_HC_PING_*variables (systemd unit templates)
- legacy
- Ensure the installed systemd units source that env file:
EnvironmentFile=-/etc/healtharchive/healthchecks.env- Ensure the unit uses the wrapper so ping URLs never appear in unit files:
/opt/healtharchive/scripts/systemd-healthchecks-wrapper.sh
Safety posture:
- Pinging is best-effort; ping failures do not fail jobs.
- Removing
/etc/healtharchive/healthchecks.envdisables pings immediately.
Verification (VPS):
- Add a temporary ping URL for one service, then run:
sudo systemctl start healtharchive-replay-reconcile-dry-run.service- Confirm the check receives a ping.
Step 5 — Automated post-campaign search verification capture (optional)
Objective: once the annual campaign becomes search-ready, automatically capture golden-query /api/search JSON into a year-tagged directory for later diffing and audits.
What gets captured (recommended minimal set):
annual-status.jsonand a human-readableannual-status.txtmeta.txt(capture metadata)<query>.pages.json+<query>.snapshots.jsonfor your golden query list
Implementation approach (VPS, systemd):
- This repo provides an optional daily timer that is idempotent:
- If the campaign is not ready, it exits 0 (no alert noise).
- If artifacts already exist for the current year/run-id, it exits 0.
Install and enable:
- Copy templates onto the VPS (see
../deployment/systemd/README.md): healtharchive-annual-search-verify.servicehealtharchive-annual-search-verify.timer- Reload systemd:
sudo systemctl daemon-reload- Enable the timer:
sudo systemctl enable --now healtharchive-annual-search-verify.timer
Artifacts:
- Default location:
/srv/healtharchive/ops/search-eval/<year>/final/ - To force re-run for a year: delete that directory and re-run the service.
Verification (VPS):
- Force-run once:
sudo systemctl start healtharchive-annual-search-verify.service- Confirm it either:
- exits 0 quickly (not ready), or
- creates artifacts under
/srv/healtharchive/ops/search-eval/<year>/final/.
Step 6 — Optional GitHub-driven deploys (CD) (infrastructure project)
Objective: reduce deploy mistakes without expanding the production attack surface.
Recommended posture for this project (single VPS, no staging backend):
- Keep deployments manual on the VPS.
- Use the deploy helper script:
scripts/vps-deploy.sh(dry-run default;--applyto deploy)
Rationale:
- Avoids storing production access secrets in GitHub.
- Avoids granting passwordless sudo/SSH access to GitHub Actions.
- Keeps the operational path “boring” and easy to reason about.
Verification (VPS):
- Dry-run:
cd /opt/healtharchive && ./scripts/vps-deploy.sh - Apply:
cd /opt/healtharchive && ./scripts/vps-deploy.sh --apply
1. Uptime and health checks
1.1 Backend health endpoint
Primary health endpoint:
GET https://api.healtharchive.ca/api/healthHEAD https://api.healtharchive.ca/api/health(supported; some uptime tools useHEAD)
Some uptime providers issue HEAD requests by default. The backend supports HEAD /api/health so monitors may use either method.
Expected behavior:
- HTTP 200
- JSON body like:
GET https://api.healtharchive.ca/api/health?details=1 - Includes jobs status counts and snapshots.total Suggested uptime monitor:
- Configure an external service (UptimeRobot, healthchecks.io, your cloud provider) to poll:
https://api.healtharchive.ca/api/health- If you later add a separate staging API, also poll:
https://api-staging.healtharchive.ca/api/health- Alert on:
- 5xx responses.
- Timeouts.
- Repeated failures over a short window.
1.2 Frontend + integration check
To verify the frontend and backend integration:
GET https://healtharchive.ca/archive
Expected behavior:
- HTTP 200.
- Page renders with:
- Filters header showing
Filters (live API)when backend is up. - Real search results when snapshots exist.
Suggested uptime monitor:
- Configure a separate check that:
- Downloads
https://healtharchive.ca/archive. - Optionally asserts presence of a known string in the body (e.g. “HealthArchive.ca” or “Browse & search demo snapshots”).
1.3 Replay uptime check (optional but recommended if replay is in use)
To ensure replay is reachable:
GET https://replay.healtharchive.ca/
If you want a higher-signal check (recommended once you have stable annual jobs):
- Monitor a known-good replay URL inside a specific
job-<id>collection: https://replay.healtharchive.ca/job-<id>/<original_url>- Choose an original URL that is stable and low-cost to serve.
2. Metrics and alerting
2.1 Metrics endpoint
Metrics are exposed at:
GET https://api.healtharchive.ca/metrics- If you later add a separate staging API:
GET https://api-staging.healtharchive.ca/metrics
This endpoint is protected by HEALTHARCHIVE_ADMIN_TOKEN. In Prometheus or a similar system, you will typically:
- Store the token in a secure place (e.g. Prometheus config / secret).
- Pass it via
Authorization: Bearer <token>orX-Admin-Tokenheader.
2.2 Key metrics
The metrics endpoint exposes, among others:
- Job counts:
healtharchive_jobs_total{status="queued"} 1
healtharchive_jobs_total{status="indexed"} 5
healtharchive_jobs_total{status="failed"} 0
- Cleanup status:
healtharchive_jobs_cleanup_status_total{cleanup_status="none"} 10
healtharchive_jobs_cleanup_status_total{cleanup_status="temp_cleaned"} 3
- Snapshot counts:
healtharchive_snapshots_total 123
healtharchive_snapshots_total{source="hc"} 80
healtharchive_snapshots_total{source="phac"} 43
- Page‑level crawl metrics:
healtharchive_jobs_pages_crawled_total 45678
healtharchive_jobs_pages_crawled_total{source="hc"} 30000
healtharchive_jobs_pages_failed_total 120
healtharchive_jobs_pages_failed_total{source="hc"} 30
- Search metrics (per-process; reset on restart):
healtharchive_search_requests_total 123
healtharchive_search_errors_total 0
healtharchive_search_duration_seconds_bucket{le="0.3"} 100
healtharchive_search_mode_total{mode="relevance_fts"} 80
healtharchive_search_mode_total{mode="relevance_fallback"} 25
healtharchive_search_mode_total{mode="relevance_fuzzy"} 5
healtharchive_search_mode_total{mode="boolean"} 2
healtharchive_search_mode_total{mode="url"} 3
healtharchive_search_mode_total{mode="newest"} 8
2.3 Example alert ideas (Prometheus‑style)
These are examples, not full rules, but can guide what you set up:
-
Crawl state file unhealthy:
-
Alert if
healtharchive_crawl_running_job_state_file_ok==1buthealtharchive_crawl_running_job_state_parse_ok==0for >10m. -
Crawl stalled:
-
Alert if
healtharchive_crawl_running_job_stalled==1for >30m. -
Crawl degraded (slow but progressing):
-
Alert if
healtharchive_crawl_running_job_crawl_rate_ppm{source=~"hc|phac"} < 2whilehealtharchive_crawl_running_job_last_progress_age_seconds{source=~"hc|phac"} <= 300andhealtharchive_crawl_running_job_stalled{source=~"hc|phac"} == 0for >45m. -
Crawl completed but indexing not starting:
-
Alert if
healtharchive_indexing_pending_job_max_age_secondsexceeds your SLA (e.g., >1h) andhealtharchive_crawl_running_jobs == 0. -
High job failure rate:
-
Alert if
healtharchive_jobs_total{status="failed"}jumps unexpectedly over a sliding window. -
No new snapshots over time:
-
Alert if
increase(healtharchive_snapshots_total[24h]) == 0while jobs are being created, indicating indexing or crawl issues. -
Cleanup not happening:
-
Alert if
healtharchive_jobs_cleanup_status_total{cleanup_status="none"}grows without bound whiletemp_cleanedremains flat.
Tune these based on actual volumes and acceptable thresholds.
3. CI and branch protection
3.1 GitHub Actions workflows
Workflows live at:
- Backend:
.github/workflows/backend-ci.yml - Frontend:
.github/workflows/frontend-ci.yml
Each should:
- Backend:
- Run
make check. - Frontend:
- Verify
make contract-check. - Install deps via
npm ci. - Run
make frontend-ci. - Build the frontend Docker image.
Checklist:
- Ensure workflows are enabled in GitHub:
- Open the repo on GitHub.
- Go to Actions.
- If you see “Workflows are disabled for this fork”, click Enable.
- Verify that pushing to
mainor opening a PR triggers the workflows.
3.2 Branch protection on main (backend repo, solo-dev profile)
Backend enforcement is currently maintained as a GitHub ruleset.
For the monorepo target state:
- Ruleset name:
main-protection - Target branch:
main - Enforcement status:
Active - Bypass list:
Repository admin Role(always allow) - Enabled rules:
Restrict deletionsRequire status checks to passwith required checksBackend CI / test,Backend CI / api-health,Frontend CI / contract-sync, andFrontend CI / lint-and-testBlock force pushes- Disabled rules (intentional for solo-dev speed):
Require a pull request before merging- Review/approval requirements
- Code owner requirements
- Extra code scanning/code quality gates
Important check-selection notes:
- Do not require
Backend CI / e2e-smokein branch protection (it does not run on pull requests). - Do not require
Backend CI (Full) / test-full(nightly/manual workflow). - Do not require
actionlint; it only runs when workflow files change.
Verification ritual (operator, monthly or after workflow edits):
- Open GitHub → Repository → Settings → Rules → Rulesets →
main-protection. - Confirm required check list contains
Backend CI / testandBackend CI / api-health. - Confirm
Block force pushesandRestrict deletionsremain enabled. - Open Actions and confirm the latest
Backend CIrun onmainis green. - Log any changes in an ops report note (date + what changed + why).
- Review
.github/migration-guard-exceptions.txt: - remove expired rules,
- ensure any active rule has a short-lived expiry and clear reason.
Latest evidence snapshot:
- 2026-03-21: branch protection verified active with required checks
Backend CI / testandBackend CI / api-health, withRestrict deletionsandBlock force pushesenabled.
Future tighten-up trigger:
- If a second regular committer is added, enable PR-required merges and approval rules, then update this section.
4. Periodic operations review
On a regular cadence (e.g. monthly or quarterly), review:
- Uptime logs:
- Are there recurring outages at specific times?
- Metrics:
- Are job failures spiking?
- Are snapshots growing at the expected rate?
- Is cleanup keeping up with new jobs?
- CI:
- Are workflows still running on all relevant branches?
- Do new checks or tooling need to be added?
Recording a short “ops state” note alongside these reviews will make future debugging and capacity planning much easier.