Skip to content

Monitoring, uptime, and CI checklist

This file pulls together the ongoing operations aspects of the project:

  • Uptime and health monitoring.
  • Metrics and alerting.
  • CI enforcement and branch protection.

It is meant to complement:

  • ../deployment/hosting-and-live-server-to-dos.md
  • ../deployment/staging-rollout-checklist.md
  • ../deployment/production-rollout-checklist.md
  • service-levels.md — for SLO targets and commitments

0. Implementation steps (CI + external monitoring)

This section is a practical, sequential setup plan for enforcing CI and configuring external monitoring in the real world (GitHub + UptimeRobot, etc.).

Important: most of this is not “code you deploy” — it is configuration in:

  • GitHub repository settings (branch protection)
  • Your monitoring provider (UptimeRobot, Healthchecks, etc.)

Step 0 — Baseline audit + decisions (operator)

Objective: avoid duplicate monitors, avoid alert noise, and avoid “unknown settings drift”.

  1. Inventory current external monitors (UptimeRobot, etc.):
  2. Monitor name
  3. URL
  4. Interval + timeout
  5. Alert contacts/routes
  6. Any keyword/body assertions
  7. Decide alert routing:
  8. Which alerts should page you vs. just email (recommended: only “site down” pages; everything else emails).
  9. Recommended ops rule: only notify on conditions that usually require human action.
    • Prefer dashboard-only for crawl throughput/churn trends.
    • Prefer post-auto-recovery alerts (notify after watchdog attempts fail/persist), not first-symptom alerts.
  10. Decide the main branch policy:
  11. Mode A — Solo-fast (recommended for this project right now): direct pushes to main.
    • CI still runs on every push to main.
    • Deploys are gated by “green main” + VPS verification steps (below).
  12. Mode B — Multi-committer (defer until needed): PR-only merges into main with required status checks + code owner review (track in ../planning/roadmap.md).
  13. Switch to Mode B when there is more than one regular committer, or when you want stricter enforcement than “green main + local hooks”.

Verification:

  • You can point to a quick note (even in a personal doc) listing current monitors + what each covers.

Step 1 — Verify CI runs on main pushes (operator)

Objective: ensure CI runs on pushes to main so you can treat “green main” as the deploy gate.

  1. Confirm GitHub Actions workflows are enabled:
  2. Repo → Actions → ensure workflows are enabled (not disabled by org/fork policy).
  3. Push a trivial commit to main (e.g. a doc tweak).
  4. Confirm the workflow runs and passes on that commit.

Verification:

  • GitHub Actions shows the backend CI workflow completing successfully on main.

Check name inventory (branch protection)

Use stable workflow/job check names shown in GitHub’s UI. Avoid renaming workflow/job IDs after you start requiring them.

As of the monorepo migration branch, the checks are used as follows:

  • Monorepo required on main: Backend CI / test, Backend CI / api-health, Frontend CI / contract-sync, Frontend CI / lint-and-test
  • Monorepo non-required, but useful: Backend CI / e2e-smoke, Frontend CI / docker-build-smoke
  • Backend-only optional broader gate: Backend CI (Full) / test-full (nightly/manual)
  • Workflow-file hygiene when enabled: actionlint
  • Datasets repo (when protecting datasets main): Datasets CI / lint (required)

Step 1a — Dependabot review/merge policy (operator + repo config)

Objective: reduce weekly maintenance overhead while preserving CI gating and human-only authorship in repository history.

Current repo policy:

  • Dependabot PRs stay open for human review and human-authored landing commits.
  • semver-patch and semver-minor updates may still be accepted quickly once CI passes, but they should be recreated in a human-authored commit or branch instead of merging the bot-authored branch directly.
  • semver-major updates remain manual review by default.
  • Close superseded or declined bot PRs after the human-authored replacement lands (or after you decide not to take the update).
  • Monorepo required checks on main should include Backend CI / test, Backend CI / api-health, Frontend CI / contract-sync, and Frontend CI / lint-and-test.
  • Backend CI / e2e-smoke and Frontend CI / docker-build-smoke remain useful non-required signals unless branch-protection appetite changes later.
  • The main branch no longer relies on broad CODEOWNERS review requests for dependency PRs, which avoids reviewer-notification noise while still keeping merge control with a human.

Step 1b — End-to-end smoke checks (CI)

Objective: catch regressions where the apps “build” but user‑critical paths fail at runtime.

What the smoke does:

  • Starts the backend locally (uvicorn) with a tiny seeded SQLite + WARC dataset.
  • Builds and starts the frontend locally (next start) pointing at that backend.
  • Runs healtharchive/scripts/verify_public_surface.py against:
  • Frontend: /archive, /fr/archive, /snapshot/{id}, /fr/snapshot/{id}, and other key pages
  • API: /api/health, /api/sources, /api/search, /api/snapshot/{id}, /api/usage, /api/exports, /api/changes
  • Replay (pywb) is intentionally skipped in CI (--skip-replay).
  • The verifier includes minimal “not just 200” assertions:
  • /archive pages must include a stable Next.js marker (/_next/static/).
  • /snapshot/{id} pages must include the snapshot title returned by the API.

Where it runs:

  • Monorepo backend workflow: .github/workflows/backend-ci.yml job e2e-smoke
  • Tests backend changes against the in-tree frontend/.
  • Runs on pushes, pull requests, and manual runs from one checkout.
  • Monorepo frontend workflow: .github/workflows/frontend-ci.yml
  • Verifies contract sync, make frontend-ci, and a frontend Docker build smoke.

Local reproduction (from the monorepo root):

make venv
make frontend-install
make integration-e2e

On failure, the script prints the tail of the backend/frontend logs that it writes under:

  • healtharchive/.tmp/ci-e2e-smoke/

CI uploads the backend smoke logs as a GitHub Actions artifact on failure:

  • Backend repo: backend-e2e-smoke-artifacts

Objective: prevent broken deploys by only deploying when main is green.

Workflow (recommended):

  1. Local guardrails (recommended while branch protections are relaxed):
  2. Run checks before you push:
    • From the monorepo root:
    • make backend-ci
    • make contract-check
    • make frontend-ci
    • optional broader gate: make monorepo-ci
    • Datasets repo still runs its own checks separately.
    • Optional before deploys: healtharchive: make check-full
  3. Optional but recommended: install pre-push hooks so you can't forget:
    • Backend: scripts/install-pre-push-hook.sh (set HA_PRE_PUSH_FULL=1 for make check-full)
    • Frontend: frontend/scripts/install-pre-push-hook.sh
    • Datasets: https://github.com/jerdaw/healtharchive-datasets/blob/main/scripts/install-pre-push-hook.sh
  4. Push to main.
  5. Wait for GitHub Actions to go green on that commit.
  6. Deploy on the VPS:
  7. Recommended (one command): ./scripts/vps-deploy.sh --apply --baseline-mode live
    • Includes baseline drift + public-surface verify by default.
  8. If you use a local alias like dodeploy, ensure you still run:
    • ./scripts/check_baseline_drift.py --mode live
    • ./scripts/verify_public_surface.py

Verification:

  • The VPS deploy completes and both verification scripts pass.

Future (tighten later):

  • When there are multiple committers or when you want stricter enforcement, switch to PR-only merges and require the backend/frontend checks in branch protection (track in ../planning/roadmap.md).

Step 3 — External uptime monitoring (operator; UptimeRobot settings)

Objective: catch real, user-visible failures with minimal noise.

Recommended minimal monitor set (HTTP(s) checks):

  1. API health
  2. URL: https://api.healtharchive.ca/api/health
  3. Expected: HTTP 200
  4. Interval: 1–5 minutes
  5. Note: backend supports HEAD /api/health for providers that default to HEAD.
  6. Frontend integration
  7. URL: https://healtharchive.ca/archive
  8. Expected: HTTP 200
  9. Interval: 5 minutes
  10. Optional: keyword assertion (stable string that should always appear).
  11. Replay base URL (optional but recommended if you rely on replay)
  12. URL: https://replay.healtharchive.ca/
  13. Expected: HTTP 200
  14. Interval: 5–10 minutes

Optional, higher-signal replay monitoring (recommended later):

  • After annual jobs exist and are replay-indexed, add 1–3 “known-good replay URL” monitors (one per source or one total) pointing at a stable capture inside a job-<id> collection. Update them annually.

Verification:

  • Optional pre-flight from the VPS (or any machine with internet + curl):
  • ./scripts/smoke-external-monitors.sh
  • All monitors show “Up”.
  • Alerting routes work (optional test: intentionally break a monitor briefly).
  • Crawl-performance investigations use Grafana dashboard trends (crawl rate / progress age / restarts) rather than direct throughput alerts.
  • Grafana access quickstart (SSH port-forward preferred): observability-and-private-stats.md
  • Full observability setup/runbook: playbooks/observability/observability-guide.md

Objective: get alerted if systemd timers stop running (even when the site is up).

This is intentionally optional: you already have high-value uptime checks in Step 3, but "timer ran" alerts are useful for catching silent failures (timer disabled, unit failing, disk low refusal, etc.).

Recommended checks to monitor:

  • Baseline drift check (weekly):
  • healtharchive-baseline-drift-check.timer
  • Ping variable: HEALTHARCHIVE_HC_PING_BASELINE_DRIFT
  • Public surface verification (daily):
  • healtharchive-public-surface-verify.timer
  • Ping variable: HEALTHARCHIVE_HC_PING_PUBLIC_VERIFY
  • Replay reconcile (daily):
  • healtharchive-replay-reconcile.timer
  • Ping variable: HEALTHARCHIVE_HC_PING_REPLAY_RECONCILE
  • Change tracking (daily):
  • healtharchive-change-tracking.timer
  • Ping variable: HEALTHARCHIVE_HC_PING_CHANGE_TRACKING
  • Annual scheduler (yearly):
  • healtharchive-schedule-annual.timer
  • Ping variable: HEALTHARCHIVE_HC_PING_SCHEDULE_ANNUAL
  • Annual search verify (daily, idempotent once per year):
  • healtharchive-annual-search-verify.timer
  • Ping variable: HEALTHARCHIVE_HC_PING_ANNUAL_SEARCH_VERIFY
  • Coverage guardrails (daily):
  • healtharchive-coverage-guardrails.timer
  • Ping variable: HEALTHARCHIVE_HC_PING_COVERAGE_GUARDRAILS
  • Replay smoke tests (daily):
  • healtharchive-replay-smoke.timer
  • Ping variable: HEALTHARCHIVE_HC_PING_REPLAY_SMOKE
  • Cleanup automation (weekly):
  • healtharchive-cleanup-automation.timer
  • Ping variable: HEALTHARCHIVE_HC_PING_CLEANUP_AUTOMATION

Note: avoid pinging high-frequency timers (e.g., crawl metrics, crawl auto-recover) to reduce noise. Prefer Prometheus watchdog-freshness alerts for those timers instead of per-run external pings.

Implementation approach (VPS):

  1. Create a check in your Healthchecks provider for each timer you care about.
  2. Store ping URLs only on the VPS in a root-owned env file:
  3. /etc/healtharchive/healthchecks.env (mode 0600, root:root)
  4. Note: this file may be shared across multiple automations; it is OK to keep both:
    • legacy HC_* variables (DB backup + disk check)
    • newer HEALTHARCHIVE_HC_PING_* variables (systemd unit templates)
  5. Ensure the installed systemd units source that env file:
  6. EnvironmentFile=-/etc/healtharchive/healthchecks.env
  7. Ensure the unit uses the wrapper so ping URLs never appear in unit files:
  8. /opt/healtharchive/scripts/systemd-healthchecks-wrapper.sh

Safety posture:

  • Pinging is best-effort; ping failures do not fail jobs.
  • Removing /etc/healtharchive/healthchecks.env disables pings immediately.

Verification (VPS):

  • Add a temporary ping URL for one service, then run:
  • sudo systemctl start healtharchive-replay-reconcile-dry-run.service
  • Confirm the check receives a ping.

Step 5 — Automated post-campaign search verification capture (optional)

Objective: once the annual campaign becomes search-ready, automatically capture golden-query /api/search JSON into a year-tagged directory for later diffing and audits.

What gets captured (recommended minimal set):

  • annual-status.json and a human-readable annual-status.txt
  • meta.txt (capture metadata)
  • <query>.pages.json + <query>.snapshots.json for your golden query list

Implementation approach (VPS, systemd):

  • This repo provides an optional daily timer that is idempotent:
  • If the campaign is not ready, it exits 0 (no alert noise).
  • If artifacts already exist for the current year/run-id, it exits 0.

Install and enable:

  1. Copy templates onto the VPS (see ../deployment/systemd/README.md):
  2. healtharchive-annual-search-verify.service
  3. healtharchive-annual-search-verify.timer
  4. Reload systemd:
  5. sudo systemctl daemon-reload
  6. Enable the timer:
  7. sudo systemctl enable --now healtharchive-annual-search-verify.timer

Artifacts:

  • Default location: /srv/healtharchive/ops/search-eval/<year>/final/
  • To force re-run for a year: delete that directory and re-run the service.

Verification (VPS):

  • Force-run once:
  • sudo systemctl start healtharchive-annual-search-verify.service
  • Confirm it either:
  • exits 0 quickly (not ready), or
  • creates artifacts under /srv/healtharchive/ops/search-eval/<year>/final/.

Step 6 — Optional GitHub-driven deploys (CD) (infrastructure project)

Objective: reduce deploy mistakes without expanding the production attack surface.

Recommended posture for this project (single VPS, no staging backend):

  • Keep deployments manual on the VPS.
  • Use the deploy helper script:
  • scripts/vps-deploy.sh (dry-run default; --apply to deploy)

Rationale:

  • Avoids storing production access secrets in GitHub.
  • Avoids granting passwordless sudo/SSH access to GitHub Actions.
  • Keeps the operational path “boring” and easy to reason about.

Verification (VPS):

  • Dry-run: cd /opt/healtharchive && ./scripts/vps-deploy.sh
  • Apply: cd /opt/healtharchive && ./scripts/vps-deploy.sh --apply

1. Uptime and health checks

1.1 Backend health endpoint

Primary health endpoint:

  • GET https://api.healtharchive.ca/api/health
  • HEAD https://api.healtharchive.ca/api/health (supported; some uptime tools use HEAD)

Some uptime providers issue HEAD requests by default. The backend supports HEAD /api/health so monitors may use either method.

Expected behavior:

  • HTTP 200
  • JSON body like:

{
  "status": "ok",
  "checks": {
    "db": "ok"
  }
}
- Optional operator detail view: - GET https://api.healtharchive.ca/api/health?details=1 - Includes jobs status counts and snapshots.total

Suggested uptime monitor:

  • Configure an external service (UptimeRobot, healthchecks.io, your cloud provider) to poll:
  • https://api.healtharchive.ca/api/health
  • If you later add a separate staging API, also poll:
  • https://api-staging.healtharchive.ca/api/health
  • Alert on:
  • 5xx responses.
  • Timeouts.
  • Repeated failures over a short window.

1.2 Frontend + integration check

To verify the frontend and backend integration:

  • GET https://healtharchive.ca/archive

Expected behavior:

  • HTTP 200.
  • Page renders with:
  • Filters header showing Filters (live API) when backend is up.
  • Real search results when snapshots exist.

Suggested uptime monitor:

  • Configure a separate check that:
  • Downloads https://healtharchive.ca/archive.
  • Optionally asserts presence of a known string in the body (e.g. “HealthArchive.ca” or “Browse & search demo snapshots”).

To ensure replay is reachable:

  • GET https://replay.healtharchive.ca/

If you want a higher-signal check (recommended once you have stable annual jobs):

  • Monitor a known-good replay URL inside a specific job-<id> collection:
  • https://replay.healtharchive.ca/job-<id>/<original_url>
  • Choose an original URL that is stable and low-cost to serve.

2. Metrics and alerting

2.1 Metrics endpoint

Metrics are exposed at:

  • GET https://api.healtharchive.ca/metrics
  • If you later add a separate staging API:
  • GET https://api-staging.healtharchive.ca/metrics

This endpoint is protected by HEALTHARCHIVE_ADMIN_TOKEN. In Prometheus or a similar system, you will typically:

  • Store the token in a secure place (e.g. Prometheus config / secret).
  • Pass it via Authorization: Bearer <token> or X-Admin-Token header.

2.2 Key metrics

The metrics endpoint exposes, among others:

  • Job counts:
healtharchive_jobs_total{status="queued"} 1
healtharchive_jobs_total{status="indexed"} 5
healtharchive_jobs_total{status="failed"} 0
  • Cleanup status:
healtharchive_jobs_cleanup_status_total{cleanup_status="none"} 10
healtharchive_jobs_cleanup_status_total{cleanup_status="temp_cleaned"} 3
  • Snapshot counts:
healtharchive_snapshots_total 123
healtharchive_snapshots_total{source="hc"} 80
healtharchive_snapshots_total{source="phac"} 43
  • Page‑level crawl metrics:
healtharchive_jobs_pages_crawled_total 45678
healtharchive_jobs_pages_crawled_total{source="hc"} 30000
healtharchive_jobs_pages_failed_total 120
healtharchive_jobs_pages_failed_total{source="hc"} 30
  • Search metrics (per-process; reset on restart):
healtharchive_search_requests_total 123
healtharchive_search_errors_total 0
healtharchive_search_duration_seconds_bucket{le="0.3"} 100
healtharchive_search_mode_total{mode="relevance_fts"} 80
healtharchive_search_mode_total{mode="relevance_fallback"} 25
healtharchive_search_mode_total{mode="relevance_fuzzy"} 5
healtharchive_search_mode_total{mode="boolean"} 2
healtharchive_search_mode_total{mode="url"} 3
healtharchive_search_mode_total{mode="newest"} 8

2.3 Example alert ideas (Prometheus‑style)

These are examples, not full rules, but can guide what you set up:

  • Crawl state file unhealthy:

  • Alert if healtharchive_crawl_running_job_state_file_ok==1 but healtharchive_crawl_running_job_state_parse_ok==0 for >10m.

  • Crawl stalled:

  • Alert if healtharchive_crawl_running_job_stalled==1 for >30m.

  • Crawl degraded (slow but progressing):

  • Alert if healtharchive_crawl_running_job_crawl_rate_ppm{source=~"hc|phac"} < 2 while healtharchive_crawl_running_job_last_progress_age_seconds{source=~"hc|phac"} <= 300 and healtharchive_crawl_running_job_stalled{source=~"hc|phac"} == 0 for >45m.

  • Crawl completed but indexing not starting:

  • Alert if healtharchive_indexing_pending_job_max_age_seconds exceeds your SLA (e.g., >1h) and healtharchive_crawl_running_jobs == 0.

  • High job failure rate:

  • Alert if healtharchive_jobs_total{status="failed"} jumps unexpectedly over a sliding window.

  • No new snapshots over time:

  • Alert if increase(healtharchive_snapshots_total[24h]) == 0 while jobs are being created, indicating indexing or crawl issues.

  • Cleanup not happening:

  • Alert if healtharchive_jobs_cleanup_status_total{cleanup_status="none"} grows without bound while temp_cleaned remains flat.

Tune these based on actual volumes and acceptable thresholds.


3. CI and branch protection

3.1 GitHub Actions workflows

Workflows live at:

  • Backend: .github/workflows/backend-ci.yml
  • Frontend: .github/workflows/frontend-ci.yml

Each should:

  • Backend:
  • Run make check.
  • Frontend:
  • Verify make contract-check.
  • Install deps via npm ci.
  • Run make frontend-ci.
  • Build the frontend Docker image.

Checklist:

  • Ensure workflows are enabled in GitHub:
  • Open the repo on GitHub.
  • Go to Actions.
  • If you see “Workflows are disabled for this fork”, click Enable.
  • Verify that pushing to main or opening a PR triggers the workflows.

3.2 Branch protection on main (backend repo, solo-dev profile)

Backend enforcement is currently maintained as a GitHub ruleset.

For the monorepo target state:

  • Ruleset name: main-protection
  • Target branch: main
  • Enforcement status: Active
  • Bypass list: Repository admin Role (always allow)
  • Enabled rules:
  • Restrict deletions
  • Require status checks to pass with required checks Backend CI / test, Backend CI / api-health, Frontend CI / contract-sync, and Frontend CI / lint-and-test
  • Block force pushes
  • Disabled rules (intentional for solo-dev speed):
  • Require a pull request before merging
  • Review/approval requirements
  • Code owner requirements
  • Extra code scanning/code quality gates

Important check-selection notes:

  • Do not require Backend CI / e2e-smoke in branch protection (it does not run on pull requests).
  • Do not require Backend CI (Full) / test-full (nightly/manual workflow).
  • Do not require actionlint; it only runs when workflow files change.

Verification ritual (operator, monthly or after workflow edits):

  1. Open GitHub → Repository → Settings → Rules → Rulesets → main-protection.
  2. Confirm required check list contains Backend CI / test and Backend CI / api-health.
  3. Confirm Block force pushes and Restrict deletions remain enabled.
  4. Open Actions and confirm the latest Backend CI run on main is green.
  5. Log any changes in an ops report note (date + what changed + why).
  6. Review .github/migration-guard-exceptions.txt:
  7. remove expired rules,
  8. ensure any active rule has a short-lived expiry and clear reason.

Latest evidence snapshot:

  • 2026-03-21: branch protection verified active with required checks Backend CI / test and Backend CI / api-health, with Restrict deletions and Block force pushes enabled.

Future tighten-up trigger:

  • If a second regular committer is added, enable PR-required merges and approval rules, then update this section.

4. Periodic operations review

On a regular cadence (e.g. monthly or quarterly), review:

  • Uptime logs:
  • Are there recurring outages at specific times?
  • Metrics:
  • Are job failures spiking?
  • Are snapshots growing at the expected rate?
  • Is cleanup keeping up with new jobs?
  • CI:
  • Are workflows still running on all relevant branches?
  • Do new checks or tooling need to be added?

Recording a short “ops state” note alongside these reviews will make future debugging and capacity planning much easier.