Skip to content

HealthArchive Documentation Hub

HealthArchive is a monorepo app project with a separate datasets repo that archives Canadian health government websites for research and accountability. This page helps you navigate the documentation and code boundaries.


πŸš€ Quick Start by Role

Choose your entry point based on what you want to do:

πŸ‘€ I'm an Operator

Goal: Deploy, monitor, and maintain the production system

Start Here: 1. Production Runbook - Complete production setup guide 2. Operator Responsibilities - Must-do checklist 3. Deploy & Verify - Safe deployment process 4. Incident Response - Emergency procedures

Key Resources: - Ops Cadence Checklist - Daily/weekly/quarterly tasks - Monitoring Checklist - Set up alerts and checks - All Playbooks - 30+ operational procedures


πŸ’» I'm a Developer

Goal: Contribute code, fix bugs, add features

Start Here: 1. Quick Start Guide - Get running in 5 minutes 2. Your First Contribution - Step-by-step tutorial 3. Dev Environment Setup - Detailed local setup 4. Live Testing Guide - Run the full pipeline locally

Key Resources: - Architecture Walkthrough - Visual guide to how it all works - Architecture Deep Dive - Complete technical reference - Testing Guidelines - How to write and run tests - Contributing Guide - Code standards and workflow


πŸ”§ I'm an API Consumer / Researcher

Goal: Search the archive and retrieve historical snapshots

Start Here: 1. API Consumer Guide - Complete API walkthrough with examples 2. Interactive API Docs - Try the API in your browser 3. API Reference - Full OpenAPI specification

Quick API Test:

curl "https://api.healtharchive.ca/api/search?q=vaccines&source=hc"

Key Resources: - Dataset Downloads: healtharchive-datasets - Bulk metadata exports - Data Handling Policy - Retention and privacy - Live Site: healtharchive.ca - Web interface


πŸ“š I'm a Student / New to the Project

Goal: Learn how HealthArchive works

Recommended Reading Order: 1. Quick Start - High-level overview 2. Architecture Walkthrough - Follow a page from crawl to search 3. Architecture Reference - Deep technical details 4. Documentation Guidelines - How docs stay organized

Tutorials: - Your First Contribution - Hands-on coding tutorial - Debugging a Failed Crawl - Practical troubleshooting - Live Testing - Run it yourself locally


πŸ“¦ Monorepo + Dataset Architecture

HealthArchive keeps app code in one repository and dataset releases in a separate repository:

πŸ”™ Backend (This Repo Root)

Purpose: API, crawler, database, operations, and all internal infrastructure

Location: github.com/jerdaw/healtharchive

Documentation: docs.healtharchive.ca (you are here)

What Lives Here: - βœ… Crawler (archive_tool) and job orchestration - βœ… Database models and indexing pipeline - βœ… RESTful JSON API (FastAPI) - βœ… Operations runbooks and playbooks - βœ… Deployment guides and systemd units - βœ… Architecture and developer docs - βœ… Decision records and incident notes

Tech Stack: Python, FastAPI, SQLAlchemy, PostgreSQL, Docker, systemd


🌐 Frontend (frontend/)

Purpose: Public-facing website and user interface

Location: frontend/ inside the app monorepo

Live Site: healtharchive.ca

What Lives Here: - βœ… Next.js web application (search UI, snapshot viewer) - βœ… Public content (status page, impact statement, changelog) - βœ… Internationalization (i18n) - English and French - βœ… UI/UX documentation

Tech Stack: Next.js 16, React, TypeScript, Tailwind CSS

Canonical Frontend Docs: frontend/docs/


πŸ“Š Datasets

Purpose: Versioned, citable metadata-only dataset releases

Location: github.com/jerdaw/healtharchive-datasets

What Lives Here: - βœ… Snapshot metadata exports (JSON/CSV) - βœ… Checksums and integrity manifests - βœ… Dataset release documentation - βœ… Dataset integrity policies

Why Separate?: Enables versioned, citable releases independent of code changes

Datasets Docs in This Repo: datasets-external/README.md (pointer only)

Canonical Datasets Docs: datasets/README.md


πŸ—ΊοΈ Where Things Live (Source of Truth Map)

Content Type Lives In Link
Operations & Runbooks Backend repo docs.healtharchive.ca/operations
Architecture & Dev Guides Backend repo docs.healtharchive.ca/architecture
API Documentation Backend repo docs.healtharchive.ca/api
Public Changelog Monorepo frontend docs github.com/jerdaw/healtharchive/.../frontend/docs/changelog-process.md
Status Page Monorepo frontend code github.com/jerdaw/healtharchive/.../frontend/src/app/%5Blocale%5D/status/page.tsx
Impact Statement Monorepo frontend code github.com/jerdaw/healtharchive/.../frontend/src/app/%5Blocale%5D/impact/page.tsx
Dataset Releases Datasets repo github.com/jerdaw/healtharchive-datasets
I18n Guidelines Monorepo frontend docs github.com/jerdaw/healtharchive/.../frontend/docs/i18n.md

Principle: Each doc has one canonical source. Other repos link to it.


πŸ”— Linking Conventions

In GitHub Issues/PRs

Use full GitHub URLs:

See the [production runbook](https://github.com/jerdaw/healtharchive/blob/main/docs/deployment/production-single-vps.md)

In Documentation

For docs users: Use the docs site URLs:

See the [Production Runbook](https://docs.healtharchive.ca/deployment/production-single-vps/)

For frontend paths in the monorepo or cross-repo references: Use full GitHub URLs:

Frontend changelog process: [changelog-process.md](https://github.com/jerdaw/healtharchive/blob/main/frontend/docs/changelog-process.md)

Local Development (Monorepo Workspace)

Recommended local layout:

/home/user/healtharchive/
β”œβ”€β”€ healtharchive/
└── healtharchive-datasets/

Some docs use workspace paths like frontend/... or healtharchive-datasets/... for convenience.

Note: These paths only work in local workspaces, not on GitHub.


πŸ—οΈ System Architecture (High Level)

graph TB
    subgraph "HealthArchive Backend"
        CLI[CLI Commands] -->|Create| Jobs[(Database)]
        Worker[Worker Process] -->|Poll| Jobs
        Worker -->|Execute| Crawler[Archive Tool]
        Crawler -->|Docker| Zimit[Zimit Crawler]
        Zimit -->|Write| WARC[WARC Files]
        Worker -->|Index| WARC
        WARC -->|Extract| Snapshots[(Snapshots)]
        API[FastAPI API] -->|Query| Snapshots
    end

    subgraph "HealthArchive Frontend"
        UI[Next.js UI] -->|API Calls| API
        API -->|JSON| UI
        UI -->|Display| Users[End Users]
    end

    subgraph "HealthArchive Datasets"
        Export[Export Script] -->|Read| Snapshots
        Export -->|Write| Datasets[Dataset Files]
        Researchers -->|Download| Datasets
    end

Data Flow: 1. Crawl: CLI creates job β†’ Worker runs crawler β†’ Docker writes WARCs 2. Index: Worker parses WARCs β†’ Extracts text β†’ Stores snapshots in DB 3. Serve: API queries DB β†’ Returns JSON β†’ Frontend displays results 4. Export: Scripts export metadata β†’ Version as datasets β†’ Researchers download

See: Architecture Walkthrough for detailed data flow


πŸ“– Documentation Structure

This documentation follows the DiΓ‘taxis framework for clarity:

Type Purpose Where
Tutorials Learning-oriented, step-by-step tutorials/
How-To Guides Task-oriented, problem-solving operations/playbooks/, development/
Reference Information-oriented, lookup api.md, architecture.md, reference/
Explanation Understanding-oriented, concepts documentation-guidelines.md, decisions/, operations/

Navigation: Use the sidebar to explore by category


πŸ†˜ Getting Help

By Issue Type

Issue Where to Go
API questions API Consumer Guide β†’ API Docs
Deployment problems Production Runbook β†’ Playbooks
Code questions Architecture Guide β†’ GitHub Discussions
Bugs or feature requests GitHub Issues
Operational incidents Incident Response

Community


πŸ”„ Documentation Updates

Found something wrong? Documentation lives in git and accepts pull requests!

  1. Backend/docs hub: Edit files in docs/
  2. Frontend docs: Edit files in frontend/docs/
  3. Datasets docs: Edit datasets README

Guidelines: Documentation Guidelines


πŸ“Š Project Status

Metric Value Details
Snapshots Archived Check /api/stats Live count
Sources 2 (Health Canada, PHAC) /api/sources
Crawl Frequency Annual + ad-hoc Ops Roadmap
API Status Production Health Check
Frontend Status Production healtharchive.ca

Latest Incidents: See operations/incidents/


🎯 Next Steps

Based on your role, here's what to do next:

Operators

  1. βœ… Review Production Runbook
  2. βœ… Complete Monitoring Checklist
  3. βœ… Bookmark Incident Response

Developers

  1. βœ… Complete Quick Start
  2. βœ… Follow Your First Contribution
  3. βœ… Read Architecture Walkthrough

Researchers

  1. βœ… Read API Consumer Guide
  2. βœ… Try Interactive API Docs
  3. βœ… Explore Datasets

πŸ“š Essential Documentation Index

Getting Started: - Quick Start - Project Overview (you are here)

For Operators: - Production Runbook - All Playbooks - Ops Cadence

For Developers: - First Contribution - Architecture Guide - Dev Setup

For Researchers: - API Guide - API Reference - Datasets

Reference: - Documentation Guidelines - Decision Records - Roadmaps


πŸ’‘ About This Documentation

This documentation portal is currently built with MkDocs Material and deployed to docs.healtharchive.ca.

Source: docs/ Build: make docs-build Serve Locally: make docs-serve

Last Updated: Auto-generated on every push to main