HealthArchive Documentation Hub
HealthArchive is a monorepo app project with a separate datasets repo that archives Canadian health government websites for research and accountability. This page helps you navigate the documentation and code boundaries.
π Quick Start by Role
Choose your entry point based on what you want to do:
π€ I'm an Operator
Goal: Deploy, monitor, and maintain the production system
Start Here: 1. Production Runbook - Complete production setup guide 2. Operator Responsibilities - Must-do checklist 3. Deploy & Verify - Safe deployment process 4. Incident Response - Emergency procedures
Key Resources: - Ops Cadence Checklist - Daily/weekly/quarterly tasks - Monitoring Checklist - Set up alerts and checks - All Playbooks - 30+ operational procedures
π» I'm a Developer
Goal: Contribute code, fix bugs, add features
Start Here: 1. Quick Start Guide - Get running in 5 minutes 2. Your First Contribution - Step-by-step tutorial 3. Dev Environment Setup - Detailed local setup 4. Live Testing Guide - Run the full pipeline locally
Key Resources: - Architecture Walkthrough - Visual guide to how it all works - Architecture Deep Dive - Complete technical reference - Testing Guidelines - How to write and run tests - Contributing Guide - Code standards and workflow
π§ I'm an API Consumer / Researcher
Goal: Search the archive and retrieve historical snapshots
Start Here: 1. API Consumer Guide - Complete API walkthrough with examples 2. Interactive API Docs - Try the API in your browser 3. API Reference - Full OpenAPI specification
Quick API Test:
Key Resources: - Dataset Downloads: healtharchive-datasets - Bulk metadata exports - Data Handling Policy - Retention and privacy - Live Site: healtharchive.ca - Web interface
π I'm a Student / New to the Project
Goal: Learn how HealthArchive works
Recommended Reading Order: 1. Quick Start - High-level overview 2. Architecture Walkthrough - Follow a page from crawl to search 3. Architecture Reference - Deep technical details 4. Documentation Guidelines - How docs stay organized
Tutorials: - Your First Contribution - Hands-on coding tutorial - Debugging a Failed Crawl - Practical troubleshooting - Live Testing - Run it yourself locally
π¦ Monorepo + Dataset Architecture
HealthArchive keeps app code in one repository and dataset releases in a separate repository:
π Backend (This Repo Root)
Purpose: API, crawler, database, operations, and all internal infrastructure
Location: github.com/jerdaw/healtharchive
Documentation: docs.healtharchive.ca (you are here)
What Lives Here: - β Crawler (archive_tool) and job orchestration - β Database models and indexing pipeline - β RESTful JSON API (FastAPI) - β Operations runbooks and playbooks - β Deployment guides and systemd units - β Architecture and developer docs - β Decision records and incident notes
Tech Stack: Python, FastAPI, SQLAlchemy, PostgreSQL, Docker, systemd
π Frontend (frontend/)
Purpose: Public-facing website and user interface
Location: frontend/ inside the app monorepo
Live Site: healtharchive.ca
What Lives Here: - β Next.js web application (search UI, snapshot viewer) - β Public content (status page, impact statement, changelog) - β Internationalization (i18n) - English and French - β UI/UX documentation
Tech Stack: Next.js 16, React, TypeScript, Tailwind CSS
Canonical Frontend Docs: frontend/docs/
π Datasets
Purpose: Versioned, citable metadata-only dataset releases
Location: github.com/jerdaw/healtharchive-datasets
What Lives Here: - β Snapshot metadata exports (JSON/CSV) - β Checksums and integrity manifests - β Dataset release documentation - β Dataset integrity policies
Why Separate?: Enables versioned, citable releases independent of code changes
Datasets Docs in This Repo: datasets-external/README.md (pointer only)
Canonical Datasets Docs: datasets/README.md
πΊοΈ Where Things Live (Source of Truth Map)
| Content Type | Lives In | Link |
|---|---|---|
| Operations & Runbooks | Backend repo | docs.healtharchive.ca/operations |
| Architecture & Dev Guides | Backend repo | docs.healtharchive.ca/architecture |
| API Documentation | Backend repo | docs.healtharchive.ca/api |
| Public Changelog | Monorepo frontend docs | github.com/jerdaw/healtharchive/.../frontend/docs/changelog-process.md |
| Status Page | Monorepo frontend code | github.com/jerdaw/healtharchive/.../frontend/src/app/%5Blocale%5D/status/page.tsx |
| Impact Statement | Monorepo frontend code | github.com/jerdaw/healtharchive/.../frontend/src/app/%5Blocale%5D/impact/page.tsx |
| Dataset Releases | Datasets repo | github.com/jerdaw/healtharchive-datasets |
| I18n Guidelines | Monorepo frontend docs | github.com/jerdaw/healtharchive/.../frontend/docs/i18n.md |
Principle: Each doc has one canonical source. Other repos link to it.
π Linking Conventions
In GitHub Issues/PRs
Use full GitHub URLs:
See the [production runbook](https://github.com/jerdaw/healtharchive/blob/main/docs/deployment/production-single-vps.md)
In Documentation
For docs users: Use the docs site URLs:
For frontend paths in the monorepo or cross-repo references: Use full GitHub URLs:
Frontend changelog process: [changelog-process.md](https://github.com/jerdaw/healtharchive/blob/main/frontend/docs/changelog-process.md)
Local Development (Monorepo Workspace)
Recommended local layout:
Some docs use workspace paths like frontend/... or healtharchive-datasets/... for convenience.
Note: These paths only work in local workspaces, not on GitHub.
ποΈ System Architecture (High Level)
graph TB
subgraph "HealthArchive Backend"
CLI[CLI Commands] -->|Create| Jobs[(Database)]
Worker[Worker Process] -->|Poll| Jobs
Worker -->|Execute| Crawler[Archive Tool]
Crawler -->|Docker| Zimit[Zimit Crawler]
Zimit -->|Write| WARC[WARC Files]
Worker -->|Index| WARC
WARC -->|Extract| Snapshots[(Snapshots)]
API[FastAPI API] -->|Query| Snapshots
end
subgraph "HealthArchive Frontend"
UI[Next.js UI] -->|API Calls| API
API -->|JSON| UI
UI -->|Display| Users[End Users]
end
subgraph "HealthArchive Datasets"
Export[Export Script] -->|Read| Snapshots
Export -->|Write| Datasets[Dataset Files]
Researchers -->|Download| Datasets
end Data Flow: 1. Crawl: CLI creates job β Worker runs crawler β Docker writes WARCs 2. Index: Worker parses WARCs β Extracts text β Stores snapshots in DB 3. Serve: API queries DB β Returns JSON β Frontend displays results 4. Export: Scripts export metadata β Version as datasets β Researchers download
See: Architecture Walkthrough for detailed data flow
π Documentation Structure
This documentation follows the DiΓ‘taxis framework for clarity:
| Type | Purpose | Where |
|---|---|---|
| Tutorials | Learning-oriented, step-by-step | tutorials/ |
| How-To Guides | Task-oriented, problem-solving | operations/playbooks/, development/ |
| Reference | Information-oriented, lookup | api.md, architecture.md, reference/ |
| Explanation | Understanding-oriented, concepts | documentation-guidelines.md, decisions/, operations/ |
Navigation: Use the sidebar to explore by category
π Getting Help
By Issue Type
| Issue | Where to Go |
|---|---|
| API questions | API Consumer Guide β API Docs |
| Deployment problems | Production Runbook β Playbooks |
| Code questions | Architecture Guide β GitHub Discussions |
| Bugs or feature requests | GitHub Issues |
| Operational incidents | Incident Response |
Community
- GitHub Discussions: app monorepo
- Issues: app monorepo | datasets
- Contributor Guide: contributing.md
π Documentation Updates
Found something wrong? Documentation lives in git and accepts pull requests!
- Backend/docs hub: Edit files in
docs/ - Frontend docs: Edit files in
frontend/docs/ - Datasets docs: Edit datasets README
Guidelines: Documentation Guidelines
π Project Status
| Metric | Value | Details |
|---|---|---|
| Snapshots Archived | Check /api/stats | Live count |
| Sources | 2 (Health Canada, PHAC) | /api/sources |
| Crawl Frequency | Annual + ad-hoc | Ops Roadmap |
| API Status | Production | Health Check |
| Frontend Status | Production | healtharchive.ca |
Latest Incidents: See operations/incidents/
π― Next Steps
Based on your role, here's what to do next:
Operators
- β Review Production Runbook
- β Complete Monitoring Checklist
- β Bookmark Incident Response
Developers
- β Complete Quick Start
- β Follow Your First Contribution
- β Read Architecture Walkthrough
Researchers
- β Read API Consumer Guide
- β Try Interactive API Docs
- β Explore Datasets
π Essential Documentation Index
Getting Started: - Quick Start - Project Overview (you are here)
For Operators: - Production Runbook - All Playbooks - Ops Cadence
For Developers: - First Contribution - Architecture Guide - Dev Setup
For Researchers: - API Guide - API Reference - Datasets
Reference: - Documentation Guidelines - Decision Records - Roadmaps
π‘ About This Documentation
This documentation portal is currently built with MkDocs Material and deployed to docs.healtharchive.ca.
Source: docs/ Build: make docs-build Serve Locally: make docs-serve
Last Updated: Auto-generated on every push to main