Skip to content

Data Model Reference

Quick reference for HealthArchive database models.

Full details: See Architecture Guide


Entity Relationship

erDiagram
    Source ||--o{ ArchiveJob : has
    Source ||--o{ Snapshot : has
    Source ||--o{ AnnualEdition : has
    AnnualEdition ||--o{ ArchiveJob : contains
    ArchiveJob ||--o{ Snapshot : produces

    Source {
        int id PK
        string code UK
        string name
        string base_url
        bool enabled
    }

    ArchiveJob {
        int id PK
        int source_id FK
        int edition_id FK
        string name
        string status
        string output_dir
        string shard_key
        string shard_kind
        string acceptance_state
        json config
        int warc_file_count
        int indexed_page_count
    }

    AnnualEdition {
        int id PK
        int source_id FK
        int year
        string status
        bool search_ready
        bool research_ready
        int intended_url_count
        int captured_url_count
    }

    Snapshot {
        int id PK
        int job_id FK
        int source_id FK
        string url
        string normalized_url_group
        datetime capture_timestamp
        string title
        string snippet
        string language
        string capture_backend
        string capture_fidelity
    }

Source

Represents a content origin (e.g., Health Canada, PHAC).

Table: sources

Field Type Nullable Description
id Integer No Primary key
code String(50) No Unique short code ("hc", "phac")
name String(200) No Human-readable name
base_url String(500) Yes Base URL of source
description Text Yes Optional description
enabled Boolean No Whether source is active (default: true)
created_at DateTime No Creation timestamp
updated_at DateTime No Last update timestamp

Indexes: - Unique on code

Relationships: - jobs: One-to-many → ArchiveJob - snapshots: One-to-many → Snapshot - annual_editions: One-to-many → AnnualEdition


AnnualEdition

Represents one researcher-facing annual archive for {source, year}. An edition can be built from many shard jobs and may include legacy full-site salvage jobs.

Table: annual_editions

Field Type Nullable Description
id Integer No Primary key
source_id Integer No Foreign key → sources.id
year Integer No Annual edition year
status String(50) No Edition lifecycle (planned, in_progress, search_ready, research_ready, needs_review)
search_ready Boolean No Search/indexing readiness flag
research_ready Boolean No Coverage/provenance acceptance flag
intended_url_count Integer No URLs in the current target ledger
captured_url_count Integer No URLs represented by indexed snapshots
failed_url_count Integer No URLs recorded as failed after retry budget
excluded_url_count Integer No URLs deliberately excluded with rules/reasons
backend_counts JSON Yes Capture backend mix summary
coverage_summary JSON Yes Public-safe coverage summary
target_ledger_path String(500) Yes Durable target-ledger.jsonl artifact
capture_manifest_path String(500) Yes Durable capture-manifest.jsonl artifact
coverage_report_json_path String(500) Yes Durable JSON coverage report
coverage_report_md_path String(500) Yes Durable Markdown coverage report
created_at DateTime No Creation timestamp
updated_at DateTime No Last update timestamp

Indexes: - Unique on source_id, year - Index on status - Index on year

Relationships: - source: Many-to-one → Source - jobs: One-to-many → ArchiveJob


ArchiveJob

Represents a single crawl job execution.

Table: archive_jobs

Field Type Nullable Description
Identity
id Integer No Primary key
source_id Integer Yes Foreign key → sources.id
edition_id Integer Yes Foreign key → annual_editions.id
name String(200) No Job name (used in ZIM naming)
output_dir String(500) No Absolute path to job directory
shard_key String(200) Yes Deterministic shard key within an annual edition
shard_kind String(50) Yes Shard type (path-language, legacy-full-site, etc.)
acceptance_state String(50) Yes Review state (pending, needs_review, accepted, accepted_gap, excluded)
Lifecycle
status String(50) No queued, running, completed, failed, indexing, indexed, etc.
queued_at DateTime Yes When job was queued
started_at DateTime Yes When crawl started
finished_at DateTime Yes When crawl finished
retry_count Integer No Number of retry attempts (default: 0)
Configuration
config JSON Yes Job configuration (seeds, tool_options, zimit args)
Crawl Metrics
crawler_exit_code Integer Yes Exit code from archive_tool process
crawler_status String(50) Yes Summarized status ("success", "failed")
crawler_stage String(50) Yes Last known stage
last_stats_json JSON Yes Parsed crawl stats from logs
pages_crawled Integer Yes Pages successfully crawled
pages_total Integer Yes Total pages discovered
pages_failed Integer Yes Pages that failed to crawl
Indexing
warc_file_count Integer No Number of WARC files discovered (default: 0)
indexed_page_count Integer No Number of snapshots created (default: 0)
File Paths
final_zim_path String(500) Yes Path to ZIM file (if built)
combined_log_path String(500) Yes Path to combined crawl log
state_file_path String(500) Yes Path to .archive_state.json
coverage_report_path String(500) Yes Shard/edition report artifact path
Cleanup
cleanup_status String(50) No "none", "temp_cleaned" (default: "none")
cleaned_at DateTime Yes When cleanup was performed
Timestamps
created_at DateTime No Record creation
updated_at DateTime No Last update

Indexes: - Index on source_id - Index on edition_id - Index on shard_key - Index on acceptance_state - Index on status - Index on queued_at

Relationships: - source: Many-to-one → Source - edition: Many-to-one → AnnualEdition - snapshots: One-to-many → Snapshot


Snapshot

Represents a single captured web page.

Table: snapshots

Field Type Nullable Description
Identity
id Integer No Primary key
job_id Integer Yes Foreign key → archive_jobs.id
source_id Integer Yes Foreign key → sources.id
URL & Grouping
url String(2000) No Full URL of captured page
normalized_url_group String(2000) Yes Canonical URL for grouping
Timing
capture_timestamp DateTime No When page was captured (from WARC)
HTTP & Content
mime_type String(100) Yes MIME type (usually "text/html")
status_code Integer Yes HTTP status code
title String(500) Yes Extracted page title
snippet Text Yes Short text preview
language String(10) Yes ISO language code ("en", "fr")
capture_backend String(50) Yes Backend that captured the page (browsertrix, playwright_warc, etc.)
capture_fidelity String(50) Yes Capture fidelity label (high, fallback, unknown)
provenance_json JSON Yes Structured provenance metadata for reports/exports
Storage/Replay
warc_path String(500) No Path to WARC file
warc_record_id String(200) Yes WARC record identifier
raw_snapshot_path String(500) Yes Optional static HTML export path
content_hash String(64) Yes Hash of HTML body (for deduplication)
Timestamps
created_at DateTime No Record creation
updated_at DateTime No Last update

Indexes: - Index on job_id - Index on source_id - Index on url - Index on normalized_url_group - Index on capture_timestamp - Index on status_code - Index on capture_backend - Index on capture_fidelity

Relationships: - job: Many-to-one → ArchiveJob - source: Many-to-one → Source


Job Status Lifecycle

stateDiagram-v2
    [*] --> queued: create-job
    queued --> running: worker starts
    running --> completed: crawl succeeds
    running --> failed: crawl fails
    failed --> retryable: retry if count < MAX
    retryable --> running: worker retries
    completed --> indexing: index-job starts
    indexing --> indexed: indexing succeeds
    indexing --> index_failed: indexing fails
    indexed --> [*]
    failed --> [*]
    index_failed --> completed: retry-job (reindex)

Common status values: - queued - Job created, waiting for worker - running - Crawl in progress - completed - Crawl succeeded - failed - Crawl failed (terminal if retries exhausted) - retryable - Failed but can retry - indexing - WARC indexing in progress - indexed - Fully indexed and ready to serve - index_failed - Indexing failed


Config JSON Schema

ArchiveJob.config structure:

{
  "seeds": [
    "https://www.canada.ca/en/health-canada.html"
  ],
  "zimit_passthrough_args": [
    "--profile", "social-media"
  ],
  "tool_options": {
    "cleanup": false,
    "overwrite": false,
    "skip_final_build": false,
    "enable_monitoring": false,
    "enable_adaptive_workers": false,
    "enable_adaptive_restart": false,
    "enable_vpn_rotation": false,
    "initial_workers": 2,
    "log_level": "INFO",
    "relax_perms": true,
    "docker_shm_size": "1g",
    "monitor_interval_seconds": 30,
    "stall_timeout_minutes": 30,
    "error_threshold_timeout": 10,
    "error_threshold_http": 10,
    "min_workers": 1,
    "max_worker_reductions": 2,
    "vpn_connect_command": "vpn connect ca",
    "max_vpn_rotations": 3,
    "vpn_rotation_frequency_minutes": 60,
    "backoff_delay_minutes": 15
  }
}

See: Job Registry for defaults per source


Database Configuration

Location: src/ha_backend/models.py

ORM: SQLAlchemy 2.0

Migrations: Alembic (in alembic/ directory)

Running Migrations

# Upgrade to latest
alembic upgrade head

# Downgrade one revision
alembic downgrade -1

# Show current revision
alembic current

# Show migration history
alembic history

Supported Databases

  • SQLite (default for dev): sqlite:///healtharchive.db
  • PostgreSQL (recommended for production): postgresql://user:pass@host/dbname

Environment variable: HEALTHARCHIVE_DATABASE_URL


Common Queries

Find jobs by status

from ha_backend.models import ArchiveJob
from ha_backend.db import get_session

session = get_session()
jobs = session.query(ArchiveJob).filter_by(status="queued").all()

Get source with all jobs

from ha_backend.models import Source

session = get_session()
source = session.query(Source).filter_by(code="hc").one()
print(f"{source.name}: {len(source.jobs)} jobs")

Find snapshots by URL

from ha_backend.models import Snapshot

session = get_session()
snapshots = session.query(Snapshot).filter(
    Snapshot.url.like("%health-canada%")
).limit(10).all()