Data Model Reference
Quick reference for HealthArchive database models.
Full details: See Architecture Guide
Entity Relationship
erDiagram
Source ||--o{ ArchiveJob : has
Source ||--o{ Snapshot : has
Source ||--o{ AnnualEdition : has
AnnualEdition ||--o{ ArchiveJob : contains
ArchiveJob ||--o{ Snapshot : produces
Source {
int id PK
string code UK
string name
string base_url
bool enabled
}
ArchiveJob {
int id PK
int source_id FK
int edition_id FK
string name
string status
string output_dir
string shard_key
string shard_kind
string acceptance_state
json config
int warc_file_count
int indexed_page_count
}
AnnualEdition {
int id PK
int source_id FK
int year
string status
bool search_ready
bool research_ready
int intended_url_count
int captured_url_count
}
Snapshot {
int id PK
int job_id FK
int source_id FK
string url
string normalized_url_group
datetime capture_timestamp
string title
string snippet
string language
string capture_backend
string capture_fidelity
} Source
Represents a content origin (e.g., Health Canada, PHAC).
Table: sources
| Field | Type | Nullable | Description |
|---|---|---|---|
id | Integer | No | Primary key |
code | String(50) | No | Unique short code ("hc", "phac") |
name | String(200) | No | Human-readable name |
base_url | String(500) | Yes | Base URL of source |
description | Text | Yes | Optional description |
enabled | Boolean | No | Whether source is active (default: true) |
created_at | DateTime | No | Creation timestamp |
updated_at | DateTime | No | Last update timestamp |
Indexes: - Unique on code
Relationships: - jobs: One-to-many → ArchiveJob - snapshots: One-to-many → Snapshot - annual_editions: One-to-many → AnnualEdition
AnnualEdition
Represents one researcher-facing annual archive for {source, year}. An edition can be built from many shard jobs and may include legacy full-site salvage jobs.
Table: annual_editions
| Field | Type | Nullable | Description |
|---|---|---|---|
id | Integer | No | Primary key |
source_id | Integer | No | Foreign key → sources.id |
year | Integer | No | Annual edition year |
status | String(50) | No | Edition lifecycle (planned, in_progress, search_ready, research_ready, needs_review) |
search_ready | Boolean | No | Search/indexing readiness flag |
research_ready | Boolean | No | Coverage/provenance acceptance flag |
intended_url_count | Integer | No | URLs in the current target ledger |
captured_url_count | Integer | No | URLs represented by indexed snapshots |
failed_url_count | Integer | No | URLs recorded as failed after retry budget |
excluded_url_count | Integer | No | URLs deliberately excluded with rules/reasons |
backend_counts | JSON | Yes | Capture backend mix summary |
coverage_summary | JSON | Yes | Public-safe coverage summary |
target_ledger_path | String(500) | Yes | Durable target-ledger.jsonl artifact |
capture_manifest_path | String(500) | Yes | Durable capture-manifest.jsonl artifact |
coverage_report_json_path | String(500) | Yes | Durable JSON coverage report |
coverage_report_md_path | String(500) | Yes | Durable Markdown coverage report |
created_at | DateTime | No | Creation timestamp |
updated_at | DateTime | No | Last update timestamp |
Indexes: - Unique on source_id, year - Index on status - Index on year
Relationships: - source: Many-to-one → Source - jobs: One-to-many → ArchiveJob
ArchiveJob
Represents a single crawl job execution.
Table: archive_jobs
| Field | Type | Nullable | Description |
|---|---|---|---|
| Identity | |||
id | Integer | No | Primary key |
source_id | Integer | Yes | Foreign key → sources.id |
edition_id | Integer | Yes | Foreign key → annual_editions.id |
name | String(200) | No | Job name (used in ZIM naming) |
output_dir | String(500) | No | Absolute path to job directory |
shard_key | String(200) | Yes | Deterministic shard key within an annual edition |
shard_kind | String(50) | Yes | Shard type (path-language, legacy-full-site, etc.) |
acceptance_state | String(50) | Yes | Review state (pending, needs_review, accepted, accepted_gap, excluded) |
| Lifecycle | |||
status | String(50) | No | queued, running, completed, failed, indexing, indexed, etc. |
queued_at | DateTime | Yes | When job was queued |
started_at | DateTime | Yes | When crawl started |
finished_at | DateTime | Yes | When crawl finished |
retry_count | Integer | No | Number of retry attempts (default: 0) |
| Configuration | |||
config | JSON | Yes | Job configuration (seeds, tool_options, zimit args) |
| Crawl Metrics | |||
crawler_exit_code | Integer | Yes | Exit code from archive_tool process |
crawler_status | String(50) | Yes | Summarized status ("success", "failed") |
crawler_stage | String(50) | Yes | Last known stage |
last_stats_json | JSON | Yes | Parsed crawl stats from logs |
pages_crawled | Integer | Yes | Pages successfully crawled |
pages_total | Integer | Yes | Total pages discovered |
pages_failed | Integer | Yes | Pages that failed to crawl |
| Indexing | |||
warc_file_count | Integer | No | Number of WARC files discovered (default: 0) |
indexed_page_count | Integer | No | Number of snapshots created (default: 0) |
| File Paths | |||
final_zim_path | String(500) | Yes | Path to ZIM file (if built) |
combined_log_path | String(500) | Yes | Path to combined crawl log |
state_file_path | String(500) | Yes | Path to .archive_state.json |
coverage_report_path | String(500) | Yes | Shard/edition report artifact path |
| Cleanup | |||
cleanup_status | String(50) | No | "none", "temp_cleaned" (default: "none") |
cleaned_at | DateTime | Yes | When cleanup was performed |
| Timestamps | |||
created_at | DateTime | No | Record creation |
updated_at | DateTime | No | Last update |
Indexes: - Index on source_id - Index on edition_id - Index on shard_key - Index on acceptance_state - Index on status - Index on queued_at
Relationships: - source: Many-to-one → Source - edition: Many-to-one → AnnualEdition - snapshots: One-to-many → Snapshot
Snapshot
Represents a single captured web page.
Table: snapshots
| Field | Type | Nullable | Description |
|---|---|---|---|
| Identity | |||
id | Integer | No | Primary key |
job_id | Integer | Yes | Foreign key → archive_jobs.id |
source_id | Integer | Yes | Foreign key → sources.id |
| URL & Grouping | |||
url | String(2000) | No | Full URL of captured page |
normalized_url_group | String(2000) | Yes | Canonical URL for grouping |
| Timing | |||
capture_timestamp | DateTime | No | When page was captured (from WARC) |
| HTTP & Content | |||
mime_type | String(100) | Yes | MIME type (usually "text/html") |
status_code | Integer | Yes | HTTP status code |
title | String(500) | Yes | Extracted page title |
snippet | Text | Yes | Short text preview |
language | String(10) | Yes | ISO language code ("en", "fr") |
capture_backend | String(50) | Yes | Backend that captured the page (browsertrix, playwright_warc, etc.) |
capture_fidelity | String(50) | Yes | Capture fidelity label (high, fallback, unknown) |
provenance_json | JSON | Yes | Structured provenance metadata for reports/exports |
| Storage/Replay | |||
warc_path | String(500) | No | Path to WARC file |
warc_record_id | String(200) | Yes | WARC record identifier |
raw_snapshot_path | String(500) | Yes | Optional static HTML export path |
content_hash | String(64) | Yes | Hash of HTML body (for deduplication) |
| Timestamps | |||
created_at | DateTime | No | Record creation |
updated_at | DateTime | No | Last update |
Indexes: - Index on job_id - Index on source_id - Index on url - Index on normalized_url_group - Index on capture_timestamp - Index on status_code - Index on capture_backend - Index on capture_fidelity
Relationships: - job: Many-to-one → ArchiveJob - source: Many-to-one → Source
Job Status Lifecycle
stateDiagram-v2
[*] --> queued: create-job
queued --> running: worker starts
running --> completed: crawl succeeds
running --> failed: crawl fails
failed --> retryable: retry if count < MAX
retryable --> running: worker retries
completed --> indexing: index-job starts
indexing --> indexed: indexing succeeds
indexing --> index_failed: indexing fails
indexed --> [*]
failed --> [*]
index_failed --> completed: retry-job (reindex) Common status values: - queued - Job created, waiting for worker - running - Crawl in progress - completed - Crawl succeeded - failed - Crawl failed (terminal if retries exhausted) - retryable - Failed but can retry - indexing - WARC indexing in progress - indexed - Fully indexed and ready to serve - index_failed - Indexing failed
Config JSON Schema
ArchiveJob.config structure:
{
"seeds": [
"https://www.canada.ca/en/health-canada.html"
],
"zimit_passthrough_args": [
"--profile", "social-media"
],
"tool_options": {
"cleanup": false,
"overwrite": false,
"skip_final_build": false,
"enable_monitoring": false,
"enable_adaptive_workers": false,
"enable_adaptive_restart": false,
"enable_vpn_rotation": false,
"initial_workers": 2,
"log_level": "INFO",
"relax_perms": true,
"docker_shm_size": "1g",
"monitor_interval_seconds": 30,
"stall_timeout_minutes": 30,
"error_threshold_timeout": 10,
"error_threshold_http": 10,
"min_workers": 1,
"max_worker_reductions": 2,
"vpn_connect_command": "vpn connect ca",
"max_vpn_rotations": 3,
"vpn_rotation_frequency_minutes": 60,
"backoff_delay_minutes": 15
}
}
See: Job Registry for defaults per source
Database Configuration
Location: src/ha_backend/models.py
ORM: SQLAlchemy 2.0
Migrations: Alembic (in alembic/ directory)
Running Migrations
# Upgrade to latest
alembic upgrade head
# Downgrade one revision
alembic downgrade -1
# Show current revision
alembic current
# Show migration history
alembic history
Supported Databases
- SQLite (default for dev):
sqlite:///healtharchive.db - PostgreSQL (recommended for production):
postgresql://user:pass@host/dbname
Environment variable: HEALTHARCHIVE_DATABASE_URL
Common Queries
Find jobs by status
from ha_backend.models import ArchiveJob
from ha_backend.db import get_session
session = get_session()
jobs = session.query(ArchiveJob).filter_by(status="queued").all()
Get source with all jobs
from ha_backend.models import Source
session = get_session()
source = session.query(Source).filter_by(code="hc").one()
print(f"{source.name}: {len(source.jobs)} jobs")
Find snapshots by URL
from ha_backend.models import Snapshot
session = get_session()
snapshots = session.query(Snapshot).filter(
Snapshot.url.like("%health-canada%")
).limit(10).all()
Related Documentation
- Full Architecture Guide: architecture.md
- Job Creation: architecture.md#4-job-registry--creation
- Indexing Pipeline: architecture.md#6-indexing-pipeline
- CLI Commands: cli-commands.md