HealthArchive Backend – Architecture & Implementation Guide
This document is an in‑depth walkthrough of the HealthArchive.ca backend (healtharchive repo). It covers:
- How the backend is structured.
- How it integrates with the
archive_toolcrawler subpackage. - The data model and job lifecycle.
- The indexing pipeline (WARCs → snapshots).
- HTTP APIs (public + admin) and metrics.
- Worker loop, retries, and cleanup/retention (future).
For archive_tool internals (log parsing, Docker orchestration, run modes), see src/archive_tool/docs/documentation.md. For a shorter, task‑oriented overview of common commands and local testing flows, see development/live-testing.md. For deployment‑oriented configuration (staging/prod env vars, DNS, and the historical preview path), see deployment/hosting-and-live-server-to-dos.md. For the implemented VPS deployment runbook, see deployment/production-single-vps.md.
1. High‑level architecture
1.1 Components
- archive_tool (internal subpackage under
src/archive_tool/): - CLI wrapper around
zimit+ Docker. - Manages temporary output dirs, WARCs, and final ZIM build.
- Tracks persistent state in
.archive_state.json+.tmp*directories. -
Implements stall/error detection, adaptive worker reductions, and VPN rotation (when enabled).
-
Backend package (
src/ha_backend/): - Orchestrates crawl jobs using
archive_toolas a subprocess. - Groups annual jobs into AnnualEdition records for
{source, year}. - Stores job and snapshot metadata in a relational database via SQLAlchemy.
- Indexes WARCs into
Snapshotrows. - Exposes HTTP APIs via FastAPI.
- Provides a worker loop to process queued jobs.
-
Offers CLI commands for admins (job creation, status, retry, cleanup).
-
External dependencies:
- Docker &
ghcr.io/openzim/zimitimage. - Database (SQLite by default; Postgres recommended in production).
- Optional VPN client/command for rotation (e.g.,
nordvpn).
1.2 Data flow overview
- Job creation:
- Admin runs
healtharchive create-job --source hc. -
Backend:
- Ensures a
Sourcerow exists. - Uses
SourceJobConfigto build seeds, tool options, andoutput_dir. - Inserts an
ArchiveJobwithstatus="queued".
- Ensures a
-
Crawl (archive_tool):
- Worker or CLI runs
run_persistent_job(job_id):- Builds
archive_toolCLI args fromArchiveJob.configandoutput_dir. - Runs
archive_toolas a subprocess (no in‑process calls). - Marks job
running→completedorfailedwithcrawler_exit_codeandcrawler_status. - Treats annual search readiness as WARC-first: if a Browsertrix/Zimit run reaches a WARC-complete crawl state but optional ZIM finalization fails, the backend can accept the job for WARC indexing with crawler stage
warc_complete_finalization_failedwhen final crawlStatus haspending=0and discoverable indexable WARCs exist.
- Builds
-
archive_tool:- Validates Docker.
- Determines run mode (Fresh/Resume/New‑with‑Consolidation/Overwrite).
- Spawns
docker run ghcr.io/openzim/zimit zimit .... - Tracks temp dirs and state, discovers WARCs, and optionally runs a final ZIM build (depending on its configuration).
-
Indexing (WARCs → Snapshot):
- Worker calls
index_job(job_id)when crawl succeeds, and also reconcilescompletedjobs that were started outside the worker. - Backend:
- Consolidates readable temp WARCs into stable storage where possible.
- Uses union WARC discovery across stable, temp, and fallback outputs.
- Streams WARC records, extracts HTML, text, language, etc.
- Writes
Snapshotrows for each captured page. - Marks job
indexedwithindexed_page_count.
-
ZIM output is optional for the backend search/replay pipeline; WARCs are the durable source of truth for
Snapshotrows and raw/replay lookups. -
Annual coverage reporting:
- Annual edition services attach legacy/full-site jobs as salvage shards or create deterministic shard jobs from configured source seeds.
- Coverage reports write durable JSON/Markdown artifacts next to crawl outputs and summarize intended, captured, excluded, and review-needed URLs.
-
Public APIs expose the researcher-safe summary. Admin APIs expose shard diagnostics and acceptance state.
-
Change tracking (Snapshot → Change events):
- A background task (
healtharchive compute-changes) computes precomputed change events between adjacent captures of the samenormalized_url_group. - Outputs
SnapshotChangerows with:- provenance (from/to snapshot IDs, timestamps),
- summary stats (sections/lines changed),
- and a renderable diff artifact when available.
-
This work is intentionally off the request path to keep APIs fast.
-
Serving:
-
FastAPI app:
GET /api/searchqueriesSnapshotfor search results.GET /api/statsprovides lightweight public archive totals for frontend metrics.GET /api/sourcessummarises captures perSource.GET /api/snapshot/{id}returns metadata for a single snapshot.GET /api/snapshots/raw/{id}replays archived HTML from a WARC.GET /api/changesandGET /api/changes/compareexpose change feeds and diffs.GET /api/snapshots/{id}/timelinereturns a capture timeline for a page group.
-
Admin & cleanup:
- Admin API:
GET /api/admin/jobs/{id}for job status and config.GET /metricsfor Prometheus‑style metrics.
- CLI:
healtharchive retry-jobto reattempt failed jobs.healtharchive cleanup-jobto delete temp dirs/state for indexed jobs, updatingcleanup_status.
2. Configuration & environment
2.1 Config module (ha_backend/config.py)
Key roles:
- Locate the archive root (
--output-dirbase) andarchive_toolcommand. - Read the database URL.
Admin‑related configuration is handled separately in ha_backend/api/deps.py, which reads HEALTHARCHIVE_ADMIN_TOKEN from the environment. When this token is unset, admin and metrics endpoints are effectively open and should only be used in local development. In staging and production you should always set HEALTHARCHIVE_ADMIN_TOKEN to a long, random value and treat it as a secret.
ArchiveToolConfig
@dataclass
class ArchiveToolConfig:
archive_root: Path = DEFAULT_ARCHIVE_ROOT
archive_tool_cmd: str = DEFAULT_ARCHIVE_TOOL_CMD
def ensure_archive_root(self) -> None:
self.archive_root.mkdir(parents=True, exist_ok=True)
Defaults:
DEFAULT_ARCHIVE_ROOT=/mnt/nasd/nobak/healtharchive/jobsDEFAULT_ARCHIVE_TOOL_CMD="archive-tool"
Env overrides:
HEALTHARCHIVE_ARCHIVE_ROOT→ archive root.HEALTHARCHIVE_TOOL_CMD→ CLI to call (e.g.,archive-tool,python run_archive.py).
DatabaseConfig
Defaults:
DEFAULT_DATABASE_URL = "sqlite:///healtharchive.db"in the repo root.
Env override:
HEALTHARCHIVE_DATABASE_URL.
2.2 Logging (ha_backend/logging_config.py)
Centralized logging configuration:
- Reads
HEALTHARCHIVE_LOG_LEVEL(defaultINFO). - On first call, uses
logging.basicConfig(...)with: - Format:
"%(asctime)s [%(levelname)s] %(name)s: %(message)s". - Adjusts noisy loggers:
sqlalchemy.engine→WARNING.uvicorn.access→INFO.
Used in:
ha_backend.api.__init__(API startup).ha_backend.cli.main(CLI entrypoint).
3. Data model (SQLAlchemy ORM)
Defined in src/ha_backend/models.py, with Base from ha_backend.db.
3.1 Source
Represents a logical content origin (e.g., Health Canada, PHAC).
Important fields:
id: int(PK)code: str– short code ("hc","phac") – unique, indexed.name: str– human‑readable name.base_url: str | Nonedescription: str | Noneenabled: bool- Timestamps:
created_at,updated_at
Relationships:
jobs: List[ArchiveJob]– all jobs for this source.snapshots: List[Snapshot]– all snapshots for this source.annual_editions: List[AnnualEdition]– one row per source/year annual edition.
3.2 AnnualEdition
Represents the researcher-facing annual archive for one source/year. It is usually built from multiple ArchiveJob shards plus any legacy full-site salvage jobs.
Key fields:
- Identity:
id: int(PK)source_id: int→ FK tosources.id-
year: int -
Readiness:
status: str– edition lifecycle (planned,in_progress,search_ready,research_ready,needs_review, etc.).search_ready: bool– all blocking shard jobs have indexed searchable snapshots.-
research_ready: bool– coverage/provenance review has accepted the documented result. -
Coverage summary:
intended_url_count,captured_url_count,failed_url_count,excluded_url_countbackend_counts: JSON | None-
coverage_summary: JSON | None -
Artifacts:
target_ledger_pathcapture_manifest_pathcoverage_report_json_pathcoverage_report_md_path
Relationships:
source: Sourcejobs: List[ArchiveJob]
3.3 ArchiveJob
Represents a single archive_tool run for a source. In annual campaigns it is also a shard belonging to an AnnualEdition.
Key fields:
- Identity:
id: int(PK)source_id: int | None→ FK tosources.idedition_id: int | None→ FK toannual_editions.idname: str– must match--nameforarchive_tool; used in ZIM naming.output_dir: str– host path used as--output-dirforarchive_tool.shard_key: str | None– deterministic shard identifier within an edition.shard_kind: str | None– e.g.path-language,legacy-full-site,fallback-fill.-
acceptance_state: str | None–pending,needs_review,accepted,accepted_gap, orexcluded. -
Lifecycle/status:
status: str– high‑level state; typical values:queuedrunningretryablefailedcompleted(crawl succeeded)indexingindexedindex_failed
queued_at,started_at,finished_at: timestamps.-
retry_count: int– number of times the worker retried the crawl. -
Configuration:
-
config: JSON | None– “opaque” config used to reconstruct the CLI:{ "seeds": ["https://..."], "zimit_passthrough_args": ["--profile", "foo"], "tool_options": { "cleanup": false, "overwrite": false, "skip_final_build": false, "enable_monitoring": false, "enable_adaptive_workers": false, "enable_vpn_rotation": false, "initial_workers": 2, "log_level": "INFO", "...": "..." } } -
Crawl metrics:
crawler_exit_code: int | None– exit code from thearchive_toolprocess.crawler_status: str | None– summarised status (e.g."success","failed").crawler_stage: str | None– last known stage (not heavily used yet).last_stats_json: JSON | None– parsed crawl stats from the latest combined log, when available.-
pages_crawled,pages_total,pages_failed: simple integer metrics derived fromlast_stats_json(best-effort). -
WARC/ZIM counts:
warc_file_count: int– number of WARCs discovered for this job.-
indexed_page_count: int– number ofSnapshots created during indexing. -
Filesystem paths:
final_zim_path: str | None– if a ZIM is produced byarchive_toolor manualwarc2zim.combined_log_path: str | None– path to the latest combined log, used for stats/debugging.state_file_path: str | None– path to.archive_state.jsonwithinoutput_dir(may beNoneafter cleanup).-
coverage_report_path: str | None– shard or edition report artifact. -
Cleanup state (future):
cleanup_status: str– describes whether any cleanup has occurred:"none"(default) – temp dirs & state still present (or never existed)."temp_cleaned"–cleanup-jobor an equivalent operation removed temp dirs/state.- Future values could represent more aggressive cleanup.
cleaned_at: datetime | None– when cleanup was performed.
Relationships:
source: Source | None– parent source.edition: AnnualEdition | None– annual edition this job contributes to.snapshots: List[Snapshot]– all snapshots produced by this job.
3.4 Snapshot
Represents a single captured web page (an HTML response) extracted from a WARC.
Key fields:
- Identity:
id: int(PK)job_id: int | None→ FK toarchive_jobs.id-
source_id: int | None→ FK tosources.id -
URL & grouping:
url: str– full URL of the capture (including query string).-
normalized_url_group: str | None– optional canonicalised URL for grouping (e.g., removing query or anchors). -
Timing:
-
capture_timestamp: datetime– fromWARC-Dateor HTTP headers. -
HTTP & content:
mime_type: str | Nonestatus_code: int | Nonetitle: str | None– extracted from<title>or headings.snippet: str | None– short preview text.language: str | None– ISO language (e.g."en","fr").capture_backend: str | None– backend that produced the capture (browsertrix,playwright_warc, etc.).capture_fidelity: str | None– fidelity label used in reports and public exports (high,fallback,unknown).-
provenance_json: JSON | None– structured capture provenance, including job/shard metadata. -
Storage / replay:
warc_path: str– path to the.warc.gzfile on disk.warc_record_id: str | None– WARC record identifier or offset (seeindexing.viewer).raw_snapshot_path: str | None– optional path to a static HTML export, if you create such stubs.content_hash: str | None– hash of the HTML body for deduplication.
Relationships:
job: ArchiveJob | Nonesource: Source | None
4. Job registry & creation (ha_backend/job_registry.py)
The job registry defines default behavior and seeds for each source code ("hc", "phac").
4.1 SourceJobConfig
@dataclass
class SourceJobConfig:
source_code: str
name_template: str
default_seeds: List[str]
default_zimit_passthrough_args: List[str]
default_tool_options: Dict[str, Any]
schedule_hint: Optional[str] = None
Examples:
-
hc(Health Canada): -
name_template = "hc-{date:%Y%m%d}" default_seeds = ["https://www.canada.ca/en/health-canada.html"]-
default_tool_options:cleanup = Falseoverwrite = Falseenable_monitoring = True(required for adaptive strategies)enable_adaptive_workers = Trueenable_adaptive_restart = Trueenable_vpn_rotation = False(disabled by default)initial_workers = 2stall_timeout_minutes = 60docker_shm_size = "1g"skip_final_build = True(annual campaign: search/indexing uses WARCs)error_threshold_timeout = 50error_threshold_http = 50backoff_delay_minutes = 2max_container_restarts = 20log_level = "INFO"
-
phac(Public Health Agency of Canada) is similar with a PHAC home page seed.
4.2 Job name and output dir
generate_job_name(source_cfg, now):- Renders
name_templateusing{date:%Y%m%d}from UTC timestamp. -
E.g.
hc-20251209. -
build_output_dir_for_job(source_code, job_name, archive_root, now):
Example:
4.3 Job config JSON
build_job_config(source_cfg, extra_seeds=None, overrides=None):- Merges
default_seeds+ extra seeds. - Copies
default_zimit_passthrough_args. - Copies and updates
default_tool_optionswith anyoverrides. -
Performs basic validation of
tool_optionsto fail fast on misconfiguration:- If
enable_adaptive_workers=Truebutenable_monitoringis notTrue, aValueErroris raised. - If
enable_vpn_rotation=Truebutenable_monitoringis notTrue, aValueErroris raised. - If
enable_vpn_rotation=Truebutvpn_connect_commandis missing or empty, aValueErroris raised.
- If
Result structure:
{
"seeds": ["https://...", "..."],
"zimit_passthrough_args": [],
"tool_options": {
"cleanup": false,
"overwrite": false,
"skip_final_build": true,
"enable_monitoring": true,
"enable_adaptive_workers": true,
"enable_adaptive_restart": true,
"enable_vpn_rotation": false,
"initial_workers": 2,
"stall_timeout_minutes": 60,
"docker_shm_size": "1g",
"error_threshold_timeout": 50,
"error_threshold_http": 50,
"backoff_delay_minutes": 2,
"max_container_restarts": 20,
"log_level": "INFO"
}
}
4.4 create_job_for_source
def create_job_for_source(
source_code: str,
*,
session: Session,
overrides: Optional[Dict[str, Any]] = None,
) -> ORMArchiveJob:
Steps:
- Look up
SourceJobConfigforsource_code. - Ensure a
Sourcerow with that code exists (or raise). - Resolve
archive_rootfrom config. - Generate
job_nameandoutput_dir. - Build
job_config. - Insert an
ArchiveJob: status="queued",queued_at=now,config=job_config.
The CLI command healtharchive create-job --source hc is a thin wrapper around this.
5. archive_tool integration & job runner (ha_backend/jobs.py)
5.1 RuntimeArchiveJob
RuntimeArchiveJob is a small helper for ad‑hoc runs (healtharchive run-job) that:
- Holds just a
nameandseeds: list[str]. - Creates a timestamped job directory under the archive root (unless overridden).
- Builds the
archive_toolCLI command. - Executes it via
subprocess.run(...).
This path is used by:
healtharchive run-job– direct, non‑persistent jobs.
5.2 run_persistent_job – DB‑backed jobs
Responsibilities:
-
Load job and mark running:
-
Using
get_session():- Fetch
ArchiveJobby ID. - Validate
status in ("queued", "retryable"). - Extract
config, splitting into: tool_optionszimit_passthrough_argsseeds- Validate that
seedsis non‑empty. - Record
output_dirandname. - Set:
status = "running"started_at = now
- Fetch
-
Build CLI options from tool_options:
-
Core:
-
Monitoring options:
Only if
enable_monitoringisTrue:- Adds
--enable-monitoring. - Optionally:
monitor_interval_seconds→--monitor-interval-seconds Nstall_timeout_minutes→--stall-timeout-minutes Nerror_threshold_timeout→--error-threshold-timeout Nerror_threshold_http→--error-threshold-http N
- Adds
-
Adaptive workers:
Only if both
enable_monitoringandenable_adaptive_workersareTrue:- Adds
--enable-adaptive-workers. - Optionally:
min_workers→--min-workers Nmax_worker_reductions→--max-worker-reductions N
- Adds
-
VPN rotation:
Only if
enable_monitoring,enable_vpn_rotation, andvpn_connect_commandare all present:- Adds:
- Optionally:
max_vpn_rotations→--max-vpn-rotations Nvpn_rotation_frequency_minutes→--vpn-rotation-frequency-minutes N
-
Backoff:
Only when monitoring is enabled and
backoff_delay_minutesis set:--backoff-delay-minutes N.
-
Zimit passthrough:
zimit_passthrough_argsare appended directly (no explicit"--"separator is required):archive_toolusesargparse.parse_known_args()and passes unknown args through tozimit.- For
healtharchive run-job, a leading"--"is accepted and stripped for convenience when passing through flags interactively.
-
The final
extra_argspassed toRuntimeArchiveJob.run(...)look like: -
Execute archive_tool:
-
Instantiates
RuntimeArchiveJob(name, seeds). -
Calls:
-
output_dir_overrideensures a specific job directory under the archive root (matching the DB record) is used, and created if needed. -
Update job status:
-
After the subprocess returns:
crawler_exit_code = rcfinished_at = nowcombined_log_pathis recorded best-effort (newestarchive_*.combined.log)status = "completed"andcrawler_status = "success"ifrc == 0- Otherwise:
status = "retryable",crawler_status = "infra_error"for storage/mount failuresstatus = "failed",crawler_status = "infra_error_config"for CLI/config/runtime errors (e.g., invalidzimit_passthrough_args)status = "failed",crawler_status = "failed"for normal crawl failures
The worker uses run_persistent_job(job_id) for each queued job.
5.3 Maintaining the archive_tool integration
The backend and archive_tool share a small but important contract:
-
Configuration JSON:
-
ArchiveJob.configstores a dict that is the serialised form ofArchiveJobConfigfromha_backend.archive_contract:{ "seeds": ["https://...", "..."], "zimit_passthrough_args": ["--scopeType", "host"], "tool_options": { "cleanup": false, "overwrite": false, "skip_final_build": true, "enable_monitoring": true, "enable_adaptive_workers": true, "enable_adaptive_restart": true, "enable_vpn_rotation": false, "initial_workers": 2, "log_level": "INFO", "relax_perms": true, "stall_timeout_minutes": 60, "docker_shm_size": "1g", "error_threshold_timeout": 50, "error_threshold_http": 50, "max_container_restarts": 20, "backoff_delay_minutes": 2 } } -
SourceJobConfig.default_tool_optionsinha_backend.job_registryis the source of truth for defaults; overrides are merged viabuild_job_config(...)which usesArchiveToolOptions+validate_tool_options(...)to enforce invariants that mirrorarchive_tool.cli(e.g. monitoring required for adaptive/VPN). -
CLI construction:
-
ha_backend.jobs.run_persistent_jobis the only place that mapstool_optionsfields toarchive_toolCLI flags. It expects the argument model described insrc/archive_tool/docs/documentation.mdandarchive_tool/cli.py. -
If you add or rename CLI options in
archive_tool:- Extend
ArchiveToolOptionsandArchiveJobConfigto carry the new fields. - Update
run_persistent_jobto add/remove the corresponding flags. - Adjust tests under
tests/test_job_registry.py,tests/test_archive_contract.py, andtests/test_jobs_persistent.pythat assert config and CLI behaviour.
- Extend
-
Stats and logs:
-
archive_toolwrites combined logsarchive_<stage_name>_*.combined.logunder each job'soutput_dirand emits"Crawl statistics"JSON lines thatarchive_tool.utils.parse_last_stats_from_logcan parse. -
ha_backend.crawl_stats.update_job_stats_from_logs:- Locates the latest combined log for a job.
- Calls
parse_last_stats_from_log(log_path)to obtain a stats dict. - Stores it in
ArchiveJob.last_stats_json. - Updates
pages_crawled,pages_total,pages_failed, andcombined_log_pathas a best-effort summary.
-
/metricsexposes these page counters via:healtharchive_jobs_pages_crawled_totalhealtharchive_jobs_pages_failed_total- per-source variants, backed by the
pages_*fields onArchiveJob.
-
WARC discovery and cleanup:
-
ha_backend.indexing.warc_discovery.discover_warcs_for_jobrelies onarchive_tool.state.CrawlStateandarchive_tool.utils.find_all_warc_files/find_latest_temp_dir_fallbackfor WARC discovery and temp dir tracking. ha_backend.cli.cmd_cleanup_jobusesCrawlStateandarchive_tool.utils.cleanup_temp_dirsto remove.tmp*directories and.archive_state.jsonsafely once jobs are indexed.
If you change log formats, state layout, or directory structure in archive_tool, update the corresponding backend helpers (ArchiveJobConfig, run_persistent_job, update_job_stats_from_logs, WARC discovery, and cleanup) and their tests to keep the contract in sync.
6. Indexing pipeline (ha_backend/indexing/*)
The indexing pipeline converts the WARCs produced by archive_tool into structured Snapshot rows.
6.1 WARC discovery (warc_discovery.py)
from archive_tool.state import CrawlState
from archive_tool.utils import find_all_warc_files, find_latest_temp_dir_fallback
Steps:
- Resolve
host_output_dir = Path(job.output_dir).resolve(). - Instantiate
CrawlState(host_output_dir, initial_workers=1): - This loads
.archive_state.jsonif present. - Get
temp_dirs = state.get_temp_dir_paths(): - Returns only existing directories and prunes missing ones from state.
- If
temp_dirsis empty andallow_fallback: - Use
find_latest_temp_dir_fallback(host_output_dir)to scan for.tmp*directories. - If still empty → return
[]. - Call
find_all_warc_files(temp_dirs): - Returns a de‑duplicated list of
*.warc.gzfiles under eachcollections/crawl-*/archivedirectory.
This ensures the backend uses exactly the same WARC discovery logic as archive_tool itself.
6.2 WARC reading (warc_reader.py)
Wraps warcio to stream HTML response records from a .warc.gz file.
Exports a generator like:
Where ArchiveRecord provides:
url: strcapture_timestamp: datetimeheaders: dict[str, str]body_bytes: byteswarc_path: Pathwarc_record_id: str | None
6.3 Text extraction (text_extraction.py)
Helpers:
extract_title(html: str) -> str– heuristics over<title>/ headings.extract_text(html: str) -> str– uses BeautifulSoup to pull visible text.make_snippet(text: str) -> str– short preview (~N chars/words).detect_language(text: str, headers: dict) -> str– simple language detection, leveraging headers or heuristics (kept basic for now).
6.4 Mapping records to Snapshot (mapping.py)
record_to_snapshot(job, source, rec, title, snippet, language):
- Takes:
ArchiveJobSourceArchiveRecordfromiter_html_recordstitle,snippet,languagefrom text extraction- Produces a new
Snapshotinstance with: job_id,source_idurl,normalized_url_groupcapture_timestampmime_type,status_codetitle,snippet,languagewarc_path,warc_record_idcontent_hash(if computed)
6.5 Orchestration (pipeline.py)
Steps:
- Load
ArchiveJobby ID, ensure: job.sourceis notNone.job.status in ("completed", "index_failed", "indexed").- Validate
output_direxists. - Discover WARCs:
warc_paths = discover_warcs_for_job(job).- Sets
job.warc_file_count = len(warc_paths). - If no WARCs found:
- Logs warning.
- Sets
job.status = "index_failed"and returns1.
- Clear previous snapshots for this job:
DELETE FROM snapshots WHERE job_id = :job_id.- Mark job as indexing:
job.indexed_page_count = 0,job.status = "indexing".- For each WARC path:
- Iterate
iter_html_records(warc_path). - Decode
html = rec.body_bytes.decode("utf-8", errors="replace"). - Use text extraction functions to get
title,text,snippet,language. - Call
record_to_snapshot(...)to construct aSnapshot. session.add(snapshot); flush every 500 additions.- Count snapshots in
n_snapshots. - On per‑record errors, log and continue.
- On success:
- Set
job.indexed_page_count = n_snapshots. - Set
job.status = "indexed". - Return
0. - On unexpected error:
- Log at error level.
- Set
job.status = "index_failed". - Return
1.
7. Viewer helper (ha_backend/indexing/viewer.py)
The viewer helper is used by GET /api/snapshots/raw/{id} to reconstruct the HTML for a snapshot from its WARC.
Design:
- Either:
- Use
warc_record_idto seek directly to a known record, or - Fallback to scanning
warc_pathfor the first matching URL + timestamp.
The API route:
- Validates that
Snapshotand itswarc_pathexist. - Calls
find_record_for_snapshot(snapshot): - Returns an
ArchiveRecordorNone. - Decodes
record.body_bytesas UTF‑8 with replacement. - Writes
HTMLResponse(content=html, media_type="text/html").
This is used by the Next.js frontend for the embedded snapshot viewer.
8. HTTP API (ha_backend/api/*)
8.1 Public schemas (schemas.py)
Public Pydantic models:
SourceSummarySchema– used by/api/sources:
sourceCode: str
sourceName: str
recordCount: int
firstCapture: str
lastCapture: str
latestRecordId: Optional[int]
-
SnapshotSummarySchema– used by/api/search: -
id,title,sourceCode,sourceName,language,captureDate,originalUrl,snippet,rawSnapshotUrl. -
SearchResponseSchema: -
results: List[SnapshotSummarySchema],total,page,pageSize. -
ArchiveStatsSchema– used by/api/stats: -
snapshotsTotal,pagesTotal,sourcesTotal,latestCaptureDate,latestCaptureAgeDays. -
SnapshotDetailSchema– used by/api/snapshot/{id}: -
Contains metadata for a single snapshot including
mimeTypeandstatusCode, plusrawSnapshotUrl.
8.2 Public routes (routes_public.py)
-
GET /api/health: -
Returns lightweight JSON with:
-
GET /api/health?details=1adds summary counts: -
If the DB connectivity check fails, returns HTTP 500 with
{"status": "error", "checks": {"db": "error"}}. -
GET /api/stats: -
Returns lightweight, cacheable archive totals used by the frontend:
-
GET /api/sources: -
Aggregates
Snapshotbysource_id:- Counts, first/last capture dates, latest snapshot ID.
-
GET /api/search: -
Query params:
q: str | None– keyword.source: str | None– source code (e.g."hc").sort: "relevance" | "newest" | None– ordering mode.view: "snapshots" | "pages" | None– results grouping mode.includeNon2xx: bool– include non‑2xx HTTP status captures (defaults tofalse).from: YYYY-MM-DD | None– filter captures from this UTC date, inclusive.to: YYYY-MM-DD | None– filter captures up to this UTC date, inclusive.page: int– 1‑based page index (default1, must be>= 1).pageSize: int– results per page (default20, minimum1, maximum100).
- Filters:
Source.code == source.lower()whensourceset.- By default (
includeNon2xx=false), filters out snapshots with a known non‑2xxstatus_code(keepsstatus_code IS NULLand200–299). - Keyword filter / query intent:
- URL lookup: when
qlooks like a URL (or starts withurl:), treat it as a page lookup and filter by the normalized URL group (with a small set of common scheme/www.variants). - Boolean/field syntax: when
qcontainsAND/OR/NOT, parentheses,-term, ortitle:/snippet:/url:prefixes, parse it and apply a boolean filter using case-insensitive substring matching. - Plain text:
- On Postgres with
sort="relevance": full‑text search (FTS) againstsnapshots.search_vector. - If FTS yields no results, fall back to tokenized substring matching.
- If that still yields no results and
pg_trgmis available, fall back to pg_trgm word-level trigram similarity for fuzzy matching (misspellings). - Otherwise: tokenized substring matching on
title,snippet, andurl.
- On Postgres with
- Ordering:
- Default sort:
- When
qis present:sort="relevance". - When
qis absent:sort="newest".sort="relevance"(whenqpresent):- On Postgres: uses FTS (
websearch_to_tsquery+ts_rank_cd) againstsnapshots.search_vector, with small heuristics (phrase-in-title boost, URL depth/querystring penalties) and an optional authority boost frompage_signals.inlink_count(when available). - On SQLite/other DBs: uses a DB‑agnostic match score (title > URL > snippet), then (when available) a small authority tie-break from
page_signals, then recency.
sort="newest": orders by recency.- When
includeNon2xx=true, 2xx snapshots are still prioritised ahead of 3xx, unknown, and 4xx/5xx captures.
- Grouping:
- Default view:
view="snapshots"(returns individual captures;totalcounts snapshots). - For broad newest snapshot browsing without query/date/URL filters and with
includeDuplicates=false, the API can use storedSnapshot.deduplicatedflags instead of rebuilding same-day content de-duplication with a runtime window function. Query, date, URL, and relevance searches keep the stricter runtime de-duplication path. view="pages"returns only the latest snapshot for each page group (normalized_url_group, falling back tourlwith query/fragment stripped), andtotalcounts page groups.- When
view="pages"is used for browse (noqand no date range), the API can optionally use thepagestable as a fast path (controlled byHA_PAGES_FASTPATH). This is a metadata-only optimization and does not affect replay fidelity. - When available,
pageSnapshotsCountis included onview="pages"results to show the number of captures for that page group.
- Default view:
-
Pagination semantics:
totalis the total number of matching items across all pages (snapshots forview="snapshots", page groups forview="pages").resultscontains at mostpageSizesnapshots for the requestedpage(inview="pages", these are the latest snapshots for each page group).- Requesting a page past the end of the result set returns
200 OKwithresults: []andtotalunchanged. - Supplying an invalid
page(< 1) orpageSize(< 1or> 100) yields422 Unprocessable Entityfrom FastAPI’s validation.
-
GET /api/snapshot/{id}: -
Loads
Snapshot+Source. - Returns
SnapshotDetailSchema. -
404 if snapshot or source missing.
-
GET /api/snapshots/raw/{id}: -
Validates
Snapshotexists andwarc_pathpoints to an existing file. - Uses
find_record_for_snapshot(snapshot)to get a WARC record. - Returns an HTML page via
HTMLResponsethat includes the reconstructed archived HTML plus a lightweight HealthArchive top bar (navigation links + disclaimer) so it can be viewed standalone.
8.3 Admin auth (deps.py)
require_admin is a FastAPI dependency used to protect admin and metrics endpoints.
Behavior:
- Reads
HEALTHARCHIVE_ENVandHEALTHARCHIVE_ADMIN_TOKENfrom the environment. - If
HEALTHARCHIVE_ENVis"production"or"staging"andHEALTHARCHIVE_ADMIN_TOKENis unset: - Admin and metrics endpoints fail closed with HTTP 500 and a clear error detail (
"Admin token not configured for this environment"). - In other environments (or when
HEALTHARCHIVE_ENVis unset) and the admin token is unset: - Admin endpoints are open (dev mode convenience).
- When
HEALTHARCHIVE_ADMIN_TOKENis set: - Requires the same token via either:
Authorization: Bearer <token>header, orX-Admin-Token: <token>header.
- On mismatch/missing token →
HTTP 403.
8.4 Admin schemas (schemas_admin.py)
Key models:
-
JobSummarySchema– used for lists: -
Contains the key job fields plus:
-
JobDetailSchema– extended view for a single job: -
Includes status, worker counters, pages, WARC counts, ZIM/log/state paths,
config(JSON), andlastStats(JSON, reserved). -
Also includes
cleanupStatusandcleanedAt. -
JobSnapshotSummarySchema– minimalSnapshotview in a job context. -
JobListResponseSchema– wrapper for job list results. -
JobStatusCountsSchema– dictionary of{status: count}.
8.5 Admin routes (routes_admin.py)
All routes are under /api/admin and use require_admin for auth. They are intended for internal operator tooling (CLI or a future admin console), not for the public web UI.
GET /api/admin/jobs→JobListResponseSchema:- Filters:
source: str | None– by source code.status: str | None– by job status.limit(1–500, default 50),offset(≥0).
-
Joins
ArchiveJobwithSource(outer join). -
GET /api/admin/jobs/{job_id}→JobDetailSchema: - Joins
ArchiveJobwithSource. -
404 if job not found.
-
GET /api/admin/jobs/status-counts→JobStatusCountsSchema: -
SQL:
SELECT status, COUNT(*) FROM archive_jobs GROUP BY status. -
GET /api/admin/jobs/{job_id}/snapshots→List[JobSnapshotSummarySchema]: - Lists snapshots for a given job with pagination (
limit,offset).
8.6 Metrics (Prometheus‑style)
Defined directly in ha_backend.api.__init__:
GET /metrics:- Protected by
require_admin(same token behavior) and intended for scrape‑only use by monitoring systems (e.g., Prometheus) and internal tooling. - Computes:
healtharchive_jobs_total{status="..."}healtharchive_jobs_cleanup_status_total{cleanup_status="..."}healtharchive_snapshots_totalhealtharchive_snapshots_total{source="hc"}, etc.
8.7 CORS
- CORS is enabled on the public API routes. Allowed origins are derived from
HEALTHARCHIVE_CORS_ORIGINS(comma-separated). Defaults cover local dev and production (http://localhost:3000,http://localhost:5173,https://healtharchive.ca,https://www.healtharchive.ca). - Admin and metrics routes remain token-gated even when CORS allows browser access to public routes.
Typical environment setups:
- Local development:
# often no override needed; defaults already include localhost:3000/5173
export HEALTHARCHIVE_DATABASE_URL=sqlite:///$(pwd)/.dev-healtharchive.db
export HEALTHARCHIVE_ARCHIVE_ROOT=$(pwd)/.dev-archive-root
# Optional CORS override if your frontend runs on a different origin:
# export HEALTHARCHIVE_CORS_ORIGINS=http://localhost:3000
- Optional preview/staging (example only; not an active production path):
# If you intentionally add a separate preview/staging frontend later,
# allow only its exact origin.
export HEALTHARCHIVE_CORS_ORIGINS=https://preview.example.invalid
- Production (example):
# healtharchive.ca is canonical; www may remain in the allowlist as a redirect alias.
export HEALTHARCHIVE_CORS_ORIGINS=https://healtharchive.ca,https://www.healtharchive.ca
In all cases, CORS affects only the browser’s ability to call public routes; admin and metrics endpoints still require the admin token when configured.
9. Worker loop (ha_backend/worker/main.py)
The worker processes jobs end‑to‑end: crawl and index.
9.1 Selection
_select_next_crawl_job(session):
- Query:
session.query(ArchiveJob) \
.join(Source) \
.filter(ArchiveJob.status.in_(["queued", "retryable"])) \
.order_by(ArchiveJob.queued_at.asc().nullsfirst(),
ArchiveJob.created_at.asc()) \
.first()
- Chooses the oldest queued/retryable job, preferring jobs with the earliest
queued_at.
9.2 Processing a single job
_process_single_job():
- Select a job → get
job_id. - Run
run_persistent_job(job_id): - Executes
archive_tooland returns a process exit code. - Reload job in a new session and apply retry semantics:
- If
crawl_rc != 0orjob.status == "failed":- If
job.retry_count < MAX_CRAWL_RETRIES: - Increment
job.retry_count. - Set
job.status = "retryable". - Else:
- Log error; job remains in
failed.
- If
- Else (crawl succeeded):
- Log that indexing will start.
- If crawl succeeded:
- Run
index_job(job_id). - Log success/failure for indexing.
Returns True if a job was processed, False if no jobs were found.
9.3 Main loop
run_worker_loop(poll_interval=30, run_once=False):
- Logs startup with the given interval and
run_once. - In a loop:
- Calls
_process_single_job(). - If
run_once→ break after first iteration. - If no job processed:
- Logs and sleeps for
poll_intervalseconds.
- Logs and sleeps for
- Handles
KeyboardInterruptgracefully.
10. Cleanup & retention (future)
Job‑level cleanup is focused on removing temporary crawl artifacts (.tmp* dirs and .archive_state.json) after indexing is complete.
10.1 Cleanup flags on ArchiveJob
New fields:
cleanup_status: str:"none"– no cleanup performed (default)."temp_cleaned"– temporary dirs and state file have been deleted.- Future values could represent more aggressive cleanup modes.
cleaned_at: datetime | None– when cleanup occurred.
These fields are exposed through:
- Admin schemas (
JobSummarySchema,JobDetailSchema). - Metrics (
healtharchive_jobs_cleanup_status_total).
10.2 CLI command: cleanup-job
healtharchive cleanup-job --id JOB_ID [--mode temp] [--force]
Implementation notes:
- Currently supports only
--mode temp: -
Any other mode → error.
-
Behavior:
-
Load the
ArchiveJobby ID. - If job is missing → error, exit 1.
- If replay is enabled globally (
HEALTHARCHIVE_REPLAY_BASE_URLis set) and--forceis not provided:- Refuse cleanup and exit 1.
- Rationale:
--mode tempcan delete WARCs required for replay.
- If
job.statusis not one of:"indexed"– indexing completed successfully, or"index_failed"– indexing failed and you have decided not to retry, then refuse cleanup and exit 1.- This ensures we don’t delete temp dirs while a job might still be resumed or indexing is in progress.
- Validate
output_direxists and is a directory. - Use
archive_tool.state.CrawlState(output_dir, initial_workers=1)to instantiate state and locate the state file. - Use
state.get_temp_dir_paths()to get known temp dirs; fall back tofind_latest_temp_dir_fallbackif none are tracked. - If neither temp dirs nor the state file exist:
- Print a message that there is nothing to clean up and do not change
cleanup_statusorcleaned_at.
- Print a message that there is nothing to clean up and do not change
- Otherwise (if temp dirs and/or state file exist):
- Call
cleanup_temp_dirs(temp_dirs, state.state_file_path): - Deletes
.tmp*directories and the.archive_state.json. - Update job:
cleanup_status = "temp_cleaned"cleaned_at = nowstate_file_path = None
- Call
Operational warning:
cleanup-job --mode tempwill delete WARCs if they live under the job’s.tmp*directory (common for legacy imports and some crawl layouts). If you intend to serve the job via replay (pywb), do not run cleanup for that job — replay depends on WARCs remaining on disk. If replay is enabled globally, you must pass--forceto run cleanup; treat this as an emergency override.
Caution: This cleanup removes WARCs stored under
.tmp*directories, consistent witharchive_tool’s own--cleanupbehavior. In v1 you should only run it once you have: - Indexed the job successfully (status="indexed"), and - Verified any desired ZIM or exports derived from these WARCs.
10.3 Metrics for cleanup
/metrics includes:
healtharchive_jobs_cleanup_status_total{cleanup_status="none"}healtharchive_jobs_cleanup_status_total{cleanup_status="temp_cleaned"}
This gives a quick overview of how many jobs still have temp artifacts versus those that have been cleaned.
11. CLI commands summary
All commands are available via the healtharchive entrypoint.
- Environment / connectivity:
check-env– show archive root and ensure it exists.check-archive-tool– runarchive-tool --help.-
check-db– simple DB connectivity check. -
Direct, non‑persistent job:
-
run-job– runarchive_toolimmediately with explicit--name,--seeds,--initial-workers, etc. -
Persistent jobs (DB‑backed):
create-job --source CODE– createArchiveJobusing registry defaults.run-db-job --id ID– runarchive_toolfor an existing job, then index it on crawl success unless--no-indexis used.index-job --id ID– index an existing job’s WARCs into snapshots.reconcile-completed-indexing– idempotently index completed jobs that were started outside the worker.register-job-dir --source CODE --output-dir PATH [--name NAME]– attach a DBArchiveJobto an existing archive_tool output directory (useful when a crawl has already been run and you want to index its WARCs).-
Job configs default to
relax_perms=Truefor dev (adds--relax-permsso temp WARCs are chmod’d readable on the host after a crawl). -
Seeding:
-
seed-sources– insert baselineSourcerows forhc,phac. -
Admin / introspection:
list-jobs– list recent jobs with basic fields.show-job --id ID– detailed job info including config.retry-job --id ID– mark:failedjobs asretryable(for another crawl).index_failedjobs ascompleted(for re-indexing).
cleanup-job --id ID [--mode temp] [--force]– cleanup temp dirs/state for jobs in statusindexedorindex_failed.replay-index-job --id ID– create/refresh the pywb collection + CDX index for a job (so snapshots can be browsed via replay).-
start-worker [--poll-interval N] [--once]– start the worker loop. -
Annual editions and shards:
salvage-annual-edition --year YEAR– attach existing annual jobs as legacy full-site salvage shards and optionally regenerate reports.plan-annual-shards --year YEAR [--apply]– plan or create deterministic edition shard jobs from source seeds.annual-edition-report– generate or display edition coverage/provenance reports.accept-annual-shard-gap --job-id ID --reason TEXT– mark a reviewed shard as accepted with a documented gap.
12. Testing & development
- Tests are written with
pytestand live undertests/. - To run checks:
- Many tests configure a temporary SQLite DB by:
- Setting
HEALTHARCHIVE_DATABASE_URLto a temp file. - Resetting
db_module._engineand_SessionLocal. - Calling
Base.metadata.drop_all()/create_all()to fully reset the schema.
This allows development and CI to run in isolated environments without touching real data.
13. Relationship to archive_tool and the frontend
- archive_tool:
- Lives under
src/archive_tool/and is maintained as part of this repo. It originated as an earlier standalone crawler project but is now the in-tree crawler/orchestrator subpackage for the backend. - The backend calls it strictly via the CLI (
archive-tool) as a subprocess. -
Its internal behavior (Docker orchestration, run modes, monitoring, adaptive strategies) is documented in
src/archive_tool/docs/documentation.md. -
Frontend (
frontend/in this repo): - In-tree Next.js 16 app using the backend’s HTTP APIs:
/api/health/api/sources/api/search/api/snapshot/{id}/api/snapshots/raw/{id}
- The frontend currently still supports a demo dataset, but is gradually being wired to these real APIs.
Together, the backend + archive_tool + frontend form a pipeline from:
Web → crawl (Docker +
zimit) → WARCs → Snapshots in DB → searchable archive UI at HealthArchive.ca.