Skip to content

CLI Commands Reference

Complete reference for healtharchive command-line interface.


Installation

The healtharchive command is installed when you install the package:

pip install -e .
# or
make venv

Verify installation:

healtharchive --help


Command Categories

Category Commands
Environment check-env, check-archive-tool, check-db
Job Management create-job, run-db-job, index-job, reconcile-completed-indexing, register-job-dir
Direct Execution run-job
Inspection list-jobs, show-job
Maintenance retry-job, reset-retry-count, cleanup-job, reset-crawl-state, replay-index-job
Annual Campaign schedule-annual, annual-status, salvage-annual-edition, plan-annual-shards, annual-edition-report, accept-annual-shard-gap, reconcile-annual-tool-options
Seeding seed-sources
Worker start-worker
Change Tracking compute-changes

Environment Commands

check-env

Check environment configuration and ensure archive root exists.

Usage:

healtharchive check-env

Output:

Archive root: /mnt/nasd/nobak/healtharchive/jobs
Archive root exists: True
Archive tool command: archive-tool

Exit codes: - 0 - Success - 1 - Archive root missing


check-archive-tool

Verify archive-tool is available and functional.

Usage:

healtharchive check-archive-tool

What it does: - Runs archive-tool --help - Validates command is available

Exit codes: - 0 - archive-tool available - 1 - archive-tool not found or failed


check-db

Test database connectivity.

Usage:

healtharchive check-db

Output:

Database connection successful

Exit codes: - 0 - Database reachable - 1 - Connection failed


Job Management Commands

create-job

Create a new archive job using source defaults.

Usage:

healtharchive create-job --source SOURCE_CODE [--override JSON]

Arguments: - --source, -s (required) - Source code (hc, phac) - --override (optional) - JSON string with config overrides

Examples:

# Create Health Canada job with defaults
healtharchive create-job --source hc

# Create with custom worker count
healtharchive create-job --source hc --override '{"tool_options": {"initial_workers": 2}}'

# Create a "search-first" crawl (skip optional .zim build) with a larger Docker /dev/shm
healtharchive create-job --source hc --override '{"tool_options": {"initial_workers": 2, "skip_final_build": true, "docker_shm_size": "1g"}}'

# Enable monitoring and stall detection
healtharchive create-job --source phac --override '{
  "tool_options": {
    "enable_monitoring": true,
    "stall_timeout_minutes": 60
  }
}'

Output:

Created job ID: 42
Name: hc-20260118
Output directory: /mnt/nasd/nobak/healtharchive/jobs/hc/20260118T210911Z__hc-20260118
Status: queued

Exit codes: - 0 - Job created successfully - 1 - Failed (invalid source, config validation error)


run-db-job

Execute a queued job by ID.

Usage:

healtharchive run-db-job --id JOB_ID [--no-index]

Arguments: - --id (required) - Job ID to run - --no-index (optional) - Leave a successful crawl in completed instead of indexing immediately

Example:

healtharchive run-db-job --id 42

What it does: 1. Validates job status is queued or retryable 2. Sets status to running 3. Executes archive-tool subprocess 4. Updates status to completed or failed 5. On success, indexes the job unless --no-index is used

Exit codes: - 0 - Crawl and indexing succeeded, or crawl succeeded with --no-index - 1 - Crawl failed or job invalid


reconcile-completed-indexing

Index completed jobs that were started outside the worker.

Usage:

healtharchive reconcile-completed-indexing [--source SOURCE_CODE] [--limit N]

What it does: 1. Finds jobs in status="completed" 2. Skips running/queued jobs 3. Runs the normal WARC indexing pipeline for each job 4. Leaves already indexed jobs untouched

This command is idempotent and is the preferred remediation when a watchdog or manual run-db-job path leaves a crawl finished but not searchable.


index-job

Index WARCs from a completed job into the database.

Usage:

healtharchive index-job --id JOB_ID

Arguments: - --id (required) - Job ID to index

Example:

healtharchive index-job --id 42

What it does: 1. Discovers WARC files in job output directory 2. Parses WARC records 3. Extracts text, title, snippet 4. Creates Snapshot rows 5. Sets job status to indexed

Output:

Indexing job 42...
Found 245 WARC files
Indexed 12,347 snapshots
Job status: indexed

Exit codes: - 0 - Indexing succeeded - 1 - Failed (no WARCs, parsing error)


register-job-dir

Attach an existing archive_tool output directory to a new database job.

Usage:

healtharchive register-job-dir --source SOURCE --output-dir PATH [--name NAME]

Arguments: - --source (required) - Source code - --output-dir (required) - Existing directory path - --name (optional) - Job name (default: derived from directory)

Example:

healtharchive register-job-dir \
  --source hc \
  --output-dir /mnt/nasd/nobak/healtharchive/jobs/hc/20260101T120000Z__hc-20260101

Use case: Import externally-run crawls into database

Exit codes: - 0 - Job registered - 1 - Directory doesn't exist or validation failed


Direct Execution

run-job

Run archive-tool directly without database persistence.

Usage:

healtharchive run-job \
  --name NAME \
  --seeds URL [URL...] \
  [--initial-workers N] \
  [--output-dir DIR]

Arguments: - --name (required) - Job name - --seeds (required) - One or more seed URLs - --initial-workers (optional) - Worker count (default: 1) - --output-dir (optional) - Output directory (default: auto-generated)

Example:

healtharchive run-job \
  --name test-crawl \
  --seeds https://www.canada.ca/en/health-canada.html \
  --initial-workers 2

Use case: Quick testing without database overhead

Exit codes: - 0 - Crawl succeeded - Non-zero - archive-tool exit code


Inspection Commands

list-jobs

List recent jobs with summary information.

Usage:

healtharchive list-jobs [--limit N] [--status STATUS [STATUS ...]] [--source SOURCE]

Arguments: - --limit (optional) - Number of jobs to show (default: 20) - --status (optional) - Filter by one or more statuses - --source (optional) - Filter by source code

Examples:

# List 20 most recent jobs
healtharchive list-jobs

# Show only failed jobs
healtharchive list-jobs --status failed

# Show Health Canada jobs
healtharchive list-jobs --source hc

# Show last 50 jobs
healtharchive list-jobs --limit 50

Output:

ID  Source  Status       Backend          Rescue            Retries  Created_at           Started_at           Finished_at          Indexed
6   hc      running      playwright_warc  fallback-active   0        2026-01-01 00:05:02 2026-04-10 16:15:18 None                 0 hc-20260101
7   phac    failed       browsertrix      fresh-failed      1        2026-01-01 00:05:02 2026-04-03 01:50:12 2026-04-03 02:18:57 0 phac-20260101

The Backend and Rescue columns are intended to make annual rescue state visible from the standard operator path:

  • Backend shows the current effective backend inferred from job config and live crawl state.
  • Rescue shows a compact rescue summary such as normal, fresh-failed, fallback-active, fallback-retry, or fallback-exhausted.

show-job

Display detailed information about a specific job.

Usage:

healtharchive show-job --id JOB_ID [--warc-details]

Arguments: - --id (required) - Job ID - --warc-details (optional) - Include detailed WARC discovery information

Examples:

# Human-readable output
healtharchive show-job --id 42

# Include WARC discovery details
healtharchive show-job --id 42 --warc-details

Output (text format):

ID:              6
Source:          hc (Health Canada)
Name:            hc-20260101
Status:          running
Retry count:     0
Created at:      2026-01-01 00:05:02.537667+00:00
Queued at:       2026-01-01 00:05:02.331347+00:00
Started at:      2026-04-10 16:15:18.050361+00:00
Finished at:     None
Output dir:      /srv/healtharchive/jobs/hc/20260101T000502Z__hc-20260101
Crawler RC:      None
Crawler status:  None
Crawler stage:   promoted_to_playwright_warc
WARC files:      0
WARC files (discovered): 300
Indexed pages:   0
Rescue:
  Primary backend:      browsertrix
  Configured backend:   playwright_warc
  Effective backend:    playwright_warc
  Fallback backend:     playwright_warc
  Resume policy:        fresh_only
  Fresh failure budget: 2
  Fallback active:      yes
  Promoted to fallback: yes
  Rescue note:          promoted from browsertrix to playwright_warc after fresh-failure budget exhaustion

The Rescue block is designed to answer the common annual-crawl operator questions without requiring immediate combined-log inspection:

  • which backend is primary for the job
  • which backend is configured now
  • which backend is effectively active
  • whether fallback promotion already happened
  • whether the job is still in a fresh Browsertrix failure state or has moved to a healthy fallback path

Maintenance Commands

retry-job

Retry a failed or index-failed job.

Usage:

healtharchive retry-job --id JOB_ID

Arguments: - --id (required) - Job ID to retry

Example:

healtharchive retry-job --id 42

What it does: - If job status is failed: Sets to retryable (for re-crawl) - If job status is index_failed: Sets to completed (for re-index)

Exit codes: - 0 - Job marked for retry - 1 - Job not in retryable state


reset-retry-count

Reset a crawl job's retry budget by setting retry_count to a lower value.

Safe-by-default: dry-run unless --apply is passed.

Usage:

healtharchive reset-retry-count --id JOB_ID [--apply] [--reason "note"]

Arguments: - --id - One or more Job IDs to modify - --apply - Persist changes (default: dry-run) - --reason - Optional note printed in output (required for multi-job apply) - --new-count - New value for retry_count (default: 0) - --min-retry-count - Only match jobs with retry_count >= this (default: 1)

Examples:

# Dry-run (prints what would change)
healtharchive reset-retry-count --id 42

# Apply for one job
healtharchive reset-retry-count --id 42 --apply --reason "storage recovered; re-attempt crawl"

# Bulk mode (requires --source, --status, and --limit)
healtharchive reset-retry-count --source hc --status failed retryable --limit 25 --apply --reason "post-incident retry budget reset"

Safety guardrails: - Skips jobs in running status. - Skips jobs whose lock file appears held (job runner likely still active). - Only supports statuses: queued, retryable, failed.


cleanup-job

Clean up temporary crawl artifacts.

Usage:

healtharchive cleanup-job --id JOB_ID [--mode MODE] [--force] [--dry-run]

Arguments: - --id (required) - Job ID - --mode (optional) - Cleanup mode (default: temp; supported: temp, temp-nonwarc) - --force (optional) - Force cleanup even if replay is enabled - --dry-run (optional) - Print the cleanup plan without changing files or the DB

Example:

# Safe cleanup for an indexed job (preserves WARCs / replayability)
healtharchive cleanup-job --id 42 --mode temp-nonwarc --dry-run
healtharchive cleanup-job --id 42 --mode temp-nonwarc

# Legacy destructive cleanup (use with caution)
healtharchive cleanup-job --id 42 --mode temp --force

What it does: - temp-nonwarc: - consolidates WARCs into warcs/ - preserves provenance under provenance/ - rewrites Snapshot.warc_path away from .tmp* locations - removes .tmp* directories and the live .archive_state.json - updates job: cleanup_status = "temp_nonwarc_cleaned", cleaned_at = now - temp: - removes .tmp* directories - removes .archive_state.json - updates job: cleanup_status = "temp_cleaned", cleaned_at = now

⚠️ Warning: - temp-nonwarc is the preferred cleanup mode for terminal jobs because it preserves WARCs and replayability. - temp deletes WARCs if they're in .tmp* directories. Only use it when you explicitly do not need replay retention.

Exit codes: - 0 - Cleanup succeeded - 1 - Failed (job not indexed, replay enabled without --force)


replay-index-job

Create/refresh pywb collection index for a job.

Usage:

healtharchive replay-index-job --id JOB_ID

Arguments: - --id (required) - Job ID

Example:

healtharchive replay-index-job --id 42

What it does: - Creates pywb collection for job WARCs - Generates CDX index for fast replay - Enables browsing via pywb

Prerequisites: - HEALTHARCHIVE_REPLAY_BASE_URL set - pywb installed and configured

Exit codes: - 0 - Index created - 1 - Failed or replay not configured


Seeding

seed-sources

Initialize source records in the database.

Usage:

healtharchive seed-sources

What it does: - Inserts Source rows for hc, phac, and cihr - Idempotent (safe to run multiple times)

Example:

healtharchive seed-sources

Output:

Seeded source: hc (Health Canada)
Seeded source: phac (Public Health Agency of Canada)
Seeded source: cihr (Canadian Institutes of Health Research)

Exit codes: - 0 - Sources seeded or already exist


Annual Campaign

schedule-annual

Plan or enqueue Jan 01 (UTC) annual campaign jobs for hc, phac, and cihr.

Usage:

healtharchive schedule-annual --year YEAR [--sources hc phac cihr] [--apply]

Examples:

# Show what would be created
healtharchive schedule-annual --year 2026

# Actually create jobs
healtharchive schedule-annual --year 2026 --apply

Notes: - Dry-run by default - Idempotent for annual campaign metadata/name matches - Refuses to enqueue when a source already has an active non-indexed job

annual-status

Report annual campaign progress and search-readiness for a given year.

Usage:

healtharchive annual-status --year YEAR [--json] [--sources hc phac cihr]

Examples:

healtharchive annual-status --year 2026
healtharchive annual-status --year 2026 --json

Text output:

Annual campaign status — 2026-01-01 (Jan 01 UTC)
Ready for search: NO
Summary: total=3 indexed=1 in_progress=2 failed=0 missing=0 errors=0
Rescue states: fallback-active=1 fresh-failed=1 normal=1
Operator states: running-fallback=1 search-ready=1 waiting-fresh-retry=1

hc: job_id=6 status=running operator_state=running-fallback backend=playwright_warc rescue=fallback-active indexed_pages=0 retries=0 crawl_rc=None crawl_status=None name=hc-20260101
     note: promoted from browsertrix to playwright_warc after fresh-failure budget exhaustion
phac: job_id=7 status=retryable operator_state=waiting-fresh-retry backend=browsertrix rescue=fresh-failed indexed_pages=0 retries=1 crawl_rc=1 crawl_status=failed name=phac-20260101
     note: awaiting next fresh browsertrix retry within the configured rescue budget
cihr: job_id=8 status=indexed operator_state=search-ready backend=browsertrix rescue=normal indexed_pages=4123 retries=0 crawl_rc=0 crawl_status=success name=cihr-20260101

annual-status is now the compact annual rescue summary surface:

  • backend shows the current effective backend for each annual job.
  • rescue shows the compact rescue status (normal, fresh-failed, fallback-active, fallback-retry, fallback-exhausted).
  • operator_state distinguishes active work from intentional waiting states, so retryable jobs in backoff do not read like terminal failures.
  • --json includes per-job rescue details plus summary-level rescueStates and operatorStates counts for downstream tooling.

reconcile-annual-tool-options

Reconcile existing annual jobs to source-specific crawl profiles.

Usage:

healtharchive reconcile-annual-tool-options --year YEAR [--sources ...] [--limit N] [--apply]

Examples:

# Dry-run reconciliation for all annual sources
healtharchive reconcile-annual-tool-options --year 2026

# Apply only to HC annual jobs
healtharchive reconcile-annual-tool-options --year 2026 --sources hc --apply

What it does: - Reconciles legacy baseline tool options to per-source profiles - Reconciles annual execution_policy defaults (for example HC/PHAC fresh_only resume policy and playwright_warc fallback settings) - Reconciles canonical HC/PHAC scope filters on existing annual jobs - Backfills canonical annual metadata on matching jobs: campaign_kind, campaign_year, campaign_date, campaign_date_utc, and scheduler_version - Preserves explicit non-baseline overrides - Enforces restart-budget floor and annual safety defaults

salvage-annual-edition

Attach existing annual jobs/WARCs to annual edition records as legacy full-site salvage shards.

Usage:

healtharchive salvage-annual-edition --year YEAR [--sources hc phac cihr] [--report]

What it does: - Creates missing {source, year} annual edition rows - Attaches matching annual ArchiveJob rows to those editions - Marks attached jobs as legacy-full-site shards - With --report, regenerates coverage/provenance artifacts

plan-annual-shards

Plan or create deterministic shard jobs for annual editions.

Usage:

healtharchive plan-annual-shards --year YEAR [--sources hc phac cihr] [--apply]

Dry-run output lists the shard keys and seed URLs. --apply creates queued ArchiveJob rows tied to the annual edition.

annual-edition-report

Generate or display a coverage/provenance report for one annual edition.

Usage:

healtharchive annual-edition-report --source SOURCE_CODE --year YEAR [--generate] [--json]
healtharchive annual-edition-report --id EDITION_ID [--generate] [--json]

The generated artifacts are:

  • target-ledger.jsonl
  • capture-manifest.jsonl
  • coverage-report.json
  • coverage-report.md

accept-annual-shard-gap

Mark a reviewed shard gap as accepted with an operator-supplied reason.

Usage:

healtharchive accept-annual-shard-gap --job-id JOB_ID --reason "documented reason"

Use this only after the retry budget has been exhausted and the remaining gap is acceptable for the edition’s research/provenance report.

probe-browser-fetch

Run one or more URLs through the pinned Playwright browser path used by the server-side playwright_warc fallback backend.

Usage:

healtharchive probe-browser-fetch URL [URL ...]

Examples:

healtharchive probe-browser-fetch https://www.canada.ca/en/public-health.html
healtharchive probe-browser-fetch \
  https://www.canada.ca/en/public-health.html \
  https://www.canada.ca/en/health-canada.html

What it does: - launches the same pinned Playwright Docker image used by the browser fallback - reports final URL, status code, cookie count, body source, and HTML byte size - helps operators confirm whether the server-side browser path works before rerunning a failed annual job

reset-crawl-state

Reset poisoned crawl temp/resume state for a non-running job while preserving stable WARCs.

Usage:

healtharchive reset-crawl-state --id JOB_ID [--apply]

Examples:

# Show what would be removed/preserved
healtharchive reset-crawl-state --id 7

# Consolidate temp WARCs, remove stale .tmp*/state/resume files
healtharchive reset-crawl-state --id 7 --apply

What it does: - Refuses to run if the job is still running or its job lock is held - Consolidates temp-dir WARCs into stable warcs/ - Removes stale .tmp* dirs - Removes .archive_state.json - Removes .zimit_resume.yaml - Marks the job crawler_stage=state_reset

Use this when an annual job has accumulated poisoned resume state and should be forced back to a fresh crawl phase without losing already captured WARCs.


Worker

start-worker

Start the job processing worker loop.

Usage:

healtharchive start-worker [--poll-interval SECONDS] [--once]

Arguments: - --poll-interval (optional) - Seconds between polls (default: 30) - --once (optional) - Process one job then exit

Examples:

# Run continuously with 30s polling
healtharchive start-worker

# Poll every 60 seconds
healtharchive start-worker --poll-interval 60

# Process one job and exit (for testing)
healtharchive start-worker --once

What it does: 1. Polls for jobs with status queued or retryable 2. Runs oldest job first 3. Crawls → Indexes → Repeats 4. Sleeps if no jobs found

Exit: Press Ctrl+C to stop gracefully


Change Tracking

compute-changes

Compute change events between adjacent snapshots.

Usage:

healtharchive compute-changes [--limit N] [--source SOURCE]

Arguments: - --limit (optional) - Max snapshot groups to process - --source (optional) - Limit to specific source

Example:

# Compute changes for all snapshots
healtharchive compute-changes

# Process 100 page groups
healtharchive compute-changes --limit 100

# Only Health Canada changes
healtharchive compute-changes --source hc

What it does: - Groups snapshots by normalized_url_group - Compares adjacent captures (by timestamp) - Generates SnapshotChange rows with diff metadata

Exit codes: - 0 - Changes computed - 1 - Error


Global Options

All commands support:

healtharchive COMMAND --help  # Show command help

Environment Variables

Commands respect these environment variables:

Variable Purpose Default
HEALTHARCHIVE_DATABASE_URL Database connection sqlite:///healtharchive.db
HEALTHARCHIVE_ARCHIVE_ROOT Base directory for jobs /mnt/nasd/nobak/healtharchive/jobs
HEALTHARCHIVE_TOOL_CMD archive-tool command archive-tool
HEALTHARCHIVE_LOG_LEVEL Logging level INFO

Set in .env file:

HEALTHARCHIVE_DATABASE_URL=postgresql://user:pass@localhost/healtharchive
HEALTHARCHIVE_ARCHIVE_ROOT=/data/healtharchive/jobs
HEALTHARCHIVE_LOG_LEVEL=DEBUG


Exit Codes

Standard exit codes: - 0 - Success - 1 - General error - 2 - Command-line usage error


Scripting Examples

Process a job end-to-end

#!/bin/bash
set -e

# Create job
JOB_ID=$(healtharchive create-job --source hc | grep "Created job ID:" | awk '{print $4}')
echo "Created job $JOB_ID"

# Run crawl
healtharchive run-db-job --id $JOB_ID

# Index WARCs
healtharchive index-job --id $JOB_ID

# Clean up
healtharchive cleanup-job --id $JOB_ID --mode temp

echo "Job $JOB_ID complete"

Monitor worker

#!/bin/bash

while true; do
  clear
  echo "=== Job Status ==="
  healtharchive list-jobs --limit 10
  sleep 10
done

Retry all failed jobs

#!/bin/bash

healtharchive list-jobs --status failed --limit 100 --format json | \
  jq -r '.[].id' | \
  while read job_id; do
    echo "Retrying job $job_id"
    healtharchive retry-job --id $job_id
  done