CLI Commands Reference
Complete reference for healtharchive command-line interface.
Installation
The healtharchive command is installed when you install the package:
Verify installation:
Command Categories
| Category | Commands |
|---|---|
| Environment | check-env, check-archive-tool, check-db |
| Job Management | create-job, run-db-job, index-job, reconcile-completed-indexing, register-job-dir |
| Direct Execution | run-job |
| Inspection | list-jobs, show-job |
| Maintenance | retry-job, reset-retry-count, cleanup-job, reset-crawl-state, replay-index-job |
| Annual Campaign | schedule-annual, annual-status, salvage-annual-edition, plan-annual-shards, annual-edition-report, accept-annual-shard-gap, reconcile-annual-tool-options |
| Seeding | seed-sources |
| Worker | start-worker |
| Change Tracking | compute-changes |
Environment Commands
check-env
Check environment configuration and ensure archive root exists.
Usage:
Output:
Archive root: /mnt/nasd/nobak/healtharchive/jobs
Archive root exists: True
Archive tool command: archive-tool
Exit codes: - 0 - Success - 1 - Archive root missing
check-archive-tool
Verify archive-tool is available and functional.
Usage:
What it does: - Runs archive-tool --help - Validates command is available
Exit codes: - 0 - archive-tool available - 1 - archive-tool not found or failed
check-db
Test database connectivity.
Usage:
Output:
Exit codes: - 0 - Database reachable - 1 - Connection failed
Job Management Commands
create-job
Create a new archive job using source defaults.
Usage:
Arguments: - --source, -s (required) - Source code (hc, phac) - --override (optional) - JSON string with config overrides
Examples:
# Create Health Canada job with defaults
healtharchive create-job --source hc
# Create with custom worker count
healtharchive create-job --source hc --override '{"tool_options": {"initial_workers": 2}}'
# Create a "search-first" crawl (skip optional .zim build) with a larger Docker /dev/shm
healtharchive create-job --source hc --override '{"tool_options": {"initial_workers": 2, "skip_final_build": true, "docker_shm_size": "1g"}}'
# Enable monitoring and stall detection
healtharchive create-job --source phac --override '{
"tool_options": {
"enable_monitoring": true,
"stall_timeout_minutes": 60
}
}'
Output:
Created job ID: 42
Name: hc-20260118
Output directory: /mnt/nasd/nobak/healtharchive/jobs/hc/20260118T210911Z__hc-20260118
Status: queued
Exit codes: - 0 - Job created successfully - 1 - Failed (invalid source, config validation error)
run-db-job
Execute a queued job by ID.
Usage:
Arguments: - --id (required) - Job ID to run - --no-index (optional) - Leave a successful crawl in completed instead of indexing immediately
Example:
What it does: 1. Validates job status is queued or retryable 2. Sets status to running 3. Executes archive-tool subprocess 4. Updates status to completed or failed 5. On success, indexes the job unless --no-index is used
Exit codes: - 0 - Crawl and indexing succeeded, or crawl succeeded with --no-index - 1 - Crawl failed or job invalid
reconcile-completed-indexing
Index completed jobs that were started outside the worker.
Usage:
What it does: 1. Finds jobs in status="completed" 2. Skips running/queued jobs 3. Runs the normal WARC indexing pipeline for each job 4. Leaves already indexed jobs untouched
This command is idempotent and is the preferred remediation when a watchdog or manual run-db-job path leaves a crawl finished but not searchable.
index-job
Index WARCs from a completed job into the database.
Usage:
Arguments: - --id (required) - Job ID to index
Example:
What it does: 1. Discovers WARC files in job output directory 2. Parses WARC records 3. Extracts text, title, snippet 4. Creates Snapshot rows 5. Sets job status to indexed
Output:
Exit codes: - 0 - Indexing succeeded - 1 - Failed (no WARCs, parsing error)
register-job-dir
Attach an existing archive_tool output directory to a new database job.
Usage:
Arguments: - --source (required) - Source code - --output-dir (required) - Existing directory path - --name (optional) - Job name (default: derived from directory)
Example:
healtharchive register-job-dir \
--source hc \
--output-dir /mnt/nasd/nobak/healtharchive/jobs/hc/20260101T120000Z__hc-20260101
Use case: Import externally-run crawls into database
Exit codes: - 0 - Job registered - 1 - Directory doesn't exist or validation failed
Direct Execution
run-job
Run archive-tool directly without database persistence.
Usage:
healtharchive run-job \
--name NAME \
--seeds URL [URL...] \
[--initial-workers N] \
[--output-dir DIR]
Arguments: - --name (required) - Job name - --seeds (required) - One or more seed URLs - --initial-workers (optional) - Worker count (default: 1) - --output-dir (optional) - Output directory (default: auto-generated)
Example:
healtharchive run-job \
--name test-crawl \
--seeds https://www.canada.ca/en/health-canada.html \
--initial-workers 2
Use case: Quick testing without database overhead
Exit codes: - 0 - Crawl succeeded - Non-zero - archive-tool exit code
Inspection Commands
list-jobs
List recent jobs with summary information.
Usage:
Arguments: - --limit (optional) - Number of jobs to show (default: 20) - --status (optional) - Filter by one or more statuses - --source (optional) - Filter by source code
Examples:
# List 20 most recent jobs
healtharchive list-jobs
# Show only failed jobs
healtharchive list-jobs --status failed
# Show Health Canada jobs
healtharchive list-jobs --source hc
# Show last 50 jobs
healtharchive list-jobs --limit 50
Output:
ID Source Status Backend Rescue Retries Created_at Started_at Finished_at Indexed
6 hc running playwright_warc fallback-active 0 2026-01-01 00:05:02 2026-04-10 16:15:18 None 0 hc-20260101
7 phac failed browsertrix fresh-failed 1 2026-01-01 00:05:02 2026-04-03 01:50:12 2026-04-03 02:18:57 0 phac-20260101
The Backend and Rescue columns are intended to make annual rescue state visible from the standard operator path:
Backendshows the current effective backend inferred from job config and live crawl state.Rescueshows a compact rescue summary such asnormal,fresh-failed,fallback-active,fallback-retry, orfallback-exhausted.
show-job
Display detailed information about a specific job.
Usage:
Arguments: - --id (required) - Job ID - --warc-details (optional) - Include detailed WARC discovery information
Examples:
# Human-readable output
healtharchive show-job --id 42
# Include WARC discovery details
healtharchive show-job --id 42 --warc-details
Output (text format):
ID: 6
Source: hc (Health Canada)
Name: hc-20260101
Status: running
Retry count: 0
Created at: 2026-01-01 00:05:02.537667+00:00
Queued at: 2026-01-01 00:05:02.331347+00:00
Started at: 2026-04-10 16:15:18.050361+00:00
Finished at: None
Output dir: /srv/healtharchive/jobs/hc/20260101T000502Z__hc-20260101
Crawler RC: None
Crawler status: None
Crawler stage: promoted_to_playwright_warc
WARC files: 0
WARC files (discovered): 300
Indexed pages: 0
Rescue:
Primary backend: browsertrix
Configured backend: playwright_warc
Effective backend: playwright_warc
Fallback backend: playwright_warc
Resume policy: fresh_only
Fresh failure budget: 2
Fallback active: yes
Promoted to fallback: yes
Rescue note: promoted from browsertrix to playwright_warc after fresh-failure budget exhaustion
The Rescue block is designed to answer the common annual-crawl operator questions without requiring immediate combined-log inspection:
- which backend is primary for the job
- which backend is configured now
- which backend is effectively active
- whether fallback promotion already happened
- whether the job is still in a fresh Browsertrix failure state or has moved to a healthy fallback path
Maintenance Commands
retry-job
Retry a failed or index-failed job.
Usage:
Arguments: - --id (required) - Job ID to retry
Example:
What it does: - If job status is failed: Sets to retryable (for re-crawl) - If job status is index_failed: Sets to completed (for re-index)
Exit codes: - 0 - Job marked for retry - 1 - Job not in retryable state
reset-retry-count
Reset a crawl job's retry budget by setting retry_count to a lower value.
Safe-by-default: dry-run unless --apply is passed.
Usage:
Arguments: - --id - One or more Job IDs to modify - --apply - Persist changes (default: dry-run) - --reason - Optional note printed in output (required for multi-job apply) - --new-count - New value for retry_count (default: 0) - --min-retry-count - Only match jobs with retry_count >= this (default: 1)
Examples:
# Dry-run (prints what would change)
healtharchive reset-retry-count --id 42
# Apply for one job
healtharchive reset-retry-count --id 42 --apply --reason "storage recovered; re-attempt crawl"
# Bulk mode (requires --source, --status, and --limit)
healtharchive reset-retry-count --source hc --status failed retryable --limit 25 --apply --reason "post-incident retry budget reset"
Safety guardrails: - Skips jobs in running status. - Skips jobs whose lock file appears held (job runner likely still active). - Only supports statuses: queued, retryable, failed.
cleanup-job
Clean up temporary crawl artifacts.
Usage:
Arguments: - --id (required) - Job ID - --mode (optional) - Cleanup mode (default: temp; supported: temp, temp-nonwarc) - --force (optional) - Force cleanup even if replay is enabled - --dry-run (optional) - Print the cleanup plan without changing files or the DB
Example:
# Safe cleanup for an indexed job (preserves WARCs / replayability)
healtharchive cleanup-job --id 42 --mode temp-nonwarc --dry-run
healtharchive cleanup-job --id 42 --mode temp-nonwarc
# Legacy destructive cleanup (use with caution)
healtharchive cleanup-job --id 42 --mode temp --force
What it does: - temp-nonwarc: - consolidates WARCs into warcs/ - preserves provenance under provenance/ - rewrites Snapshot.warc_path away from .tmp* locations - removes .tmp* directories and the live .archive_state.json - updates job: cleanup_status = "temp_nonwarc_cleaned", cleaned_at = now - temp: - removes .tmp* directories - removes .archive_state.json - updates job: cleanup_status = "temp_cleaned", cleaned_at = now
⚠️ Warning: - temp-nonwarc is the preferred cleanup mode for terminal jobs because it preserves WARCs and replayability. - temp deletes WARCs if they're in .tmp* directories. Only use it when you explicitly do not need replay retention.
Exit codes: - 0 - Cleanup succeeded - 1 - Failed (job not indexed, replay enabled without --force)
replay-index-job
Create/refresh pywb collection index for a job.
Usage:
Arguments: - --id (required) - Job ID
Example:
What it does: - Creates pywb collection for job WARCs - Generates CDX index for fast replay - Enables browsing via pywb
Prerequisites: - HEALTHARCHIVE_REPLAY_BASE_URL set - pywb installed and configured
Exit codes: - 0 - Index created - 1 - Failed or replay not configured
Seeding
seed-sources
Initialize source records in the database.
Usage:
What it does: - Inserts Source rows for hc, phac, and cihr - Idempotent (safe to run multiple times)
Example:
Output:
Seeded source: hc (Health Canada)
Seeded source: phac (Public Health Agency of Canada)
Seeded source: cihr (Canadian Institutes of Health Research)
Exit codes: - 0 - Sources seeded or already exist
Annual Campaign
schedule-annual
Plan or enqueue Jan 01 (UTC) annual campaign jobs for hc, phac, and cihr.
Usage:
Examples:
# Show what would be created
healtharchive schedule-annual --year 2026
# Actually create jobs
healtharchive schedule-annual --year 2026 --apply
Notes: - Dry-run by default - Idempotent for annual campaign metadata/name matches - Refuses to enqueue when a source already has an active non-indexed job
annual-status
Report annual campaign progress and search-readiness for a given year.
Usage:
Examples:
Text output:
Annual campaign status — 2026-01-01 (Jan 01 UTC)
Ready for search: NO
Summary: total=3 indexed=1 in_progress=2 failed=0 missing=0 errors=0
Rescue states: fallback-active=1 fresh-failed=1 normal=1
Operator states: running-fallback=1 search-ready=1 waiting-fresh-retry=1
hc: job_id=6 status=running operator_state=running-fallback backend=playwright_warc rescue=fallback-active indexed_pages=0 retries=0 crawl_rc=None crawl_status=None name=hc-20260101
note: promoted from browsertrix to playwright_warc after fresh-failure budget exhaustion
phac: job_id=7 status=retryable operator_state=waiting-fresh-retry backend=browsertrix rescue=fresh-failed indexed_pages=0 retries=1 crawl_rc=1 crawl_status=failed name=phac-20260101
note: awaiting next fresh browsertrix retry within the configured rescue budget
cihr: job_id=8 status=indexed operator_state=search-ready backend=browsertrix rescue=normal indexed_pages=4123 retries=0 crawl_rc=0 crawl_status=success name=cihr-20260101
annual-status is now the compact annual rescue summary surface:
backendshows the current effective backend for each annual job.rescueshows the compact rescue status (normal,fresh-failed,fallback-active,fallback-retry,fallback-exhausted).operator_statedistinguishes active work from intentional waiting states, soretryablejobs in backoff do not read like terminal failures.--jsonincludes per-jobrescuedetails plus summary-levelrescueStatesandoperatorStatescounts for downstream tooling.
reconcile-annual-tool-options
Reconcile existing annual jobs to source-specific crawl profiles.
Usage:
Examples:
# Dry-run reconciliation for all annual sources
healtharchive reconcile-annual-tool-options --year 2026
# Apply only to HC annual jobs
healtharchive reconcile-annual-tool-options --year 2026 --sources hc --apply
What it does: - Reconciles legacy baseline tool options to per-source profiles - Reconciles annual execution_policy defaults (for example HC/PHAC fresh_only resume policy and playwright_warc fallback settings) - Reconciles canonical HC/PHAC scope filters on existing annual jobs - Backfills canonical annual metadata on matching jobs: campaign_kind, campaign_year, campaign_date, campaign_date_utc, and scheduler_version - Preserves explicit non-baseline overrides - Enforces restart-budget floor and annual safety defaults
salvage-annual-edition
Attach existing annual jobs/WARCs to annual edition records as legacy full-site salvage shards.
Usage:
What it does: - Creates missing {source, year} annual edition rows - Attaches matching annual ArchiveJob rows to those editions - Marks attached jobs as legacy-full-site shards - With --report, regenerates coverage/provenance artifacts
plan-annual-shards
Plan or create deterministic shard jobs for annual editions.
Usage:
Dry-run output lists the shard keys and seed URLs. --apply creates queued ArchiveJob rows tied to the annual edition.
annual-edition-report
Generate or display a coverage/provenance report for one annual edition.
Usage:
healtharchive annual-edition-report --source SOURCE_CODE --year YEAR [--generate] [--json]
healtharchive annual-edition-report --id EDITION_ID [--generate] [--json]
The generated artifacts are:
target-ledger.jsonlcapture-manifest.jsonlcoverage-report.jsoncoverage-report.md
accept-annual-shard-gap
Mark a reviewed shard gap as accepted with an operator-supplied reason.
Usage:
Use this only after the retry budget has been exhausted and the remaining gap is acceptable for the edition’s research/provenance report.
probe-browser-fetch
Run one or more URLs through the pinned Playwright browser path used by the server-side playwright_warc fallback backend.
Usage:
Examples:
healtharchive probe-browser-fetch https://www.canada.ca/en/public-health.html
healtharchive probe-browser-fetch \
https://www.canada.ca/en/public-health.html \
https://www.canada.ca/en/health-canada.html
What it does: - launches the same pinned Playwright Docker image used by the browser fallback - reports final URL, status code, cookie count, body source, and HTML byte size - helps operators confirm whether the server-side browser path works before rerunning a failed annual job
reset-crawl-state
Reset poisoned crawl temp/resume state for a non-running job while preserving stable WARCs.
Usage:
Examples:
# Show what would be removed/preserved
healtharchive reset-crawl-state --id 7
# Consolidate temp WARCs, remove stale .tmp*/state/resume files
healtharchive reset-crawl-state --id 7 --apply
What it does: - Refuses to run if the job is still running or its job lock is held - Consolidates temp-dir WARCs into stable warcs/ - Removes stale .tmp* dirs - Removes .archive_state.json - Removes .zimit_resume.yaml - Marks the job crawler_stage=state_reset
Use this when an annual job has accumulated poisoned resume state and should be forced back to a fresh crawl phase without losing already captured WARCs.
Worker
start-worker
Start the job processing worker loop.
Usage:
Arguments: - --poll-interval (optional) - Seconds between polls (default: 30) - --once (optional) - Process one job then exit
Examples:
# Run continuously with 30s polling
healtharchive start-worker
# Poll every 60 seconds
healtharchive start-worker --poll-interval 60
# Process one job and exit (for testing)
healtharchive start-worker --once
What it does: 1. Polls for jobs with status queued or retryable 2. Runs oldest job first 3. Crawls → Indexes → Repeats 4. Sleeps if no jobs found
Exit: Press Ctrl+C to stop gracefully
Change Tracking
compute-changes
Compute change events between adjacent snapshots.
Usage:
Arguments: - --limit (optional) - Max snapshot groups to process - --source (optional) - Limit to specific source
Example:
# Compute changes for all snapshots
healtharchive compute-changes
# Process 100 page groups
healtharchive compute-changes --limit 100
# Only Health Canada changes
healtharchive compute-changes --source hc
What it does: - Groups snapshots by normalized_url_group - Compares adjacent captures (by timestamp) - Generates SnapshotChange rows with diff metadata
Exit codes: - 0 - Changes computed - 1 - Error
Global Options
All commands support:
Environment Variables
Commands respect these environment variables:
| Variable | Purpose | Default |
|---|---|---|
HEALTHARCHIVE_DATABASE_URL | Database connection | sqlite:///healtharchive.db |
HEALTHARCHIVE_ARCHIVE_ROOT | Base directory for jobs | /mnt/nasd/nobak/healtharchive/jobs |
HEALTHARCHIVE_TOOL_CMD | archive-tool command | archive-tool |
HEALTHARCHIVE_LOG_LEVEL | Logging level | INFO |
Set in .env file:
HEALTHARCHIVE_DATABASE_URL=postgresql://user:pass@localhost/healtharchive
HEALTHARCHIVE_ARCHIVE_ROOT=/data/healtharchive/jobs
HEALTHARCHIVE_LOG_LEVEL=DEBUG
Exit Codes
Standard exit codes: - 0 - Success - 1 - General error - 2 - Command-line usage error
Scripting Examples
Process a job end-to-end
#!/bin/bash
set -e
# Create job
JOB_ID=$(healtharchive create-job --source hc | grep "Created job ID:" | awk '{print $4}')
echo "Created job $JOB_ID"
# Run crawl
healtharchive run-db-job --id $JOB_ID
# Index WARCs
healtharchive index-job --id $JOB_ID
# Clean up
healtharchive cleanup-job --id $JOB_ID --mode temp
echo "Job $JOB_ID complete"
Monitor worker
#!/bin/bash
while true; do
clear
echo "=== Job Status ==="
healtharchive list-jobs --limit 10
sleep 10
done
Retry all failed jobs
#!/bin/bash
healtharchive list-jobs --status failed --limit 100 --format json | \
jq -r '.[].id' | \
while read job_id; do
echo "Retrying job $job_id"
healtharchive retry-job --id $job_id
done
Related Documentation
- Architecture Guide: ../architecture.md
- Job Registry: ../architecture.md#4-job-registry--creation
- Worker Loop: ../architecture.md#9-worker-loop
- Data Model: data-model.md
- Live Testing: ../development/live-testing.md