Archive Tool Reference

The archive_tool is HealthArchive's internal crawler and orchestrator subpackage.

Quick Overview

archive_tool is a Docker-based web crawler that: - Wraps the zimit crawler (from OpenZIM) - Manages crawl state and resumption - Monitors crawl health (stall detection, error thresholds) - Supports adaptive worker scaling - Optionally rotates VPN connections

Location: src/archive_tool/

Technology: - Python 3.11+ - Docker (runs ghcr.io/openzim/zimit container) - State persistence (.archive_state.json)

Environment Variables: - HEALTHARCHIVE_ZIMIT_DOCKER_IMAGE: Override Docker image (default: ghcr.io/openzim/zimit) - HEALTHARCHIVE_DOCKER_MEMORY_LIMIT: Container memory limit (default: 4g) - HEALTHARCHIVE_DOCKER_CPU_LIMIT: Container CPU limit (default: 1.5)

Note: the HealthArchive backend indexes WARCs into Snapshot rows; it does not read .zim files. ZIM output is an optional artifact and can be skipped with --skip-final-build.

Architecture

┌─────────────────────────────────────────────────┐
│ HealthArchive Backend                           │
│                                                 │
│  ha_backend.jobs.run_persistent_job()           │
│         │                                       │
│         ├──> Builds CLI args from job config   │
│         │                                       │
│         └──> subprocess.run()                   │
│                     │                           │
└─────────────────────┼───────────────────────────┘
                      │
                      ▼
         ┌────────────────────────────┐
         │   archive-tool CLI         │
         │   (archive_tool/cli.py)    │
         └────────────┬───────────────┘
                      │
                      ├──> Validates Docker
                      ├──> Determines run mode
                      ├──> Spawns zimit in Docker
                      ├──> Monitors progress
                      ├──> Writes WARCs to .tmp_N/
                      └──> Builds ZIM (optional)

Canonical Documentation

Full technical reference: src/archive_tool/docs/documentation.md

1,508 lines covering: - CLI interface and all flags - Run modes (Fresh, Resume, New-with-Consolidation, Overwrite) - State management (.archive_state.json) - Docker orchestration details - Monitoring and adaptive workers - VPN rotation mechanism - WARC discovery and consolidation - Error handling and recovery - Testing and development

Read the full docs for: - Detailed CLI flag reference - State machine diagrams - Docker volume mapping - Log parsing internals - Adding new features to archive_tool

Quick Reference

CLI Usage

archive-tool \
  --name CRAWL_NAME \
  --output-dir /path/to/output \
  --initial-workers N \
  [--enable-monitoring] \
  [--enable-adaptive-workers] \
  [--enable-vpn-rotation --vpn-connect-command "..."] \
  SEED_URL [SEED_URL...]

Common Flags

Flag	Purpose
`--name`	Crawl name (used in output naming; ZIM is optional)
`--output-dir`	Output directory path
`--initial-workers`	Number of parallel workers (default: 1)
`--enable-monitoring`	Enable stall/error detection
`--stall-timeout-minutes`	Abort if no progress (requires monitoring)
`--enable-adaptive-workers`	Reduce workers on errors (requires monitoring)
`--enable-vpn-rotation`	Rotate VPN on stalls (requires monitoring)
`--docker-shm-size`	Increase container `/dev/shm` (can improve stability)
`--skip-final-build`	Skip the final `.zim` build stage (WARCs still produced)
`--cleanup`	Delete temp dirs after successful crawl
`--overwrite`	Delete existing output before starting

Run Modes

archive-tool automatically determines the run mode based on state:

Fresh - No prior state, start new crawl
Resume - State exists and incomplete, resume from checkpoint
New-with-Consolidation - State complete, start new crawl but consolidate WARCs
Overwrite - --overwrite flag set, delete everything and start fresh

See: src/archive_tool/docs/documentation.md (Run Modes) for decision tree

Output Structure

output_dir/
├── .archive_state.json              # Persistent state
├── .tmp_1/                          # First crawl attempt
│   └── collections/
│       └── crawl-YYYYMMDD.../
│           ├── archive/             # WARCs here
│           │   ├── rec-00000-....warc.gz
│           │   └── rec-00001-....warc.gz
│           └── logs/
├── .tmp_2/                          # Second attempt (if restarted)
├── archive_STAGE_TIMESTAMP.log      # Individual stage logs
├── archive_STAGE_TIMESTAMP.combined.log  # Aggregated logs
└── zim/
    └── NAME_DATE.zim                # Optional ZIM file

State File Format

.archive_state.json:

{
  "current_workers": 4,
  "initial_workers": 4,
  "temp_dirs_host_paths": ["/some/output/.tmp123", "..."],
  "vpn_rotations_done": 1,
  "worker_reductions_done": 1,
  "container_restarts_done": 1
}

Backend Integration

The backend calls archive-tool via subprocess. Key files:

Job Execution

ha_backend/jobs.py:run_persistent_job() (lines 439-560): - Loads ArchiveJob.config from database - Translates tool_options to CLI flags - Builds command: archive-tool --flag1 val1 --flag2 val2 ... SEEDS - Executes with subprocess.run() - Updates job status based on exit code

Config → CLI Mapping:

config["tool_options"]["enable_monitoring"] → --enable-monitoring
config["tool_options"]["initial_workers"] → --initial-workers N
config["tool_options"]["stall_timeout_minutes"] → --stall-timeout-minutes N

WARC Discovery

ha_backend/indexing/warc_discovery.py: - Uses archive_tool.state.CrawlState to load .archive_state.json - Uses archive_tool.utils.find_all_warc_files() to locate WARCs - Ensures backend and archive-tool use identical logic

Cleanup

ha_backend/cli/cmd_cleanup_job.py: - Uses archive_tool.utils.cleanup_temp_dirs() to remove .tmp* directories - Deletes .archive_state.json - Updates ArchiveJob.cleanup_status

Monitoring Features

Stall Detection

When --enable-monitoring is set: - Monitors log output every --monitor-interval-seconds (default: 30) - Parses "Crawl statistics" JSON from logs - Detects stalls: no new pages for --stall-timeout-minutes - Action: Abort crawl with non-zero exit code

Error Thresholds

--error-threshold-timeout N: Abort if N timeout errors
--error-threshold-http N: Abort if N HTTP errors
Prevents runaway crawls that repeatedly fail

Adaptive Workers

When --enable-adaptive-workers is set: - Reduces worker count on sustained errors - Min workers: --min-workers (default: 1) - Max reductions: --max-worker-reductions (default: 2) - Strategy: Reduce by 1 each time threshold exceeded

VPN Rotation

When --enable-vpn-rotation is set: - Rotates VPN connection on stalls or errors - Command: --vpn-connect-command "vpn connect server" - Frequency: Every --vpn-rotation-frequency-minutes - Max rotations: --max-vpn-rotations

Use case: Avoid IP bans during large crawls

Development

Running Locally

# Direct execution
cd src/archive_tool
python -m archive_tool.cli \
  --name test \
  --output-dir /tmp/test-crawl \
  https://example.com

# Via installed command
archive-tool --name test --output-dir /tmp/test https://example.com

Testing

# Run archive_tool tests
pytest tests/test_archive_tool*.py

# Test state management
pytest tests/test_archive_state.py

# Test WARC discovery
pytest tests/test_warc_discovery.py

Adding New Features

Modify CLI (archive_tool/cli.py):
Add new argument to argparse
Update run_with_parsed_args()
Update contract (ha_backend/archive_contract.py):
Add field to ArchiveToolOptions TypedDict
Update validate_tool_options()
Update backend (ha_backend/jobs.py):
Add CLI flag construction in run_persistent_job()
Update job registry (ha_backend/job_registry.py):
Add to default_tool_options if needed
Add tests:
tests/test_archive_contract.py - Config validation
tests/test_jobs_persistent.py - CLI construction
tests/test_archive_tool_*.py - archive_tool behavior

See: src/archive_tool/docs/documentation.md (Development) for details

Troubleshooting

Docker Issues

Problem: "Cannot connect to Docker daemon"

Solution:

sudo systemctl start docker
docker ps  # Verify

Problem: Permission denied accessing Docker socket

Solution:

sudo usermod -aG docker $USER
# Log out and back in

State Issues

Problem: Crawl won't resume

Solution:

# Check state file
cat output_dir/.archive_state.json

# Force fresh start
archive-tool --overwrite ...

Problem: WARCs not found

Solution:

# Manually check
find output_dir -name "*.warc.gz"

# Verify state points to correct dirs
cat output_dir/.archive_state.json | jq '.temp_dirs'

Monitoring Issues

Problem: Adaptive workers not triggering

Check: 1. --enable-monitoring is set 2. --enable-adaptive-workers is set 3. Errors exceed threshold 4. Not already at --min-workers

Performance Tuning

Worker Count

Default: 1 worker (conservative)
Small sites: 1-2 workers
Medium sites: 2-4 workers
Large sites: 4-8 workers (watch resource usage)

Factors: - Server CPU/memory - Network bandwidth - Site's rate limiting - Politeness requirements

Memory Usage

Docker container memory (per worker): - ~500MB base - +200-500MB per worker - +500MB-1GB for large sites

Example: 4 workers ≈ 2-4GB RAM

Disk I/O

WARCs write continuously: - 10-50MB/min for typical sites - 100-500MB/min for large sites

Ensure: - Fast disk (SSD recommended) - Sufficient space (check df -h before starting) - No I/O bottlenecks (iostat -x 1)

Full archive_tool docs: src/archive_tool/docs/documentation.md (Start here for details)
Backend integration: ../architecture.md#5-archive_tool-integration
Job execution: ../architecture.md#52-run_persistent_job
CLI commands: cli-commands.md
Debugging crawls: ../tutorials/debug-crawl.md

Archive Tool Reference

Quick Overview

Architecture

Canonical Documentation

Quick Reference

CLI Usage

Common Flags

Run Modes

Output Structure

State File Format

Backend Integration

Job Execution

WARC Discovery

Cleanup

Monitoring Features

Stall Detection

Error Thresholds

Adaptive Workers

VPN Rotation

Development

Running Locally

Testing

Adding New Features

Troubleshooting

Docker Issues

State Issues

Monitoring Issues

Performance Tuning

Worker Count

Memory Usage

Disk I/O

Related Documentation