Investigation Report: Indexing Delay / Zero Indexed Pages
Date: 2026-01-19 Subject: Job 6 "indexed_pages" count remaining at 0 despite WARC generation. Status: RESOLVED (Expected Behavior)
Issue Description
During the deployment of the 2026 Annual Crawl Hardening, it was observed that Job 6 (Health Canada) had generated 56 WARC files but the indexed_pages metric in the database remained at 0. This raised concerns that the indexing pipeline was broken or stalled.
Investigation Steps (Phase 5)
- Static Analysis: Searched for
index_jobcalls in the worker source code. - Runtime Analysis: Verified
healtharchive-workerlogs. - State Verification: Checked filesystem for WARCs vs DB status.
Findings
-
Indexing is Terminal: Code analysis of
src/ha_backend/worker/main.pyconfirmed thatindex_job(job_id)is only called after the crawl loop exits successfully. Unlike some crawlers that index incrementally, HealthArchive currently indexes in batches after the crawl completes. -
Crawl is Active: Job 6 is still in
runningstate. - 56 WARC files exist on disk.
-
last_progresstimestamps are updating. -
Conclusion: The
indexed_pages=0metric is correct for a running job. It will update to the full count once the job finishes and the indexing phase begins.
Hardening Actions Taken
To prevent future confusion and catch actual indexing failures:
- New Alert:
IndexingNotStartedAfterCrawl(inprometheus-alerts-crawl.yml). - Fires if
status='completed'ANDindexed_pages=0for > 1 hour. - Runbook:
docs/operations/runbooks/indexing-not-started.md.
Resolution
No fix required. The system is functioning as designed. Monitoring will alert if the post-crawl indexing fails.