Pages table rollout (browse performance + capture counts)
The backend can optionally maintain a pages table that materializes a per‑source “page” concept (grouped by normalized_url_group) from the raw snapshots table.
Important: this is metadata only. It does not modify WARCs, does not delete snapshots, and does not affect replay fidelity.
What it improves
- Browse performance for
GET /api/search?view=pageswhen there is no search query (and no date range). This avoids expensive window functions over the entiresnapshotstable. - Adds
pageSnapshotsCountto page-browse results so the frontend can show “Captures N”.
Keyword searches (q=...) and date-range filters still run directly against snapshots to keep correctness predictable.
Rollout steps (production)
1) Apply migrations:
2) Backfill the table once:
Notes:
- On Postgres, the CLI may print
upserted unknownbecause the DB driver often does not report an accuraterowcountfor largeINSERT ... SELECTstatements. Use the verification steps below (orSELECT count(*) FROM pages;) to confirm it worked. - For large datasets this can take a while; run it in
tmuxor off-peak. - The worker will keep the table updated for newly indexed jobs (incremental rebuilds happen after indexing).
Verification
1) Confirm pages browse includes capture counts:
curl -s "https://api.healtharchive.ca/api/search?view=pages&pageSize=1" | python3 -m json.tool | head
You should see pageSnapshotsCount as an integer (not null) on results.
2) Confirm metrics (admin token required):
curl -s https://api.healtharchive.ca/metrics \
-H "Authorization: Bearer $HEALTHARCHIVE_ADMIN_TOKEN" \
| grep -E "healtharchive_pages_(table_present|total|fastpath_enabled)|healtharchive_search_mode_total\\{mode=\\\"pages_fastpath\\\"\\}"
Rollback / safety valve
If anything looks suspicious in production (for example: browse ordering or unexpected results), you can disable the fast path without touching data:
1) Set HA_PAGES_FASTPATH=0 in /etc/healtharchive/backend.env 2) Restart only the API process:
This forces view=pages browsing to fall back to snapshot-based grouping.