HealthArchive – Replay Service (pywb) runbook
This document covers setting up full‑fidelity web replay (HTML + CSS/JS/images/fonts) for HealthArchive using a dedicated pywb service behind Caddy.
It is intentionally written so a future operator can follow it without needing additional context.
0) What this is (and is not)
What this provides
- A replay origin:
https://replay.healtharchive.ca - Wayback‑style replay from the project’s WARC files
- Natural browsing: links stay inside replay, and captured assets (CSS/JS/images) load from the archive when present
What this does not provide
- Guaranteed completeness. If a page depends on third‑party CSS/JS/images that were not captured into the WARCs, those assets will still be missing at replay time.
- A custom replay UI in pywb itself. HealthArchive provides the primary browsing experience via the frontend wrapper pages (see “Backend wiring” below).
1) Core decisions (contract)
1.1 Collections are per ArchiveJob
- Each
ArchiveJobbecomes a replay “edition”. -
Collection name is stable and mechanical:
-
job-<job_id>(example:job-1)
This makes it easy to generate replay URLs from DB data later.
1.2 Replay URL format
We will use pywb’s standard collection routing.
Replay latest capture (most common):
List captures (“calendar” / capture list UI):
Replay closest capture to a timestamp (14-digit UTC YYYYMMDDhhmmss):
Where:
<collection>isjob-<job_id><timestamp>is UTC inYYYYMMDDhhmmss(14 digits)
Example (latest capture):
Note: HealthArchive’s public API generates timestamp-locked replay URLs for snapshots (the <timestamp> form) so the viewer stays anchored to the capture time as you navigate within the backup.
1.3 Retention warning: replay depends on WARCs staying on disk
Replay reads from the WARC files referenced by each job.
Important: healtharchive cleanup-job --mode temp currently removes archive_tool temp dirs including WARCs (see src/ha_backend/cli.py:cmd_cleanup_job).
If you run cleanup on a replayable job, replay will break.
Operational rule for now: do not run cleanup-job --mode temp for any job you want replayable.
When replay is enabled (backend env var HEALTHARCHIVE_REPLAY_BASE_URL is set), cleanup-job --mode temp will refuse to run unless you pass --force.
This rule is repeated in:
docs/deployment/production-single-vps.mddocs/deployment/hosting-and-live-server-to-dos.mddocs/development/live-testing.md
2) DNS + TLS
Create DNS:
A replay.healtharchive.ca -> <VPS_PUBLIC_IP>
TLS is handled by Caddy automatically once the site block exists.
Before continuing, SSH to the VPS as your admin user (typically over Tailscale):
3) VPS directory layout
On the VPS:
- WARCs/job outputs already live under:
/srv/healtharchive/jobs- Replay service state (config, collections, indexes) will live under:
/srv/healtharchive/replay
Create directories:
Create a dedicated system user for the replay volume:
Recommended perms (important):
sudo chown -R hareplay:healtharchive /srv/healtharchive/replay
sudo chmod 2770 /srv/healtharchive/replay /srv/healtharchive/replay/collections
Why the hareplay ownership matters:
- Your WARC files are typically
640and group-owned byhealtharchive. - The pywb container is hardened with
--cap-drop=ALL, which means “root” in the container cannot bypass Unix permissions (noCAP_DAC_OVERRIDE). - We will also run the container as the
hareplayUID/GID explicitly (below), so pywb can: - write its indexes under
/webarchive(owned byhareplay:healtharchive) - read group-readable WARCs under
/warcs(grouphealtharchive)
4) pywb container deployment (systemd + Docker)
We run pywb only on localhost (Caddy is the public edge).
4.1 Create pywb config + rules
Create /srv/healtharchive/replay/config.yaml:
debug: false
# We embed replay inside a HealthArchive wrapper UI later; disable pywb’s framed
# replay chrome so the page itself renders “as captured”.
framed_replay: false
# Prefer stable URLs once a capture is resolved.
redirect_to_exact: true
# pywb 2.9.1 reads cookie rewriting from rewrite rules rather than a top-level
# config knob. Keep the managed rules file alongside this config.
rules_file: /webarchive/rules.yaml
# Optional: expose an aggregate across all on-disk collections at `/all/...`.
# (This is not required for per-job collections like `/job-1/...`.)
# collections:
# all: $all
Create /srv/healtharchive/replay/rules.yaml:
default_filters:
fuzzy_search_limit: "100"
not_exts:
- asp
- aspx
- jsp
- php
- pl
- exe
- dll
mimes:
- application/x-shockwave-flash
- application/dash+xml
- application/x-mpegURL
- application/vnd.apple.mpegurl
url_normalize:
- match: "[?&](_|cb|uncache)=([\\d]+)(?=&|$)"
replace: ""
- match: "[?&]utm_[^=]+=[^&]+(?=&|$)"
replace: ""
- match: "[?&](callback=jsonp)[^&]+(?=&|$)"
replace: "\\1"
- match: "[?&]((?:\\w+)=jquery)[\\d]+_[\\d]+"
replace: "\\1"
- match: "[?&](\\w*(bust|ts)\\w*=1[\\d]{12,15})(?=&|$)"
replace: ""
- match: "[?&](fbclid)=(.*)+(?=&|$)"
replace: ""
rules:
- url_prefix: ""
fuzzy_lookup:
match: "()"
- url_prefix: ""
rewrite:
cookie_scope: removeall
Repo-managed template:
/opt/healtharchive/docs/deployment/pywb/config.yaml/opt/healtharchive/docs/deployment/pywb/rules.yaml/opt/healtharchive/docs/deployment/pywb/sitecustomize.py/opt/healtharchive/docs/deployment/systemd/healtharchive-replay.service
On single-VPS deployments, prefer installing that tracked template instead of editing /srv/healtharchive/replay/config.yaml or /srv/healtharchive/replay/rules.yaml by hand.
4.2 Create systemd service
Preferred (single VPS): install the repo-managed template:
Manual equivalent (/etc/systemd/system/healtharchive-replay.service):
[Unit]
Description=HealthArchive replay (pywb)
Wants=network-online.target
After=network-online.target docker.service healtharchive-warc-tiering.service
Requires=docker.service
ConditionPathExists=/srv/healtharchive/replay
ConditionPathExists=/srv/healtharchive/replay/config.yaml
ConditionPathExists=/srv/healtharchive/jobs
[Service]
Type=simple
ExecStartPre=/usr/bin/getent passwd hareplay
ExecStartPre=/usr/bin/getent group healtharchive
ExecStartPre=-/usr/bin/docker rm -f healtharchive-replay
ExecStartPre=/usr/bin/docker pull webrecorder/pywb:2.9.1
ExecStart=/usr/bin/bash -lc 'exec /usr/bin/docker run --rm --name healtharchive-replay -p 127.0.0.1:8090:8080 -e PYTHONPATH=/webarchive --user "$$(/usr/bin/id -u hareplay):$$(/usr/bin/getent group healtharchive | /usr/bin/cut -d: -f3)" --cap-drop=ALL --security-opt no-new-privileges:true -v /srv/healtharchive/replay:/webarchive:rw -v /srv/healtharchive/jobs:/warcs:ro,rshared webrecorder/pywb:2.9.1'
Restart=always
RestartSec=3
TimeoutStopSec=30
[Install]
WantedBy=multi-user.target
Notes:
- We run as
hareplay:healtharchiveto avoid the container needing touseradd/suinternally (which fails when--cap-drop=ALLremovesCAP_SETUID/CAP_SETGID). - The
rsharedbind propagation on/srv/healtharchive/jobshelps pywb see nested mounts under that tree (e.g., Storage Box tiering bind mounts) without requiring a container restart after mount repairs.
Enable + start:
sudo systemctl daemon-reload
sudo systemctl enable --now healtharchive-replay.service
sudo systemctl status healtharchive-replay.service --no-pager
Local check (on the VPS):
If wb-manager reindex fails with Permission denied:
- Double-check:
/srv/healtharchive/replayis owned byhareplay:healtharchive(notroot:healtharchive)- the systemd unit runs with
--user <hareplay_uid>:<healtharchive_gid>
Then restart:
sudo chown -R hareplay:healtharchive /srv/healtharchive/replay
sudo systemctl restart healtharchive-replay.service
5) Caddy config (public HTTPS)
Edit /etc/caddy/Caddyfile and add:
replay.healtharchive.ca {
encode zstd gzip
# Replay needs to be embeddable by the HealthArchive frontend.
# (The frontend wrapper provides the visible banner/controls.)
header {
-X-Frame-Options
Content-Security-Policy "frame-ancestors https://healtharchive.ca https://www.healtharchive.ca"
}
reverse_proxy 127.0.0.1:8090
}
Validate + reload:
sudo caddy fmt --overwrite /etc/caddy/Caddyfile
sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl reload caddy
Public check (from your laptop):
5.1) Optional: HealthArchive banner on direct replay pages
The HealthArchive frontend provides the primary replay UX via /snapshot/<id> and /browse/<id> (header, navigation, disclaimers). Users may still open replay.healtharchive.ca directly in a new tab.
To reduce confusion, you can inject a small HealthArchive banner into pywb’s non-framed replay HTML using pywb’s custom_banner.html hook.
Implementation notes:
- The banner is inserted only for non-framed replay (our default).
- When replay is embedded in an iframe, the banner collapses to a minimal UI (View diff + Details + Hide) to avoid duplicating the HealthArchive wrapper header.
- When embedded, the script also emits lightweight
postMessageevents with the current replay URL/timestamp so the HealthArchive frontend can support edition switching while you browse. - Users can dismiss it via the Hide button (stored in
localStorageon the replay origin).
Deploy on the VPS:
sudo install -o hareplay -g healtharchive -m 0640 \
/opt/healtharchive/docs/deployment/pywb/config.yaml \
/srv/healtharchive/replay/config.yaml
sudo install -o hareplay -g healtharchive -m 0640 \
/opt/healtharchive/docs/deployment/pywb/rules.yaml \
/srv/healtharchive/replay/rules.yaml
sudo install -o hareplay -g healtharchive -m 0640 \
/opt/healtharchive/docs/deployment/pywb/sitecustomize.py \
/srv/healtharchive/replay/sitecustomize.py
sudo mkdir -p /srv/healtharchive/replay/templates
sudo install -o hareplay -g healtharchive -m 0640 \
/opt/healtharchive/docs/deployment/pywb/custom_banner.html \
/srv/healtharchive/replay/templates/custom_banner.html
sudo systemctl restart healtharchive-replay.service
Notes:
- The banner script calls the HealthArchive public API from the replay origin (for example,
GET /api/replay/resolve) to resolve the snapshot ID and build the correct “back to snapshot” and compare links. Ensure the backend CORS allowlist includeshttps://replay.healtharchive.cawhen the banner is enabled. - The direct-replay banner is a compact sticky top bar: title, capture date, original URL, an always-visible disclaimer line, and action links (View diff, Details, All snapshots, Raw HTML, Metadata JSON, Cite, Report issue, Hide). “All snapshots” opens a right-aligned popover list.
- On small screens, the banner stacks the action links below the “← HealthArchive.ca” button to avoid overlap; when you scroll, it transitions into a more compact mode that hides the metadata/disclaimer line.
- The “← HealthArchive.ca” button returns to the HealthArchive page the user came from when possible (the frontend passes an explicit return path in the replay URL fragment). If no return path is available, it falls back to the archive search page for the current original URL.
- The banner uses
XMLHttpRequestwith pywb’s wombat opt-out (xhr._no_rewrite = true) so API requests are not replay-rewritten. Ensure CORS allows theX-Pywb-Requested-Withheader fromhttps://replay.healtharchive.ca. - Production expects the public API to be reachable at
https://api.<apex>(for example,https://api.healtharchive.ca). If your deployment instead proxies/apion the frontend origin, ensure the banner’s API base candidates are still valid for your hostnames. - If you deploy the backend using
./scripts/vps-deploy.sh --apply --restart-replay, the deploy helper will also install the tracked replayconfig.yaml,rules.yaml,sitecustomize.py,custom_banner.html, and restart the replay service as part of that run (single-VPS setup).
Note: the banner can be disabled for screenshot generation by adding a fragment:
6) Create a collection and index a job’s WARCs (no copying)
pywb’s wb-manager requires WARC files to exist in the collection’s archive/ directory. We avoid duplicating data by placing symlinks to the real WARC files (mounted read-only at /warcs in the container).
6.0 Recommended: use the backend CLI (one command per job)
If the backend and pywb run on the same VPS, you can make a job replayable via:
sudo systemd-run --wait --pipe \
--property=EnvironmentFile=/etc/healtharchive/backend.env \
/opt/healtharchive/.venv/bin/healtharchive replay-index-job --id 1
Dry-run (prints actions without changes):
sudo systemd-run --wait --pipe \
--property=EnvironmentFile=/etc/healtharchive/backend.env \
/opt/healtharchive/.venv/bin/healtharchive replay-index-job --id 1 --dry-run
6.1 Initialize collection for job 1
6.2 Link job 1 WARCs into the collection
1) Determine the job output directory:
sudo systemd-run --wait --pipe \
--property=EnvironmentFile=/etc/healtharchive/backend.env \
/opt/healtharchive/.venv/bin/healtharchive show-job --id 1
2) Find WARCs under that output directory:
OUTPUT_DIR="/srv/healtharchive/jobs/imports/legacy-hc-2025-04-21" # example; replace
find "$OUTPUT_DIR" -type f -name '*.warc.gz' | sort > /tmp/job-1-warcs.txt
wc -l /tmp/job-1-warcs.txt
3) Convert host paths → container paths (because we mount /srv/healtharchive/jobs as /warcs):
4) Create symlinks in the collection archive directory (prefixing with a stable counter to avoid name collisions):
COLL_ARCHIVE_DIR="/srv/healtharchive/replay/collections/job-1/archive"
sudo mkdir -p "$COLL_ARCHIVE_DIR"
nl -ba /tmp/job-1-warcs.container.txt | while read -r n p; do
printf -v linkname "warc-%06d.warc.gz" "$n"
sudo ln -sf "$p" "$COLL_ARCHIVE_DIR/$linkname"
done
Note: the symlink targets are container paths under /warcs/..., so they may appear “broken” when inspected on the host. They will resolve correctly inside the container because /srv/healtharchive/jobs is mounted as /warcs.
5) Index:
6.3 Verify replay works
Pick a known URL in the job (example HC homepage):
In a browser:
- Open the same URL and click around.
- Confirm CSS/images load and links stay under
replay.healtharchive.ca/job-1/....
6.4 Repeat for another job (example: CIHR)
Once the CIHR legacy WARCs are imported and indexed as an ArchiveJob (see docs/operations/legacy-crawl-imports.md), repeat the same steps with that job ID:
- Recommended:
healtharchive replay-index-job --id <id>wb-manager init job-<id>- Symlink that job’s WARCs into
/srv/healtharchive/replay/collections/job-<id>/archive/ wb-manager reindex job-<id>- Verify:
https://replay.healtharchive.ca/job-<id>/<some captured url>/
7) Troubleshooting
- Blank pages / missing styling: the asset was not captured into the WARC set, or the page uses live third‑party resources not archived.
- Replay 404 but snapshot exists in DB: the job’s WARCs were not linked+indexed into pywb (or you ran
cleanup-joband deleted WARCs). - Replay 502 for a specific page but localhost replay is 200: check for malformed replayed cookie headers. A bare archived cookie line such as
AWSALBCORS=...can make Caddy/Go reject the upstream response before it reaches the client. Ensure/srv/healtharchive/replay/config.yamlpointsrules_fileat/webarchive/rules.yaml, ensure/srv/healtharchive/replay/rules.yamlcontainscookie_scope: removeallunder a rewrite rule, ensure the replay container starts withPYTHONPATH=/webarchiveso/srv/healtharchive/replay/sitecustomize.pycan drop invalid header names, and restarthealtharchive-replay.service. - Replay UI shows “All-time (0 captures)”: that exact URL (including scheme + host, eg
www.vs non-www) likely isn’t present in the WARC set. Confirm via/<collection>/cdx?url=...and try host/scheme variants. - Iframe blocked: check
frame-ancestorsheader onreplay.healtharchive.caand ensure you removedX-Frame-Options. - Service crash-loop with
groupadd/useraddandsu: Authentication failure: the container entrypoint is trying to create/switch users, but--cap-drop=ALLremoves the capabilities needed. Fix by running the container as the host UID/GID directly via--user <hareplay_uid>:<healtharchive_gid>.
8) Backend wiring (optional, but recommended)
If you want the HealthArchive frontend to embed replay by default, configure the backend to emit a browseUrl for each snapshot.
On the VPS (backend host), set in /etc/healtharchive/backend.env:
Then restart the backend service.
8.1 Edition switching (v2: “preserve current page across backups”)
HealthArchive supports switching “editions” (jobs) while keeping you on the same original URL when possible.
This is implemented as:
GET /api/sources/{sourceCode}/editions- lists replayable jobs (editions) for the source, including each job’s
entryBrowseUrl(a good fallback when a specific page wasn’t captured). POST /api/replay/resolve- input:
{ "jobId": <id>, "url": "<original_url>", "timestamp14": "YYYYMMDDhhmmss" | null } - output: a best-effort
browseUrlfor the selected job if a capture exists (orfound=falsewhen it does not).
The frontend relies on lightweight postMessage events emitted by the replay banner template (see “Optional: HealthArchive banner on direct replay pages” above) to learn the current original URL while the user clicks around inside replay.
Frontend-side details and verification are documented in:
- https://github.com/jerdaw/healtharchive/blob/main/frontend/docs/deployment/verification.md
Frontend verification (recommended):
- See https://github.com/jerdaw/healtharchive/blob/main/frontend/docs/deployment/verification.md for the end-to-end checks that confirm:
- snapshot pages embed replay correctly, and
/browse/<snapshotId>provides a full-screen browsing wrapper with a persistent HealthArchive banner above the replay iframe.
9) Cached source preview images (optional, recommended)
The frontend /archive page can show a lightweight “homepage preview” tile for each source’s latest replayable backup.
To avoid rendering live iframes on every page load, these previews are served as cached static images generated out-of-band.
9.1 Configure preview directory (VPS)
Choose a directory on the VPS:
- Recommended:
/srv/healtharchive/replay/previews
Create it with the same ownership model as the replay volume:
sudo mkdir -p /srv/healtharchive/replay/previews
sudo chown -R hareplay:healtharchive /srv/healtharchive/replay/previews
sudo chmod 2770 /srv/healtharchive/replay/previews
In /etc/healtharchive/backend.env, set:
Then restart the API:
9.2 Generate previews (VPS)
Generate (or refresh) previews for all sources with:
sudo systemd-run --wait --pipe \
--property=EnvironmentFile=/etc/healtharchive/backend.env \
/opt/healtharchive/.venv/bin/healtharchive replay-generate-previews
This uses a Playwright container to screenshot each source’s entryBrowseUrl (with #ha_nobanner=1 so the pywb banner is not captured).
Note: The generator caches Playwright’s Node.js dependencies under <preview_dir_parent>/.preview-node/. If you point HEALTHARCHIVE_REPLAY_PREVIEW_DIR at a path inside your repo for local testing, ensure .preview-node/ is ignored by git (it is in .gitignore).
9.3 Verify previews are available
1) Confirm /api/sources advertises entryPreviewUrl where available:
2) Confirm an individual preview serves as an image: