Incident: Replay smoke tests failed (503) due to stale mounts + warc-tiering service failed (2026-01-16)
Status: closed
Metadata
- Date (UTC): 2026-01-16
- Severity (see
severity.md): sev1 - Environment: production
- Primary area: replay + storage
- Owner: (unassigned)
- Start (UTC): 2026-01-15T04:20:00Z (first observed failing replay-smoke metrics)
- End (UTC): 2026-01-16T02:51:56Z (replay-smoke metrics OK)
Summary
The daily replay smoke tests began returning 503 for the legacy imported jobs (HC + CIHR), even though https://replay.healtharchive.ca/ itself was up (200). The underlying issue was that the replay container could not reliably read WARCs under /srv/healtharchive/jobs/imports/** due to stale mountpoints (Transport endpoint is not connected) and the replay container’s mount namespace not reflecting repaired/updated mounts. Separately, healtharchive-warc-tiering.service had been left in a failed state since 2026-01-08, preventing tiered imports from being reliably mounted.
Recovery: re-apply WARC tiering, clear the failed systemd state, and restart the replay service to refresh its mounts; then re-run replay smoke tests.
Impact
- User-facing impact: replay for legacy jobs intermittently failed (HTTP 503 responses from pywb for snapshot requests).
- Internal impact:
ReplaySmokeFailedmonitoring noise and operator intervention required. - Data impact:
- Data loss: no evidence
- Data integrity risk: low/unknown (symptom was read failures, not WARC corruption)
- Recovery completeness: complete (smoke tests returned
200) - Duration: ~22h (first failing metric to confirmed recovery)
Detection
- node_exporter metrics:
healtharchive_replay_smoke_ok{job_id="1",source="hc"} 0+status_code ... 503healtharchive_replay_smoke_ok{job_id="2",source="cihr"} 0+status_code ... 503- systemd state:
healtharchive-warc-tiering.servicewasfailedsince 2026-01-08 withTransport endpoint is not connected.- Container symptom:
docker exec healtharchive-replay ... ls -la /warcs/imports/...showedd?????????andTransport endpoint is not connected.
Decision log (recommended for sev1)
- 2026-01-16T02:51:00Z — Decision: restart replay after fixing tiering mounts (why: quickest way to ensure the pywb container sees a clean view of
/srv/healtharchive/jobsand can read WARCs; risks: brief replay downtime, but no data mutation). - 2026-01-16T16:00:00Z — Decision (post-incident hardening): run pywb with
rsharedbind propagation for/srv/healtharchive/jobs(why: allow the container to observe repaired nested mounts without requiring an additional restart; risks: broader mount propagation surface, but still read-only inside the container).
Timeline (UTC)
- 2026-01-08T06:25:23Z —
healtharchive-warc-tiering.servicefailed while attempting to operate on/srv/healtharchive/jobs/imports/...(stale mount:Transport endpoint is not connected). - 2026-01-15T04:20:00Z — Replay smoke test metrics show
503for legacy jobs (first observed failinghealtharchive_replay_smoke_*timestamp). - 2026-01-16T02:25Z — Verified replay root is up (
curl -I https://replay.healtharchive.ca/returns200), but snapshot requests return503. - 2026-01-16T02:30Z — Confirmed the replay container cannot read tiered import directories (
docker exec healtharchive-replay ...showsTransport endpoint is not connected). - 2026-01-16T02:51Z — Recovered by re-applying tiering + restarting replay:
sudo systemctl reset-failed healtharchive-warc-tiering.servicesudo systemctl start healtharchive-warc-tiering.servicesudo systemctl restart healtharchive-replay.servicesudo systemctl start healtharchive-replay-smoke.service- 2026-01-16T02:51:56Z — Replay smoke metrics return to
200: healtharchive_replay_smoke_ok{job_id="1",source="hc"} 1healtharchive_replay_smoke_ok{job_id="2",source="cihr"} 1- 2026-01-16T16:00Z — Post-incident hardening: updated replay systemd unit to mount
/srv/healtharchive/jobswithrsharedbind propagation so pywb can observe nested mount repairs without a restart (see:../../deployment/replay-service-pywb.md).
Root cause
- Immediate trigger: one or more tiered paths under
/srv/healtharchive/jobs/imports/**were stale/unreadable (Errno 107: Transport endpoint is not connected), causing WARC reads inside pywb to fail. - Underlying cause(s):
healtharchive-warc-tiering.serviceremainedfailedafter a prior storage incident, so tiered import mountpoints were not being applied/validated by systemd.- The replay service is a long-running Docker container bind-mounting
/srv/healtharchive/jobsinto/warcs. Mount changes/repairs on the host can require a container restart for the container to observe a clean view of the mountpoints.
Contributing factors
- Tiered import jobs are critical to replay smoke (legacy jobs are used as smoke targets).
- Stale mount symptoms were partly masked because:
- the Storage Box base mount looked healthy, and
- replay root
/still returned200.
Resolution / Recovery
1) Ensure WARC tiering mounts are applied and systemd is not stuck in a failed state:
sudo systemctl reset-failed healtharchive-warc-tiering.service
sudo systemctl start healtharchive-warc-tiering.service
sudo systemctl status healtharchive-warc-tiering.service --no-pager -l
2) Restart replay so the container sees a clean view of /srv/healtharchive/jobs:
sudo systemctl restart healtharchive-replay.service
sudo systemctl status healtharchive-replay.service --no-pager -l
3) Re-run replay smoke and verify metrics:
sudo systemctl start healtharchive-replay-smoke.service
curl -s http://127.0.0.1:9100/metrics | rg '^healtharchive_replay_smoke_'
Post-incident hardening (durable fixes)
- Replay service mount propagation:
- Updated
/etc/systemd/system/healtharchive-replay.serviceto mount/srv/healtharchive/jobsasro,rsharedso nested bind-mount repairs (tiering/hot-path recovery) are visible inside the container. - Canonical doc:
../../deployment/replay-service-pywb.md - Tiering service resilience:
- Updated the tiering systemd unit template to run
vps-warc-tiering-bind-mounts.sh --apply --repair-stale-mountsso it can automatically unmount staleErrno 107mountpoints and re-apply binds on start. - Canonical playbook:
../playbooks/storage/warc-storage-tiering.md - Storage hot-path auto-recovery:
- Enabled
healtharchive-storage-hotpath-auto-recover.timer(opt-in via sentinel file) so stale mounts are detected and recovered without requiring a manual incident response for commonErrno 107cases. - Canonical playbook:
../playbooks/storage/storagebox-sshfs-stale-mount-recovery.md
Post-incident verification
- Public surface checks:
curl -I https://replay.healtharchive.ca/ | headreturns200.- Storage/mount checks:
systemctl status healtharchive-warc-tiering.service --no-pager -lis successful.- Replay job checks:
healtharchive_replay_smoke_ok{job_id="1",source="hc"} 1and...{job_id="2",source="cihr"} 1
Public communication (optional)
- Public status update: not posted (incident was internal and did not change public-facing expectations beyond the replay smoke targets).
- Public-safe summary: keep on file; if replay becomes a user-facing guarantee in future, revisit whether sev1 incidents should trigger a public note.
Open questions (still unknown)
- Can we make replay smoke targets independent of tiered-import mounts (e.g., keep a tiny always-local “canary replay” job) so storage tiering issues don’t mask replay regressions?
- Decision: Deferred to backlog. Tiering alerting (now implemented) addresses the immediate need for better detection. Canary replay is a future enhancement.
- Should replay smoke include an explicit “WARC file exists + readable” check to disambiguate pywb failures vs storage failures?
Action items (TODOs)
- Update playbooks to call out “restart replay after mount/tiering repairs” when smoke returns
503but replay root is200. (owner=eng, priority=high, due=2026-01-16) - Enable the storage hot-path auto-recover watchdog (
healtharchive-storage-hotpath-auto-recover.timer) after validating thresholds. (owner=eng, priority=medium, due=2026-01-16) - Document and apply
rsharedbind propagation for the replay service so nested mount repairs are visible without restarting pywb. (owner=eng, priority=high, due=2026-01-16) - Enable tiering health metrics + alerting so
healtharchive-warc-tiering.servicefailures are visible quickly. (owner=eng, priority=medium, due=2026-01-18)
Automation opportunities
- Automate “tiering failed” detection with metrics + alerting:
- Enable
healtharchive-tiering-metrics.timerand alert on a sustained unhealthy signal (e.g.,healtharchive_tiering_metrics_ok==0or a “tiering applied” check failing). - Keep replay smoke meaningful but safe:
- Prefer smoke probes that are read-only and low-cost.
- Treat
Errno 107as an infra/storage failure class, and route recovery through the storage/tiering watchdogs rather than marking replay itself “broken”.
References / Artifacts
- Tiering manifest (VPS):
/etc/healtharchive/warc-tiering.binds - Tiering script (VPS):
scripts/vps-warc-tiering-bind-mounts.sh - Replay smoke playbook:
../playbooks/validation/replay-smoke-tests.md - Storage recovery playbook:
../playbooks/storage/storagebox-sshfs-stale-mount-recovery.md