Skip to content

Incident: Replay smoke tests failed (503) due to stale mounts + warc-tiering service failed (2026-01-16)

Status: closed

Metadata

  • Date (UTC): 2026-01-16
  • Severity (see severity.md): sev1
  • Environment: production
  • Primary area: replay + storage
  • Owner: (unassigned)
  • Start (UTC): 2026-01-15T04:20:00Z (first observed failing replay-smoke metrics)
  • End (UTC): 2026-01-16T02:51:56Z (replay-smoke metrics OK)

Summary

The daily replay smoke tests began returning 503 for the legacy imported jobs (HC + CIHR), even though https://replay.healtharchive.ca/ itself was up (200). The underlying issue was that the replay container could not reliably read WARCs under /srv/healtharchive/jobs/imports/** due to stale mountpoints (Transport endpoint is not connected) and the replay container’s mount namespace not reflecting repaired/updated mounts. Separately, healtharchive-warc-tiering.service had been left in a failed state since 2026-01-08, preventing tiered imports from being reliably mounted.

Recovery: re-apply WARC tiering, clear the failed systemd state, and restart the replay service to refresh its mounts; then re-run replay smoke tests.

Impact

  • User-facing impact: replay for legacy jobs intermittently failed (HTTP 503 responses from pywb for snapshot requests).
  • Internal impact: ReplaySmokeFailed monitoring noise and operator intervention required.
  • Data impact:
  • Data loss: no evidence
  • Data integrity risk: low/unknown (symptom was read failures, not WARC corruption)
  • Recovery completeness: complete (smoke tests returned 200)
  • Duration: ~22h (first failing metric to confirmed recovery)

Detection

  • node_exporter metrics:
  • healtharchive_replay_smoke_ok{job_id="1",source="hc"} 0 + status_code ... 503
  • healtharchive_replay_smoke_ok{job_id="2",source="cihr"} 0 + status_code ... 503
  • systemd state:
  • healtharchive-warc-tiering.service was failed since 2026-01-08 with Transport endpoint is not connected.
  • Container symptom:
  • docker exec healtharchive-replay ... ls -la /warcs/imports/... showed d????????? and Transport endpoint is not connected.
  • 2026-01-16T02:51:00Z — Decision: restart replay after fixing tiering mounts (why: quickest way to ensure the pywb container sees a clean view of /srv/healtharchive/jobs and can read WARCs; risks: brief replay downtime, but no data mutation).
  • 2026-01-16T16:00:00Z — Decision (post-incident hardening): run pywb with rshared bind propagation for /srv/healtharchive/jobs (why: allow the container to observe repaired nested mounts without requiring an additional restart; risks: broader mount propagation surface, but still read-only inside the container).

Timeline (UTC)

  • 2026-01-08T06:25:23Z — healtharchive-warc-tiering.service failed while attempting to operate on /srv/healtharchive/jobs/imports/... (stale mount: Transport endpoint is not connected).
  • 2026-01-15T04:20:00Z — Replay smoke test metrics show 503 for legacy jobs (first observed failing healtharchive_replay_smoke_* timestamp).
  • 2026-01-16T02:25Z — Verified replay root is up (curl -I https://replay.healtharchive.ca/ returns 200), but snapshot requests return 503.
  • 2026-01-16T02:30Z — Confirmed the replay container cannot read tiered import directories (docker exec healtharchive-replay ... shows Transport endpoint is not connected).
  • 2026-01-16T02:51Z — Recovered by re-applying tiering + restarting replay:
  • sudo systemctl reset-failed healtharchive-warc-tiering.service
  • sudo systemctl start healtharchive-warc-tiering.service
  • sudo systemctl restart healtharchive-replay.service
  • sudo systemctl start healtharchive-replay-smoke.service
  • 2026-01-16T02:51:56Z — Replay smoke metrics return to 200:
  • healtharchive_replay_smoke_ok{job_id="1",source="hc"} 1
  • healtharchive_replay_smoke_ok{job_id="2",source="cihr"} 1
  • 2026-01-16T16:00Z — Post-incident hardening: updated replay systemd unit to mount /srv/healtharchive/jobs with rshared bind propagation so pywb can observe nested mount repairs without a restart (see: ../../deployment/replay-service-pywb.md).

Root cause

  • Immediate trigger: one or more tiered paths under /srv/healtharchive/jobs/imports/** were stale/unreadable (Errno 107: Transport endpoint is not connected), causing WARC reads inside pywb to fail.
  • Underlying cause(s):
  • healtharchive-warc-tiering.service remained failed after a prior storage incident, so tiered import mountpoints were not being applied/validated by systemd.
  • The replay service is a long-running Docker container bind-mounting /srv/healtharchive/jobs into /warcs. Mount changes/repairs on the host can require a container restart for the container to observe a clean view of the mountpoints.

Contributing factors

  • Tiered import jobs are critical to replay smoke (legacy jobs are used as smoke targets).
  • Stale mount symptoms were partly masked because:
  • the Storage Box base mount looked healthy, and
  • replay root / still returned 200.

Resolution / Recovery

1) Ensure WARC tiering mounts are applied and systemd is not stuck in a failed state:

sudo systemctl reset-failed healtharchive-warc-tiering.service
sudo systemctl start healtharchive-warc-tiering.service
sudo systemctl status healtharchive-warc-tiering.service --no-pager -l

2) Restart replay so the container sees a clean view of /srv/healtharchive/jobs:

sudo systemctl restart healtharchive-replay.service
sudo systemctl status healtharchive-replay.service --no-pager -l

3) Re-run replay smoke and verify metrics:

sudo systemctl start healtharchive-replay-smoke.service
curl -s http://127.0.0.1:9100/metrics | rg '^healtharchive_replay_smoke_'

Post-incident hardening (durable fixes)

  • Replay service mount propagation:
  • Updated /etc/systemd/system/healtharchive-replay.service to mount /srv/healtharchive/jobs as ro,rshared so nested bind-mount repairs (tiering/hot-path recovery) are visible inside the container.
  • Canonical doc: ../../deployment/replay-service-pywb.md
  • Tiering service resilience:
  • Updated the tiering systemd unit template to run vps-warc-tiering-bind-mounts.sh --apply --repair-stale-mounts so it can automatically unmount stale Errno 107 mountpoints and re-apply binds on start.
  • Canonical playbook: ../playbooks/storage/warc-storage-tiering.md
  • Storage hot-path auto-recovery:
  • Enabled healtharchive-storage-hotpath-auto-recover.timer (opt-in via sentinel file) so stale mounts are detected and recovered without requiring a manual incident response for common Errno 107 cases.
  • Canonical playbook: ../playbooks/storage/storagebox-sshfs-stale-mount-recovery.md

Post-incident verification

  • Public surface checks:
  • curl -I https://replay.healtharchive.ca/ | head returns 200.
  • Storage/mount checks:
  • systemctl status healtharchive-warc-tiering.service --no-pager -l is successful.
  • Replay job checks:
  • healtharchive_replay_smoke_ok{job_id="1",source="hc"} 1 and ...{job_id="2",source="cihr"} 1

Public communication (optional)

  • Public status update: not posted (incident was internal and did not change public-facing expectations beyond the replay smoke targets).
  • Public-safe summary: keep on file; if replay becomes a user-facing guarantee in future, revisit whether sev1 incidents should trigger a public note.

Open questions (still unknown)

  • Can we make replay smoke targets independent of tiered-import mounts (e.g., keep a tiny always-local “canary replay” job) so storage tiering issues don’t mask replay regressions?
  • Decision: Deferred to backlog. Tiering alerting (now implemented) addresses the immediate need for better detection. Canary replay is a future enhancement.
  • Should replay smoke include an explicit “WARC file exists + readable” check to disambiguate pywb failures vs storage failures?

Action items (TODOs)

  • Update playbooks to call out “restart replay after mount/tiering repairs” when smoke returns 503 but replay root is 200. (owner=eng, priority=high, due=2026-01-16)
  • Enable the storage hot-path auto-recover watchdog (healtharchive-storage-hotpath-auto-recover.timer) after validating thresholds. (owner=eng, priority=medium, due=2026-01-16)
  • Document and apply rshared bind propagation for the replay service so nested mount repairs are visible without restarting pywb. (owner=eng, priority=high, due=2026-01-16)
  • Enable tiering health metrics + alerting so healtharchive-warc-tiering.service failures are visible quickly. (owner=eng, priority=medium, due=2026-01-18)

Automation opportunities

  • Automate “tiering failed” detection with metrics + alerting:
  • Enable healtharchive-tiering-metrics.timer and alert on a sustained unhealthy signal (e.g., healtharchive_tiering_metrics_ok==0 or a “tiering applied” check failing).
  • Keep replay smoke meaningful but safe:
  • Prefer smoke probes that are read-only and low-cost.
  • Treat Errno 107 as an infra/storage failure class, and route recovery through the storage/tiering watchdogs rather than marking replay itself “broken”.

References / Artifacts

  • Tiering manifest (VPS): /etc/healtharchive/warc-tiering.binds
  • Tiering script (VPS): scripts/vps-warc-tiering-bind-mounts.sh
  • Replay smoke playbook: ../playbooks/validation/replay-smoke-tests.md
  • Storage recovery playbook: ../playbooks/storage/storagebox-sshfs-stale-mount-recovery.md