Storage Box / `sshfs` stale mount recovery (Errno 107)

Use this playbook when HealthArchive crawls/indexing/metrics start failing with:

OSError: [Errno 107] Transport endpoint is not connected

This typically indicates a stale FUSE mount (often sshfs) where the mountpoint still exists, but basic filesystem operations (stat, ls, is_dir) fail.

Operational note:

The worker skips jobs that recently failed with crawler_status=infra_error for a short cooldown window to prevent retry storms. This reduces alert noise but does not fix the underlying mount issue; use this playbook (or the hot-path auto-recover automation) to repair the stale mountpoint.

For background and the full implementation plan (prevention + automation + integrity), see:

../../../planning/implemented/2026-01-08-storagebox-sshfs-stale-mount-recovery-and-integrity.md
Drills (safe on production): storagebox-sshfs-stale-mount-drills.md

Quick triage (60 seconds)

On the VPS (/opt/healtharchive):

0) Capture an evidence bundle (recommended, read-only):

cd /opt/healtharchive
./scripts/vps-capture-hotpath-staleness-evidence.sh --tag pre-repair

Optional: if the affected crawl campaign year differs from the current UTC year, pass it explicitly:

cd /opt/healtharchive
./scripts/vps-capture-hotpath-staleness-evidence.sh --tag pre-repair --year 2026

This writes a timestamped directory under:

/srv/healtharchive/ops/observability/hotpath-staleness/

Optional (recommended): after recovery actions complete, capture a second bundle for comparison:

cd /opt/healtharchive
./scripts/vps-capture-hotpath-staleness-evidence.sh --tag post-repair

Optional: diff the latest pre-repair vs post-repair bundles:

root=/srv/healtharchive/ops/observability/hotpath-staleness
before=$(ls -1dt "${root}"/*pre-repair 2>/dev/null | head -n 1)
after=$(ls -1dt "${root}"/*post-repair 2>/dev/null | head -n 1)
./scripts/vps-diff-hotpath-staleness-evidence.sh --before "${before}" --after "${after}"

1) Snapshot current crawl state:

./scripts/vps-crawl-status.sh --year "$(date -u +%Y)"

Optional: if Phase 2 automation has been enabled, check whether it is already attempting recovery (it is disabled-by-default unless the sentinel exists):

systemctl status healtharchive-storage-hotpath-auto-recover.timer --no-pager -l || true
ls -la /etc/healtharchive/storage-hotpath-auto-recover-enabled 2>/dev/null || true
cat /srv/healtharchive/ops/watchdog/storage-hotpath-auto-recover.json 2>/dev/null || true

If the worker auto-start watchdog is enabled (optional), check it too:

systemctl status healtharchive-worker-auto-start.timer --no-pager -l || true
ls -la /etc/healtharchive/worker-auto-start-enabled 2>/dev/null || true
cat /srv/healtharchive/ops/watchdog/worker-auto-start.json 2>/dev/null || true

Watchdog failure-mode matrix (deterministic triage)

Use this matrix before manual repair so you can classify the watchdog state quickly.

Primary evidence commands:

cat /srv/healtharchive/ops/watchdog/storage-hotpath-auto-recover.json 2>/dev/null || true
curl -s http://127.0.0.1:9100/metrics | rg '^healtharchive_storage_hotpath_auto_recover_(metrics_ok|detected_targets|deploy_lock_active|last_apply_ok|last_apply_timestamp_seconds|apply_total)'

Failure class	Typical signals	Meaning	Operator action
Stale targets detected, no apply attempted yet	`detected_targets > 0`, `apply_total` unchanged, no fresh `last_apply_timestamp_seconds`	Target not yet eligible (confirm-runs/min-age/rate-limit gate)	Keep observing for a short window; if persistent, run dry-run drill and then manual recovery steps below
Deploy-lock suppression	`deploy_lock_active == 1`, state has `last_skip_reason=deploy_lock`	Apply mode intentionally downgraded to safe dry-run during deploy	Finish deploy first, then re-check watchdog state; do not force overlapping recovery
Apply attempted and failed	`apply_total > 0`, `last_apply_ok == 0` (especially if older than 24h)	Recovery ran but post-check failed (mount not restored/readable)	Follow full recovery sequence in this playbook; inspect `last_apply_errors` and `last_apply_warnings` in watchdog state JSON
Watchdog metrics stale/missing	`metrics_ok == 0` or metric timestamp stale alert firing	Timer or script is failing before/while writing metrics	Check timer/service logs, fix watchdog execution first, then re-run dry-run drill

2) Confirm Storage Box base mount health:

mount | rg '/srv/healtharchive/storagebox'
ls -la /srv/healtharchive/storagebox >/dev/null && echo "OK: storagebox readable" || echo "BAD: storagebox unreadable"

3) Identify broken “hot paths” (job output dirs):

./scripts/vps-crawl-status.sh --year "$(date -u +%Y)" | rg '^Output dir:'
ls -la /srv/healtharchive/jobs/hc/  # replace with source path(s) as needed

If you see Transport endpoint is not connected or d????????? for job output dirs, continue.

Recovery procedure (safe ordering)

1) Stop the worker

Stop the worker first to prevent repeated filesystem touches while mounts are broken:

sudo systemctl stop healtharchive-worker.service

Note: if healtharchive-worker-auto-start.timer is enabled, it may restart the worker while you are mid-repair. Either:

temporarily disable the timer, or
temporarily remove /etc/healtharchive/worker-auto-start-enabled,

then re-enable after recovery.

Optional: if you suspect a crawler container is still running and stuck on IO, inspect it:

docker ps --format 'table {{.ID}}\t{{.Image}}\t{{.Names}}\t{{.Status}}' | rg 'zimit|openzim' || true

Only stop a container if you’re sure it is part of the broken job and not making progress.

2) Identify stale mountpoints (targeted)

This incident class often affects specific job output directories (not necessarily the whole Storage Box mount).

For each affected job output dir (examples shown):

mount | rg '/srv/healtharchive/jobs/(hc|phac|cihr)/'
sudo findmnt -T /srv/healtharchive/jobs/hc/<JOB_DIR> || true

If ls against a path returns Transport endpoint is not connected, treat it as stale.

3) Unmount stale hot paths (use `umount` first, then `-l` only if needed)

For each stale mountpoint, try:

sudo umount /srv/healtharchive/jobs/<source>/<JOB_DIR>

If it fails and the path is still broken/unstat’able, use lazy unmount:

sudo umount -l /srv/healtharchive/jobs/<source>/<JOB_DIR>

Notes:

Use targeted unmounts only (specific job dirs), not broad parent directories.
umount -l is an emergency tool; use it only for confirmed-stale mountpoints.

4) Re-apply tiering mounts

1) Re-apply WARC tiering bind mounts (manifest-driven):

sudo ./scripts/vps-warc-tiering-bind-mounts.sh --apply

If you have confirmed-stale mountpoints and want the script to attempt targeted repair automatically (still requires the worker to be stopped first):

sudo ./scripts/vps-warc-tiering-bind-mounts.sh --apply --repair-stale-mounts

If this fails with Errno 107 under /srv/healtharchive/jobs/imports/..., unmount those stale import mountpoints too and re-run.

If the systemd unit is in a failed state, clear it and re-run (prevents repeated WarcTieringFailed alerts):

systemctl is-failed healtharchive-warc-tiering.service && sudo systemctl reset-failed healtharchive-warc-tiering.service || true
sudo systemctl start healtharchive-warc-tiering.service

2) Re-apply annual output tiering (campaign job output dirs → Storage Box):

Preferred (avoids the systemd unit’s internal worker stop/start):

# Ensure we target the production DB (Postgres), not a local fallback (SQLite):
set -a; source /etc/healtharchive/backend.env; set +a
systemctl is-active postgresql.service

sudo --preserve-env=HEALTHARCHIVE_DATABASE_URL,HEALTHARCHIVE_ARCHIVE_ROOT \
  /opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-annual-output-tiering.py --apply --year "$(date -u +%Y)"

If you want the script to attempt targeted repair for stale mountpoints (Errno 107), pass:

sudo --preserve-env=HEALTHARCHIVE_DATABASE_URL,HEALTHARCHIVE_ARCHIVE_ROOT \
  /opt/healtharchive/.venv/bin/python3 /opt/healtharchive/scripts/vps-annual-output-tiering.py --apply --repair-stale-mounts --allow-repair-running-jobs --year "$(date -u +%Y)"

Alternative (uses the systemd unit, which stops/starts the worker internally):

sudo systemctl start healtharchive-annual-output-tiering.service

5) Recover job state (stuck `running` → `retryable`)

Load the backend env (production DB connection):

set -a; source /etc/healtharchive/backend.env; set +a

Recover stale jobs:

/opt/healtharchive/.venv/bin/healtharchive recover-stale-jobs --older-than-minutes 5 --apply --limit 25

If a job ended up failed due to the mount issue and you want it to run again:

/opt/healtharchive/.venv/bin/healtharchive retry-job --id <JOB_ID>

6) Restart the worker

sudo systemctl start healtharchive-worker.service

Replay note (after mount repairs)

If replay smoke tests start returning 503 for previously indexed jobs after a mount/tiering incident, restart replay to refresh its view of /srv/healtharchive/jobs:

sudo systemctl restart healtharchive-replay.service
sudo systemctl start healtharchive-replay-smoke.service
curl -s http://127.0.0.1:9100/metrics | rg '^healtharchive_replay_smoke_'

Validation (confirm we’re actually healthy)

1) Worker is running and picking jobs

sudo systemctl status healtharchive-worker.service --no-pager -l
sudo journalctl -u healtharchive-worker.service -n 80 --no-pager -l

2) Crawls are making progress (not just “running”)

Pick the active job ID and check progress:

./scripts/vps-crawl-status.sh --year "$(date -u +%Y)" --job-id <JOB_ID>

Look for:

crawlStatus counters increasing over time (crawled ticks up).
healtharchive_crawl_running_job_stalled == 0
last_progress_age_seconds small (tens of seconds to a few minutes).
healtharchive_crawl_running_job_state_parse_ok == 1 (state file readable; no sshfs weirdness)
healtharchive_crawl_running_job_container_restarts_done not climbing rapidly (avoid restart thrash)
new .warc.gz files appearing under the job’s active temp dir.

3) Metrics writers are healthy

sudo systemctl start healtharchive-crawl-metrics.service
sudo systemctl start healtharchive-tiering-metrics.service
sudo systemctl status healtharchive-crawl-metrics.service healtharchive-tiering-metrics.service --no-pager -l

If recovery fails

If hot paths are still unreadable after unmount + tiering reapply:

1) Verify Storage Box base mount is readable:

ls -la /srv/healtharchive/storagebox >/dev/null && echo OK || echo BAD
sudo systemctl status healtharchive-storagebox-sshfs.service --no-pager -l

2) Consider restarting the base mount:

sudo systemctl restart healtharchive-storagebox-sshfs.service

3) Re-run tiering reapply steps (WARC tiering + annual output tiering).

If this becomes a recurring pattern, treat it as an infrastructure incident and follow:

../core/incident-response.md

If the persistent failed-apply alert is active (HealthArchiveStorageHotpathApplyFailedPersistent):

Capture last_apply_errors / last_apply_warnings from:
/srv/healtharchive/ops/watchdog/storage-hotpath-auto-recover.json
Run this playbook’s ordered recovery sequence (worker quiesce -> targeted unmount -> tiering re-apply -> stale job recover -> worker restart).
Run the dry-run drill from storagebox-sshfs-stale-mount-drills.md to confirm planned actions are now sane.

sshfs tuning options

The healtharchive-storagebox-sshfs.service uses these sshfs options:

-o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,kernel_cache

These defaults are tuned for reliability:

reconnect - automatically reconnect when the SSH connection drops
ServerAliveInterval=15 - send SSH keepalives every 15 seconds
ServerAliveCountMax=3 - disconnect after 3 missed keepalives (~45s)
kernel_cache - use kernel caching for better performance

If you experience frequent Errno 107 issues, consider these additional options in /etc/healtharchive/storagebox.env (requires service restart):

Option	Description	When to use
`ServerAliveCountMax=5`	Increase from 3 to tolerate more keepalive misses	Unreliable network with brief dropouts
`ConnectTimeout=30`	Limit initial connection wait	Slow network, avoids long hangs
`max_write=65536`	Smaller write chunks	Large file writes cause timeouts
`workaround=rename`	Better rename handling	If file moves fail intermittently
`auto_cache`	Smarter caching based on mtime	If you see stale data

Note: Changing sshfs options can have unintended effects on performance and behavior. Test changes in a non-production environment first.

Storage Box / sshfs stale mount recovery (Errno 107)