Incident response playbook (operators)

Goal: restore service safely and capture enough context to prevent repeat incidents.

Canonical references:

Production runbook: ../../../deployment/production-single-vps.md
Monitoring checklist: ../../monitoring-and-ci-checklist.md
Service levels: ../../service-levels.md — for communication commitments and SLOs
Escalation procedures: ../../escalation-procedures.md
Assistant-guided production sessions: assistant-guided-production-session.md
Disaster recovery runbook: ../../../deployment/disaster-recovery.md
Baseline drift: ../../baseline-drift.md
Incident notes (template + where to file): ../../incidents/README.md
Ops runbooks (quick response procedures): ../../runbooks/README.md

First: start an incident note

As soon as you suspect this is “an incident” (not routine maintenance), start a note so you can record a timeline and the exact recovery steps.

Create a new file: docs/operations/incidents/YYYY-MM-DD-short-slug.md
Copy the template: docs/_templates/incident-template.md
Pick an initial severity using: docs/operations/incidents/severity.md

If you can’t easily edit the repo on the VPS, capture the note in a local scratchpad and copy it into the repo later.

Operating rule: repo-first remediation

When an incident appears to require a backend behavior change, source-profile change, scope fix, watchdog change, or CLI reconciliation fix, do not jump straight from diagnosis into VPS recovery commands.

Use this order instead:

Classify the incident first (storage, stale state, scope/config drift, or crawler/site compatibility).
If the fix lives in the repo, make the change in the repo first.
Commit and push the change.
Deploy a pinned ref on the VPS via deploy-and-verify.md.
Verify the VPS checkout contains the intended change.
Only then run reconcile/recover/restart commands that depend on that fix.

This project is an archive, not a just-keep-it-running service. Prefer one auditable, versioned fix plus one controlled recovery over repeated ad hoc retries against stale code.

When the site/API looks broken

Confirm what’s failing (public surface):
cd /opt/healtharchive && ./scripts/verify_public_surface.py
Check services:
sudo systemctl status healtharchive-api healtharchive-worker --no-pager -l
Check recent logs:
sudo journalctl -u healtharchive-api -n 200 --no-pager
sudo journalctl -u healtharchive-worker -n 200 --no-pager
Check baseline drift (production correctness):
./scripts/check_baseline_drift.py --mode live

When jobs are stuck (crawl/indexing pipeline)

If the worker is running but jobs never advance, check for a job stuck in status=running after a reboot or unexpected termination.

Load production environment (so the CLI targets Postgres):
set -a; source /etc/healtharchive/backend.env; set +a
Inspect recent jobs:
/opt/healtharchive/.venv/bin/healtharchive list-jobs --limit 50
Decide whether recovery depends on undeployed repo changes:
If yes, stop here and follow deploy-and-verify.md first.
Verify the live checkout contains the intended change before continuing.
Recover stale running jobs (safe dry-run first):
/opt/healtharchive/.venv/bin/healtharchive recover-stale-jobs --older-than-minutes 180
Apply (sets status=retryable): /opt/healtharchive/.venv/bin/healtharchive recover-stale-jobs --older-than-minutes 180 --apply
Verify the worker picks them up:
sudo journalctl -u healtharchive-worker -n 200 --no-pager

If you need to deploy a fix

Follow deploy-and-verify.md (don’t skip the deploy gate).

What “done” means

The public surface verification passes again.
The underlying cause is identified (config drift, failed migration, disk, external dependency, etc.).