Skip to content

Incident response playbook (operators)

Goal: restore service safely and capture enough context to prevent repeat incidents.

Canonical references:

  • Production runbook: ../../../deployment/production-single-vps.md
  • Monitoring checklist: ../../monitoring-and-ci-checklist.md
  • Service levels: ../../service-levels.md — for communication commitments and SLOs
  • Escalation procedures: ../../escalation-procedures.md
  • Assistant-guided production sessions: assistant-guided-production-session.md
  • Disaster recovery runbook: ../../../deployment/disaster-recovery.md
  • Baseline drift: ../../baseline-drift.md
  • Incident notes (template + where to file): ../../incidents/README.md
  • Ops runbooks (quick response procedures): ../../runbooks/README.md

First: start an incident note

As soon as you suspect this is “an incident” (not routine maintenance), start a note so you can record a timeline and the exact recovery steps.

  • Create a new file: docs/operations/incidents/YYYY-MM-DD-short-slug.md
  • Copy the template: docs/_templates/incident-template.md
  • Pick an initial severity using: docs/operations/incidents/severity.md

If you can’t easily edit the repo on the VPS, capture the note in a local scratchpad and copy it into the repo later.

Operating rule: repo-first remediation

When an incident appears to require a backend behavior change, source-profile change, scope fix, watchdog change, or CLI reconciliation fix, do not jump straight from diagnosis into VPS recovery commands.

Use this order instead:

  1. Classify the incident first (storage, stale state, scope/config drift, or crawler/site compatibility).
  2. If the fix lives in the repo, make the change in the repo first.
  3. Commit and push the change.
  4. Deploy a pinned ref on the VPS via deploy-and-verify.md.
  5. Verify the VPS checkout contains the intended change.
  6. Only then run reconcile/recover/restart commands that depend on that fix.

This project is an archive, not a just-keep-it-running service. Prefer one auditable, versioned fix plus one controlled recovery over repeated ad hoc retries against stale code.

When the site/API looks broken

  1. Confirm what’s failing (public surface):
  2. cd /opt/healtharchive && ./scripts/verify_public_surface.py
  3. Check services:
  4. sudo systemctl status healtharchive-api healtharchive-worker --no-pager -l
  5. Check recent logs:
  6. sudo journalctl -u healtharchive-api -n 200 --no-pager
  7. sudo journalctl -u healtharchive-worker -n 200 --no-pager
  8. Check baseline drift (production correctness):
  9. ./scripts/check_baseline_drift.py --mode live

When jobs are stuck (crawl/indexing pipeline)

If the worker is running but jobs never advance, check for a job stuck in status=running after a reboot or unexpected termination.

  1. Load production environment (so the CLI targets Postgres):
  2. set -a; source /etc/healtharchive/backend.env; set +a
  3. Inspect recent jobs:
  4. /opt/healtharchive/.venv/bin/healtharchive list-jobs --limit 50
  5. Decide whether recovery depends on undeployed repo changes:
  6. If yes, stop here and follow deploy-and-verify.md first.
  7. Verify the live checkout contains the intended change before continuing.
  8. Recover stale running jobs (safe dry-run first):
  9. /opt/healtharchive/.venv/bin/healtharchive recover-stale-jobs --older-than-minutes 180
  10. Apply (sets status=retryable): /opt/healtharchive/.venv/bin/healtharchive recover-stale-jobs --older-than-minutes 180 --apply
  11. Verify the worker picks them up:
  12. sudo journalctl -u healtharchive-worker -n 200 --no-pager

If you need to deploy a fix

  • Follow deploy-and-verify.md (don’t skip the deploy gate).

What “done” means

  • The public surface verification passes again.
  • The underlying cause is identified (config drift, failed migration, disk, external dependency, etc.).