Incident: API 502 Bad Gateway (2026-02-20)
Status: closed
Metadata
- Date (UTC): 2026-02-20
- Severity (see
operations/incidents/severity.md): sev1 - Environment: production
- Primary area: api
- Owner: Jeremy Dawson
- Start (UTC): 2026-02-20T02:53:24Z
- End (UTC): 2026-02-20T16:29:00Z
Summary
UptimeRobot detected an HTTP 502 Bad Gateway on api.healtharchive.ca/api/health. This indicates Caddy is unable to route traffic to the backend API (healtharchive-api.service).
Impact
- User-facing impact: The public website and API are unreachable, returning a 502 error.
- Internal impact: All public API routes, admin routes, and metrics are unavailable.
- Data impact:
- Data loss: unknown (likely none, as worker/db might still be running)
- Data integrity risk: no
- Recovery completeness: unknown
- Duration: 13 hours 35 minutes
Detection
- How was it detected (alert, operator check, user report)? UptimeRobot alert for
api.healtharchive.ca/api/health. - What signals were most useful (commands/metrics/logs)? The 502 status code indicates a proxy-to-upstream failure (Caddy -> uvicorn).
Decision log (optional but recommended for sev0/sev1)
- 2026-02-20T02:53:24Z — Decision: Investigated the 502 Bad Gateway alert and prepared recovery steps (why: Incident triggered by UptimeRobot, risks: None)
Timeline (UTC)
- 2026-02-20T02:53:24Z — UptimeRobot detected HTTP 502 Bad Gateway on api.healtharchive.ca/api/health.
- 2026-02-20T16:29:00Z — Investigated logs indicating
ModuleNotFoundError: No module named 'slowapi'. - 2026-02-20T16:32:54Z — Restored via
./scripts/vps-deploy.sh --applywhich installed the missing dependencies and restarted the API service.
Root cause
- Immediate trigger:
healtharchive-api.servicewas crashing withModuleNotFoundError: No module named 'slowapi'. - Underlying cause(s): A code change merged into
mainadded theslowapirate limiting dependency to pyproject.toml and source code, but the production VPS environment was pulled or restarted without running the deployment script, leaving the virtual environment without the new dependency.
Contributing factors
- What made this worse or harder to debug? Deployment without executing the full script (
vps-deploy.sh) skipped dependency resolution.
Resolution / Recovery
- Identified API service crash via
sudo journalctl -u healtharchive-api -n 200. - Error was
ModuleNotFoundError: No module named 'slowapi'. - Noticed
slowapiwas missing from.venvsite-packages. - Ran
./scripts/vps-deploy.sh --applyto reinstall dependencies (pip install -e .) and restart the fastAPI process securely. - Verified recovery with
./scripts/verify_public_surface.pyshowing all checks passed.
Post-incident verification
- Public surface checks:
./scripts/verify_public_surface.pypasses (0 failures). - Worker/job health checks: Worker was unaffected, continued processing crawl jobs securely.
- Storage/mount checks (if relevant):
- Integrity checks (if relevant):
Public communication (optional; do this when it changes user expectations)
- Public status update (where/when):
- Changelog entry (date/link):
- Public summary (2–5 sentences):
Open questions (still unknown)
- Why did the API service crash?
Action items (TODOs)
- Investigate root cause of API crash from journalctl logs (owner=, priority=, due=)
- Determine how code was updated on the VPS without a deploy triggered (e.g. manual
git pull) and reinforce operator runbooks to continually usevps-deploy.sh.
Automation opportunities
- What can be automated safely?
- What should stay manual (risk/false positives)?
References / Artifacts
- Relevant log path(s):
sudo journalctl -u healtharchive-api - Related playbooks/runbooks:
docs/deployment/production-single-vps.md