Incident: API search/changes 500 due to missing dedupe migration (2026-02-06)
Status: closed
Metadata
- Date (UTC): 2026-02-06
- Severity (see
operations/incidents/severity.md): sev2 - Environment: production
- Primary area: api
- Owner: jerdaw
- Start (UTC): 2026-02-06T14:55:33Z
- End (UTC): 2026-02-06T15:05:15Z
Summary
After deploying a backend change that referenced Snapshot.deduplicated, production API endpoints /api/search and /api/changes returned HTTP 500 because the Postgres schema was missing the new snapshots.deduplicated column. The issue was detected by the deploy verification script and confirmed via loopback repro + API logs. Recovery was completed by adding the missing Alembic migration and re-deploying, which applied the migration and restored API functionality.
Impact
- User-facing impact:
/api/searchand/api/changesreturned 500 during the incident window. - Internal impact: deploy verification failed; required a follow-up deploy.
- Data impact:
- Data loss: no
- Data integrity risk: low (read queries failing, not silent corruption)
- Recovery completeness: complete
- Duration: ~10 minutes
Detection
- Detected via deploy verification (
scripts/verify_public_surface.py) failing on/api/searchand/api/changes. - Confirmed via loopback repro and
journalctl -u healtharchive-apishowing: psycopg.errors.UndefinedColumn: column snapshots.deduplicated does not exist
Timeline (UTC)
- 2026-02-06T14:55:33Z — Deploy verification reports
/api/search500; loopback repro confirms 500. - 2026-02-06T14:55:34Z — API logs show UndefinedColumn error for
snapshots.deduplicated. - 2026-02-06T15:04:45Z — Follow-up deploy pulls migration change and runs Alembic upgrade.
- 2026-02-06T15:05:15Z — Loopback
/api/searchand/api/changesreturn 200; public API verification passes.
Root cause
- Immediate trigger: application code queried
snapshots.deduplicatedbefore the corresponding migration existed/applied. - Underlying cause(s): the deploy introduced a schema-dependent feature without an accompanying migration in the same deployable unit.
Contributing factors
- The public surface verifier correctly detected the regression, but it was run after the first deploy had already restarted the API.
Resolution / Recovery
- Added the missing Alembic migration for dedupe schema:
alembic/versions/0014_snapshot_deduplication.py- Re-deployed backend so Alembic applied the migration and the API restarted cleanly.
Post-incident verification
- Local on VPS:
curl -sS -i "http://127.0.0.1:8001/api/search?pageSize=1" | headcurl -sS -i "http://127.0.0.1:8001/api/changes?pageSize=1" | head- Public surface:
python3 ./scripts/verify_public_surface.py --api-base https://api.healtharchive.ca --frontend-base https://www.healtharchive.ca --skip-frontend
Action items (TODOs)
- Add a lightweight CI guard that fails if an API query references a missing column in the test DB schema (owner=jerdaw, priority=medium, due=2026-03)
- Update the dedupe feature checklist to explicitly require an Alembic migration in the same PR (owner=jerdaw, priority=low, due=2026-03)
References / Artifacts
- Deploy script:
scripts/vps-deploy.sh - Public verifier:
scripts/verify_public_surface.py - Migration:
alembic/versions/0014_snapshot_deduplication.py