Escalation Procedures
Last Updated: 2026-01-18 Status: Active
This document defines how to categorize, escalate, and respond to incidents affecting the HealthArchive production environment.
1. Severity Levels
We categorize incidents into four levels based on impact and urgency.
| Level | Definition | Response Time | Actions |
|---|---|---|---|
| Sev0 | Critical Outage / Data Loss System is totally unusable, or confirmed data loss is occurring. | Immediate | 1. Stop all non-recovery work. 2. Notify stakeholders (if any). 3. Initiate Disaster Recovery. |
| Sev1 | Major Degradation Core features (Search, API) are broken or extremely slow. User impact is high. | < 1 Hour | 1. Engage Primary On-Call. 2. Investigate immediately. 3. Deploy hotfix or rollback. |
| Sev2 | Partial Degradation Secondary features (e.g., Replay) broken, or performance issues with workarounds. | < 4 Hours | 1. Log incident. 2. Investigate within business hours. 3. Schedule fix for next release window. |
| Sev3 | Minor Issue Trivial bugs, cosmetic issues, or single-page failures. No broad user impact. | < 24 Hours | 1. Log ticket/issue. 2. Prioritize in normal development backlog. |
2. Escalation Path
Current State: Single Operator
In the current single-maintainer topology, the escalation path is flat.
- Primary: Operator (You) - Responsible for all triage and resolution.
- Backup: None (Bus factor = 1).
- Mitigation: Comprehensive Runbooks and Disaster Recovery docs to allow a skilled third party to recover the system using "Break-Glass" credentials if the primary operator is incapacitated.
Future State: Multi-Operator
When the team grows, follow this hierarchy:
- Level 1 (On-Call): Triage, immediate mitigation, and initial investigation.
- Level 2 (Secondary/Backup): Deep dive debugging, code fixes, and complex recovery.
- Level 3 (Project Lead): Strategic decisions (e.g., data loss acceptance, major architecture rollback).
3. DRI Assignments (Directly Responsible Individuals)
Since we largely operate as a single unit, the Operator is the DRI for all areas. This matrix serves as a template for future delegation.
| Area | DRI | Responsibilities |
|---|---|---|
| Backend API | Operator | FastAPI availability, performance, response correctness. |
| Worker / Crawls | Operator | Job scheduling, zimit/warcio execution, tiering to storage. |
| Database | Operator | PostgreSQL uptime, backup verification, schema migrations. |
| Storage / WARC | Operator | Disk space management, Storage Box connectivity, manifest integrity. |
| Replay Service | Operator | pywb availability and indexing health. |
| Infrastructure | Operator | VPS provisioning, OS updates, systemd maintenance, Tailscale. |
4. Contact Information Storage
For security reasons, do not store phone numbers or sensitive access codes in this git repository.
Production Contact list
Store a secure, read-only file on the production VPS for emergency reference:
- Path:
/etc/healtharchive/contacts.env - Permissions:
600(root/owner only) - Format: Key-Value pairs
# Example content for /etc/healtharchive/contacts.env
OPERATOR_PHONE="+1-555-0100"
OPERATOR_EMAIL="admin@healtharchive.ca"
SECONDARY_CONTACT_PHONE="+1-555-0101" # Backup contact (if any)
HETZNER_SUPPORT_PIN="12345"
NAMECHEAP_SUPPORT_PIN="67890"
Personal Backup
Mirror this information in your password manager (e.g., 1Password, Bitwarden) under a secure note titled "HealthArchive Emergency Contacts".
5. Break-Glass Procedures
Quick-reference steps for common critical failures where normal access or services are blocked.
A. API Unresponsive (HTTP 502/503/Timeout)
- Access: SSH to VPS via Tailscale (
ssh haadmin@100.x.y.z). - Status: Check if the service is running.
- Logs: specific error messages?
- Action: Restart the service.
- Escalation: If restart fails or immediately crashes, check Database connectivity (see B).
B. Database Unreachable
- Status: Is Postgres running?
- Resources: Is disk full?
- Logs:
- Action: Restart Postgres.
- Escalation: If database won't start due to corruption, proceed to Disaster Recovery Scenario B.
C. VPS Unreachable (SSH Down)
- Check Network: Try accessing via different Tailscale node or public IP (if SSH open/testing).
- Console Access: Log in to Hetzner Cloud Console > Select Server > Console.
- This bypasses network/SSH config issues.
- Reboot: Use the Hetzner "Power" menu to force a reboot ACPI or hard reset if the OS is frozen.
- Escalation: If the server is deleted or hardware failed, proceed to Disaster Recovery Scenario A.
6. Handoff Procedures
When transferring responsibility (e.g., vacation coverage):
- Sync: Verify current system health (dashboards, logs).
- Access: Confirm backup operator has valid SSH/Tailscale access.
- Docs: Ensure emergency contact info is accessible to the backup.
- Notify: Inform any stakeholders (if applicable) of the active operator change.