Escalation Procedures

Last Updated: 2026-01-18 Status: Active

This document defines how to categorize, escalate, and respond to incidents affecting the HealthArchive production environment.

1. Severity Levels

We categorize incidents into four levels based on impact and urgency.

Level	Definition	Response Time	Actions
Sev0	Critical Outage / Data Loss System is totally unusable, or confirmed data loss is occurring.	Immediate	1. Stop all non-recovery work. 2. Notify stakeholders (if any). 3. Initiate Disaster Recovery.
Sev1	Major Degradation Core features (Search, API) are broken or extremely slow. User impact is high.	< 1 Hour	1. Engage Primary On-Call. 2. Investigate immediately. 3. Deploy hotfix or rollback.
Sev2	Partial Degradation Secondary features (e.g., Replay) broken, or performance issues with workarounds.	< 4 Hours	1. Log incident. 2. Investigate within business hours. 3. Schedule fix for next release window.
Sev3	Minor Issue Trivial bugs, cosmetic issues, or single-page failures. No broad user impact.	< 24 Hours	1. Log ticket/issue. 2. Prioritize in normal development backlog.

2. Escalation Path

Current State: Single Operator

In the current single-maintainer topology, the escalation path is flat.

Primary: Operator (You) - Responsible for all triage and resolution.
Backup: None (Bus factor = 1).
- Mitigation: Comprehensive Runbooks and Disaster Recovery docs to allow a skilled third party to recover the system using "Break-Glass" credentials if the primary operator is incapacitated.

Future State: Multi-Operator

When the team grows, follow this hierarchy:

Level 1 (On-Call): Triage, immediate mitigation, and initial investigation.
Level 2 (Secondary/Backup): Deep dive debugging, code fixes, and complex recovery.
Level 3 (Project Lead): Strategic decisions (e.g., data loss acceptance, major architecture rollback).

3. DRI Assignments (Directly Responsible Individuals)

Since we largely operate as a single unit, the Operator is the DRI for all areas. This matrix serves as a template for future delegation.

Area	DRI	Responsibilities
Backend API	Operator	FastAPI availability, performance, response correctness.
Worker / Crawls	Operator	Job scheduling, zimit/warcio execution, tiering to storage.
Database	Operator	PostgreSQL uptime, backup verification, schema migrations.
Storage / WARC	Operator	Disk space management, Storage Box connectivity, manifest integrity.
Replay Service	Operator	`pywb` availability and indexing health.
Infrastructure	Operator	VPS provisioning, OS updates, systemd maintenance, Tailscale.

4. Contact Information Storage

For security reasons, do not store phone numbers or sensitive access codes in this git repository.

Production Contact list

Store a secure, read-only file on the production VPS for emergency reference:

Path: /etc/healtharchive/contacts.env
Permissions: 600 (root/owner only)
Format: Key-Value pairs

# Example content for /etc/healtharchive/contacts.env
OPERATOR_PHONE="+1-555-0100"
OPERATOR_EMAIL="admin@healtharchive.ca"
SECONDARY_CONTACT_PHONE="+1-555-0101" # Backup contact (if any)
HETZNER_SUPPORT_PIN="12345"
NAMECHEAP_SUPPORT_PIN="67890"

Personal Backup

Mirror this information in your password manager (e.g., 1Password, Bitwarden) under a secure note titled "HealthArchive Emergency Contacts".

5. Break-Glass Procedures

Quick-reference steps for common critical failures where normal access or services are blocked.

A. API Unresponsive (HTTP 502/503/Timeout)

Access: SSH to VPS via Tailscale (ssh haadmin@100.x.y.z).

Status: Check if the service is running.

username@host:~$ systemctl status healtharchive-api

Logs: specific error messages?

username@host:~$ journalctl -u healtharchive-api -n 100

Action: Restart the service.

username@host:~$ sudo systemctl restart healtharchive-api

Escalation: If restart fails or immediately crashes, check Database connectivity (see B).

B. Database Unreachable

Status: Is Postgres running?

username@host:~$ systemctl status postgresql

Resources: Is disk full?
```
username@host:~$ df -h
```

Logs:

username@host:~$ journalctl -u postgresql -n 100

Action: Restart Postgres.

username@host:~$ sudo systemctl restart postgresql

Escalation: If database won't start due to corruption, proceed to Disaster Recovery Scenario B.

C. VPS Unreachable (SSH Down)

Check Network: Try accessing via different Tailscale node or public IP (if SSH open/testing).
Console Access: Log in to Hetzner Cloud Console > Select Server > Console.
- This bypasses network/SSH config issues.
Reboot: Use the Hetzner "Power" menu to force a reboot ACPI or hard reset if the OS is frozen.
Escalation: If the server is deleted or hardware failed, proceed to Disaster Recovery Scenario A.

6. Handoff Procedures

When transferring responsibility (e.g., vacation coverage):

Sync: Verify current system health (dashboards, logs).
Access: Confirm backup operator has valid SSH/Tailscale access.
Docs: Ensure emergency contact info is accessible to the backup.
Notify: Inform any stakeholders (if applicable) of the active operator change.