Disaster Recovery Runbook
Last Updated: 2026-01-18 Status: Active
Recovery Objectives
In the context of HealthArchive, these objectives define our boundaries for data loss and downtime during a major failure.
- RPO (Recovery Point Objective): The maximum age of files that must be recovered from backup storage for operations to resume. It defines our "data loss tolerance."
- RTO (Recovery Time Objective): The maximum duration of time within which service must be restored after a disaster. It defines our "downtime tolerance."
- MTTR (Mean Time To Recovery): The average time taken to repair a failed component and return it to service.
RPO (Recovery Point Objective)
Target: 24 hours
Rationale: - We perform nightly backups of the database and configuration. - Crawl data (WARCs) is tiered to storage regularly. - Up to 24 hours of data loss (recent crawls, user actions) is considered acceptable for the current service criticality level (research access, no real-time critical operational dependencies). Data can often be re-crawled.
RTO (Recovery Time Objective)
Target: 8 hours
Rationale: - Recovery involves manual provisioning of a new VPS, installing dependencies, and restoring from backup. - This timeframe allows a single operator to perform these steps during a standard workday.
MTTR (Mean Time To Recovery)
Target: 4 hours
Rationale: - For partial failures (e.g., service restart, database recovery without full VPS loss), we aim to restore service within 4 hours.
When to Revisit
These targets should be reviewed: - Annually: During the full DR drill. - Service Changes: If the service criticality increases (e.g., adding real-time users). - Architecture Changes: If moving from a single VPS to a multi-node/HA setup. - Scale Changes: If the dataset size grows significantly enough to impact restoration times.
Scenarios
Scenario A: Complete VPS loss (NAS backup available)
Most likely DR scenario. Requires provisioning new VPS and restoring from offsite NAS backup.
Scenario B: Database corruption
Restoration from local or NAS pg_dump.
Scenario C: Storage failure
Recovery of WARC files from tiered storage or accepted data loss.
Procedures
1. VPS Complete Restoration (Scenario A)
Prerequisites: - Access to Hetzner Cloud Console. - Access to Synology NAS (via physical access or alternative network if Tailscale is down). - SSH key for haadmin available locally.
Step 1: Provision New VPS
-
Create Server: Follow the standard provisioning steps in Production Single VPS.
- Image: Ubuntu 24.04 LTS.
- Updates:
sudo apt update && sudo apt upgrade -y. - User: Create
haadminuser and harden SSH.
-
Configure Networking:
- Set up Firewall rules (Allow 80/443, block 22 public, allow Tailscale UDP).
Step 2: Install Base Dependencies
Run as haadmin:
sudo apt install -y docker.io postgresql postgresql-contrib python3-venv python3-pip git curl build-essential pkg-config unzip
sudo systemctl enable --now docker postgresql
Step 3: Re-join Tailscale
- Install Tailscale:
curl -fsSL https://tailscale.com/install.sh | sh. - Authenticate:
sudo tailscale up --ssh.- Note: If possible, reuse the old IP/hostname from the admin console to simplify ACLs, or update ACLs to trust the new node.
Step 4: Prepare Directories
sudo groupadd --system healtharchive
sudo mkdir -p /srv/healtharchive/{jobs,backups,ops}
sudo chown -R haadmin:haadmin /srv/healtharchive/jobs
sudo chown root:healtharchive /srv/healtharchive/backups /srv/healtharchive/ops
sudo chmod 2770 /srv/healtharchive/backups /srv/healtharchive/ops
Step 5: Retrieve Backup from NAS
If Tailscale is up on both ends: 1. SSH to NAS: ssh user@nas-ip. 2. Rsync backup to new VPS:
rsync -av /volume1/nobak/healtharchive/backups/db/latest.dump haadmin@new-vps-ip:/srv/healtharchive/backups/
Step 6: Restore Database
- Create DB and User:
- Restore Schema and Data:
Step 7: Restore Application
- Clone Repository:
- Restore Configuration:
- Restore
/etc/healtharchive/backend.envfrom your distinct secure offsite storage (e.g., password manager notes). Do not lose this file. - If needed, regenerate the
ADMIN_TOKEN.
- Restore
Step 8: Re-mount Storage / Restore WARCs
- Mount the Storage Box (tiered storage) to
/srv/healtharchive/storageboxusingsshfs(seeproduction-single-vps.md). - If local WARCs were lost (
/srv/healtharchive/jobs), you have two options:- Rescan: If files exist on Storage Box, re-import headers (slow).
- Empty Start: Start with empty local jobs; historical data remains on Storage Box/index.
2. Database Intact Restoration (Scenario B)
Use this procedure when the VPS is running but the database is corrupted or dropped.
Prerequisites: - Backup file available (local or NAS). - PostgreSQL service is running.
Step 1: Locate Backup
- Format:
pg_dump -Fc(custom format, compressed). - Local:
/srv/healtharchive/backups/- Naming:
healtharchive_<timestamp>.dump - Retention: 14 days.
- Naming:
- NAS:
/volume1/nobak/healtharchive/backups/db/(needs retrieval)- Offsite mirror of
/srv/healtharchive/backupsvia NASrsync --deletepull. - Retention follows the current VPS backup directory contents, not an independent permanent archive. Archive one-off maintenance dumps under
/srv/healtharchive/ops/maintenance/...if they must outlive the normal mirrored set.
- Offsite mirror of
Step 2: Restore Database
Warning: This will overwrite the current database state.
-
Drop and Recreate:
-
Restore from Dump:
-
Verify Restoration: Check that tables are populated:
-
Swap Databases: Stop services to preventing locking:
Swap:
-
Restart Services:
Step 3: Integrity Verification
- Row Counts: Compare
SELECT count(*) FROM snapshotswith expected values. - Recent Data: Check for the most recent captures
SELECT * FROM snapshots ORDER BY id DESC LIMIT 5;. - Foreign Keys:
pg_restorewould have failed on constraint violations, but check application logs for ORM errors. - Orphaned Records: Ensure core relations are intact:
Step 4: Partial Restoration (Advanced)
- Specific Table: Use
pg_restore -t <tablename>to restore only one table to a temp DB, then copy data. - Verify on Separate Server: For high-stakes restorations, perform the restoration on a development or temporary VPS first to verify integrity before swapping production.
- Point-in-Time: Requires WAL archiving (currently not enabled; rely on nightly dumps).
3. Archive Root Recovery (Scenario C)
Use this procedure when WARC files or the archive storage structure is compromised.
Archive Root Structure:
/srv/healtharchive/jobs/
├── <source_slug>-<year>-<month>/ # Job Output Directories
│ ├── warcs/ # Stable WARC files
│ │ ├── manifest.json # Mapping of source -> stable filenames
│ │ └── warc-000001.warc.gz
│ ├── provenance/ # Metadata preservation
│ │ └── archive_state.json
│ └── logs/
└── tiered/ # Mount point for cold storage (Storage Box)
Recovery Scenarios
Case 1: Local WARCs lost (e.g., accidental deletion), Tiered storage intact This is the most common recovery case. 1. Check Tiered Storage: Verify header-only WARCs or full files exist in /srv/healtharchive/storagebox. 2. Re-import Headers/WARCs (Slow but safe): If the database is intact, you don't need the local WARCs effectively immediately for the site to work, but the Replay service will fail for those snapshots. To restore replayability, copy the WARCs back from tiered storage:
# Example: Restore specific job
rsync -av /srv/healtharchive/storagebox/jobs/hc-2026-01/ /srv/healtharchive/jobs/hc-2026-01/
# Check that all files in manifest exist and have correct sizes
cat /srv/healtharchive/jobs/hc-2026-01/warcs/manifest.json | jq .records
Case 2: Tiered storage unavailable, Local intact 1. Run in Degraded Mode: Operations can continue using local WARCs. 2. Disable Tiering: Stop the tiering cron job/timer to prevent errors. 3. Restore Connection: Troubleshoot sshfs mount or Storage Box availability. 4. Re-enable Tiering: Once fixed, the system will resume tiering new WARCs.
Case 3: All copies lost (Catastrophic) 1. Accept Data Loss: Crawl data is gone. 2. Clean Database: You may need to truncate snapshots table if it references missing files, or mark them as lost. 3. Re-crawl: Trigger new manual crawls for critical sources.
Integrity Verification
- WARC Validation:
- Database Consistency: Ensure database records point to existing files (custom script required).
Re-tiering and Consolidation Procedure
If tiered storage was wiped and replaced, or if you need to stabilize newly crawled data:
- Consolidate WARCs: Ensure files are moved from
.tmp*to stablewarcs/folders and manifests are updated. - Verify Local Integrity: Ensure local WARCs match their manifest and are valid.
- Force Tiering: Run the tiering command manually to re-upload everything:
- Verify Tiered Copies: Check that the files on the Storage Box match the local stable WARCs.
4. Service Startup Sequence
Order is critical:
- Database:
sudo systemctl start postgresql- Health Check:
sudo systemctl status postgresqlorpg_isready - Failure: Check disk space (
df -h) and logs (journalctl -u postgresql).
- Health Check:
- API:
sudo systemctl start healtharchive-api- Health Check:
curl http://localhost:8001/api/health - Failure: Check
/etc/healtharchive/backend.envandjournalctl -u healtharchive-api -n 100.
- Health Check:
- Worker:
sudo systemctl start healtharchive-worker- Health Check:
sudo systemctl status healtharchive-worker(Check logs for "Worker started"). - Failure: Check database connectivity and logs.
- Health Check:
- Replay (Optional): Start pywb if configured.
- Health Check:
curl http://localhost:8080(or configured port).
- Health Check:
- Reverse Proxy:
sudo systemctl start caddy- Health Check:
sudo systemctl status caddy - Failure:
sudo caddy validate --config /etc/caddy/Caddyfile.
- Health Check:
5. Verification Checklist
Run these checks immediately after startup:
- Database Connectivity:
sudo -u postgres psql -d healtharchive -c 'SELECT count(*) FROM sources;'(Should > 0) - API Health:
curl http://localhost:8001/api/health->{"status":"ok"} - Public Endpoint (HTTPS):
curl -I https://api.healtharchive.ca/api/health(Verify TLS works) - Search Index: Query a known term via the frontend or API.
- Worker Health: Check logs for "Worker started" and no immediate crashes.
- Snapshot Viewing: Visit a known snapshot URL (e.g., the smoke test snapshot ID 1).
- Monitoring Reconnected: Confirm that Healthchecks.io, Prometheus, and external uptime monitors are receiving signals from the new VPS.
DR Drills
Regular testing ensures that these procedures remain effective and that operators are familiar with the recovery process.
Schedule
| Drill Type | Frequency | Next Due | Owner | Scope |
|---|---|---|---|---|
| Tabletop | Quarterly | Q1 2026 | Operator | Review procedure, check credentials, identify gaps. |
| Partial Restore | Quarterly | Q1 2026 | Operator | Restore database summary/integrity check on local dev machine. |
| Full DR | Annual | 2026-06 | Operator | Full recovery from backup to a fresh VPS. |
Procedures
1. Tabletop Drill
Objective: Verify documentation accuracy and credential availability without interacting with production.
- Read-Through: Walk through the "Complete VPS Restoration (Scenario A)" procedure step-by-step.
- Credential Check: Verify you can locate/access:
- Hetzner Cloud Console password/2FA.
- Synology NAS SSH keys.
- Encrypted backup of
/etc/healtharchive/backend.env. - Domain DNS controls (Namecheap).
- Success Criteria:
- All restoration steps are understood and commands are valid.
- All required credentials are confirmed as accessible and current.
- Documentation & Follow-up:
- Fix any broken links, outdated commands, or unclear instructions found during the read-through.
- Record findings in the Results Log (see below).
2. Partial Restoration Drill
Objective: Verify backup integrity and database restorability.
- Retrieve Backup: Download the latest actual
healtharchive_<ts>.dumpfrom the NAS or VPS. - Local Restore:
- Spin up a local Docker Postgres container or use a local dev DB.
- Run the Scenario B (Database Corruption) restoration steps against this local instance.
- Success Criteria:
pg_restorecompletes without fatal errors.- Row counts for
snapshotsmatch or are within expected growth margins. - Recent captures are present and readable.
- Documentation & Follow-up:
- Record the size of the backup and restoration time in the Results Log.
- If corruption is found, investigate backup job logs and schedule an immediate re-run.
- Cleanup: Delete the local test database and backup file.
3. Full DR Drill (Annual)
Objective: Prove total system recovery capability.
Prerequisites: - Perform during low-traffic window (e.g., weekend). - Budget ~$5 for temporary VPS costs.
Procedure: 1. Provision: Create a new VPS (e.g., dr-test-2026) in Hetzner. DO NOT DELETE THE EXISTING PRODUCTION VPS. 2. Execute Scenario A: Follow "VPS Complete Restoration" strictly. - Modification: When restoring backend.env, change HEALTHARCHIVE_PUBLIC_SITE_URL to the temporary IP or a test subdomain to avoid DNS conflicts. - Modification: Do not switch the main DNS (A record) unless you are intentionally testing failover (requires downtime). 3. Verify & Success Criteria: - Run the complete "Verification Checklist" on the new host; all checks must pass. - Verify you can pull a WARC file from tiered storage. - Total restoration time is within the 8-hour RTO. 4. Documentation & Follow-up: - Record total time to recovery (RTO metric) and any blockers in the Results Log. - Update the MTTR/RTO targets if they are consistently missed or easily exceeded. 5. Teardown: - Destroy the temporary VPS. - Remove the temporary node from Tailscale.
Results Log
Copy and paste this template to docs/operations/dr-logs/<YYYY-MM-DD>-drill-report.md:
# DR Drill Report: <Date>
**Drill Type:** (Tabletop / Partial / Full)
**Operator:** <Name>
**Time Started:** <HH:MM UTC>
**Time Finished:** <HH:MM UTC>
**Total Duration:** <Minutes>
## Outcome
- [ ] Success (All objectives met)
- [ ] Partial Success (Objectives met with issues)
- [ ] Failure (Could not complete recovery)
## Metric
- **RTO Achieved:** N/A (or actual time if Full Drill)
- **Backup Age:** <Hours since last backup> (RPO check)
## Issues Encountered
1. Issue description...
## Documentation Updates Required
- [ ] Update section X.Y...