Service Levels
This document describes the service level objectives and commitments for HealthArchive.ca. These are targets, not contractual guarantees.
Last Updated: 2026-01-18
Scope and Context
HealthArchive is a public-interest research archive operated as a best-effort service. These service levels reflect commitments appropriate for the project's resources and mission:
- Infrastructure: Single VPS (Hetzner cx33: 4 vCPU / 8GB RAM / 80GB SSD)
- Staffing: Solo operator, no 24/7 coverage
- Purpose: Public good research tool, not a commercial service
All targets are measured and reviewed on a best-effort basis. Incidents outside business hours may see delayed response.
Availability
Target
99.5% monthly availability
This allows for approximately 3.6 hours of downtime per month, which is realistic for: - Single-server architecture (no redundancy) - Manual maintenance operations - Solo operator response times
Measurement
- Primary endpoint:
GET /api/health(https://api.healtharchive.ca/api/health) - Monitoring method: External uptime monitoring (Healthchecks.io, UptimeRobot)
- Measurement window: Calendar month
- Exclusions: Planned maintenance with advance notice (see Maintenance Windows)
Review
- Semi-annual review of actuals vs target
- Adjust target if infrastructure or staffing changes significantly
Response Times
Target response times for key API endpoints, measured server-side (excludes network latency):
| Endpoint | p50 Target | p95 Target | p99 Target | Notes |
|---|---|---|---|---|
GET /api/health | 50ms | 100ms | 200ms | Minimal processing; use ?details=1 only for operator summary counts |
GET /api/search | 500ms | 2000ms | 5000ms | Complex queries, database-dependent |
GET /api/sources | 100ms | 300ms | 500ms | Lightweight, typically cached |
GET /api/snapshot/{id} | 100ms | 300ms | 500ms | Single record lookup |
GET /api/changes | 200ms | 500ms | 1000ms | Precomputed change feed |
Degradation Criteria
The service is considered degraded when: - p95 latency exceeds target by 2× for 5+ consecutive minutes - p99 latency exceeds target by 3× for 5+ consecutive minutes - Any endpoint timeout rate exceeds 1%
Exclusions
These targets do not apply to: - Attack traffic or abusive request patterns - Bulk export operations (/api/exports/*) - Replay operations (separate service: replay.healtharchive.ca)
Data Freshness
Crawl Cadence
Primary sources (Health Canada, PHAC): Crawled at least annually, with ad-hoc updates as resources permit
- Major annual crawl campaign: typically January
- Ad-hoc crawls: triggered by significant health events or policy changes
- Schedule is best-effort and subject to operator availability
Indexing Latency
- Crawl-to-indexed: Within 24 hours of crawl completion, subject to operator availability
- Indexed-to-searchable: Immediate (same database transaction)
Change Tracking
- Changes computed: Within 24 hours of new snapshots being indexed, subject to operator availability
- Change feed updated: On next
compute-changesrun (automated via systemd timer)
Exceptions
- Manual crawls may have different timelines based on urgency
- Emergency updates (e.g., public health crises) prioritized on case-by-case basis
Maintenance Windows
Window Types
Routine Maintenance
- Examples: Security updates, dependency patches, configuration changes
- Advance Notice: 24 hours (via changelog)
- Maximum Duration: 30 minutes
- Typical Downtime: < 15 minutes
Major Maintenance
- Examples: Database migrations, infrastructure changes, new feature deployments
- Advance Notice: 72 hours (via changelog + announcement if user-facing)
- Maximum Duration: 4 hours
- Typical Downtime: 1-2 hours
Emergency Maintenance
- Examples: Critical security patches, severe bug fixes
- Advance Notice: ASAP (post-hoc notification if required immediately)
- Duration: As needed
- Communication: Documented in changelog after completion
Preferred Timing
- Weekdays, off-peak hours: Early morning (00:00-06:00 UTC) or late evening (22:00-24:00 UTC)
- Avoid: Business hours (14:00-22:00 UTC), weekends, holidays
Post-Maintenance Verification
After all maintenance: - Health check validation (/api/health, /archive) - Smoke test (search query, snapshot retrieval) - External uptime monitor confirmation - Documented in changelog
Communication Commitments
Channels
Public Channels: - Changelog: https://healtharchive.ca/changelog - primary source for planned changes and incidents - Status: https://healtharchive.ca/status - service status overview - No dedicated status page (updates via changelog)
Internal/Operator: - Incident notes (selected public-safe summaries published) - Operations logs (private)
Incident Communication
Following the incident disclosure policy (Option B):
Sev0/Sev1 (Service Down / Major Degradation): - Communicate within 48 hours of resolution, or as soon as practical - Public-safe summary published to changelog - Includes impact, timeline, resolution, and prevention measures
Sev2/Sev3 (Minor Issues): - Include in regular changelog if user-facing - Internal documentation only if operator-only impact
Changelog Cadence
- Major changes: Immediate entry
- Minor changes: Batched weekly or monthly
- Security updates: Published as appropriate (may delay for responsible disclosure)
Limitations
Communication timelines are best-effort and depend on: - Solo operator availability (no 24/7 coverage) - Incident severity and complexity - Need for coordination with external parties (e.g., infrastructure provider)
Performance Baselines
Purpose
Baselines provide reference points for detecting performance degradation and validating improvements. They are not targets but rather observations of typical performance under normal conditions.
Baseline Measurement Approach
Baselines should be measured: - On production hardware (single VPS, current configuration) - Under typical load (not during crawls or heavy operations) - Multiple samples to account for variance - Documented with measurement date and conditions
Current Baselines
[!NOTE] Initial baselines to be measured and documented during implementation. This section will be updated with actual measurements.
API Response Times (server-side, measured via curl timing):
| Endpoint | Baseline p50 | Baseline p95 | Measured Date |
|---|---|---|---|
GET /api/health | TBD | TBD | TBD |
GET /api/search?q=covid | TBD | TBD | TBD |
GET /api/sources | TBD | TBD | TBD |
GET /api/snapshot/{id} | TBD | TBD | TBD |
Operational Baselines:
| Operation | Baseline Throughput | Measured Date |
|---|---|---|
| Crawl (pages/hour) | TBD | TBD |
| Indexing (records/second) | TBD | TBD |
| Change computation (changes/minute) | TBD | TBD |
Baseline Review
- Semi-annual review: Compare current performance against baselines
- After major changes: Re-baseline if infrastructure or architecture changes
- Drift documentation: Document and investigate significant baseline drift (>20%)
Review and Update Process
Review Cadence
Annual Review: - Assess targets vs actuals for the past year - Evaluate appropriateness of commitments - Update targets if resources or infrastructure change significantly
Triggered Reviews: - After infrastructure changes (e.g., VPS upgrade, migration) - After staffing changes (e.g., additional operators) - After major architectural changes (e.g., HA implementation)
Update Process
- Propose changes: Document in roadmap or ADR
- Review against actuals: Compare proposed targets to historical data
- Update documentation: Revise this document
- Communicate changes: Announce via changelog if user-facing impact
- Update monitoring: Adjust alerts and dashboards to match new targets
Document Maintenance
- Location:
docs/operations/service-levels.md - Owner: Primary operator
- Format: Markdown, version-controlled in healtharchive repo
- Navigation: Linked from operations index and docs site navigation
References
- Production Runbook - Infrastructure details and deployment procedures
- Incident Response Playbook - Incident classification and response procedures
- Monitoring Checklist - Monitoring setup and external checks
- Disaster Recovery - Recovery procedures and RTO/RPO targets
- Ops Cadence Checklist - Routine operational tasks
Changelog
| Date | Change | Rationale |
|---|---|---|
| 2026-01-18 | Initial version | Established baseline service level documentation |