Skip to content

Observability Setup and Maintenance Guide

Scope: Complete setup of the private observability stack (Prometheus, Grafana, Alertmanager) on the production VPS.

Canonical boundary doc (read first): observability-and-private-stats.md


Overview

This guide covers the full observability stack installation in order:

  1. Bootstrap - Filesystem + secrets layout
  2. Install Exporters - node_exporter + postgres_exporter
  3. Configure Prometheus - Metrics collection
  4. Configure Grafana - Dashboards + tailnet access
  5. Provision Dashboards - Automated dashboard deployment
  6. Configure Alerting - Alertmanager + rules
  7. Ongoing Maintenance - Upgrades, rotation, troubleshooting

Architecture: All services bind to loopback only. Access via Tailscale SSH port-forward.


1. Bootstrap (Prerequisites)

Goal: Prepare filesystem + secrets layout without installing services.

Preconditions

  • On the production VPS with sudo access
  • /srv/healtharchive/ exists
  • Ops group exists (usually healtharchive)

Procedure

cd /opt/healtharchive
sudo ./scripts/vps-bootstrap-observability-scaffold.sh

Populate secret files (do NOT store under /srv/healtharchive/ops/):

sudoedit /etc/healtharchive/observability/prometheus_backend_admin_token
sudoedit /etc/healtharchive/observability/grafana_admin_password
sudoedit /etc/healtharchive/observability/postgres_grafana_password

Verify

stat -c '%U:%G %a %n' /srv/healtharchive/ops/observability /srv/healtharchive/ops/observability/*
stat -c '%U:%G %a %n' /etc/healtharchive/observability/*

Rollback

sudo rm -rf /srv/healtharchive/ops/observability
sudo rm -rf /etc/healtharchive/observability

2. Install Exporters

Goal: Install node_exporter (host metrics) and postgres_exporter (DB health), loopback-only.

Preconditions

  • Bootstrap complete (directories exist)
  • Postgres running locally

Procedure

cd /opt/healtharchive
./scripts/vps-install-observability-exporters.sh          # Dry-run
sudo ./scripts/vps-install-observability-exporters.sh --apply

This installs packages, creates the postgres_exporter DB role with pg_monitor, and forces loopback binding (127.0.0.1:9100, 127.0.0.1:9187).

Verify

curl -s http://127.0.0.1:9100/metrics | head
curl -s http://127.0.0.1:9187/metrics | head
ss -lntp | grep -E ':9100|:9187'  # Expect 127.0.0.1 only
systemctl --no-pager status prometheus-node-exporter prometheus-postgres-exporter

Rollback

sudo systemctl disable --now prometheus-node-exporter prometheus-postgres-exporter || true
sudo rm -rf /etc/systemd/system/prometheus-node-exporter.service.d \
            /etc/systemd/system/prometheus-postgres-exporter.service.d
sudo systemctl daemon-reload
sudo apt-get remove -y prometheus-node-exporter prometheus-postgres-exporter
sudo rm -f /etc/healtharchive/observability/postgres_exporter.env \
           /etc/healtharchive/observability/postgres_exporter_password

3. Configure Prometheus

Goal: Install Prometheus and configure scraping of HealthArchive metrics.

Preconditions

  • Exporters installed and loopback-only
  • /etc/healtharchive/observability/prometheus_backend_admin_token set to backend admin token
  • Backend API reachable: curl -s http://127.0.0.1:8001/api/health

Procedure

cd /opt/healtharchive
./scripts/vps-install-observability-prometheus.sh          # Dry-run
sudo ./scripts/vps-install-observability-prometheus.sh --apply

This installs Prometheus, writes config to /etc/prometheus/prometheus.yml, forces loopback binding (127.0.0.1:9090), and caps retention.

Verify

curl -s http://127.0.0.1:9090/-/ready
ss -lntp | grep -E ':9090\b'  # Expect 127.0.0.1 only
curl -s http://127.0.0.1:9090/api/v1/targets | head
curl -s "http://127.0.0.1:9090/api/v1/query?query=up%7Bjob%3D%22healtharchive_backend%22%7D" | head

Rollback

sudo systemctl disable --now prometheus.service
sudo rm -rf /etc/systemd/system/prometheus.service.d
sudo systemctl daemon-reload
sudo apt-get remove -y prometheus  # Optional

4. Configure Grafana

Goal: Install Grafana as operator-only dashboard, reachable via tailnet.

Preconditions

  • Prometheus installed on 127.0.0.1:9090
  • Tailscale connected to tailnet
  • Secrets set: grafana_admin_password, postgres_grafana_password

Procedure

cd /opt/healtharchive
./scripts/vps-install-observability-grafana.sh          # Dry-run
sudo ./scripts/vps-install-observability-grafana.sh --apply

This binds Grafana to 127.0.0.1:3000, disables anonymous access, resets admin password, and creates the grafana_readonly Postgres role.

Access Options

Preferred - SSH port-forward (more private):

ssh -L 3000:127.0.0.1:3000 haadmin@<vps-tailscale-ip>
# Then open http://127.0.0.1:3000

Optional - Tailscale Serve (requires HTTPS certs enabled):

./scripts/vps-enable-tailscale-serve-grafana.sh          # Dry-run
sudo ./scripts/vps-enable-tailscale-serve-grafana.sh --apply
sudo tailscale serve status  # Get HTTPS URL

Configure Data Sources (Grafana UI)

  1. Prometheus: URL http://127.0.0.1:9090
  2. Postgres: Host 127.0.0.1:5432, DB healtharchive, User grafana_readonly, TLS disabled

Verify

ss -lntp | grep -E ':3000\b'  # Expect 127.0.0.1 only
# Test data sources in Grafana UI

Rollback

sudo tailscale serve reset  # If using Serve
sudo systemctl disable --now grafana-server.service
sudo rm -rf /etc/systemd/system/grafana-server.service.d
sudo systemctl daemon-reload

5. Provision Dashboards

Goal: Install ops and usage dashboards reproducibly.

Preconditions

  • Prometheus and Grafana running
  • Data sources configured in Grafana UI:
  • Prometheus: named prometheus
  • Postgres: named grafana-postgresql-datasource

Procedure

cd /opt/healtharchive
git pull
./scripts/vps-install-observability-dashboards.sh          # Dry-run
sudo ./scripts/vps-install-observability-dashboards.sh --apply

Verify

In Grafana, find the HealthArchive folder with these dashboards:

  • HealthArchive - Ops Overview
  • HealthArchive - Ops Console (Read-only)
  • HealthArchive - Pipeline Health
  • HealthArchive - Search Performance
  • HealthArchive - Usage (Private, Aggregate)
  • HealthArchive - Impact Summary (Private, Aggregate)

Troubleshooting

  • Permission errors: Add Grafana to ops group: sudo usermod -aG healtharchive grafana && sudo systemctl restart grafana-server
  • Baseline drift on /etc/grafana/provisioning/dashboards/healtharchive.yaml: Re-run sudo ./scripts/vps-install-observability-dashboards.sh --apply. The dashboards installer is the canonical writer for that provisioning file and restores the expected root:root 0644 state.
  • "Data source not found": Rename data sources to match expected names or edit dashboard JSON

Rollback

sudo rm -f /etc/grafana/provisioning/dashboards/healtharchive.yaml
sudo rm -rf /srv/healtharchive/ops/observability/dashboards/healtharchive
sudo systemctl restart grafana-server

6. Configure Alerting

Goal: Get notified about real outages without pager fatigue.

Preconditions

  • Prometheus running
  • Node exporter installed (for disk metrics)
  • If using WARC tiering: sudo systemctl enable --now healtharchive-tiering-metrics.timer

Choose Operator Channel

Create a webhook URL (Discord, Slack, or any HTTPS endpoint accepting Alertmanager JSON):

sudoedit /etc/healtharchive/observability/alertmanager_webhook_url

For Pushover:

sudoedit /etc/healtharchive/observability/pushover_app_token
sudoedit /etc/healtharchive/observability/pushover_user_key
sudo ./scripts/vps-install-observability-pushover-relay.sh --apply
# Set webhook URL to: http://127.0.0.1:9911/alertmanager

Procedure

cd /opt/healtharchive
git pull
./scripts/vps-install-observability-alerting.sh          # Dry-run
sudo ./scripts/vps-install-observability-alerting.sh --apply
# Optional: --mountpoint / (if storage not on /)

Alert Rules (High-Signal Set)

  • Backend scrape down (>5m)
  • Disk usage >80% warning, >90% critical
  • Sustained /api/search errors (traffic-gated)
  • Job failures increased
  • Storage Box mount down (if tiering enabled)
  • WARC tiering bind-mount failed
  • Tiering metrics stale (>2 hours)
  • Tiering hot path unreadable
  • Annual campaign sentinel failed (Jan 01 UTC)

Default Routing Behavior

scripts/vps-install-observability-alerting.sh installs Alertmanager routing with:

  • severity=critical: includes resolved notifications, repeats every 6h.
  • severity=warning/severity=info: no resolved notifications, repeats every 24h.
  • severity=drill: routed to null receiver (visible in UI, no operator notification).

This keeps critical outage signals immediate while reducing warning-level notification churn.

Verify

curl -s http://127.0.0.1:9093/-/ready
curl -s http://127.0.0.1:9090/api/v1/rules | head
ss -lntp | grep -E ':9093\b|:9090\b'

Test Delivery

amtool alert add HealthArchiveTestAlert severity=warning service=healtharchive

Rollback

sudo systemctl disable --now prometheus-alertmanager.service || \
  sudo systemctl disable --now alertmanager.service
sudo rm -f /etc/prometheus/rules/healtharchive-alerts.yml
sudo systemctl restart prometheus.service

7. Ongoing Maintenance

On VPS:

cd /opt/healtharchive
./scripts/vps-verify-observability.sh

From laptop (tailnet SSH tunnel):

ssh -N \
  -L 3000:127.0.0.1:3000 \
  -L 9090:127.0.0.1:9090 \
  -L 8002:127.0.0.1:8002 \
  haadmin@<vps-tailscale-ip>

Then open Grafana (http://127.0.0.1:3000/) and check:

  • HealthArchive - Ops Overview
  • HealthArchive - Pipeline Health
  • HealthArchive - Usage (Private, Aggregate)

Quarterly Upgrade Cadence

sudo apt-get update && sudo apt-get -y upgrade
sudo systemctl restart prometheus prometheus-alertmanager \
  prometheus-node-exporter prometheus-postgres-exporter \
  grafana-server healtharchive-pushover-relay healtharchive-admin-proxy
./scripts/vps-verify-observability.sh

Dashboard Updates

cd /opt/healtharchive
git pull
sudo ./scripts/vps-install-observability-dashboards.sh --apply

Credential Rotation

Backend admin token:

# Update /etc/healtharchive/backend.env and /etc/healtharchive/observability/prometheus_backend_admin_token
sudo systemctl restart healtharchive-api prometheus healtharchive-admin-proxy

Grafana admin password:

# Update /etc/healtharchive/observability/grafana_admin_password
sudo ./scripts/vps-install-observability-grafana.sh --apply --skip-apt --skip-db-role

Grafana Postgres password:

# Update /etc/healtharchive/observability/postgres_grafana_password
sudo ./scripts/vps-install-observability-grafana.sh --apply --skip-apt
# Update data source in Grafana UI if needed

Alert webhook URL:

# Update /etc/healtharchive/observability/alertmanager_webhook_url
sudo ./scripts/vps-install-observability-alerting.sh --apply

Prometheus Retention Tuning

sudo ./scripts/vps-install-observability-prometheus.sh --apply --skip-apt \
  --retention-time 15d --retention-size 1GB
curl -s http://127.0.0.1:9090/-/ready

Troubleshooting

# Service status
systemctl status grafana-server prometheus prometheus-alertmanager --no-pager -l

# Verify loopback-only binding
ss -lntp | grep -E ':3000|:8002|:9090|:9093|:9100|:9187|:9911'

# Check Prometheus targets
curl -s http://127.0.0.1:9090/api/v1/targets | head

See Also