Runbook
Public endpoint down
Trigger symptoms
- `EndpointDown` (warn, 2m) or `EndpointDownCritical` (page, 5m) firing.
- `probe_success == 0` for one or more instances in `blackbox_http_probes`.
- Customers may be unable to load the site / app.
First checks
first-checks.sh
# 1. Which endpoints are failing? Pull labels from the alert in Alertmanager
# or hit Prometheus directly.
curl -s 'http://prometheus:9090/api/v1/query?query=probe_success==0' \
| jq '.data.result[] | .metric.instance'
# 2. Is the container even running?
docker ps --filter "name=via_prod" --format 'table {{.Names}}\t{{.Status}}' \
| grep -v ' Up ' # anything listed here is restarting / dead
# 3. Recent logs from the affected service.
docker logs --tail 200 via_prod-<service> 2>&1 | tail -60
# 4. Traefik routing for that public hostname.
docker exec via_prod-traefik wget -qO- http://localhost:8080/api/http/routers \
| jq '.[] | select(.rule | contains("api.via-basket.com"))'
# 5. TLS handshake (if HTTPS endpoint).
echo | openssl s_client -connect api.via-basket.com:443 -servername api.via-basket.com 2>&1 \
| grep -E 'subject=|issuer=|Verify return code'Immediate mitigation
- Container down:
docker compose -f docker-compose.prod.yml up -d <service>. Tail logs to understand the original crash. - Crash loop: Read the last 200 log lines for the root exception. Don’t restart blindly — you’ll just reset the loop counter.
- Traefik routing missing: Confirm the service has correct `traefik.*` labels, then reload Traefik (`docker compose restart traefik`).
- TLS cert expired: Inspect Traefik logs for ACME failures; force-renew via the Traefik dashboard or wipe `acme.json` and restart (last resort — rate-limited by LE).
- DNS issue:
dig api.via-basket.comfrom outside the EC2 host.
Escalation path
- Warn tier (2m): handled by on-call via #ops-alerts.
- Critical tier (5m): page on-call via PagerDuty.
- If a primary host (api/admin/landing) is down > 10 min: post to status page.
Recovery / rollback
- If the outage started right after a deploy: redeploy the previous image tag via GitHub Actions (`workflow_dispatch` on the deploy workflow with the prior SHA).
- After recovery, watch `probe_success` for 5 minutes before closing the incident — flapping endpoints often re-fail in the first window.