Public endpoint down

Trigger symptoms

`EndpointDown` (warn, 2m) or `EndpointDownCritical` (page, 5m) firing.
`probe_success == 0` for one or more instances in `blackbox_http_probes`.
Customers may be unable to load the site / app.

First checks

first-checks.sh

# 1. Which endpoints are failing? Pull labels from the alert in Alertmanager
#    or hit Prometheus directly.
curl -s 'http://prometheus:9090/api/v1/query?query=probe_success==0' \
  | jq '.data.result[] | .metric.instance'

# 2. Is the container even running?
docker ps --filter "name=via_prod" --format 'table {{.Names}}\t{{.Status}}' \
  | grep -v ' Up '   # anything listed here is restarting / dead

# 3. Recent logs from the affected service.
docker logs --tail 200 via_prod-<service> 2>&1 | tail -60

# 4. Traefik routing for that public hostname.
docker exec via_prod-traefik wget -qO- http://localhost:8080/api/http/routers \
  | jq '.[] | select(.rule | contains("api.via-basket.com"))'

# 5. TLS handshake (if HTTPS endpoint).
echo | openssl s_client -connect api.via-basket.com:443 -servername api.via-basket.com 2>&1 \
  | grep -E 'subject=|issuer=|Verify return code'

Immediate mitigation

Container down: docker compose -f docker-compose.prod.yml up -d <service>. Tail logs to understand the original crash.
Crash loop: Read the last 200 log lines for the root exception. Don’t restart blindly — you’ll just reset the loop counter.
Traefik routing missing: Confirm the service has correct `traefik.*` labels, then reload Traefik (`docker compose restart traefik`).
TLS cert expired: Inspect Traefik logs for ACME failures; force-renew via the Traefik dashboard or wipe `acme.json` and restart (last resort — rate-limited by LE).
DNS issue: dig api.via-basket.com from outside the EC2 host.

Escalation path

Warn tier (2m): handled by on-call via #ops-alerts.
Critical tier (5m): page on-call via PagerDuty.
If a primary host (api/admin/landing) is down > 10 min: post to status page.

Recovery / rollback

If the outage started right after a deploy: redeploy the previous image tag via GitHub Actions (`workflow_dispatch` on the deploy workflow with the prior SHA).
After recovery, watch `probe_success` for 5 minutes before closing the incident — flapping endpoints often re-fail in the first window.

Loading...جاري التحميل...

All runbooks

Runbook

Public endpoint down

Trigger symptoms

`EndpointDown` (warn, 2m) or `EndpointDownCritical` (page, 5m) firing.
`probe_success == 0` for one or more instances in `blackbox_http_probes`.
Customers may be unable to load the site / app.

First checks

first-checks.sh

# 1. Which endpoints are failing? Pull labels from the alert in Alertmanager
#    or hit Prometheus directly.
curl -s 'http://prometheus:9090/api/v1/query?query=probe_success==0' \
  | jq '.data.result[] | .metric.instance'

# 2. Is the container even running?
docker ps --filter "name=via_prod" --format 'table {{.Names}}\t{{.Status}}' \
  | grep -v ' Up '   # anything listed here is restarting / dead

# 3. Recent logs from the affected service.
docker logs --tail 200 via_prod-<service> 2>&1 | tail -60

# 4. Traefik routing for that public hostname.
docker exec via_prod-traefik wget -qO- http://localhost:8080/api/http/routers \
  | jq '.[] | select(.rule | contains("api.via-basket.com"))'

# 5. TLS handshake (if HTTPS endpoint).
echo | openssl s_client -connect api.via-basket.com:443 -servername api.via-basket.com 2>&1 \
  | grep -E 'subject=|issuer=|Verify return code'

Immediate mitigation

Container down: docker compose -f docker-compose.prod.yml up -d <service>. Tail logs to understand the original crash.
Crash loop: Read the last 200 log lines for the root exception. Don’t restart blindly — you’ll just reset the loop counter.
Traefik routing missing: Confirm the service has correct `traefik.*` labels, then reload Traefik (`docker compose restart traefik`).
TLS cert expired: Inspect Traefik logs for ACME failures; force-renew via the Traefik dashboard or wipe `acme.json` and restart (last resort — rate-limited by LE).
DNS issue: dig api.via-basket.com from outside the EC2 host.

Escalation path

Warn tier (2m): handled by on-call via #ops-alerts.
Critical tier (5m): page on-call via PagerDuty.
If a primary host (api/admin/landing) is down > 10 min: post to status page.

Recovery / rollback

If the outage started right after a deploy: redeploy the previous image tag via GitHub Actions (`workflow_dispatch` on the deploy workflow with the prior SHA).
After recovery, watch `probe_success` for 5 minutes before closing the incident — flapping endpoints often re-fail in the first window.