Runbooks
Incident runbooks
These pages are linked from Prometheus alert rules. Each one is terse on purpose: symptoms, check, remediation, post-incident.
For live incidents: check `#ops-alerts` in Slack for the full alert payload, then follow the relevant runbook from the appropriate step.
Ledger chain broken (writes suspended)
Hash-chain integrity check failed — every payment / approval write is blocked.
Rate limiter failing open
API-key rate limiter fell back to allow-all because Redis is unreachable.
High error rate
A service is returning 5xx above the configured threshold.
Public endpoint down
A blackbox-probed endpoint is failing — customer-visible if the host is api/admin/landing.
Event outbox backlog
Domain-event outbox queue is rising faster than the downstream consumer drains it.
Queue stuck — zero consumers
A queue has messages but no consumer attached — worker container down or disconnected.
Sustained DLX traffic
Events are dead-lettering faster than the noise floor — a consumer is in an exception loop.
Disk full on host
EBS volume on the prod EC2 is above 85% — containers at risk of eviction.
DB connection pool exhausted
A service is opening PG connections faster than it releases them.
Alertmanager silenced
Alerts are being suppressed globally — verify no stale silence survived past its window.