Runbook
Event outbox backlog
Trigger symptoms
- `EventOutboxBacklog` (warn, > 500 for 5m) or `EventOutboxBacklogCritical` (page, > 2000 for 2m).
- `rabbitmq_queue_messages{queue=~".*outbox.*"}` rising and not draining.
- Domain events from producers (orders, payments, fleet) are not landing in their downstream consumers.
First checks
first-checks.sh
# 1. Which queue, how deep, and is anyone consuming? docker exec via_prod-rabbitmq rabbitmqctl list_queues name messages consumers \ | sort -k2 -n -r | head -20 # 2. Identify the consumer service from the queue name (convention: # "<consumer>.<event-routing-key>"). Then check its container. docker logs --tail 200 via_prod-<consumer-service> 2>&1 | tail -80 # 3. Is the downstream DB healthy? Pool exhaustion is the #1 cause. curl -s http://prometheus:9090/api/v1/query \ --data-urlencode 'query=db_pool_connections_in_use / db_pool_connections_max' \ | jq '.data.result[] | select(.value[1] | tonumber > 0.9)' # 4. Memory pressure on the consumer pod. docker stats --no-stream via_prod-<consumer-service>
Immediate mitigation
- Consumer down / OOM:
docker compose -f docker-compose.prod.yml up -d <consumer>. Bump the memory limit if OOM recurs (`mem_limit:` in compose file). - Consumer crash loop on a poison message: Identify the offending payload from logs, then dead-letter it manually so the queue can drain (publish a one-shot redirect, or `rabbitmqctl purge_queue` as a last resort — known data loss).
- Downstream DB saturated: Open the `db-pool-exhausted` runbook — once the pool recovers, the consumer drains automatically.
- Producer storm: Check `rate(rabbitmq_queue_messages_published_total[5m])` — a sudden 10× spike means a producer regression. Roll the producer service, not the consumer.
Escalation path
- Warn tier: on-call handles via #ops-alerts.
- Critical tier: PagerDuty + page the consumer-service owner.
- If `orders.*` or `payments.*` events stuck > 10 min: financial flow degraded, escalate to finance + CTO.
Recovery / rollback
- After consumer recovers, watch the queue depth drop. Drain rate ≈ `(consumers × prefetch_count × processing_rate)`. If draining < 100 msg/s on a backlog > 10k, scale consumers (raise replica count or prefetch).
- Never `purge_queue` without explicit lost-data acknowledgement. Outbox events back DB writes — losing them desyncs read models.
- If poison messages were dead-lettered to DLX during the incident, replay them via `python scripts/dlx_replay.py` after the consumer is fixed.