Event outbox backlog

Trigger symptoms

`EventOutboxBacklog` (warn, > 500 for 5m) or `EventOutboxBacklogCritical` (page, > 2000 for 2m).
`rabbitmq_queue_messages{queue=~".*outbox.*"}` rising and not draining.
Domain events from producers (orders, payments, fleet) are not landing in their downstream consumers.

First checks

first-checks.sh

# 1. Which queue, how deep, and is anyone consuming?
docker exec via_prod-rabbitmq rabbitmqctl list_queues name messages consumers \
  | sort -k2 -n -r | head -20

# 2. Identify the consumer service from the queue name (convention:
#    "<consumer>.<event-routing-key>"). Then check its container.
docker logs --tail 200 via_prod-<consumer-service> 2>&1 | tail -80

# 3. Is the downstream DB healthy? Pool exhaustion is the #1 cause.
curl -s http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=db_pool_connections_in_use / db_pool_connections_max' \
  | jq '.data.result[] | select(.value[1] | tonumber > 0.9)'

# 4. Memory pressure on the consumer pod.
docker stats --no-stream via_prod-<consumer-service>

Immediate mitigation

Consumer down / OOM: docker compose -f docker-compose.prod.yml up -d <consumer>. Bump the memory limit if OOM recurs (`mem_limit:` in compose file).
Consumer crash loop on a poison message: Identify the offending payload from logs, then dead-letter it manually so the queue can drain (publish a one-shot redirect, or `rabbitmqctl purge_queue` as a last resort — known data loss).
Downstream DB saturated: Open the `db-pool-exhausted` runbook — once the pool recovers, the consumer drains automatically.
Producer storm: Check `rate(rabbitmq_queue_messages_published_total[5m])` — a sudden 10× spike means a producer regression. Roll the producer service, not the consumer.

Escalation path

Warn tier: on-call handles via #ops-alerts.
Critical tier: PagerDuty + page the consumer-service owner.
If `orders.*` or `payments.*` events stuck > 10 min: financial flow degraded, escalate to finance + CTO.

Recovery / rollback

After consumer recovers, watch the queue depth drop. Drain rate ≈ `(consumers × prefetch_count × processing_rate)`. If draining < 100 msg/s on a backlog > 10k, scale consumers (raise replica count or prefetch).
Never `purge_queue` without explicit lost-data acknowledgement. Outbox events back DB writes — losing them desyncs read models.
If poison messages were dead-lettered to DLX during the incident, replay them via `python scripts/dlx_replay.py` after the consumer is fixed.

Loading...جاري التحميل...

All runbooks

Runbook

Event outbox backlog

Trigger symptoms

`EventOutboxBacklog` (warn, > 500 for 5m) or `EventOutboxBacklogCritical` (page, > 2000 for 2m).
`rabbitmq_queue_messages{queue=~".*outbox.*"}` rising and not draining.
Domain events from producers (orders, payments, fleet) are not landing in their downstream consumers.

First checks

first-checks.sh

# 1. Which queue, how deep, and is anyone consuming?
docker exec via_prod-rabbitmq rabbitmqctl list_queues name messages consumers \
  | sort -k2 -n -r | head -20

# 2. Identify the consumer service from the queue name (convention:
#    "<consumer>.<event-routing-key>"). Then check its container.
docker logs --tail 200 via_prod-<consumer-service> 2>&1 | tail -80

# 3. Is the downstream DB healthy? Pool exhaustion is the #1 cause.
curl -s http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=db_pool_connections_in_use / db_pool_connections_max' \
  | jq '.data.result[] | select(.value[1] | tonumber > 0.9)'

# 4. Memory pressure on the consumer pod.
docker stats --no-stream via_prod-<consumer-service>

Immediate mitigation

Consumer down / OOM: docker compose -f docker-compose.prod.yml up -d <consumer>. Bump the memory limit if OOM recurs (`mem_limit:` in compose file).
Consumer crash loop on a poison message: Identify the offending payload from logs, then dead-letter it manually so the queue can drain (publish a one-shot redirect, or `rabbitmqctl purge_queue` as a last resort — known data loss).
Downstream DB saturated: Open the `db-pool-exhausted` runbook — once the pool recovers, the consumer drains automatically.
Producer storm: Check `rate(rabbitmq_queue_messages_published_total[5m])` — a sudden 10× spike means a producer regression. Roll the producer service, not the consumer.

Escalation path

Warn tier: on-call handles via #ops-alerts.
Critical tier: PagerDuty + page the consumer-service owner.
If `orders.*` or `payments.*` events stuck > 10 min: financial flow degraded, escalate to finance + CTO.

Recovery / rollback

After consumer recovers, watch the queue depth drop. Drain rate ≈ `(consumers × prefetch_count × processing_rate)`. If draining < 100 msg/s on a backlog > 10k, scale consumers (raise replica count or prefetch).
Never `purge_queue` without explicit lost-data acknowledgement. Outbox events back DB writes — losing them desyncs read models.
If poison messages were dead-lettered to DLX during the incident, replay them via `python scripts/dlx_replay.py` after the consumer is fixed.