Queue stuck — zero consumers

Trigger symptoms

`ConsumerStuck` alert active (page-tier).
`rabbitmq_queue_consumers == 0 AND rabbitmq_queue_messages_ready > 10`.
Distinguishes from `outbox-backlog`: that one has slow consumers; this one has *no* consumer attached at all.

First checks

first-checks.sh

# 1. Which queue is orphaned?
docker exec via_prod-rabbitmq rabbitmqctl list_queues name messages consumers \
  | awk '$3 == "0" && $2 > "10" {print}'

# 2. Map queue → consumer service. Convention: queue name carries the
#    consumer's service prefix (e.g. "notification.order.placed" →
#    notification-service consumes it).
QUEUE_NAME=<from step 1>

# 3. Is the consumer container even running?
docker ps --filter "name=via_prod-<consumer>" --format \
  'table {{.Names}}\t{{.Status}}\t{{.RestartCount}}'

# 4. Why did it die?
docker logs --since 30m via_prod-<consumer> 2>&1 | tail -100

# 5. RabbitMQ connection-side view (does the broker think the consumer
#    ever connected?).
docker exec via_prod-rabbitmq rabbitmqctl list_connections name client_properties \
  | grep -i <consumer>

Immediate mitigation

Container stopped: docker compose -f docker-compose.prod.yml up -d <consumer>.
Container up but not connecting: Confirm `RABBITMQ_URL` env value, RabbitMQ container health, and that the consumer’s startup actually called `await broker.connect()` (grep startup logs).
Connect-time crash loop: Read the last exception. Auth fail, name resolution, or a Pydantic-validation failure on the first message all look the same — fix the root cause, not the symptom.
Emergency drain: If the consumer can’t be revived fast and messages are piling up, manually shovel them to DLX so they aren’t lost (`rabbitmqadmin shovel ...`). Document and replay later.

Escalation path

PagerDuty → on-call.
If the orphaned queue carries `orders.*` or `payments.*`: escalate to finance immediately — money-state propagation is paused.
Outage > 15 min on a customer-visible flow: post status page update.

Recovery / rollback

After the consumer reconnects, verify `rabbitmq_queue_consumers >= 1` and `rabbitmq_queue_messages_ready` is decreasing. Both must hold for 5 min before closing.
If you shoveled to DLX, run `dlx_replay` once the root cause is fixed — otherwise downstream state stays inconsistent.
Post-mortem required if outage > 10 min: root cause + why container exit went undetected before the alert fired.

Loading...جاري التحميل...

All runbooks

Runbook — CRITICAL

Queue stuck — zero consumers

Trigger symptoms

`ConsumerStuck` alert active (page-tier).
`rabbitmq_queue_consumers == 0 AND rabbitmq_queue_messages_ready > 10`.
Distinguishes from `outbox-backlog`: that one has slow consumers; this one has *no* consumer attached at all.

First checks

first-checks.sh

# 1. Which queue is orphaned?
docker exec via_prod-rabbitmq rabbitmqctl list_queues name messages consumers \
  | awk '$3 == "0" && $2 > "10" {print}'

# 2. Map queue → consumer service. Convention: queue name carries the
#    consumer's service prefix (e.g. "notification.order.placed" →
#    notification-service consumes it).
QUEUE_NAME=<from step 1>

# 3. Is the consumer container even running?
docker ps --filter "name=via_prod-<consumer>" --format \
  'table {{.Names}}\t{{.Status}}\t{{.RestartCount}}'

# 4. Why did it die?
docker logs --since 30m via_prod-<consumer> 2>&1 | tail -100

# 5. RabbitMQ connection-side view (does the broker think the consumer
#    ever connected?).
docker exec via_prod-rabbitmq rabbitmqctl list_connections name client_properties \
  | grep -i <consumer>

Immediate mitigation

Container stopped: docker compose -f docker-compose.prod.yml up -d <consumer>.
Container up but not connecting: Confirm `RABBITMQ_URL` env value, RabbitMQ container health, and that the consumer’s startup actually called `await broker.connect()` (grep startup logs).
Connect-time crash loop: Read the last exception. Auth fail, name resolution, or a Pydantic-validation failure on the first message all look the same — fix the root cause, not the symptom.
Emergency drain: If the consumer can’t be revived fast and messages are piling up, manually shovel them to DLX so they aren’t lost (`rabbitmqadmin shovel ...`). Document and replay later.

Escalation path

PagerDuty → on-call.
If the orphaned queue carries `orders.*` or `payments.*`: escalate to finance immediately — money-state propagation is paused.
Outage > 15 min on a customer-visible flow: post status page update.

Recovery / rollback

After the consumer reconnects, verify `rabbitmq_queue_consumers >= 1` and `rabbitmq_queue_messages_ready` is decreasing. Both must hold for 5 min before closing.
If you shoveled to DLX, run `dlx_replay` once the root cause is fixed — otherwise downstream state stays inconsistent.
Post-mortem required if outage > 10 min: root cause + why container exit went undetected before the alert fired.