Runbook — CRITICAL
Queue stuck — zero consumers
Trigger symptoms
- `ConsumerStuck` alert active (page-tier).
- `rabbitmq_queue_consumers == 0 AND rabbitmq_queue_messages_ready > 10`.
- Distinguishes from `outbox-backlog`: that one has slow consumers; this one has *no* consumer attached at all.
First checks
first-checks.sh
# 1. Which queue is orphaned?
docker exec via_prod-rabbitmq rabbitmqctl list_queues name messages consumers \
| awk '$3 == "0" && $2 > "10" {print}'
# 2. Map queue → consumer service. Convention: queue name carries the
# consumer's service prefix (e.g. "notification.order.placed" →
# notification-service consumes it).
QUEUE_NAME=<from step 1>
# 3. Is the consumer container even running?
docker ps --filter "name=via_prod-<consumer>" --format \
'table {{.Names}}\t{{.Status}}\t{{.RestartCount}}'
# 4. Why did it die?
docker logs --since 30m via_prod-<consumer> 2>&1 | tail -100
# 5. RabbitMQ connection-side view (does the broker think the consumer
# ever connected?).
docker exec via_prod-rabbitmq rabbitmqctl list_connections name client_properties \
| grep -i <consumer>Immediate mitigation
- Container stopped:
docker compose -f docker-compose.prod.yml up -d <consumer>. - Container up but not connecting: Confirm `RABBITMQ_URL` env value, RabbitMQ container health, and that the consumer’s startup actually called `await broker.connect()` (grep startup logs).
- Connect-time crash loop: Read the last exception. Auth fail, name resolution, or a Pydantic-validation failure on the first message all look the same — fix the root cause, not the symptom.
- Emergency drain: If the consumer can’t be revived fast and messages are piling up, manually shovel them to DLX so they aren’t lost (`rabbitmqadmin shovel ...`). Document and replay later.
Escalation path
- PagerDuty → on-call.
- If the orphaned queue carries `orders.*` or `payments.*`: escalate to finance immediately — money-state propagation is paused.
- Outage > 15 min on a customer-visible flow: post status page update.
Recovery / rollback
- After the consumer reconnects, verify `rabbitmq_queue_consumers >= 1` and `rabbitmq_queue_messages_ready` is decreasing. Both must hold for 5 min before closing.
- If you shoveled to DLX, run `dlx_replay` once the root cause is fixed — otherwise downstream state stays inconsistent.
- Post-mortem required if outage > 10 min: root cause + why container exit went undetected before the alert fired.