Runbook
Sustained DLX traffic
Trigger symptoms
- `VIAEventBusDLXMessages` (warn, > 0/s for 1m) or `VIAEventBusDLXSustained` (page, > 100 in 10m).
- `rate(via_dlx_messages_total[5m]) > 0` — events are being NACK-rejected by consumers.
- New rows in `audit_event_dlq` (audit-service DB).
First checks
first-checks.sh
# 1. Top offending consumers / event types in the last 10 min.
curl -s 'http://prometheus:9090/api/v1/query' --data-urlencode \
'query=topk(5, sum by (consumer, event_type) (increase(via_dlx_messages_total[10m])))' \
| jq '.data.result'
# 2. Pull a few raw envelopes from the DLQ table.
docker exec via_prod-audit-db psql -U "$AUDIT_POSTGRES_USER" \
-d "$AUDIT_POSTGRES_DB" -c \
"SELECT id, event_type, consumer, error_message, payload->>'event_id' AS event_id,
created_at
FROM audit_event_dlq
ORDER BY created_at DESC LIMIT 10;"
# 3. Look at the offending consumer's logs around the timestamps.
docker logs --since 15m via_prod-<consumer> 2>&1 \
| grep -iE 'nack|exception|traceback' | tail -40
# 4. RabbitMQ DLX queue depth + consumer count.
docker exec via_prod-rabbitmq rabbitmqctl list_queues name messages consumers \
| grep dlxImmediate mitigation
- Single consumer NACKing in a loop: An unhandled exception in the message handler. Fix the bug, deploy. Until then, the consumer keeps re-NACKing → DLX keeps filling.
- Schema mismatch (producer/consumer drift): A producer rolled a new event-shape before the consumer caught up. Either roll back the producer, or fast-track the consumer release.
- Single poison message: DLX is doing its job — the bad message is parked. Confirm the rest of the queue keeps draining, then plan a replay after fixing.
Escalation path
- Warn tier (any non-zero for 1m): on-call investigates via #ops-alerts.
- Page tier (> 100 events in 10m): PagerDuty.
- If `event_type` is `orders.*` or `payments.*`: notify finance — money-state propagation is delayed by however long the event sits in DLX.
Recovery / rollback
- After the consumer is fixed, run `python scripts/dlx_replay.py --since "<incident-start>"` to re-publish dead-lettered envelopes back to their original exchange. The replay is idempotent (event_id de-dup at the consumer).
- Never hand-delete `audit_event_dlq` rows — they’re the audit trail for the incident. The retention worker prunes them after 30 days.
- Post-mortem required if DLX accumulated > 1000 events, OR if any `payments.*` / `orders.*` event sat in DLX > 30 min — reconcile bank/PSP totals against ledger for the affected window.