Runbook — CRITICAL (PAGE)
Ledger chain broken — writes suspended
Do NOT restart blindly. Restart re-runs the check but does not repair a real divergence.
Trigger symptoms
- `LedgerChainBroken` alert active (page-tier).
- `ledger_chain_broken == 1` in Prometheus.
- ledger-service log: `CHAIN INTEGRITY FAILURE — N errors`.
- All payment / approval / certificate writes are suspended.
First checks
first-checks.sh
# 1. Confirm Redis flag matches the alert. docker exec via_prod-redis redis-cli GET ledger:chain_broken # Expect "1". If "0" or empty, the gauge is stale — see step 5. # 2. Pull the integrity errors from the most recent verification. docker logs --tail 200 via_prod-ledger-service 2>&1 \ | grep -iE 'CHAIN INTEGRITY|chain.*failure|invalid block' # 3. Re-run verification on demand to capture the offending block(s). curl -s -H "Authorization: Bearer $ADMIN_JWT" \ https://api.via-basket.com/admin/ledger/verify | jq '.errors[:5]' # 4. Identify the last KNOWN-GOOD block id. docker exec via_prod-ledger-db psql -U "$LEDGER_POSTGRES_USER" \ -d "$LEDGER_POSTGRES_DB" -c \ "SELECT id, prev_hash, hash, created_at FROM ledger_blocks ORDER BY id DESC LIMIT 20;"
Immediate mitigation
- DO NOT restart ledger-service. Restart neither repairs the chain nor preserves the failing-block evidence in memory.
- Confirm the write-suspend gate is actually blocking: a `POST /ledger/blocks` attempt should fail with HTTP 503 / `chain_broken`.
- Classify the failure from the errors surfaced:
- DB tampering (hash drift): Open a P0 security incident, isolate the ledger DB, take an immediate snapshot.
- Partial outage (missing block): Check ledger-service logs at the time of the last successful block write.
- Failed migration mid-flight: Verify the most recent `alembic upgrade head` ran to completion (no rolled-back transactions).
- Have finance pause any manual settlement work until the chain is verified clean.
Escalation path
- PagerDuty: oncall-payments (5-min response SLA).
- If tampering confirmed: CTO + security + legal immediately.
- If divergence persists past 30 min: declare a customer-visible read-only mode and post to status page.
Recovery / rollback
- Never delete or hand-edit blocks. A repair must always be a NEW correction block so the original divergence stays auditable.
- Once the cause is identified, run the chosen repair script (if one exists) with the explicit `--confirm` flag, while a finance reviewer watches.
- After repair, `_verify_chain_integrity_loop` will clear `ledger:chain_broken` automatically on the next successful check (≤ 6 h, or kick a manual `POST /admin/ledger/verify`).
- Open a post-mortem within 24 h: root cause + why detection lagged + reconciliation against bank/PSP totals during the suspended-write window.