Azure PostgreSQL Flexible Server: Inactive Replication Slots Eating Your Storage (And How to Fix It)

Inactive replication slots in Azure Database for PostgreSQL Flexible Server can silently fill your disk with WAL files. Here’s how to spot, drop, and prevent them.

The Problem: WAL Explosion from Orphaned Slots

Replication slots ensure WAL retention so consumers (CDC tools, read replicas) don’t miss changes. Inactive slots—created by stopped CDC jobs, deleted replicas, or failed experiments—pin old WAL indefinitely, consuming storage until it fills. Azure Flexible Server has safeguards like auto-grow, but slots can still cause outages.

Spot the Culprits

Run these to identify storage hogs:

-- WAL retained by each slot (biggest first)
SELECT slot_name, plugin, slot_type, active, 
       pg_size_pretty(pg_wal_lsn_diff(restart_lsn, '0/0')) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(restart_lsn, '0/0') DESC;

-- Current lag relative to WAL head
SELECT slot_name, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size
FROM pg_replication_slots;

-- Inactive slots only
SELECT * FROM pg_replication_slots WHERE NOT active;

-- Check active physical replication (HA/replicas)
SELECT pid, state, sent_lsn, replay_lsn, write_lag 
FROM pg_stat_replication;

Focus on inactive logical slots (slot_type='logical', active=false).

Clean Them Up Safely

Drop one by one—never active or Azure HA slots like azure_standby:

SELECT pg_drop_replication_slot('your_inactive_slot_name');

Storage recovers as WAL checkpoints recycle old segments (minutes to hours). Verify with Azure Metrics > Disk Used.

Prevention: Best Practices

Practice	Action	Why
Monitor slots	Alert on inactive slots >24h or WAL >20% disk	Catches issues early
Limit WAL retention	Set `max_slot_wal_keep_size = '20GB'` (PG13+)	Auto-invalidates lagging slots
Config for CDC/replicas	`wal_level=logical`, `max_replication_slots >= replicas + CDC + 4 (HA)`	Reserves space
Cleanup workflow	Drop slot before stopping CDC job/replica	No orphans
Azure limits	Check portal Server Parameters; monitor replicas	HA needs ~4 slots

Real-World Traps

CDC tools (Debezium, DMS, Fivetran) create per-task slots; drop on job stop.
Deleted read replicas leave slots; check Azure portal first.
HA failover recreates azure_standby; ignore it.

The Problem: WAL Explosion from Orphaned Slots

Spot the Culprits

Clean Them Up Safely

Prevention: Best Practices

Real-World Traps

Share this:

Related

Leave a comment Cancel reply