logs – Database Blog

Inactive replication slots in Azure Database for PostgreSQL Flexible Server can silently fill your disk with WAL files. Here’s how to spot, drop, and prevent them.

The Problem: WAL Explosion from Orphaned Slots

Replication slots ensure WAL retention so consumers (CDC tools, read replicas) don’t miss changes. Inactive slots—created by stopped CDC jobs, deleted replicas, or failed experiments—pin old WAL indefinitely, consuming storage until it fills. Azure Flexible Server has safeguards like auto-grow, but slots can still cause outages.

Spot the Culprits

Run these to identify storage hogs:

-- WAL retained by each slot (biggest first)
SELECT slot_name, plugin, slot_type, active, 
       pg_size_pretty(pg_wal_lsn_diff(restart_lsn, '0/0')) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(restart_lsn, '0/0') DESC;

-- Current lag relative to WAL head
SELECT slot_name, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size
FROM pg_replication_slots;

-- Inactive slots only
SELECT * FROM pg_replication_slots WHERE NOT active;

-- Check active physical replication (HA/replicas)
SELECT pid, state, sent_lsn, replay_lsn, write_lag 
FROM pg_stat_replication;

Focus on inactive logical slots (slot_type='logical', active=false).

Clean Them Up Safely

Drop one by one—never active or Azure HA slots like azure_standby:

SELECT pg_drop_replication_slot('your_inactive_slot_name');

Storage recovers as WAL checkpoints recycle old segments (minutes to hours). Verify with Azure Metrics > Disk Used.

Prevention: Best Practices

Practice	Action	Why
Monitor slots	Alert on inactive slots >24h or WAL >20% disk	Catches issues early
Limit WAL retention	Set `max_slot_wal_keep_size = '20GB'` (PG13+)	Auto-invalidates lagging slots
Config for CDC/replicas	`wal_level=logical`, `max_replication_slots >= replicas + CDC + 4 (HA)`	Reserves space
Cleanup workflow	Drop slot before stopping CDC job/replica	No orphans
Azure limits	Check portal Server Parameters; monitor replicas	HA needs ~4 slots

Real-World Traps

CDC tools (Debezium, DMS, Fivetran) create per-task slots; drop on job stop.
Deleted read replicas leave slots; check Azure portal first.
HA failover recreates azure_standby; ignore it.

Pre-requisite

Ensure that Audit Logs are enabled on all the production instances for Azure MySql as documented here — https://docs.microsoft.com/en-us/azure/mysql/howto-configure-audit-logs-portal

Make sure to enable Auditing on Server Parameters section (at least Connection audit log event type).

The logs can be saved to Storage Account, Event Hubs or Azure Log Analytics Workspace. We used Log Analytics Workspace for the runbook because it provides rich querying capabilities.

The following Log Analytics Query and the corresponding screenshot shows that individual user activity is tracked in the Audit Logs in spite of them using the Azure AD Group as username to connect Azure MySQL — external_user_s field logs the exact individual user, connection_id_d and thread_id_d carry over the Connection/ThreadId which can be used to tie the user across all log records.

AzureDiagnostics
| where Category == "MySqlAuditLogs"
| order by TimeGenerated desc
| project TimeGenerated, event_class_s, event_subclass_s, user_s, external_user_s, connection_id_d, thread_id_d, event_time_t, sql_text_s, db_s, table_name_s, error_code_d