Azure PostgreSQL Flexible Server: Inactive Replication Slots Eating Your Storage (And How to Fix It)

Inactive replication slots in Azure Database for PostgreSQL Flexible Server can silently fill your disk with WAL files. Here’s how to spot, drop, and prevent them.

The Problem: WAL Explosion from Orphaned Slots

Replication slots ensure WAL retention so consumers (CDC tools, read replicas) don’t miss changes. Inactive slots—created by stopped CDC jobs, deleted replicas, or failed experiments—pin old WAL indefinitely, consuming storage until it fills. Azure Flexible Server has safeguards like auto-grow, but slots can still cause outages.

Spot the Culprits

Run these to identify storage hogs:

-- WAL retained by each slot (biggest first)
SELECT slot_name, plugin, slot_type, active, 
       pg_size_pretty(pg_wal_lsn_diff(restart_lsn, '0/0')) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(restart_lsn, '0/0') DESC;
-- Current lag relative to WAL head
SELECT slot_name, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size
FROM pg_replication_slots;
-- Inactive slots only
SELECT * FROM pg_replication_slots WHERE NOT active;
-- Check active physical replication (HA/replicas)
SELECT pid, state, sent_lsn, replay_lsn, write_lag 
FROM pg_stat_replication;

Focus on inactive logical slots (slot_type='logical'active=false).

Clean Them Up Safely

Drop one by one—never active or Azure HA slots like azure_standby:

SELECT pg_drop_replication_slot('your_inactive_slot_name');

Storage recovers as WAL checkpoints recycle old segments (minutes to hours). Verify with Azure Metrics > Disk Used.

Prevention: Best Practices

PracticeActionWhy
Monitor slotsAlert on inactive slots >24h or WAL >20% diskCatches issues early 
Limit WAL retentionSet max_slot_wal_keep_size = '20GB' (PG13+)Auto-invalidates lagging slots 
Config for CDC/replicaswal_level=logicalmax_replication_slots >= replicas + CDC + 4 (HA)Reserves space 
Cleanup workflowDrop slot before stopping CDC job/replicaNo orphans 
Azure limitsCheck portal Server Parameters; monitor replicasHA needs ~4 slots 

Real-World Traps

  • CDC tools (Debezium, DMS, Fivetran) create per-task slots; drop on job stop.
  • Deleted read replicas leave slots; check Azure portal first.
  • HA failover recreates azure_standby; ignore it.

Query Azure MySQL Audit Logs

The common practice for user management in Azure AD is to create Azure AD Security Groups for users and assign permissions to groups rather than individual users. For example, the common practice for users who need access to Azure MySQL instance would to be add them to an Azure AD Group (say MySqlDb_Users) and then setup this Azure AD Group for access on Azure MySQL Instance.

Pre-requisite

  • Ensure that Audit Logs are enabled on all the production instances for Azure MySql as documented here — https://docs.microsoft.com/en-us/azure/mysql/howto-configure-audit-logs-portal
  • Make sure to enable Auditing on Server Parameters section (at least Connection audit log event type).
  • The logs can be saved to Storage Account, Event Hubs or Azure Log Analytics Workspace. We used Log Analytics Workspace for the runbook because it provides rich querying capabilities.

The following Log Analytics Query and the corresponding screenshot shows that individual user activity is tracked in the Audit Logs in spite of them using the Azure AD Group as username to connect Azure MySQL — external_user_s field logs the exact individual user, connection_id_d and thread_id_d carry over the Connection/ThreadId which can be used to tie the user across all log records.

AzureDiagnostics
| where Category == "MySqlAuditLogs"
| order by TimeGenerated desc
| project TimeGenerated, event_class_s, event_subclass_s, user_s, external_user_s, connection_id_d, thread_id_d, event_time_t, sql_text_s, db_s, table_name_s, error_code_d