Azure Site Recovery Replication Broken — RPO Breach or Health Critical
Azure Site Recovery replication health degrades to Critical or Warning, causing RPO to exceed the configured threshold and leaving the protected workload without a valid recovery point. Typically caused by Mobility Service agent version mismatch, process server health issues, or sustained high disk churn.
Indicators
- ASR replication health shows 'Critical' or 'Warning' in Recovery Services Vault
- RPO exceeds configured threshold — typically >15 minutes for production VMs
- Mobility Service agent shows 'Paused' or error state
- Process server shows unhealthy in Vault > Site Recovery Infrastructure
Likely causes
- Mobility Service agent version mismatch with process server or configuration server
- Process server certificate expired or connectivity issue to vault
- Source VM disk change rate exceeding replication bandwidth capacity
- Temporary network disruption between source site and Azure creating replication lag
- Process server running out of disk space on cache drive
Diagnostic steps
-
Recovery Services Vault > Replicated items > select VM — review replication health, error details and last recovery point time
-
Check Mobility Service version: compare agent version on source VM against current vault-recommended version; update via Vault > Replicated item > Update Mobility Service
-
On process server: check C:\ProgramData\ASR\home\svsystems\var\log\outbound.log and inbound.log for connection errors
-
Vault > Site Recovery Infrastructure > Process Servers — verify heartbeat status, CPU, memory and cache disk free space (should be >30%)
-
If churn is the issue: run ASR Deployment Planner against workload to calculate required bandwidth and determine if process server upgrade is needed
-
Restart 'Microsoft Azure Recovery Services Agent' and 'Microsoft Azure Site Recovery Process Server' services on process server if replication is stuck
Resolution path
- Identify whether root cause is agent version, process server health, or network bandwidth
- Update Mobility Service agent from vault console
- Resolve process server disk, CPU or connectivity issues
- Restart ASR services on process server to clear stuck replication
Prevention
- Enable Azure Monitor alerts for ASR replication health and RPO breach thresholds
- Run test failover quarterly to confirm RTO and RPO targets are achievable
- Enable automatic Mobility Service agent updates via vault policy
- Monitor process server resource utilisation — upgrade before saturation
Tools
- Azure Recovery Services Vault portal
- ASR Mobility Agent (Windows/Linux service)
- Azure Monitor (ASR replication health alerts)
- Azure Site Recovery Deployment Planner
- ASR process server logs (inbound.log / outbound.log)