Failed Domain Controller — recovery without making it worse
A DC has failed (hardware, OS, NTDS corruption, or network isolation). The danger isn't the failure — it's the recovery shortcut that breaks the rest of the forest.
Indicators
- DC unreachable on LDAP/389, SMB/445, or pingable but services not responding
- dcdiag failures across multiple categories on the DC itself
- Replication errors on partner DCs (event 1722, 1925, 8606)
- USN rollback warnings if a DC was restored from snapshot incorrectly
- If the failed DC held FSMO roles, dependent operations stall
Likely causes
- Hardware / OS failure on the DC itself
- NTDS database corruption (jet errors, dirty shutdown)
- Disk full on volume holding NTDS.dit
- Tombstone lifetime exceeded — DC offline >180 days then powered on (NEVER bring it back)
- DNS / time service drift breaking Kerberos before the DC itself failed
Diagnostic steps
-
Decide: repair this DC, or seize roles and demote it metadata-only? Driver: time available, criticality of FSMO roles held, age of the DC's last successful replication
-
Run dcdiag /v /c /e and repadmin /showrepl /errorsonly from a healthy DC — establish the actual fault before action
-
Check NTDS.dit free space and event log on the failed DC — Directory Service event channel, especially event IDs 1644, 2087, 2088, 1311
-
If repair viable: stop NTDS service, esentutl /g for integrity, /p for repair (last resort — backup the .dit first)
-
If demote viable: use ntdsutil metadata cleanup from a healthy DC; remove all DNS/SRV records, computer object, NTDS settings object
-
Never restore a DC from a VM snapshot/checkpoint into the production network — USN rollback poisons replication. Always demote and rebuild
Resolution path
- Establish ground truth via dcdiag/repadmin
- Choose repair vs demote-and-rebuild
- If demoting: metadata cleanup → DNS cleanup → site cleanup → SYSVOL cleanup
- Stand up replacement DC, seize FSMO if needed
- Validate replication, SYSVOL, and login flow before signing off
Prevention
- Minimum two DCs per site, three for any business >50 users
- System state backup nightly on every DC
- Monitor replication continuously (dcdiag scheduled task or RMM check)
- Tombstone lifetime alarm if any DC offline >100 days
- DC OS patching on staggered schedule, never all at once
Tools
- dcdiag, repadmin, ntdsutil
- esentutl (NTDS.dit integrity / repair)
- PowerShell ActiveDirectory module: Get-ADReplicationFailure, Move-ADDirectoryServerOperationMasterRole
- DNS Manager / dnscmd
- Sites and Services console
References
- Microsoft Learn — Forcing AD removal & metadata cleanup
- Microsoft Learn — USN rollback detection
- Microsoft Learn — Tombstone lifetime
- Engineer Direct guide — Recover a failed domain controller