How do I resolve Failed Domain Controller — recovery without making it worse?

Establish ground truth via dcdiag/repadmin • Choose repair vs demote-and-rebuild • If demoting: metadata cleanup → DNS cleanup → site cleanup → SYSVOL cleanup • Stand up replacement DC, seize FSMO if needed • Validate replication, SYSVOL, and login flow before signing off

How do I prevent Failed Domain Controller — recovery without making it worse?

Minimum two DCs per site, three for any business >50 users • System state backup nightly on every DC • Monitor replication continuously (dcdiag scheduled task or RMM check) • Tombstone lifetime alarm if any DC offline >100 days • DC OS patching on staggered schedule, never all at once

P1 · Active Directory

Failed Domain Controller — recovery without making it worse

Q: What causes Failed Domain Controller — recovery without making it worse?

Hardware / OS failure on the DC itself • NTDS database corruption (jet errors, dirty shutdown) • Disk full on volume holding NTDS.dit • Tombstone lifetime exceeded — DC offline >180 days then powered on (NEVER bring it back) • DNS / time service drift breaking Kerberos before the DC itself failed

A DC has failed (hardware, OS, NTDS corruption, or network isolation). The danger isn't the failure — it's the recovery shortcut that breaks the rest of the forest.

Indicators

DC unreachable on LDAP/389, SMB/445, or pingable but services not responding
dcdiag failures across multiple categories on the DC itself
Replication errors on partner DCs (event 1722, 1925, 8606)
USN rollback warnings if a DC was restored from snapshot incorrectly
If the failed DC held FSMO roles, dependent operations stall

Likely causes

Hardware / OS failure on the DC itself
NTDS database corruption (jet errors, dirty shutdown)
Disk full on volume holding NTDS.dit
Tombstone lifetime exceeded — DC offline >180 days then powered on (NEVER bring it back)
DNS / time service drift breaking Kerberos before the DC itself failed

Diagnostic steps

Decide: repair this DC, or seize roles and demote it metadata-only? Driver: time available, criticality of FSMO roles held, age of the DC's last successful replication
Run dcdiag /v /c /e and repadmin /showrepl /errorsonly from a healthy DC — establish the actual fault before action
Check NTDS.dit free space and event log on the failed DC — Directory Service event channel, especially event IDs 1644, 2087, 2088, 1311
If repair viable: stop NTDS service, esentutl /g for integrity, /p for repair (last resort — backup the .dit first)
If demote viable: use ntdsutil metadata cleanup from a healthy DC; remove all DNS/SRV records, computer object, NTDS settings object
Never restore a DC from a VM snapshot/checkpoint into the production network — USN rollback poisons replication. Always demote and rebuild

Resolution path

Establish ground truth via dcdiag/repadmin
Choose repair vs demote-and-rebuild
If demoting: metadata cleanup → DNS cleanup → site cleanup → SYSVOL cleanup
Stand up replacement DC, seize FSMO if needed
Validate replication, SYSVOL, and login flow before signing off

Prevention

Minimum two DCs per site, three for any business >50 users
System state backup nightly on every DC
Monitor replication continuously (dcdiag scheduled task or RMM check)
Tombstone lifetime alarm if any DC offline >100 days
DC OS patching on staggered schedule, never all at once

Tools

dcdiag, repadmin, ntdsutil
esentutl (NTDS.dit integrity / repair)
PowerShell ActiveDirectory module: Get-ADReplicationFailure, Move-ADDirectoryServerOperationMasterRole
DNS Manager / dnscmd
Sites and Services console

References

Microsoft Learn — Forcing AD removal & metadata cleanup
Microsoft Learn — USN rollback detection
Microsoft Learn — Tombstone lifetime
Engineer Direct guide — Recover a failed domain controller

active-directorydomain-controllerntdsutilreplicationfsmo