Storage performance has collapsed (latency spike)
Disk I/O latency has risen to the point that VMs, databases, or file services are unusable. Find the bottleneck before throwing hardware at it.
Indicators
- Disk queue length sustained >2 per spindle
- Average disk latency >20ms (HDD) / >5ms (SSD) sustained
- SQL Server PAGEIOLATCH_* waits dominating
- Users report file open / save delays of seconds
- Backup jobs overrunning their windows
Likely causes
- Backup or AV scan running during business hours
- RAID rebuild in progress
- Cache battery failed → write-through fallback
- Snapshot chain too long (Hyper-V checkpoint / VMware snapshot / Veeam stuck snapshot)
- Noisy-neighbour VM consuming the IOPS budget
- TRIM / garbage collection on a full SSD array
Diagnostic steps
-
Identify the layer — guest OS perfmon, hypervisor (esxtop / Hyper-V perf counters), array dashboard. Latency rises as you go down the stack
-
Check for stuck snapshots — Get-VMSnapshot, vSphere snapshot manager, Veeam orphaned snapshots
-
Verify controller cache state — vendor utility (perccli, ssacli) for battery/cache module health
-
Identify per-VM IOPS distribution — find the noisy neighbour
-
Confirm no rebuild / scrub / consistency check active
-
Check free space — heavily-used SSD pools degrade severely above ~85% usage
Resolution path
- Remove the immediate cause (consolidate snapshots, reschedule backup, replace cache battery)
- Move noisy workloads off shared storage
- Evaluate capacity headroom — IOPS budget vs steady-state demand
- If structural: tier upgrade (SSD/NVMe), array refresh, or workload redistribution
Prevention
- Snapshot lifetime alerting (>24h triggers review)
- Backup windows aligned with low-utilisation periods
- Capacity planning based on 95th-percentile IOPS, not average
- Quarterly DiskSpd baseline to detect drift
Tools
- perfmon / Resource Monitor
- esxtop (VMware) — d/v/u modes
- Hyper-V Performance Counters (Hyper-V Virtual Storage Device)
- Storage array dashboards (Synology, Dell, HPE, Pure, NetApp)
- DiskSpd or fio for synthetic baseline
References
- Microsoft Learn — Hyper-V storage performance counters
- VMware KB — Identifying high latency in vSphere
- Brent Ozar — SQL Server wait stats reference