RAID array degraded or failed
One or more disks have failed in a RAID set; array is degraded, rebuilding, or offline. Goal: avoid a second-disk failure during rebuild.
Indicators
- Storage controller alerts (Dell PERC, HPE Smart Array, Megaraid, Adaptec)
- Volume offline or read-only at OS level
- Slow I/O, VMs paused-critical due to storage latency
- Predictive failure (SMART) on additional disks during rebuild
Likely causes
- Single drive failure in RAID5 with no hot spare → degraded
- Two-drive failure in RAID5 / RAID0 → array lost
- Battery/cache module failure on controller → write-through fallback (severe perf hit, not data loss)
- Firmware bug or controller hang masking a healthy array
- Rebuild in progress under heavy load is stressing remaining disks
Diagnostic steps
-
Identify controller and array state via vendor tool (perccli, ssacli, storcli) — never reseat or replace blindly
-
Verify backup currency BEFORE touching the array — last successful job timestamp, restore-test status
-
Pull SMART/predictive-failure data on every member disk; refuse to rebuild onto a disk reporting reallocated sectors
-
Reduce I/O load during rebuild — pause non-critical VMs, postpone backup jobs
-
If array is offline due to multi-disk failure: STOP. Do not initialise, do not 'force online' without a copy / image of every member disk
-
For unrecoverable arrays — engage data-recovery specialist with raw disk images, not the live disks
Resolution path
- Confirm backup is current and restorable
- Replace failed disk with verified-healthy spare
- Allow rebuild to complete with reduced I/O load
- Post-rebuild: scrub / consistency check, then re-evaluate RAID level (RAID6 / RAID10 for >2TB drives)
- If array lost: rebuild from backup onto fresh hardware
Prevention
- RAID6 or RAID10 for arrays with >2TB drives — RAID5 rebuild risk is too high
- Hot spare configured and tested
- Predictive failure alerts ingested into monitoring
- Quarterly consistency checks scheduled off-hours
- Battery / cache module replacement on schedule
Tools
- Vendor RAID utilities: perccli (Dell), ssacli/ssaducli (HPE), storcli (Broadcom/LSI), arcconf (Adaptec)
- iDRAC / iLO / IPMI for out-of-band status
- smartctl for individual disk health
- Veeam SureBackup or restore test to validate backup before risk-taking actions
References
- Vendor controller user guides (Dell PERC, HPE Smart Array)
- SNIA — RAID rebuild risk and URE statistics
- Backup vendor restore-verification documentation