Resource Exhaustion: Disk Full and OOM Kills Causing System/Application Failures
Resource exhaustion occurs when a host runs out of disk space or physical/virtual memory, causing the OS to invoke the Out-of-Memory (OOM) killer or applications to fail writes with ENOSPC errors. This leads to service crashes, database commit failures, log truncation, and cascading failures. Remediation requires immediate identification of the exhausted resource, emergency space/memory recovery, and addressing root causes such as unbounded log growth or memory leaks.
Indicators
- Applications returning 'No space left on device' (errno ENOSPC) on Linux or 'There is not enough space on the disk' on Windows
- OOM kill events in kernel log: 'Out of memory: Kill process <Pid> (<name>) score <N> or sacrifice child' in /var/log/kern.log or dmesg
- Services or containers abruptly exit with exit code 137 (SIGKILL sent by OOM killer)
- Disk usage at or near 100% on one or more mount points (e.g., /, /var, /tmp, data volume)
- Database refusing new writes or transactions rolling back due to inability to write WAL/redo logs
- System logs (syslog, journald, Windows Event Log) stopping or truncating due to full log partition
- Application health checks failing or returning HTTP 500 errors coinciding with resource exhaustion event
Likely causes
- Log files growing unbounded due to missing or misconfigured log rotation (logrotate, systemd journal limits)
- Database transaction logs, WAL files, or temp tablespace consuming all available disk
- Runaway process or memory leak causing a single application to consume all available RAM, triggering OOM killer
- Container or VM provisioned without adequate memory limits, allowing a workload to starve the host
- Large core dump files filling /var/crash or equivalent directory
- Backup or archive jobs writing large files to the same partition as application data
- Disk quota misconfiguration allowing a single user or service account to consume all space
- tmpfs or /tmp partition sized too small relative to workload demand
Diagnostic steps
-
Check disk usage across all mounted filesystems: `df -h`Pinpoints the exhausted volume so remediation is targeted correctly
-
Identify the largest directories consuming space on the full partition: `du -sh /* 2>/dev/null | sort -rh | head -20`Reveals whether logs, databases, core dumps, or another artifact is the culprit
-
Check kernel and system logs for OOM kill events: `dmesg -T | grep -i 'oom\|killed process\|out of memory'`Confirms whether OOM kills occurred, identifies the killed process, and timestamps the event for correlation
-
Check current memory usage and identify top memory consumers: `free -h && ps aux --sort=-%mem | head -20`Determines whether memory pressure is still present and which process is consuming the most RAM
-
On Linux, check journald for OOM events: `journalctl -k --since '1 hour ago' | grep -i 'oom\|memory'`; on Windows, check System Event Log for source 'Microsoft-Windows-Resource-Exhaustion-Detector'Provides additional context on frequency and pattern of OOM events
-
Identify open but deleted files holding disk space (Linux only): `lsof +L1 | grep deleted`On Linux, files deleted while held open by a process do not free disk space until the process releases the file descriptor — common cause of persistent 'disk full' after apparent cleanup
Resolution path
- 1. Immediately free disk space to restore basic system function: remove or compress old log files (`find /var/log -name '*.log' -mtime +7 -exec gzip {} \;`), delete core dumps (`rm -rf /var/crash/*`), and truncate or archive large application logs safely.
- 2. If deleted files are held open by running processes (identified via `lsof +L1 | grep deleted`), restart the holding service to release the file descriptors and reclaim space: `systemctl restart <service-name>`.
- 3. If an OOM kill terminated a critical service, restart it after confirming memory pressure has subsided: `systemctl restart <service-name>` — verify it stays up with `systemctl status <service-name>`.
- 4. If the OOM condition is ongoing (process still leaking memory), identify and kill the leaking process: `kill -9 <pid>` — then restart the service under a memory cgroup limit or with ulimit constraints.
- 5. For disk full caused by log accumulation, immediately configure or repair log rotation: install/verify logrotate config in `/etc/logrotate.d/<app>` with `rotate`, `compress`, `maxsize`, and `daily` directives, then test with `logrotate -d /etc/logrotate.conf`.
- 6. If a database WAL or temp file is the cause, follow the database-specific procedure to truncate or archive logs (e.g., PostgreSQL: `SELECT pg_switch_wal();` then archive; MySQL: `PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 3 DAY);`).
- 7. Expand the affected disk volume or partition if immediate cleanup is insufficient: resize the underlying block device, then extend the filesystem (`resize2fs /dev/sdX` for ext4, or `xfs_growfs /mountpoint` for XFS).
Prevention
- Implement log rotation for all application and system logs with a defined maxsize and retention period using logrotate or equivalent, and test the configuration in staging.
- Set memory resource limits on all services and containers (cgroup memory.limit_bytes on Linux, container memory limits in Kubernetes/Docker) to prevent a single workload from exhausting host memory.
- Configure disk usage alerting at 75% and 90% thresholds on all partitions so oncall is notified before exhaustion, not after.
- Separate high-volume write paths (logs, database data, temp files) onto dedicated volumes so that log exhaustion cannot affect database or OS partitions.
- Enable automatic core dump size limits via `/etc/security/limits.conf` (`* hard core 0` or a bounded size) to prevent crash dumps from filling production disks.
- Regularly audit and enforce database log retention policies (binary log expiry, WAL archiving) aligned with recovery time objectives.
Tools
- df — report filesystem disk space usage
- du — estimate file and directory space usage
- dmesg — kernel ring buffer, captures OOM kill events
- free — display memory usage summary
- ps aux — list processes with CPU and memory usage
- lsof — list open files, including deleted-but-held files consuming disk space
- journalctl — query systemd journal for OOM and kernel events
- logrotate — manage log file rotation, compression, and deletion
- resize2fs / xfs_growfs — online filesystem expansion after block device resize
- top / htop — interactive real-time process and memory monitor
- vmstat — report virtual memory, swap, and I/O statistics