P1 · Windows Server

Resource Exhaustion: Disk Full and OOM Kills Causing System/Application Failures

Resource exhaustion occurs when a host runs out of disk space or physical/virtual memory, causing the OS to invoke the Out-of-Memory (OOM) killer or applications to fail writes with ENOSPC errors. This leads to service crashes, database commit failures, log truncation, and cascading failures. Remediation requires immediate identification of the exhausted resource, emergency space/memory recovery, and addressing root causes such as unbounded log growth or memory leaks.

Indicators

Applications returning 'No space left on device' (errno ENOSPC) on Linux or 'There is not enough space on the disk' on Windows
OOM kill events in kernel log: 'Out of memory: Kill process <Pid> (<name>) score <N> or sacrifice child' in /var/log/kern.log or dmesg
Services or containers abruptly exit with exit code 137 (SIGKILL sent by OOM killer)
Disk usage at or near 100% on one or more mount points (e.g., /, /var, /tmp, data volume)
Database refusing new writes or transactions rolling back due to inability to write WAL/redo logs
System logs (syslog, journald, Windows Event Log) stopping or truncating due to full log partition
Application health checks failing or returning HTTP 500 errors coinciding with resource exhaustion event

Likely causes

Log files growing unbounded due to missing or misconfigured log rotation (logrotate, systemd journal limits)
Database transaction logs, WAL files, or temp tablespace consuming all available disk
Runaway process or memory leak causing a single application to consume all available RAM, triggering OOM killer
Container or VM provisioned without adequate memory limits, allowing a workload to starve the host
Large core dump files filling /var/crash or equivalent directory
Backup or archive jobs writing large files to the same partition as application data
Disk quota misconfiguration allowing a single user or service account to consume all space
tmpfs or /tmp partition sized too small relative to workload demand

Diagnostic steps

Check disk usage across all mounted filesystems: `df -h`

Pinpoints the exhausted volume so remediation is targeted correctly
Identify the largest directories consuming space on the full partition: `du -sh /* 2>/dev/null | sort -rh | head -20`

Reveals whether logs, databases, core dumps, or another artifact is the culprit
Check kernel and system logs for OOM kill events: `dmesg -T | grep -i 'oom\|killed process\|out of memory'`

Confirms whether OOM kills occurred, identifies the killed process, and timestamps the event for correlation
Check current memory usage and identify top memory consumers: `free -h && ps aux --sort=-%mem | head -20`

Determines whether memory pressure is still present and which process is consuming the most RAM
On Linux, check journald for OOM events: `journalctl -k --since '1 hour ago' | grep -i 'oom\|memory'`; on Windows, check System Event Log for source 'Microsoft-Windows-Resource-Exhaustion-Detector'

Provides additional context on frequency and pattern of OOM events
Identify open but deleted files holding disk space (Linux only): `lsof +L1 | grep deleted`

On Linux, files deleted while held open by a process do not free disk space until the process releases the file descriptor — common cause of persistent 'disk full' after apparent cleanup

Resolution path

1. Immediately free disk space to restore basic system function: remove or compress old log files (`find /var/log -name '*.log' -mtime +7 -exec gzip {} \;`), delete core dumps (`rm -rf /var/crash/*`), and truncate or archive large application logs safely.
2. If deleted files are held open by running processes (identified via `lsof +L1 | grep deleted`), restart the holding service to release the file descriptors and reclaim space: `systemctl restart <service-name>`.
3. If an OOM kill terminated a critical service, restart it after confirming memory pressure has subsided: `systemctl restart <service-name>` — verify it stays up with `systemctl status <service-name>`.
4. If the OOM condition is ongoing (process still leaking memory), identify and kill the leaking process: `kill -9 <pid>` — then restart the service under a memory cgroup limit or with ulimit constraints.
5. For disk full caused by log accumulation, immediately configure or repair log rotation: install/verify logrotate config in `/etc/logrotate.d/<app>` with `rotate`, `compress`, `maxsize`, and `daily` directives, then test with `logrotate -d /etc/logrotate.conf`.
6. If a database WAL or temp file is the cause, follow the database-specific procedure to truncate or archive logs (e.g., PostgreSQL: `SELECT pg_switch_wal();` then archive; MySQL: `PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 3 DAY);`).
7. Expand the affected disk volume or partition if immediate cleanup is insufficient: resize the underlying block device, then extend the filesystem (`resize2fs /dev/sdX` for ext4, or `xfs_growfs /mountpoint` for XFS).

Prevention

Implement log rotation for all application and system logs with a defined maxsize and retention period using logrotate or equivalent, and test the configuration in staging.
Set memory resource limits on all services and containers (cgroup memory.limit_bytes on Linux, container memory limits in Kubernetes/Docker) to prevent a single workload from exhausting host memory.
Configure disk usage alerting at 75% and 90% thresholds on all partitions so oncall is notified before exhaustion, not after.
Separate high-volume write paths (logs, database data, temp files) onto dedicated volumes so that log exhaustion cannot affect database or OS partitions.
Enable automatic core dump size limits via `/etc/security/limits.conf` (`* hard core 0` or a bounded size) to prevent crash dumps from filling production disks.
Regularly audit and enforce database log retention policies (binary log expiry, WAL archiving) aligned with recovery time objectives.

Tools

df — report filesystem disk space usage
du — estimate file and directory space usage
dmesg — kernel ring buffer, captures OOM kill events
free — display memory usage summary
ps aux — list processes with CPU and memory usage
lsof — list open files, including deleted-but-held files consuming disk space
journalctl — query systemd journal for OOM and kernel events
logrotate — manage log file rotation, compression, and deletion
resize2fs / xfs_growfs — online filesystem expansion after block device resize
top / htop — interactive real-time process and memory monitor
vmstat — report virtual memory, swap, and I/O statistics

disk-fulloom-killerresource-exhaustionmemorystoragelinuxwindows-serverincident-responseP1log-managementcontainersdatabase