What is OOM Killer and Why It Appears
OOM Killer (Out-of-Memory Killer) is a Linux kernel mechanism that automatically terminates processes when the system has exhausted available RAM and swap space. Its goal is to free up memory so the kernel and critical system processes can continue operating, preventing a complete system crash.
Typically, OOM Killer activates when:
- Physical RAM and swap are 100% full.
- An application has a memory leak.
- Too many memory-intensive processes are running on the server.
- Memory limits in containers (Docker/Kubernetes) are misconfigured.
If you see Killed process in logs or an application suddenly exits with code 137 (SIGKILL), OOM Killer is likely the culprit.
How OOM Killer Works
The Linux kernel calculates an oom_score for each process based on:
- The proportion of memory consumed by the process (primary factor).
- Process privileges (root processes are less likely to be killed).
- Process lifetime (long-running processes may have a higher score).
The process with the highest oom_score is selected for termination. However, this isn't always optimal: OOM Killer might kill an important service while leaving a background process with a leak.
Diagnosing the Problem
Before taking any action, confirm that OOM Killer is the cause.
- Check kernel logs:
dmesg | grep -i kill
Example output:[12345.678] Out of memory: Kill process 1234 (nginx) score 500 or sacrifice child [12345.680] Killed process 1234 (nginx) total-vm:1234567kB, anon-rss:456789kB, file-rss:0kB
Here, processnginxwith PID 1234 was killed. - Assess total memory:
free -h
Pay attention tototal,used,available, andSwapcolumns. Ifavailableis near zero andSwapis also full, the system is in a critical state. - Find the memory-consuming process:
top -b -n 1 | head -20
Or usehtopsorted by memory (pressF6→MEM%). - Check processes' oom_score:
for pid in $(ps -e | awk '{print $1}' | tail -n +2); do echo "PID $pid: $(cat /proc/$pid/oom_score 2>/dev/null) (adj: $(cat /proc/$pid/oom_score_adj 2>/dev/null))" done | sort -k3 -n -r | head -10
This shows the top 10 processes with the highestoom_score.
Problem Resolution Methods
Step 1: Configure oom_score_adj to Protect Key Processes
Each process can be assigned an adj value from -1000 (maximum protection) to +1000 (maximum kill priority). This is the fastest way to protect a process.
For a one-time setting (until reboot):
# Replace <PID> with the process ID
echo -1000 > /proc/<PID>/oom_score_adj
For a permanent setting via systemd (recommended): Create or edit the unit file:
# /etc/systemd/system/your-service.service.d/oom-protect.conf
[Service]
OOMScoreAdjust=-1000
Then reload and restart the service: systemctl daemon-reload && systemctl restart your-service.
Important: Do not set oom_score_adj=-1000 for all processes — this may prevent OOM Killer from freeing memory and cause the system to hang.
Step 2: Use cgroups to Limit Memory
cgroups (control groups) allow setting hard memory limits for process groups. This is the best approach for containers and isolated services.
Via systemd (modern distributions):
# Run a command with a 500 MB limit
systemd-run --scope -p MemoryMax=500M /path/to/command
# Or for an existing service, create a drop-in:
# /etc/systemd/system/your-service.service.d/limits.conf
[Service]
MemoryMax=1G
MemorySwapMax=2G # if swap is needed
Manually via cgroup v2:
# Create a cgroup
sudo mkdir /sys/fs/cgroup/mylimit
# Set a 1 GB limit
echo $((1*1024*1024*1024)) | sudo tee /sys/fs/cgroup/mylimit/memory.max
# Start a process in this group
sudo echo $$ > /sys/fs/cgroup/mylimit/cgroup.procs && /path/to/your/app
Step 3: Tune Kernel Parameters
Adjust OOM Killer behavior at the kernel level.
Option A: Disable memory overcommit (strict control):
In /etc/sysctl.conf, add:
vm.overcommit_memory = 2
vm.overcommit_ratio = 100 # allow commit only up to 100% of RAM+swap
Apply: sudo sysctl -p. This prevents allocating non-existent memory but may cause fork: Cannot allocate memory errors in applications.
Option B: Panic mode instead of killing (for debugging):
vm.panic_on_oom = 2
On memory shortage, the system will kernel panic, useful for dump collection but unsuitable for production.
Option C: Change OOM Killer aggressiveness (rarely used):
vm.oom_kill_allocating_task = 1 # kill the process that allocated memory, not a random one
Step 4: Optimize the Application or Increase Resources
If the issue is caused by a memory leak:
- Use profilers:
valgrind --leak-check=full,heaptrack,perf. - For Java apps: tune
-Xmxand-Xmsin the JVM. - For Python: check for leaks (e.g., via
tracemalloc).
If the load is legitimate:
- Increase the server's RAM.
- Add a swap file (temporary fix, not a panacea):
sudo fallocate -l 4G /swapfile sudo chmod 600 /swapfile sudo mkswap /swapfile sudo swapon /swapfile
Preventing OOM Killer
- Memory monitoring:
- Use
Prometheus + node_exporterorZabbix. - Set alerts for RAM usage > 80%.
- Quick check command:
awk '/MemAvailable/ {print $2/1024" GB available"}' /proc/meminfo.
- Use
- Log memory usage:
# Log every 5 minutes via cron */5 * * * * /usr/bin/free -h >> /var/log/memory.log - Regular process audits:
- Look for processes with abnormally high
oom_score. - Verify container limits are appropriate.
- Look for processes with abnormally high
Container-Specific Behavior (Docker/Kubernetes)
In containers, OOM Killer operates within cgroup isolation, but if a container exhausts its limit, the kernel kills processes inside it.
Docker:
# Run with 512 MB RAM and 1 GB swap limit
docker run -d --memory=512m --memory-swap=1g your-image
# Check limits
docker stats
Kubernetes:
resources:
limits:
memory: "512Mi"
cpu: "500m"
requests:
memory: "256Mi"
Ensure requests and limits are set appropriately. When a limit is exceeded, the pod will be killed (OOMKilled).
Common Configuration Mistakes
- Protecting all processes with
oom_score_adj=-1000: This effectively disables OOM Killer entirely, which can lead to a complete system lockup during memory shortages. - Setting cgroup limits higher than physical RAM: Even with high limits, OOM Killer will still trigger at the host level.
- Ignoring swap: Swap slows the system but can buy reaction time. Completely disabling swap (
swapoff -a) accelerates OOM Killer triggering. - Misinterpreting logs:
Killed processcan also come from a manualkill -9. Always checkdmesgand the process exit code (137 = SIGKILL, often from OOM).
What's Next?
After applying measures, verify:
- Stability under load (test with
stress-ngor real traffic). - Absence of new OOM Killer entries in logs.
- Protected processes are functioning correctly (not consuming excessive memory at others' expense).
If the problem persists, consider architectural changes: sharding, caching, or using more efficient data processing algorithms.
Remember: OOM Killer is the system's last line of defense. The best strategy is to prevent it from triggering through monitoring and prudent resource planning.