Introduction to Linux Performance Monitoring
Linux performance monitoring is not just about checking CPU load. It's a comprehensive analysis of the system: processor, memory, disks, network, and I/O. Understanding metrics helps prevent downtime, optimize resource costs, and quickly respond to anomalies.
In this guide, you'll master both basic utilities and advanced tools. We'll focus on practical scenarios: how to find a "hot" process, why a disk is slow, why the network is overloaded. All commands work on most distributions (Ubuntu, CentOS, Debian, Fedora).
Basic Utilities for Daily Use
top and htop: Interactive Process Monitoring
top is your first tool when analyzing. Run it and study the screen:
top
Key lines:
%Cpu(s): breakdown intous(user processes),sy(system),id(idle).KiB Mem: RAM usage:used,free,buff/cache.KiB Swap: swap activity.
Sorting: press P (by CPU), M (by memory). To see all processes, including threads, add -H at startup: top -H.
Tip: htop is an improved version with colors, a process tree, and convenient management. Install it via sudo apt install htop or sudo yum install htop.
vmstat: Virtual Statistics
vmstat provides a system summary every N seconds. Ideal for a quick "health check".
vmstat 2
Example output:
procs -----------memory---------- ---swap-- -----io------ -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 123456 78900 456789 0 0 100 200 123 456 30 10 55 5 0
Decoding:
r: processes in the run queue. Value > number of cores indicates CPU shortage.si/so: pages moved in/out of swap. Non-zero values indicate insufficient RAM.us/sy: high values (>80%) indicate CPU load.wa: time spent waiting for I/O. Highwa(e.g., >20%) indicates a disk problem.
iostat: Disk and CPU Details
Install the sysstat package if you haven't already. Command:
iostat -x 2
Key metrics for disks (Device):
%util: percentage of time the disk is busy with operations. Close to 100% means the disk is overloaded.await: average time (in ms) to complete an operation. High values (e.g., >50 ms for SSD) indicate a problem.svctm: average service time per operation. Compare withawait. Ifawait>>svctm, the queue is large.
For CPU: %user, %system, %idle.
df and du: Disk Space
Quick check of free space:
df -h
human-readable (-h) output in gigabytes. Watch %Use. If >90% — clean logs or increase the volume.
To find the largest "space eaters" in a specific folder:
du -sh /var/* | sort -rh | head -10
This shows the 10 largest subfolders in /var.
ss and netstat: Network Activity
ss is the modern replacement for netstat. Quick view of connections:
ss -tuln
Flags:
-t: TCP,-u: UDP,-l: listening,-n: numeric (no name resolution).
For interface statistics:
ip -s link
Or for detailed network packet stats:
nstat
Advanced Tools for Deep Analysis
sar: Historical Data Collection
sar (System Activity Reporter) records metrics every N minutes. Data is stored in /var/log/sysstat/ (filename depends on the distro, e.g., sa14 for day 14).
View today's data:
sar -u # CPU
sar -r # Memory
sar -b # I/O
sar -n DEV # Network interfaces
Example: sar -u 2 5 — CPU every 2 seconds, 5 times.
Advantage: you can see what happened during a problem, even if you weren't at the terminal.
nmon: Interactive Monitoring of All Resources
Install nmon (sudo apt install nmon). Run:
nmon
Keys:
c— CPU,m— memory,d— disks,n— network,t— top processes,q— exit.
nmon is useful for a quick overview and session recording (.nmon file), which can later be analyzed in Excel or via nmon2csv.
glances: Cross-Platform Monitoring
glances is a Python utility that combines many metrics in one interactive interface. Installation:
pip install glances
# or for the system:
sudo apt install glances
Run: glances. Supports colors, alerts (thresholds), export to JSON, InfluxDB, Elasticsearch.
Graphical and Web Solutions
For long-term monitoring and visualization, use combinations:
- Prometheus + Grafana: collect metrics via exporters (node_exporter) and beautiful dashboards.
- Netdata: "out-of-the-box" monitoring with a web interface on port 19999. Install:
bash <(curl -Ss https://my-netdata.io/kickstart.sh). - Zabbix/Nagios: for enterprise monitoring with alerts.
Practical Scenarios
Scenario 1: High CPU Load
- Run
toporhtop. - Sort by
%CPU. Find the process with the highest consumption. - If it's
java,python,node— check the application logs. - If it's
kworkerormigration— the problem might be in the kernel or IRQ. - Use
perf topfor profiling (installlinux-tools).
Scenario 2: Disk Fully Busy
iostat -x 2— look at%utilandawaitper disk.iotop(install viasudo apt install iotop) — shows which process is writing/reading.- If
awaitis high but%utilis low — the problem might be in the network (NFS, iSCSI). - Check disk queue:
cat /proc/diskstats | grep <device>.
Scenario 3: Memory Shortage
free -h— look atavailable(available) andswap.- If
swapis actively used (si/soinvmstat>0) — insufficient RAM. ps aux --sort=-%mem | head -10— top 10 by memory.- Check cache:
cat /proc/meminfo | grep -E "Cached|Buffers". Large cache is normal; the OS uses free RAM. - If a process is "eating" memory — look for leaks (e.g., via
valgrindfor C/C++).
Scenario 4: Network Overload
ip -s link— errors (errs) and drops (drop) per interface.ss -s— summary of sockets (e.g., manyTIME-WAIT).nethogs(install) — shows traffic per process.iftop— similar totop, but for network.
Automation and Alerting
For regular data collection, set up cron and sar:
# Enable data collection (if not running)
sudo systemctl enable sysstat
sudo systemctl start sysstat
File /etc/default/sysstat (Debian/Ubuntu) or /etc/sysconfig/sysstat (RHEL/CentOS) contains collection parameters (e.g., SA1_OPTIONS="-S XALL" for all metrics).
For alerts, use:
monit— simple daemon that watches processes, disks, CPU.nagios/zabbix— complex systems with web interfaces.- Bash/Python scripts that check metrics and send notifications (e.g., via
mailor Telegram API).
Example script to check CPU load:
#!/bin/bash
LOAD=$(awk '{print $1}' /proc/loadavg)
THRESHOLD=$(nproc) # number of cores
if (( $(echo "$LOAD > $THRESHOLD" | bc -l) )); then
echo "High load: $LOAD" | mail -s "Alert: CPU load" admin@example.com
fi
Interpreting Metrics and Prevention
Key Indicators
- CPU:
%idle< 20% — overload. But for web servers, 70-80% idle is normal if there's no queue. - Memory:
available< 10% of total — alarm. Watchswap— if active, it's a sign of insufficient RAM. - Disk:
await> 20 ms for SSD, > 10 ms for HDD — problem.%util> 80% — disk can't cope. - Network: rising
drop/errs— overload or driver error.
Prevention
- Regularly check logs (
/var/log/syslog,dmesg). - Set up monitoring with thresholds (e.g., CPU > 90% for 5 minutes).
- Limit processes via
cgroups(systemd slice, docker limits). - Update kernel and drivers — sometimes problems are fixed in new versions.
- For I/O-intensive tasks, use
ioniceandnice.
Common Beginner Mistakes
- Looking only at
topwithout consideringwa— miss I/O problems. - Treating
freeinfree -mas "free memory" — ignoring cache. Better useavailable. - Ignoring
si/soinvmstat— swap kills performance. - Not setting up alerts — they find out about the problem when the server has already crashed.
Conclusion
Monitoring is a continuous process. Start with basic utilities (top, vmstat, iostat), then add sar for history and glances/nmon for a comprehensive overview. For production environments, definitely set up graphical dashboards (Grafana) and alerts.
Remember: metrics without context are useless. Know your workload: requests per second, data volume, peak hours. Then anomalies will be visible immediately.