Linux Performance Monitoring: Complete Guide

Introduction / Why This Matters

Linux performance monitoring isn't about complex scripts; it's about quickly understanding what exactly is slowing down the system. Without this, any "the server is slow" is just guesswork. You'll learn how to pinpoint in 60 seconds whether the CPU, memory, disk, or network is the source of the problem. This is a fundamental skill for administering any Linux server, from a home setup to production.

Requirements / Preparation

Before you begin, ensure:

You have SSH access to the server with sudo privileges (some commands require root).
Basic utilities are installed. We'll start by installing sysstat and htop (Step 1).
You are in text mode (without a graphical shell) for a clean test. If you're using GNOME/KDE, some utilities (like htop) will work, but iostat and vmstat should be run in a terminal.

Step 1: Install Basic Monitoring Utilities

In most minimal Linux installations (especially in containers or on servers), convenient tools like htop are not available. The standard set (top, free, df) only provides a general overview. We need detailed data.

# For Ubuntu/Debian
sudo apt update
sudo apt install sysstat htop iftop iotop -y

# For RHEL/CentOS/AlmaLinux
sudo yum install sysstat htop iftop iotop -y

# For Fedora
sudo dnf install sysstat htop iftop iotop -y

What we're installing:

sysstat — a suite including iostat (disks), mpstat (CPU cores), sar (history).
htop — an improved interactive process viewer.
iftop — network traffic monitoring by connection.
iotop — I/O activity monitoring per process (requires root).

Step 2: Assess Overall CPU and Process Load

The first thing to understand is not 'if it's slow' but what exactly is loading the system. Run:

htop

In htop, focus on the top section:

CPU bars (CPU1, CPU2...) — show the load on each core. If all are red, the CPU is busy.
Load average (1, 5, 15 min averages) — the average length of the process queue. Rule of thumb: the load average should not significantly exceed the number of CPU cores. For example, on a 4-core server, values of 4.0, 3.5, 2.0 are normal. 10.0, 8.0, 6.0 indicate critical overload.
Process list — sort by %CPU (press F6 -> PERCENT_CPU). A process consistently 'hogging' 80-100% of one core is the likely culprit.

If htop is unavailable, use top:

top

Press 1 to show each core's load. Exit with q.

Step 3: Analyze Memory and Swap Usage

Even if the CPU is free, the system can 'slow down' due to insufficient RAM and active swap usage.

In htop, check the Mem and Swp lines:

Mem: shows total, used, buffers/cache.
Swp: if there are non-zero values (especially 'used'), the system is actively using disk as memory — this is very slow.

For precise numbers:

free -h

Key columns:

used — how much memory is occupied.
available — the most important estimate of memory available for new processes without swapping.
swap used — if greater than 0 and growing — problem.

Symptom: The application runs but response is 'laggy'. Cause: constant swapping.

Step 4: Check Disk Subsystem Load

Disks (especially HDDs or overloaded SSDs) are a common bottleneck. Use iostat:

iostat -x 1

Key columns in the output (-x for extended):

%util — percentage of time the device was busy processing requests. Target: < 70-80%. 100% means the disk is fully loaded.
await — average time (in milliseconds) to complete I/O operations. Target: for SSD < 1-5 ms, for HDD < 20-50 ms. High values (100+ ms) indicate a problem.
svctm — average service time (usually less useful than await).

Example output:

Device        r/s     w/s     rkB/s     wkB/s   await   svctm  %util
sda           0.00   150.00      0.00   6144.00    5.20    1.20   18.00
nvme0n1       5.00   200.00   1024.00   40960.00   12.50    0.80   16.40

Here sda (likely an HDD) has an await of 5.2 ms — normal. But if await were 100 ms with %util at 90% — the disk is overwhelmed.

Tip: If iostat doesn't show the desired devices, specify them explicitly: iostat -x 1 /dev/sda /dev/nvme0n1.

Step 5: Examine Network Activity and Errors

The network can 'fail' due to channel overload, interface errors, or application issues.

Quick real-time traffic view:

sudo iftop -nP

-n — don't resolve IPs to names (faster).
-P — show ports. Sorting by SENT or RECV (press s or r) will show which connections are loading the channel.

More detailed interface statistics:

ip -s link show eth0  # or ens3, enp0s3, etc.

Look for in the output:

rx errors / tx errors — number of receive/transmit errors. Non-zero values require checking the cable, switch, driver.
rx dropped / tx dropped — packets dropped by the kernel due to lack of resources (buffers). Growth in these values under high load indicates congestion.

Step 6: Collect Historical Data for In-Depth Analysis

If the issue is periodic (e.g., 'slows down every day at 14:00'), you need to look at history. The sysstat daemon handles this.

Check if it's running:
```
sudo systemctl status sysstat
```
If active (running) — data is already being collected. If inactive or failed — enable it:
```
sudo systemctl enable --now sysstat
```
Viewing archives: Data is stored in /var/log/sysstat/ in binary format. Read it with the sar command.
- CPU for yesterday:
```
sudo sar -u -f /var/log/sysstat/sa$(date -d yesterday +%d)
```
- Disks for the last 10 minutes (default collection interval is 10 min):
```
sudo sar -d -f /var/log/sysstat/sa$(date +%d) | grep -E "Device|Average"
```
- Memory:
```
sudo sar -r -f /var/log/sysstat/sa$(date +%d) | grep -E "kbmemfree|kbmemused|%memused"
```
To adjust the collection interval (e.g., every minute), edit /etc/default/sysstat (Debian/Ubuntu) or /etc/sysconfig/sysstat (RHEL) and change the SA1_OPTIONS parameter.

Verification

After completing the steps, you should:

Identify the resource bottleneck: CPU (%util near 100%, high load average), Memory (available low, swap active), Disk (high await and %util), Network (errors/dropped, 100% utilization).
Find the 'culprit': specific process (htop), operation type (iotop — many writes?), specific network connection (iftop).
Obtain data for further action: e.g., 'Process java with PID 1234 consumes 300% CPU' or 'Disk /dev/nvme0n1 has await of 150 ms at %util 95%'.

If the issue is localized at the application level (e.g., a specific Java process), further diagnosis will depend on it (log analysis, profiling).

Potential Issues

iostat: command not found — the sysstat package is not installed. See Step 1.
Permission denied when running iostat or sar — some commands require root. Use sudo or log in as root.
Zero values in iostat — the disk might not be in use or the system uses virtual block devices (in containers). Check lsblk and df -h.
iftop doesn't show the interface — specify it explicitly: sudo iftop -i eth0.
No sar data for past days — the sysstat daemon wasn't running earlier. Data is collected only from the time the service was started.
High await with low %util — may indicate issues with the disk controller, driver, or hardware failures. Check dmesg | grep -i error.
top/htop shows 100% CPU but no process with high %CPU — this could be system interrupts (si) or processes in D state (uninterruptible sleep, usually waiting for I/O). In htop, press F2 -> Display options -> enable Show custom thread names and Detailed for viewing. For I/O-bound processes, use iotop.

F.A.Q.

Which tool to choose for monitoring: top or htop?

How to monitor disk performance in real-time?

What to do if iostat or sar commands are missing?

How to save load history for later analysis?

Hints

Install basic monitoring utilities

Assess overall CPU and process load

Analyze memory and swap usage

Check disk subsystem load

Examine network activity and errors

Collect historical data for in-depth analysis