SMART Error in Linux: How to Diagnose and Fix

What a SMART Error Means

A SMART (Self-Monitoring, Analysis, and Reporting Technology) error is a signal from your hard disk drive (HDD) or solid-state drive (SSD)'s built-in self-diagnostic system. It indicates that one or more monitored disk parameters have exceeded acceptable thresholds, which often precedes complete media failure.

Typical smartctl command output when a problem exists:

SMART overall-health self-assessment test result: FAILED!

Or in the attribute table:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   Pre-fail   100   100   050    Old_age   Always       -       10
197 Current_Pending_Sector Offline     100   100   000    Old_age   Always       -       5

In this example, Reallocated_Sector_Ct (reallocated sectors) and Current_Pending_Sector (pending reallocation) have non-zero values, indicating physical damage to the disk surface.

Causes

SMART errors occur due to natural wear or external factors affecting drive reliability:

Mechanical wear (for HDDs): Wear on the motor, heads, or bearings after several years of active use.
Electronic components: Failures in the drive controller or memory (particularly relevant for SSDs).
Overheating: Prolonged operation at high temperatures accelerates media degradation.
Bad sectors: Appearance of unreadable sectors due to surface damage (HDD) or cell wear (SSD).
Physical impact: Shocks, vibrations, or power surges during operation.
Cables and interface: Poor connections (especially for SATA/IDE), leading to CRC errors (UDMA_CRC_Error_Count).

Method 1: Checking Disk Status with smartctl

This is the primary diagnostic method. Ensure the smartmontools utility is installed:

# For Ubuntu/Debian
sudo apt update && sudo apt install smartmontools

# For CentOS/RHEL/Fedora
sudo yum install smartmontools
# or
sudo dnf install smartmontools

Determine the disk device name:
```
lsblk
```
Or for a more detailed list:
```
sudo fdisk -l
```
Find your disk (e.g., /dev/sda, /dev/sdb, /dev/nvme0n1). Be cautious: working with the wrong device can lead to data loss.
Run a full SMART check:
```
sudo smartctl -a /dev/sdX
```
Replace sdX with your device (e.g., sda).
Analyze the output:
- Overall-health: Look for the line SMART overall-health self-assessment test result. If it says FAILED — the disk is critically unhealthy.
- Attribute table: Pay attention to these attributes:
  - Reallocated_Sector_Ct (ID 5): Number of sectors reallocated to spare areas. A non-zero value indicates wear.
  - Current_Pending_Sector (ID 197): Sectors awaiting reallocation. Usually indicates unrecoverable read errors.
  - UDMA_CRC_Error_Count (ID 199): Data integrity control errors on the cable. Often resolved by replacing the SATA cable.
  - Power_On_Hours (ID 9): Power-on time. Compare with the drive's MTBF (mean time between failures).
- Threshold exceeded: In the attribute table, if the WHEN_FAILED column shows FAILING_NOW or PRE-FAIL, this is critical.

💡 Tip: For SSDs, key attributes are Media_Wearout_Indicator (media wear) and Available_Reservd_Space (remaining reserve space).

Method 2: Running an Extended Self-Test

A short test (short) quickly checks electronics and basic sectors. A long test (long) scans the entire disk surface.

Start a long test (can take 1 to 10 hours depending on disk size):
```
sudo smartctl -t long /dev/sdX
```
The command will return an estimated completion time.
Check progress (optional):
```
sudo smartctl -a /dev/sdX | grep "Self-Test"
```
Or simply wait for completion.
After completion, run again:
```
sudo smartctl -a /dev/sdX
```
Scroll to the Self-Test Log section. Look for entries with Completed and Read Failure or Failure. Any error in the log is serious cause for drive replacement.

⚠️ Important: The long test stresses the disk. Avoid running it on a system disk during high load if possible.

Method 3: Checking System Logs for I/O Errors

Sometimes SMART doesn't report issues promptly, but the Linux kernel logs read/write errors.

Review the system journal (journalctl):

sudo journalctl -xe | grep -iE "error|fail|ata|scsi|nvme"

Or for more precise device-specific search:

sudo dmesg | grep -i /dev/sdX

Typical messages:

[ 1234.567890] sd 0:0:0:0: [sda] Sense Key : Medium Error [current] 
[ 1234.567891] sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error

This indicates unrecoverable read errors, often linked to physical damage.

For NVMe drives, use the nvme-cli utility:
```
sudo nvme smart-log /dev/nvme0
```
Watch for critical_warning and media_errors.

Method 4: Backing Up Data and Replacing the Drive

If diagnostics confirm a critical state (FAILED in overall-health, rising Reallocated_Sector_Ct or Current_Pending_Sector values, errors in logs), act immediately:

Create a full backup of all data from the problematic drive. Use dd, rsync, or backup tools.

# Example rsync for copying home directories
sudo rsync -avh /home/ /backup/home_backup/

If the drive is in a RAID array:
- For software RAID (mdadm), check status: cat /proc/mdstat.
- Replace the drive in the array following your RAID level's procedure (usually mdadm --manage /dev/md0 --add /dev/sdX).
- For hardware RAID, use the controller's utilities (e.g., storcli for LSI).
Physical replacement:
- Shut down the system (if it's the system drive).
- Replace the drive with a new one.
- Restore data from backup or reinstall the system.

Prevention

To minimize the risk of sudden drive failure:

Regular SMART checks: Set up a cron job for monthly automatic scans.

# Example crontab (daily at 2:00 AM)
0 2 * * * /usr/sbin/smartctl -H /dev/sdX | grep -q "FAILED" && echo "SMART FAIL on /dev/sdX" | mail -s "Disk Alert" admin@example.com

Monitor temperature: Drives should not operate above 50–60°C. Use hddtemp or smartctl -A.
Use RAID 1/5/6/10 for critical data. RAID doesn't replace backups but improves fault tolerance.
Avoid physical stress: Do not move an operating HDD, ensure adequate cooling.
Update drive firmware if the manufacturer releases reliability improvements.
For SSDs: Monitor remaining write endurance (Media_Wearout_Indicator). SSDs have a limited number of rewrite cycles.

💡 Tip: For SSDs, also avoid filling the disk beyond 80–90% to leave space for wear-leveling algorithms.

F.A.Q.

What is SMART and why is it important?

Can you ignore a SMART error if the disk is still working?

Which SMART attribute is considered critical?

Do you need to stop services before running a SMART test?

Hints

Install the smartmontools utility

Identify the disk device name

Run a full SMART diagnostic

Perform a long self-test

Take action on confirmed issues