Linux SMARTCritical

SMART Error in Linux: How to Diagnose and Fix

The article explains what SMART errors in Linux mean and provides proven solutions, including commands for diagnostics and disk replacement.

Updated at February 16, 2026
15-30 min
Medium
FixPedia Team
Применимо к:Ubuntu 20.04+CentOS 7+Debian 10+Any Linux with smartmontools

What a SMART Error Means

A SMART (Self-Monitoring, Analysis, and Reporting Technology) error is a signal from your hard disk drive (HDD) or solid-state drive (SSD)'s built-in self-diagnostic system. It indicates that one or more monitored disk parameters have exceeded acceptable thresholds, which often precedes complete media failure.

Typical smartctl command output when a problem exists:

SMART overall-health self-assessment test result: FAILED!

Or in the attribute table:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   Pre-fail   100   100   050    Old_age   Always       -       10
197 Current_Pending_Sector Offline     100   100   000    Old_age   Always       -       5

In this example, Reallocated_Sector_Ct (reallocated sectors) and Current_Pending_Sector (pending reallocation) have non-zero values, indicating physical damage to the disk surface.

Causes

SMART errors occur due to natural wear or external factors affecting drive reliability:

  1. Mechanical wear (for HDDs): Wear on the motor, heads, or bearings after several years of active use.
  2. Electronic components: Failures in the drive controller or memory (particularly relevant for SSDs).
  3. Overheating: Prolonged operation at high temperatures accelerates media degradation.
  4. Bad sectors: Appearance of unreadable sectors due to surface damage (HDD) or cell wear (SSD).
  5. Physical impact: Shocks, vibrations, or power surges during operation.
  6. Cables and interface: Poor connections (especially for SATA/IDE), leading to CRC errors (UDMA_CRC_Error_Count).

Method 1: Checking Disk Status with smartctl

This is the primary diagnostic method. Ensure the smartmontools utility is installed:

# For Ubuntu/Debian
sudo apt update && sudo apt install smartmontools

# For CentOS/RHEL/Fedora
sudo yum install smartmontools
# or
sudo dnf install smartmontools
  1. Determine the disk device name:
    lsblk
    

    Or for a more detailed list:
    sudo fdisk -l
    

    Find your disk (e.g., /dev/sda, /dev/sdb, /dev/nvme0n1). Be cautious: working with the wrong device can lead to data loss.
  2. Run a full SMART check:
    sudo smartctl -a /dev/sdX
    

    Replace sdX with your device (e.g., sda).
  3. Analyze the output:
    • Overall-health: Look for the line SMART overall-health self-assessment test result. If it says FAILED — the disk is critically unhealthy.
    • Attribute table: Pay attention to these attributes:
      • Reallocated_Sector_Ct (ID 5): Number of sectors reallocated to spare areas. A non-zero value indicates wear.
      • Current_Pending_Sector (ID 197): Sectors awaiting reallocation. Usually indicates unrecoverable read errors.
      • UDMA_CRC_Error_Count (ID 199): Data integrity control errors on the cable. Often resolved by replacing the SATA cable.
      • Power_On_Hours (ID 9): Power-on time. Compare with the drive's MTBF (mean time between failures).
    • Threshold exceeded: In the attribute table, if the WHEN_FAILED column shows FAILING_NOW or PRE-FAIL, this is critical.

💡 Tip: For SSDs, key attributes are Media_Wearout_Indicator (media wear) and Available_Reservd_Space (remaining reserve space).

Method 2: Running an Extended Self-Test

A short test (short) quickly checks electronics and basic sectors. A long test (long) scans the entire disk surface.

  1. Start a long test (can take 1 to 10 hours depending on disk size):
    sudo smartctl -t long /dev/sdX
    

    The command will return an estimated completion time.
  2. Check progress (optional):
    sudo smartctl -a /dev/sdX | grep "Self-Test"
    

    Or simply wait for completion.
  3. After completion, run again:
    sudo smartctl -a /dev/sdX
    

    Scroll to the Self-Test Log section. Look for entries with Completed and Read Failure or Failure. Any error in the log is serious cause for drive replacement.

⚠️ Important: The long test stresses the disk. Avoid running it on a system disk during high load if possible.

Method 3: Checking System Logs for I/O Errors

Sometimes SMART doesn't report issues promptly, but the Linux kernel logs read/write errors.

  1. Review the system journal (journalctl):
    sudo journalctl -xe | grep -iE "error|fail|ata|scsi|nvme"
    

    Or for more precise device-specific search:
    sudo dmesg | grep -i /dev/sdX
    
  2. Typical messages:
    [ 1234.567890] sd 0:0:0:0: [sda] Sense Key : Medium Error [current] 
    [ 1234.567891] sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error
    

    This indicates unrecoverable read errors, often linked to physical damage.
  3. For NVMe drives, use the nvme-cli utility:
    sudo nvme smart-log /dev/nvme0
    

    Watch for critical_warning and media_errors.

Method 4: Backing Up Data and Replacing the Drive

If diagnostics confirm a critical state (FAILED in overall-health, rising Reallocated_Sector_Ct or Current_Pending_Sector values, errors in logs), act immediately:

  1. Create a full backup of all data from the problematic drive. Use dd, rsync, or backup tools.
    # Example rsync for copying home directories
    sudo rsync -avh /home/ /backup/home_backup/
    
  2. If the drive is in a RAID array:
    • For software RAID (mdadm), check status: cat /proc/mdstat.
    • Replace the drive in the array following your RAID level's procedure (usually mdadm --manage /dev/md0 --add /dev/sdX).
    • For hardware RAID, use the controller's utilities (e.g., storcli for LSI).
  3. Physical replacement:
    • Shut down the system (if it's the system drive).
    • Replace the drive with a new one.
    • Restore data from backup or reinstall the system.

Prevention

To minimize the risk of sudden drive failure:

  • Regular SMART checks: Set up a cron job for monthly automatic scans.
    # Example crontab (daily at 2:00 AM)
    0 2 * * * /usr/sbin/smartctl -H /dev/sdX | grep -q "FAILED" && echo "SMART FAIL on /dev/sdX" | mail -s "Disk Alert" admin@example.com
    
  • Monitor temperature: Drives should not operate above 50–60°C. Use hddtemp or smartctl -A.
  • Use RAID 1/5/6/10 for critical data. RAID doesn't replace backups but improves fault tolerance.
  • Avoid physical stress: Do not move an operating HDD, ensure adequate cooling.
  • Update drive firmware if the manufacturer releases reliability improvements.
  • For SSDs: Monitor remaining write endurance (Media_Wearout_Indicator). SSDs have a limited number of rewrite cycles.

💡 Tip: For SSDs, also avoid filling the disk beyond 80–90% to leave space for wear-leveling algorithms.

F.A.Q.

What is SMART and why is it important?
Can you ignore a SMART error if the disk is still working?
Which SMART attribute is considered critical?
Do you need to stop services before running a SMART test?

Hints

Install the smartmontools utility
Identify the disk device name
Run a full SMART diagnostic
Perform a long self-test
Take action on confirmed issues
FixPedia

Free encyclopedia for fixing errors. Step-by-step guides for Windows, Linux, macOS and more.

© 2026 FixPedia. All materials are available for free.

Made with for the community