banner
Previous Page
PCLinuxOS Magazine
PCLinuxOS
Article List
Disclaimer
Next Page

Hard Drive Failure - And Recovery


by Phil

I have a typical desktop box running PCLinuxOS. It has two drives: sda is the main workaday drive, and sdb is the backup drive which is brought up maybe once a day via a cron job. Both are about 3 years old and are the same size, both at 500GB. Of note, sdb was partitioned into four separate partitions, on the basis (hope?) that if one partition failed the others would remain functional.

One day, while logging out of a second KDE desktop session to my main session, I noticed the session had crashed and lots of error messages were scrolling up the screen. The desktop was unavailable, so I rebooted using the REISUB technique.

The machine rebooted to the login screen, but KDE would not boot, only getting as far as the globe. After several attempts, I changed the session to LXDE, which booted without issue (Lesson - have another desktop installed).

Having a non-booting desktop triggered a recollection that this might be a full root / partition. I have done that before. So:

df -al
(Alternative is More Applications > Monitoring > KDiskFree)

This said / was 100% full.

The solution is to open a root terminal and then hunt for the problem file(s) with:

du -hsx * | sort -rh | head -10

There were many directories copied over from elsewhere, I am guessing sda. I do not recall copying over all those directories, and the source was from different partitions. I am guessing that as I also have a failing mind-of-its-own-mouse, I may have accidentally copied directories over in a root Dolphin session.

I purged all files that should not have been in the / sda1 partition and logged back into a fully functional KDE session.

From here, I decided to check all my partitions and drives. First I used Dolphin to look at partitions. Drive sda was in the clear. When trying to look at sdb partitions with Dolphin, this error message was displayed:

An error occurred while accessing '122.5 GiB Hard Drive', the system responded: The requested operation has failed: Error mounting /dev/sdb5 at /media/044df901-45ee-4986-b59c-ecbb6b3ac508: Command-line `mount -t "ext4" -o "uhelper=udisks2,nodev,nosuid" "/dev/sdb5" "/media/044df901-45ee-4986-b59c-ecbb6b3ac508"' exited with non-zero exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/sdb5, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail

I then tried fdisk -l (list, as root), and extracted the information for sdb:

Disk /dev/sdb: 500.1 GB, 500107862016 bytes, 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

I then decided to check and try to repair all partitions with fsck. To do this, all partitions being checked must be unmounted. Therefore I used a "live" LXDE system (from a Live CD or Live USB).

The fsck command is as below. Drive sda was fine, but all partitions in sdb had many inode errors:

fsck /dev/sdN (This checks a whole drive, no changes unless instructed)
fsck /dev/sdNx (This checks a partition, no changes unless instructed)
fsck -fy /dev/sdNx (This checks and repairs a partition without referral, so exercise caution)

Normally, fsck takes very little time to check a partition of drive. This happens at boot from time to time, which is why your drives are usually in good condition. Of particular note, two of the sdb partitions took a few minutes to complete, and the two other partitions took well over an hour each, with a vast number of errors. When done:


fdisk -l /dev/sdb

Disk /dev/sdb: 500.1 GB, 500107862016 bytes, 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Device Boot      Start         End            Blocks      Id          System
/dev/sdb1         *          63          256558049   128278993+  83  Linux
/dev/sdb2       256558050   976768064   360105007+               5  Extended
/dev/sdb5       256558113   513405269   128423578+              83  Linux
/dev/sdb6       513405333   770204294   128399481               83  Linux
/dev/sdb7       770204358   976768064   103281853+              83  Linux

However, the error message "an error has occurred" persisted when trying to look at the partitions with Dolphin.

Next was to run testdisk as root on sdb. It is an excellent utility:

http://www.cgsecurity.org/wiki/TestDisk

This resulted in sdb becoming even more mangled. Fdisk said:

Device Boot      Start         End           Blocks        Id      System
/dev/sdb1        63125685    976768064  456821190          f       W95 Ext'd (LBA)
/dev/sdb5        63125701    78122817   7498558+           83      Linux

Next, try to repair by using a different superblock. From the forum:

When a file system gets corrupted, it uses the superblock to repair the filesystem. If the main superblock is itself corrupted, there are backup superblocks one can use to do the repair. In a default ext4 filesystem, the superblocks are in these locations.

        Superblock backups stored on blocks:
            32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
            4096000, 7962624, 11239424, 20480000, 23887872

To use these in a filesystem check, the command would be:

fsck -fy -b 32768 /dev/sd<whatever> <Enter>

Replace <whatever> with the actual partition designation, and 32768 with each backup superblock listed, until you find one that isn't corrupted and works to restore the filesystem. Part of the restoration is the repair of any damaged superblocks, including those you may have also tried, but didn't work. You have multiple chances to fix your filesystem, on each affected partition. Don't give up until you've tried them all.

The result was no change. All superblocks were tried.

Next I tried to use photorec to recover a clonezilla image of a machine which I keep an eye on. Photorec is an excellent utility which will recover files off a broken drive if it can find them, and can be tweaked to search out particular file types. However, the recovered files have a numeric number as a name, and you have to go through recovered files to rename them. A painful and long process if you have a lot of files to go through.

I installed SMART tools and associated programmes, and from a MATE session, SMART says sdb is good.

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     11303         -
# 2  Short offline       Completed without error       00%     11302         -

I now gave up on the recovery process, having damaged the file system, and try to partition and format the drive with PCC, gparted, and fsck (with mkfs to format a created partition). The result is one partition which will not mount.


Summary

I have attempted to repair sdb using anything I could think of, with some suggestions from the forum applied. (I am always amazed and grateful for the expertise available on tap at the forum). The net result of my meddling was a broken sdb drive and no recovered files. All data except one directory and clonezilla image files are on the primary drive, so there is a loss but not a great loss. At this stage, I am convinced the drive is broken and needs replacing.


Recovery (Under Remote Control)

From here, I reached out to the forum for help on how to partition a drive from scratch. Under remote control, I applied commands as directed, posted back the result, and awaited orders. I have to sit on my hands and not do anything, which is very hard to do.

I check the memory for errors using memtest from the grub screen options at boot. The memory is good.

I zero the MBR:

dd if=/dev/zero of=/dev/sdb bs=512 count=1

1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000417889 s, 1.2 MB/s

(Addition note for information):

"I only wish to add that if the drive was using GUID (GPT) at some point, then it has a second instance of the partition scheme information (at the end of the drive), and this too needs to be wiped."

Zero the drive:

dd if=/dev/zero of=/dev/sda bs=4096 <Enter>

This takes many hours to complete. Leave it to run.

Open the box, replace the cables (Hardware fault?).

Run fsck -fyc /dev/sdb1

fdsisk -l /dev/sdb

Disk /dev/sdb: 500.1 GB, 500107862016 bytes, 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x68b42369

Device Boot      Start  End       Blocks      Id  System
/dev/sdb1            2048   976773167   488385560   83  Linux

fsck -fyc /dev/sdb1
fsck from util-linux 2.22.2
e2fsck 1.42.9 (28-Dec-2013)
Superblock has an invalid journal (inode 8).
Clear? yes

*** ext3 journal has been deleted - filesystem is now ext2 only ***

Resize inode not valid.  Recreate? yes

ext2fs_block_iterate: Ext2 file too big while sanity checking the bad blocks inode

/dev/sdb1: ***** FILE SYSTEM WAS MODIFIED *****

Try this command:

mke2fs -t ext4 /dev/sdb1       <Enter>

Result:

mke2fs -t ext4 /dev/sdb1
mke2fs 1.42.9 (28-Dec-2013)
Could not stat /dev/sdb1 --- No such file or directory

Fdisk - Device sdb1 no longer exists.

Fdisk following day:

Device Boot      Start  End       Blocks      Id  System
/dev/sdb1            2048   976773167   488385560   83  Linux

Instruction - To test for hardware fault remove the drive from the case and reattach as a USB device. Run fdisk -l:

Device Boot      Start        End            Blocks      Id      System
/dev/sdb1            2048        976773167   488385560   83      Linux

Instruction

I want you to try creating a new filesystem on /dev/sdb1, now that it's entirely free of the original computer.

mke2fs -t ext4 /dev/sdb1   <Enter>

Once the process completes, do these commands;

mkdir -p /mnt/here   <Enter>

mount /dev/sdb1 /mnt/here   <Enter>

...and if no errors up to this point:

ls -l /mnt/here   <Enter>

Post your results.

Results:

Device Boot      Start  End       Blocks      Id  System
/dev/sdb1            2048   976773167   488385560   83  Linux

[root@localhost philip]# mke2fs -t ext4 /dev/sdb1
mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
30531584 inodes, 122096390 blocks
6104819 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3727 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000

Allocating group tables: done                           
Writing inode tables: done                           
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done     


[root@localhost philip]# mkdir -p /mnt/here

[root@localhost philip]# mount /dev/sdb1 /mnt/here
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail or so

Instruction

Try:

dmesg | tail   <Enter>

Result:

# dmesg | tail
sd 13:0:0:0: [sdb] No Caching mode page found
sd 13:0:0:0: [sdb] Assuming drive cache: write through
sd 13:0:0:0: [sdb] No Caching mode page found
sd 13:0:0:0: [sdb] Assuming drive cache: write through
 sdb: sdb1
sd 13:0:0:0: [sdb] No Caching mode page found
sd 13:0:0:0: [sdb] Assuming drive cache: write through
sd 13:0:0:0: [sdb] Attached SCSI disk
EXT4-fs (sdb1): ext4_check_descriptors: Checksum for group 0 failed (2659!=31464)
EXT4-fs (sdb1): group descriptors corrupted!

Instruction:

Open fdisk, and use the d command to delete the sdb1 partition, then enter the n command to create a new partition. Accept the primary designation, and partition number, but for a start sector, use 4096. then accept the default for the last sector. Finish with the w command, then do an fdisk -l again.

Run the commands:

mke2fs -t ext4 -c /dev/sdb1   <Enter>

The -c argument does a bad block check before actually writing the file system.

mount /dev/sdb1 /mnt/here   <Enter>

Starting the partition at a different location on the drive avoids the location of the main superblock that has constantly failed, without losing an appreciable amount of overall space. If that is the only truly bad block causing all your problems, this should work. If the drive is riddled with bad blocks, and continues to fail to create usable filesystems, I'd be looking for a new one.

If the partition actually mounts:

ls -l /mnt/here   <Enter>

Input:

fdisk /dev/sdb

Command (m for help): d
Selected partition 1
Partition 1 is deleted

Command (m for help): p

Disk /dev/sdb: 500.1 GB, 500107862016 bytes, 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x68b42369

   Device Boot      Start         End      Blocks   Id  System

Command (m for help):

Create new partition:

Command (m for help): n
Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p):
Using default response p
Partition number (1-4, default 1): 1
First sector (2048-976773167, default 2048): 4096
Last sector, +sectors or +size{K,M,G} (4096-976773167, default 976773167):
Using default value 976773167
Partition 1 of type Linux and of size 465.8 GiB is set

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

fdisk -l /dev/sdb

Disk /dev/sdb: 500.1 GB, 500107862016 bytes, 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x68b42369

Device Boot      Start  End       Blocks      Id  System
/dev/sdb1            4096   976773167   488384536   83  Linux

NOTE - This next part took many hours

mke2fs -t ext4 -c /dev/sdb1

mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
30531584 inodes, 122096134 blocks
6104806 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3727 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group

Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000

Checking for bad blocks (read-only test):   0.00% done, 0:00 elapsed. (0/0/0 errorsdone
                                              
Allocating group tables: done                           
Writing inode tables: done                           
Creating journal (32768 blocks): done

Disk /dev/sdb: 500.1 GB, 500107862016 bytes, 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x68b42369

Device Boot      Start  End       Blocks      Id  System
/dev/sdb1            4096   976773167   488384536   83  Linux

mount /dev/sdc1 /mnt/here

ls -l /mnt/here

total 16
drwx------ 2 root root 16384 Mar 10 00:12 lost+found/

SUCCESS!

Additional information:

In order to be able to write to that partition, with the greatest ease, while the partition is mounted on /mnt/here:

chown -R user:user /mnt/here   <Enter>


Synopsis

A hard drive is comprised of sectors of 512 bytes (also called units by fdisk) onto which your data is saved. The sectors are organized into blocks, which may be of 4096 bytes which are organised by the onboard drive firmware.

Somewhere between unit 2048 and 4096 something has been damaged so that a read/write to this part of the disk fails, due to a bad block. By avoiding the bad block, the disk now works.



Previous Page              Top              Next Page