I have my home directories on a separate drive from the OS, and after upgrading Fedora I went to remount my home directory and ran into problems. First off, the new Fedora had renamed hda as sda and so all the drive naming was off. I had my home directory disk mirrored in RAID 1, but some months ago one the the drives went bad and I dropped it from the array but left it plugged in until I had time to deal with it.
So I looked around for my home directory disk and mounted the one that had gone bad. It was working fine when I mounted it, so I didn’t notice until a few days later when it I noticed recently created files were missing and eventually figured it out.
So I could find the ‘good’ home directory disk and mount it, right? Not so easy, it turned out to have a small ext2 partition and a large LVM partition. Which is not the way it should be, it should have one partition. I mount the small one, and it throws ‘read past end of disk’ errors. It takes quite a while to figure out how to mount and read the LVM partitions. LVM is a really, really idea. The LVM tools couldn’t find a filesystem on the LVM partition, and after much hair pulling I realized that there really *wasn’t* one, that the partition shouldn’t exist. Because it was LVM, this took about 10X longer than it should have.
Now running with the hypothesis that the home directory disk had picked up a disk error in the partition table (Argg^&@$#@!), I made a backup image of the drive using dd:
dd if=/dev/hdc of=/data/hdc_copy.bin
Then I mounted the image as a loopback device:
losetup /dev/loop0 /data/hdc_copy.bin
And started working with the image to repair it. Looking around, testdisk seemed promising, so I installed it, ran it. Testdisk finds potential partitions on the disk, and lets you view the files in them to see if it has guessed right. After a few tries I found a testdisk partition that contained my home directories. At this point I used testdisk’s copy function to save the most critical recently (no backup) changed directories. This worked and I was hopeful. Then I had testdisk write the partition it had found to the disk image.
Now running fdisk on /dev/loop0 shows the single partition (/dev/loop0p1) spanning the whole disk as expected. /dev/loop0 can’t be mounted by itself as it is an image of the disk, not a file system (/dev/loop0p1 isn’t a device in /dev, just a fdisk label). So I had to mount the partition as a second loopback device using the info from fdisk to find the correct offset:
Disk /dev/loop0: 203.9 GB, 203928109056 bytes
1 heads, 1 sectors/track, 398297088 cylinders, total 398297088 sectors
Units = cylinders of 1 * 512 = 512 bytes
Disk identifier: 0x00000000
Device Boot Start End Blocks Id System
/dev/loop0p1 64 398283327 199141632 83 Linux
So I tried an offset of 64 * 512 = 32768:
losetup -o 32768 /dev/loop1 /dev/loop0
and then ran e2fsck on /dev/loop0. But e2fsck wasn’t happy, and none of the alternate superblocks worked either. Finally I found a reference to doing this that mentioned setting fdisk to sectors first:
Command (m for help): u
Changing display/entry units to sectors
Command (m for help): p
Disk /dev/loop0: 203.9 GB, 203928109056 bytes
1 heads, 1 sectors/track, 398297088 cylinders, total 398297088 sectors
Units = sectors of 1 * 512 = 512 bytes
Disk identifier: 0x00000000
Device Boot Start End Blocks Id System
/dev/loop0p1 63 398283326 199141632 83 Linux
Ah ha, the sector offset is *really* 63, so 63 * 512 = 32256 bytes, and after
losetup -o 32256 /dev/loop1 /dev/loop0
e2fsck now sees the file system and works! BTW sending e2fsck the SIGUSR1 signal makes it show a progress bar:
kill -s SIGUSR1 <e2fsck pid>
and after e2fsck completes I can mount the now good file system:
mount /dev/loop1 /mnt/home_recover
and it works!
With a good copy of the home directory filesystem I now was willing to risk changing the original drive, and ran testdisk and e2fsck on it following the same course. I was able to fix the partition table, clean the filesystem, and mount it! My home directories are all back!