Sunday, November 13, 2011

Crapping all over my raid

I have a recently dead machine with 4 disks in a raid 5 set.  And I have a working machine with disks and free space for more disks. So I moved the disks from my dead machine to the working machine.

When booting something unremarkable happened, Ubuntu said one of the disks needed checking, and it fsck'ed. For a long time. The machine already had some 2TB disks so it would take time.  A bit later the boot sequence tells me that /boot is broken, would I like a shell to fix it? What?

Get shell. Run "fsck /boot", answer "y" to all. Fsck terminates claiming there are still errors. Run again. Still errors? Have a look in /etc/fstab. Ooooh.

  /dev/sda1 /boot ext2 defaults 0   2


Oh shit. New disk controller on board, new disks. /dev/sda1 isn't the boot partition anymore, it's one of my raid disks. I'm still not really getting the consequences of what has happened. But I look in /dev/disk/by-uuid and see this:

   lrwxrwxrwx 1 root root  10 2011-11-13 15:16 c33afa82-c287-47fb-9b10-aca0524cfbc1 -> ../../sde1

Everything else is LVM. So I edit my fstab to use UUID=c33afa82-c287-47fb-9b10-aca0524cfbc1 for the device name, so that never happens again - on this machine.

Lesson: Always mount by UUID (or lvm device name) because device names change for the simplest reasons - like the kernel changing the probe order on the PCI bus. Most (all) distributions get this right, but I didn't and I'm too much of a old timer to have UUID as a knee-jerk reflex when I edit fstab.

Get the raid running. All the lvm devices that should be there appear. I had better check the filesystems then after this upset. The most important first: /dev/mapper/DiskMd0-media - that's the filesystem with all the family videos of the kids growing up on. The most important one.  Fsck shows errors, lots of them. I press "y".  After a little while it dawns on me, "hmmm, these would be errors introduced because fsck scribbled all over the disk earlier, right?" Ctrl-C! Fsck reports that the filesystem was modified. So fsck.ext2 has crapped all over one raid disk. Damn. What now? It's a raid 5. One failed disk is survivable. Only by this time there is a graphical login on the console and the VC with the original fstab on has scrolled long past it. And my head is like teflon when it comes like stuff like that. So I don't remember which drive had been crapped all over. Sdb? Sda?

Angst follows. Lots of reading and re-reading the mdadm man page way too impatiently. Can I remove one drive from the raid and run fsck to check if the filesystem is now consistent - proving that I removed the right drive? And if I removed the wrong drive, can I add it back causing no new problems?

    mdadm md0 --fail /dev/sda1
    fsck.ext2 -n /dev/mapper/DiskMd0-misc
    ...

Note the -n, it causes fsck to NOT modify the filesystem no matter what. Cause if it modifies a filesystem while the wrong disk removed then matters will get worse. ... Whew. No errors. Had there been errors I planned to do

  mdadm md0 --re-add /dev/sda1

which should put the disk back in the raid with no new problems - providing that i didn't change the raid since it was removed (see the mdadm man page). Instead I could do

  mdadm md0 --remove /dev/sda1
  mdadm md0 --add /dev/sda1

And in /proc/mdstat I could see that the raid was rebuilding /dev/sda1, the disk that demonstrably was the one that has been crapped all over because of my fstab stupidity.

I have to add that I take more care at work. But not at home, since it's not work. And I do have a copy of the movies of the kids growing up on another disk.

So that's what I used (some of) my Sunday for.

But at least I could recover.