Using Linux Device Mapper Snapshots to Rescue a Failed RAID

Mar 292013

The Linux device mapper provides a “snapshot” capability which makes it possible to cheaply get a copy of a block device by using copy-on-write to only store the modified sections of the device. Systems like LVM and EVMS use this to provide a temporary copy of a filesystem for back-up while other software continues to modify the original. However it has some other interesting uses.

I’ve been using the software RAID feature of the Linux kernel for more than a decade on many machines. A couple times I have found myself in the unfortunate situation of having a failed software RAID from which I would like to retrieve data. (Of course one should rely on backups as much as possible, but sometimes it’s still beneficial to get at the very last contents of the disk on a server.) While doing this, you want to avoid any accidental modification of the original disks (or copies of their contents if you’re working from disk image files). One way to do this is with the device mapper’s snapshot capability. Using it this way is not well documented, so I thought I’d write up how I did this.

Here’s an overview of the process I used:

Copy the RAID member devices to somewhere else before working with them (i.e. a different disk and on a different machine).
Determine which RAID members to use when re-creating the array. (Usually there are multiple ways to do this.)
Use the device mapper’s snapshot capability to get writable virtual copies of the RAID members.
Recreate the array using the snapshot devices (keeping the originals unmodified).
Check the integrity of the data (which usually involves modifications by utilities such as fsck).
If necessary, repeat trying alternative re-creations (see step #2).

The last time I did this I had a 5-device RAID5 that had two drives ejected for errors before a replacement disk could be added to the system. (If it had been only one disk it would be no problem, as that’s the benefit of RAID5.) With the disks still on-line on the original machine, before doing anything else I copied their contents into image files on another machine. (Obviously you need sufficient space for this, and you might want to use disk partitions rather than image files.)

for dev in sd{a,b,c,d,e}1; do
  sudo dd if=/dev/$dev conv=noerror \
  | ssh user@otherhost dd of=/tmp/raid_rescue/$dev.img
done

Over on the other host, I extracted the software RAID super-blocks from the image files. (There’s information in the super-blocks that you need to recreate an array.)

for img in /tmp/raid_rescue/*.img; do
  /sbin/mdadm --examine $img > $img.sb
done

Then I checked to see when each RIAD member was last updated:

% grep "Update Time" /tmp/raid_rescue/*.img.sb
/tmp/raid_rescue/sda1.img.sb:    Update Time : Tue Mar  5 11:16:49 2013
/tmp/raid_rescue/sdb1.img.sb:    Update Time : Tue Mar  5 11:16:49 2013
/tmp/raid_rescue/sdc1.img.sb:    Update Time : Sun Mar  3 16:45:29 2013
/tmp/raid_rescue/sdd1.img.sb:    Update Time : Tue Mar  5 11:16:49 2013
/tmp/raid_rescue/sde1.img.sb:    Update Time : Mon Mar  4 07:28:08 2013

From this I can see that three were current when I made the copy (sda1, sdb1, and sdd1) and two others were removed not long before (sdc1 and sde1). Since sdc1 was ejected before sde1 and I only need 4 of the 5 devices to recreate this array, I’m going to do so with sde1 but without sdc1. (If that goes badly I could try again with sdc1 and without sde1.)

Next I need to see where each of these devices belongs in the array:

% grep "Device Role" /tmp/raid_rescue/*.img.sb
/tmp/raid_rescue/sda1.img.sb:   Device Role : Active device 1
/tmp/raid_rescue/sdb1.img.sb:   Device Role : Active device 2
/tmp/raid_rescue/sdc1.img.sb:   Device Role : Active device 3
/tmp/raid_rescue/sdd1.img.sb:   Device Role : Active device 4
/tmp/raid_rescue/sde1.img.sb:   Device Role : Active device 0

From this I can see the order of the devices in the failed array, which I’ll need when I recreate it:

sde1
sda1
sdb1
missing (leaving out sdc1)
sdd1

When creating a snapshot with the device mapper, both the original and the storage used for copy-on-write (CoW) must be block devices. I’ll use sparse files for the copy-on-write storage and then set up loop devices for the images and the copy-on-write areas. I’ll capture the created loop device names in small files so I can use them later.

for img in /tmp/raid_rescue/*.img; do
  dd if=/dev/zero of=$img-cow bs=1 seek=1G count=1
  /sbin/losetup -f --show $img-cow > $img-cow-loop
  /sbin/losetup -f --show $img > $img-loop
done

In case that’s not clear, here’s what I’ve set up so far:

Next comes the tricky (and less well documented part): creating the snapshot devices. I’ll create two device mapper devices for each of my image files: one of type “snapshot-origin” and one of type “snapshot”. This is a little complicated, but it allows modifications to either the original or snapshot to occur. (Writes to the snapshot device go directly into the copy-on-write device, and writes to the snapshot-origin device cause the original block to be copied into the CoW device.) I don’t plan to modify the original, but (as I understand it) I still have to create the snapshot-origin device for each of my partition images.

for dev in sd{a,b,c,d,e}1; do
  img="/tmp/raid_rescue/$dev.img"
  # Get the saved loop device names
  loop=`cat $img-loop`
  cow_loop=`cat $img-cow-loop`
  # Determine the block size of the original
  blocks=`blockdev --getsize $loop`
  # Create the two device mapper devices
  echo 0 $blocks snapshot-origin $loop | dmsetup create rr-$dev-orig
  echo 0 $blocks snapshot $loop $cow_loop p 128 | dmsetup create rr-$dev-tmp
done

I now have another layer of virtual devices:

Now that I have the snapshot devices, I need to erase the RAID super-blocks on them. (This change will of course only go into the sparse CoW files.)

for dev in sd{a,b,c,d,e}1; do
  /sbin/mdadm --zero-superblock /dev/mapper/rr-$dev-tmp
done

Before I try to recreate the array there are two more pieces of information I need from the old super-blocks so I can re-create the array in the same way: the meta-data version and chunk size they were using. (These have to be the same for all of them, so I only need to check one.)

% grep -E "Chunk|Version"  /tmp/raid_rescue/sde1.img.sb
        Version : 1.2
     Chunk Size : 64K

Finally, I can create a new array using my snapshot devices. Passing “missing” for a device means that there is no device for that role in the array (since I’m leaving sdc1 out). Passing “–assume-clean” is a good idea when trying to recreate an array like this as it will prevent any attempt at a rebuild (though since I’m using 4 devices in a 5-device array it should be superfluous).

# mdadm --create /dev/md/rr \
  --level=5 --raid-devices=5 --chunk=64 \
  --assume-clean --metadata=1.2 \
  /dev/mapper/rr-sde1-tmp \
  /dev/mapper/rr-sda1-tmp \
  /dev/mapper/rr-sdb1-tmp \
  missing \
  /dev/mapper/rr-sdd1-tmp
mdadm: array /dev/md/rr started.

At this point /dev/md/rr is my re-created array:

The two times that I’ve done this, the array was the sole physical volume containing an LVM volume group. Once the array was recreated, the kernel automatically recognized the volume group so I just needed to activate it:

# sudo vgchange -a y RaidVG
  22 logical volume(s) in volume group "RaidVG" now active

I didn’t really care about a few of those volumes (like /tmp or swap space). I went through and ran fsck on all the filesystems with important data plus any which might have useful information (e.g. configuration files). Only one out of the 22 (the root filesystem for a virtual machine) came back with any significant errors from fsck, and even that wasn’t a total loss (most of the configuration files it contained were still there). So overall it was quite successful as a data recovery effort.

After doing all this, the copy-on-write sparse files only took up a minimal amount of space (relative to the multi-GB device image files):

% du -shc sd?1.img-cow
34M     sda1.img-cow
34M     sdb1.img-cow
33M     sdd1.img-cow
34M     sde1.img-cow
133M    total

Of course doing this as I’ve explained above is a bit too tedious, not to mention potentially error prone. I wouldn’t want to do it quite like that, so I wrote a Perl script to do most of it for me. It might be a useful starting point if you need to do something like this.

Here are some references for further reading on using the Linux device mapper:

The RHEL documentation for dmsetup is a good overall reference.
This article from Linux Gazette is another general guide.
This document about creating snapshots without using LVM describes something very similar to the method I’ve used.
This recipe for creating a snapshot with the CoW in a RAMdisk device will give you a writable filesystem without modifying the original.
And this list wouldn’t be complete without the snapshot documentation included with the Linux kernel.

Sharper Tools

Using Linux Device Mapper Snapshots to Rescue a Failed RAID

Leave a Reply Cancel reply