I’ve been fixing a friend’s iBook G4 this week. The computer wouldn’t boot up—it couldn’t find anything to boot from—and emitted an alarmingly loud noise. I suspected a dead hard disk; by booting from a Linux CD, I was able to prove this. The computer worked fine, but the disk didn’t. I ordered a new hard disk, which arrived a few days later.

Replacing the hard disk was a convoluted process requiring time, care, and an obsessive attention to detail, but I managed it without any parts left over, and reinstalled the software. And that was the easy part done.

Unfortunately, my friend had no backups. The documents and, more importantly, photos, existed only on the defunct hard drive. Could I get them back? I rigged it up to a USB to ATA adapter, plugged it into my computer (running Linux, of course), and waited.

watch ls /dev/sd*

I waited for the disk to appear as a device node. After a few minutes of alarming grinding noises, it did so, giving me sdc and sdc1 to sdc4. In my experience, this is normal when mounting an Apple Partition Map on Linux. Three of the partitions are irrelevant housekeeping of some description, while one holds the actual HFS+ data.

I tried mounting the partitions in turn, something like this:

sudo mount -t hfsplus -o ro /dev/sdc4 /media/recovery

However, none of them would work: the driver couldn’t find a superblock. I wasn’t really surprised, to be honest: I’d already established that the disk wouldn’t mount properly, but it was worth a try.

My priority now was to recover as much off the disk as I could, while it would still yield up something.

sudo dd if=/dev/sdc of=recovery.img conv=noerror

The conv=noerror line tells dd to continue reading after read errors. This was what I wanted: to read everything that could be read.

In another shell, I watched the image file slowly grow:

watch ls -l recovery.img

This was Thursday evening.

On Friday morning, 7 of 60 GB had been copied. It was slow, but at least it was working. Looking at the output of dd showed that it had not been able to read some sectors in the first 490 MB. That was fine, though: that space would be taken by the operating system, rather than irreplaceable user data. I left it and went to work, from where I kept an eye on the progress via SSH.

By Friday evening, it was up to 48 GB. That was positively rapid! I looked forward to it being finished by Saturday morning.

But it wasn’t finished. It was only up to 51 GB. Between Saturday morning and Sunday morning, there wasn’t much progress, either; it was still at only 52.8 GB on Sunday morning, and the transfer rate was down to tens of kilobytes per second. I decided to terminate the image, bearing in mind two facts:

  1. I knew that the disk wasn’t full, so the user data was most likely to be within the first 52 GB;
  2. I could always come back and try to recover the last section later.

I moved on to the next step: extracting as much as possible from the raw image. Having established that I couldn’t just mount the disk and read the files (well, not without very expensive recovery software), I took a different tack. I’d use a forensic tool to extract all the Microsoft Office documents and JPEG images from the disk. There are a couple of problems with this:

  1. They wouldn’t have filenames;
  2. Old versions would be trawled up as well as newer ones.

In the case of JPEG files, differing versions wasn’t a problem, because my friend hadn’t been editing them. For Word documents, though, it was. Fortunately, both file types have metadata within the file. The EXIF data in a digital camera JPEG file would tell me when and with what the picture was taken; the metadata in Office files includes the date they were updated and the title.

The tool I used was foremost, whose manual states:

Original Code written by Special Agent Kris Kendall and Special Agent Jesse Kornblum of the United States Air Force Office of Special Investigations.

And:

Because Foremost could be used to obtain evidence for criminal prosecutions, we take all bug reports very seriously. Any bug that jeopardizes the forensic integrity of this program could have serious consequenses. When submitting a bug report, please include a description of the problem, how you found it, and your contact information.

Serious software, in other words. And it’s free:

This program is a work of the US Government. In accordance with 17 USC 105, copyright protection is not available for any work of the US Government.

It’s also extremely easy to use. I wanted to extract JPEG images and Office documents:

foremost -t jpg -t ole -i recovery.img

It took less than half an hour to spit out all the files it had found, which it sorted into directories by file type.

The jpg directory included a lot of noise: every image written temporarily to disk by the web browser was in there, as was every picture from the Kōjien dictionary.

I used a script wrapped around the output of gm identify to split the images out into those that were greater than 1000 pixels on both sides (likely digital camera photos) and those that weren’t. In this case, it worked perfectly. People who browse a lot of high-resolution porn might have less clearly delineated results, I guess …

I browsed the images in gThumb and manually deleted any that it wasn’t able to thumbnail: these were truncated or otherwise damaged.

I then used checksums and the output of exif to eliminate duplicate images, and to sort them into directories by the date they were taken.

For the Office documents, I used wvSummary to extract the title and date from each file and rename them accordingly. I manually deleted older versions of the same file and templates from Office itself.

All told, it wasn’t particularly difficult or time-consuming, but it was extremely satisfying to snatch sentimentally important data back from the jaws of entropy. There’s a Promethean joy in getting one over on the gods.

Besides Ruby (for scripting) and dd (for copying the disk image), the additional Ubuntu packages that I used can be installed thus:

sudo apt-get install foremost graphicsmagick wv exif