Two weeks ago, a Friday afternoon, I received an emergency call from a friend. Her sister had stored all of her University work on an external harddrive. The rest of the story should be quite predictable from this moment on. Of course the data was stored nowhere else (this drive SHOULD have been only a backup drive but grew into the main working device) and suddenly one second to another the drive was not accessible anymore. Furthermore the drive emitted a clicking sound every few seconds. Several tries to get the drive running again failed so far. My friend already checked the most obvious causes like dirty contacts or loose connections and had no success so they gave me a call as they already apprehended a headcrash.
When the drive arrived in my hands I had the following situation: an external, almost brand-new, flat-lying 1TB harddisk exposing USB and eSATA connectors. To lower the further damage during uptime a bit I immediately fixed it in a vertical position using a stand from another external disk I had lying around. In my imagination this should minimize the bouncing of particles some more (because some of them then collect on the lower side of the encasing) if the surface of the disks had indeed taken some damage and slow down the degradation to some degree. A very quick test from within Windows showed that the drive tried to register with the system but failed to do so. So no more tinkering here but quickly start up a Linux system for recovery.
I started up SystemRescueCd which I had installed on an USB stick (using SARDU)for situations like those. Connecting the drive via eSATA failed because the drive didn't show up in /dev/ so I had to fall back to the slower USB connection for all following steps. Connecting with USB took some time (about ~30-60s) until the device showed up in /dev but then it was relatively accessible. First thing I checked was the SMART info using
smartctl -a /dev/sdd
where it became pretty obvious that the drive is badly damaged. About 100 relocated sectors and a handful of pending relocations. Very strong signs for a headcrash indeed, so no time to waste and get as much data from the disk as possible.
Trying to mount the disk failed so I could not just copy the files down but had to make a complete image at first to work with that later on without the failing drive. At this moment another problem struck as I had nothing around where I could store a 1TB image file. At maximum I could free up 600GiB on a Linux drive.
I had to make another call to find out that there should be an NTFS filesystem on it with about 200GiB of data stored on it. The drive should be relatively new and there has been not a lot of activity beyond storing and some updating of the files. So I hoped for a lot of uninitialized areas which would be easily compressible. A quick check with
confirmed my speculation, there were large zeroed-out areas at the end of the disk. This confirmation took a while because I seemed to hit erronous areas already at the beginning of the disk where the tool stalled until the read-error timeout snapped it out.
The Linux filesystem ext3 has support for sparse files which automatically compresses unused/zeroed areas of a file so I had the hope that the 1TB image-file would still fit on my 600GiB free space.
A simple copy of /dev/sdd (with cp or dd) would fail because of the errors on the disk, luckily there are tools available which save the working areas and try to recover the failing areas. I chose ddrescue for this job because it has a buildin switch for creating sparse target images, which saved me from manually creating one. I somewhat sticked to the instructions from the Forensics wiki and made a first pass over the disk without retrying failing sectors to save as much of the intact data as possible.
ddrescue -d -S -n /dev/sdd disksddsparse logfile
This first run took quite some hours because transferring 1TB over USB at 30MB/sec (at best, almost zero when hitting defect sectors). Because of the logfile (the last parameter) I was able to interrupt the process overnight as I didn't want to let it run unattended for too long. During the copy from time to time I checked the SMART infos in a second terminal which showed me that either the disk was dedgrading by the minute or the disk logic was just counting currently undetected errors. But the further the initial rescue was running, the larger were the intervals between the errors which raised my hopes. In the end the first run ended with the full 1TB image stored on my disk (which took only ~250GiB because of the Sparse option), having about 130MiB of errors scattered across ~1100 locations. Not that bad, but there was surely some more to gain, so on to the second run.
In this second run I started ddrescue in a way where it looks closer to the erronous spots on the disk and tries to approximate to the exact location of the error within the whole error area to get out all bytes which are not really affected. These actions are called splitting and trimming of the defects.
ddrescue -d -S /dev/sdd disksddsparse logfile
This repair-run finished faster because it only checked the errors, nevertheless it still took some hours. It was quite successful as it lowered the number of error-locations to ~904 and the affected data area to 512kiB. Wow. I wonder if there's more to squeeze out. Let's retry the errors and automatically retry without retry-limit
ddrescue -d -S --retrim --max-retries=-1 /dev/sdd disksddsparse logfile
Again I let this run for some hours and when it seemed to only have minimal success anymore (about around the 5th automatic retry) it was down to 859 errors summing up to 490kiB of errors. So, finally the outcome of the rescue operation looked quite promising. Just for the curious ones, the smartctl statistics were far beyond good and evil with about 900 relocated sectors and 1300 pending. And big fat letters telling me "FAILING NOW"...
The last step now was to mount the partition within the disk-image. I found out the offset for the partition mount by comparing the outputs of the following hexedits and finding the second one in the first one (luckily Linux could detect the partition itself).
If this weren't possible I would have calculated the partition offset using one of the guides on the internet here (German) or here.
After that I coult mount the partition using...
mount disksddsparse /mnt/image -o ro,loop,offset=0x7e00
... and began to copy the files out of the partition. There were some filename encoding issues and warnings during the copy which were finally resolved by mounting with a manually enforced charset.
mount disksddsparse /mnt/image -o ro,loop,offset=0x7e00,iocharset=utf8
Well, that's the story of a saved academic career (at least a gigantic pile of work). I hope that my experiences maybe help someone other with rescuing data from a failing disk. Now I just have to decide what gift to take in exchange for this rescue operation... ;)| Permalink