Bit rot
Posted: Sun Apr 04, 2010 10:56 pm
In a fit of paranoia some time ago, I decided that I should assume every file I have decays over time. Not in a literal sense, but just because the likelihood of a operator error, a virus, a hacker, a power surge, or miscellaneous other hardware failure corrupting or destroying one or more files seemed pretty high. But, I'm lazy, so it wasn't until last week that I did anything serious about it. I've been ripping my DVD collection, and as part of that, I checksum'ed (SHA256) every ripped file immediately after the rip, and then mirrored them to another system soon after. Stupidly paranoid, right? Even I thought so.
Except that I saw the first error within days. I was rerunning the checksum program to test something unrelated, and between two consecutive runs, the checksum of one of the ripped files changed. As far as I know, nothing was writing to that drive, let alone to that file. That was on the drive that the file was originally ripped from. The checksum of that file had been computed at least three times before. The mirrored copy (on another machine), still had the original checksum, and comparing the files showed one byte in the middle had changed. Not by one bit, or by any obvious pattern. Just changed. The file displayed the new checksum consistently over several runs of the program. So, I copied over the mirrored file, and reran the checksum overnight. The next morning, both the original file and mirrored-restored copy had the original checksum. Creepy, huh? My best guess is that there was an uncorrected disk read error or some error in memory. The main-memory disk cache was large enough that the file might not have been reread from disk until the overnight run, which read ~300GB, more than enough to flush every cache. Or, maybe the error was persistent, and I accidentally copied the restored file over the bad one. No way to be sure.
So, now, a couple days later, I haven't seen any problems with that file. I created a third mirror (part of my plan even before the first failure) and ripped a whole bunch more files. As part of maintaining the mirrors, I've been copying a lot of files back an forth with a combination of rsync, Windows shares, and Samba. Last night, I ran the checksum program on two copies and got a different set of errors on each. In both cases, the errors are in previously untested copies, rather than the originals, so for these, I'm suspecting errors in the network hardware or drivers (I can't imagine both rsync and Windows sharing having data corruption issues).
The only things preventing me from freaking out are the statistics and consequences. Maybe twenty bad bytes out of several terrabytes of reads and transfers. Over the past decade, I probably haven't done more than 10 times that. So, assuming constant error rates, probably less than a kilobyte of damage has been done. The overwhelming majority of files I have are stuff that wouldn't be seriously affected by one byte errors. MPEG2s video, mp3s, game media, etc. If it hit an executable, it would probably just result in occasional crashes, which I probably would blame on crappy programming. Most documents I have are things I'll never look at again, like old class work. The only things way I could imagine it causing financial loss would be corrupting a file that stores some investment cost basis, causing me to pay some extra tax. But if it was off by much I'd probably notice. So, I'll probably try to expand my checksuming/mirroring process to other files over time, but for the most part, I'm probably just going to bury my head in the sand and ignore it.
Anyway, is this something any of you have come across before? Are you just living with it, or do you have some kind of backup and verification procedures to mitigate it?
Except that I saw the first error within days. I was rerunning the checksum program to test something unrelated, and between two consecutive runs, the checksum of one of the ripped files changed. As far as I know, nothing was writing to that drive, let alone to that file. That was on the drive that the file was originally ripped from. The checksum of that file had been computed at least three times before. The mirrored copy (on another machine), still had the original checksum, and comparing the files showed one byte in the middle had changed. Not by one bit, or by any obvious pattern. Just changed. The file displayed the new checksum consistently over several runs of the program. So, I copied over the mirrored file, and reran the checksum overnight. The next morning, both the original file and mirrored-restored copy had the original checksum. Creepy, huh? My best guess is that there was an uncorrected disk read error or some error in memory. The main-memory disk cache was large enough that the file might not have been reread from disk until the overnight run, which read ~300GB, more than enough to flush every cache. Or, maybe the error was persistent, and I accidentally copied the restored file over the bad one. No way to be sure.
So, now, a couple days later, I haven't seen any problems with that file. I created a third mirror (part of my plan even before the first failure) and ripped a whole bunch more files. As part of maintaining the mirrors, I've been copying a lot of files back an forth with a combination of rsync, Windows shares, and Samba. Last night, I ran the checksum program on two copies and got a different set of errors on each. In both cases, the errors are in previously untested copies, rather than the originals, so for these, I'm suspecting errors in the network hardware or drivers (I can't imagine both rsync and Windows sharing having data corruption issues).
The only things preventing me from freaking out are the statistics and consequences. Maybe twenty bad bytes out of several terrabytes of reads and transfers. Over the past decade, I probably haven't done more than 10 times that. So, assuming constant error rates, probably less than a kilobyte of damage has been done. The overwhelming majority of files I have are stuff that wouldn't be seriously affected by one byte errors. MPEG2s video, mp3s, game media, etc. If it hit an executable, it would probably just result in occasional crashes, which I probably would blame on crappy programming. Most documents I have are things I'll never look at again, like old class work. The only things way I could imagine it causing financial loss would be corrupting a file that stores some investment cost basis, causing me to pay some extra tax. But if it was off by much I'd probably notice. So, I'll probably try to expand my checksuming/mirroring process to other files over time, but for the most part, I'm probably just going to bury my head in the sand and ignore it.
Anyway, is this something any of you have come across before? Are you just living with it, or do you have some kind of backup and verification procedures to mitigate it?