Bit rot

For general rambling.
Post Reply
George
Veteran Doodler
Posts: 1267
Joined: Sun Jul 18, 2004 12:26 am
Location: Arlington, VA

Bit rot

Post by George »

In a fit of paranoia some time ago, I decided that I should assume every file I have decays over time. Not in a literal sense, but just because the likelihood of a operator error, a virus, a hacker, a power surge, or miscellaneous other hardware failure corrupting or destroying one or more files seemed pretty high. But, I'm lazy, so it wasn't until last week that I did anything serious about it. I've been ripping my DVD collection, and as part of that, I checksum'ed (SHA256) every ripped file immediately after the rip, and then mirrored them to another system soon after. Stupidly paranoid, right? Even I thought so.

Except that I saw the first error within days. I was rerunning the checksum program to test something unrelated, and between two consecutive runs, the checksum of one of the ripped files changed. As far as I know, nothing was writing to that drive, let alone to that file. That was on the drive that the file was originally ripped from. The checksum of that file had been computed at least three times before. The mirrored copy (on another machine), still had the original checksum, and comparing the files showed one byte in the middle had changed. Not by one bit, or by any obvious pattern. Just changed. The file displayed the new checksum consistently over several runs of the program. So, I copied over the mirrored file, and reran the checksum overnight. The next morning, both the original file and mirrored-restored copy had the original checksum. Creepy, huh? My best guess is that there was an uncorrected disk read error or some error in memory. The main-memory disk cache was large enough that the file might not have been reread from disk until the overnight run, which read ~300GB, more than enough to flush every cache. Or, maybe the error was persistent, and I accidentally copied the restored file over the bad one. No way to be sure.

So, now, a couple days later, I haven't seen any problems with that file. I created a third mirror (part of my plan even before the first failure) and ripped a whole bunch more files. As part of maintaining the mirrors, I've been copying a lot of files back an forth with a combination of rsync, Windows shares, and Samba. Last night, I ran the checksum program on two copies and got a different set of errors on each. In both cases, the errors are in previously untested copies, rather than the originals, so for these, I'm suspecting errors in the network hardware or drivers (I can't imagine both rsync and Windows sharing having data corruption issues).

The only things preventing me from freaking out are the statistics and consequences. Maybe twenty bad bytes out of several terrabytes of reads and transfers. Over the past decade, I probably haven't done more than 10 times that. So, assuming constant error rates, probably less than a kilobyte of damage has been done. The overwhelming majority of files I have are stuff that wouldn't be seriously affected by one byte errors. MPEG2s video, mp3s, game media, etc. If it hit an executable, it would probably just result in occasional crashes, which I probably would blame on crappy programming. Most documents I have are things I'll never look at again, like old class work. The only things way I could imagine it causing financial loss would be corrupting a file that stores some investment cost basis, causing me to pay some extra tax. But if it was off by much I'd probably notice. So, I'll probably try to expand my checksuming/mirroring process to other files over time, but for the most part, I'm probably just going to bury my head in the sand and ignore it.

Anyway, is this something any of you have come across before? Are you just living with it, or do you have some kind of backup and verification procedures to mitigate it?
I feel like I just beat a kitten to death... with a bag of puppies.

VLSmooth
Tenth Dan Procrastinator
Posts: 3055
Joined: Fri Jul 18, 2003 3:02 am
Location: Varies
Contact:

Re: Bit rot

Post by VLSmooth »

I've experienced isolated issues with failed checksums, but they're exceedingly rare.

The vast majority of the media I encounter has inline CRC32s[1] (example: "foo [A2C4E6F8].mkv") and can be checked against / repaired against a seeded torrent. I append inline CRC32s to files that don't have them with crc32_inline.pl, a little perl script I wrote. It is accessible via my Windows send-to menu, checks pre-existing inline CRC32s, outputs a stdout report, and creates a rename.bat to add inline CRC32s to files that lack them. These CRC32s are checked after moving files and on the occasional whim.

My approach is significantly more ad hoc than yours, but so far I'm satisfied and it's easy to use.


Footnote: [1] CRC32 has a much higher probability of collision than SHA256, but I wanted to stick with a common pre-existing convention. CRC32 is short enough to embed in a filename and it's nice to not require separate sfv/md5/etc files.

George
Veteran Doodler
Posts: 1267
Joined: Sun Jul 18, 2004 12:26 am
Location: Arlington, VA

Re: Bit rot

Post by George »

Unfortunately, straight VOB rips have fixed file names, so I couldn't embed a checksum. But I kind of like the idea of using the Send To menu. Might require some cleverness to make it work with my scripts, but I think it's practical.

Speed and space weren't really priorities, so I started with MD5, figuring I'd use an existing checksum tool. Unfortunately, months of experimentation convinced me that all existing tools suck (especially the one that carried a virus). QuickSFV was the best of a bad lot, but it can't verify and update checksums in a single run, nor does it recognize renames. So I resorted to Python. All hash functions have the same interface, so I figured why not go all out?
I feel like I just beat a kitten to death... with a bag of puppies.

Jonathan
Grand Pooh-Bah
Posts: 6722
Joined: Tue Sep 19, 2006 8:45 pm
Location: Portland, OR
Contact:

Re: Bit rot

Post by Jonathan »

At work, I have had some issues with transferring multi-gigabyte Norton Ghost images across network links. SMB and TCP/IP are flaky. Always check the result of a file transfer.

I am not worried about bad bits or bytes on my personal data, and my personal binaries can be redownloaded easily. Lossy compression is adding more junk into my data than any conceivable error rate.

I am mystified by your original failure. I can't even think of a system which would munge a byte (as opposed to a bit or a sector or a DRAM page or OS page or what-have-you). If you are taking a significant hit in storage and throughput performance, perhaps it is financially worthwhile for you to invest in some error-correcting hardware?

I wonder if anyone has done any testing to see what the error rate is on different TCP/IP stacks, ethernet drivers, ssh implementations, and so on.

quantus
Tenth Dan Procrastinator
Posts: 4891
Joined: Fri Jul 18, 2003 3:09 am
Location: San Jose, CA

Re: Bit rot

Post by quantus »

Your most likely failure is going to be a whole drive going bad, so sticking with some sort of raid system is probably fine like the readynas. The biggest source of failure then becomes not properly disconnecting after sending a file over samba since windows sometimes forgets to send the last few chunks of data for a while. rsync/ftp are much more reliable transfer methods.

It's possible for a hard drive to munge a byte since it processes data in chunks of a few bits to a few bytes at a time depending on the code used to be more burst error resistant. It's odd that it would work again later though. From what I understand, most physical network layers also use a form of LDPC that could have a similar effect. In any case though, I find it hard to believe that it would pass one of the higher layer's CRC checks. Data is not check just once, but a few times to try to catch different types of errors. For instance, hard drives will encode a CRC with the data being written that is xor'ed with the physical address so if a read at the wrong location takes place the crc will fail even though an un-xor'ed crc would've passed.

An alpha particle in your RAM or cache or signal traces would only upset a single bit and not munge an entire byte...

Someone should write a virus which just looks for large files and only changes some small, odd number of bytes in the middle of the file just to mess with George.
Have you clicked today? Check status, then: People, Jobs or Roads

George
Veteran Doodler
Posts: 1267
Joined: Sun Jul 18, 2004 12:26 am
Location: Arlington, VA

Re: Bit rot

Post by George »

I haven't seen any recurrences of that spontaneous byte change, so who knows. I have seen several more network transfer corruptions, both with SMB and with rsync. I'm not too surprised since both systems involved have nForce chipsets, and I remember reading about them being flaky.

I've lost two drives to click-of-death over the years, so yeah, total drive failure is something I'm hoping to defend against with the mirroring. I shy away from RAID because it doesn't offer any protection against what I still consider the biggest threat--me. When I do something stupid, RAID happily replicates my stupidity as often as necessary to corrupt all its backups. The same applies to any file system write, so it doesn't offer any defense against defective or malicious software either. At least having mirrors on two systems gives me a chance to notice when something goes wrong before I obliterate the backup. It's currently far less convenient, though I hope to automate most of it.

There certainly used to be viruses that randomly vandalize files on the drive, but I wonder if they've been supplanted by botnet clients. If you were trying to covertly hijack a computer, causing a bunch of errors seems counterproductive. If you just wanted to mess with someone, it seems like you could do better than just random corruption.
I feel like I just beat a kitten to death... with a bag of puppies.

George
Veteran Doodler
Posts: 1267
Joined: Sun Jul 18, 2004 12:26 am
Location: Arlington, VA

Re: Bit rot

Post by George »

Ooh, spoke too soon, just got another one byte error. Middle of a ~70MB file during a checksum on ~650 GB worth of files. Not the same file as before. Not the same drive as before, but otherwise the same system. File compared equal to the mirror earlier in the afternoon.

Code: Select all

File "Troy\Special Features\VIDEO_TS\VTS_03_1.VOB" property sha256 changed from 3965d56e4e281db8fd27eda1aa700d5b3b5eaa75f21fbd7cc21c4be8d89da082 to 7979252dcb48af01639601b69df50742ba79c86d0f0d93bbd8f272642d6031a1.
I compared with WinMerge as soon as I noticed the error. It may have been soon enough that the bad value was still in a cache. WinMerge did find a one byte change. The display was a mix of binary and ASCII, but I think the value changed from 0x11 to comma (=0x2C, 5 bits different, not a shift or inverse).

In the meantime, the checksum program kept running, processing another 20GB or so, presumably flushing anything in memory. When it finished, I ran the checksums just of the one directory.

Code: Select all

Reading properties of file "G:VTS_03_1.VOB".
  sha256 = 3965d56e4e281db8fd27eda1aa700d5b3b5eaa75f21fbd7cc21c4be8d89da082
... and now WinMerge concurs and says the files are identical.

At least the problem fixes itself. It could have spread if it happened during a mirroring operation, but since the mirroring of these files is now complete, there's no real danger. I'll probably kick off a Memtest86 run overnight, just in case it's something simple like a failing memory module.
I feel like I just beat a kitten to death... with a bag of puppies.

George
Veteran Doodler
Posts: 1267
Joined: Sun Jul 18, 2004 12:26 am
Location: Arlington, VA

Re: Bit rot

Post by George »

Yay, I have a bad stick of memory. Memtest showed an intermittent but repeated one bit failures, so hopefully everything else in the system is working properly. Bonus points if anyone can figure out why a one bit memory error manifested as a one byte error in practice. Maybe some kind of failed error correction?
I feel like I just beat a kitten to death... with a bag of puppies.

Jonathan
Grand Pooh-Bah
Posts: 6722
Joined: Tue Sep 19, 2006 8:45 pm
Location: Portland, OR
Contact:

Re: Bit rot

Post by Jonathan »

Consumer systems have a little bit of ECC in the data caches but nothing on DRAM. There's no ECC to fail.

Generally, customers only like systems with both N bit error correction and N+1 bit detection. If the system fixes 1 bit errors, it usually ought to detect 2 bit errors and bluescreen (at least recent systems do this). The likelihood of a 3 bit error (which generally slips past) is very small. This is going out to double error correction and triple error detection in a few years.

Post Reply