MemeStreams | MemeStreams Discussion

Create an Account

This page contains all of the posts and discussion on MemeStreams referencing the following web page: RAID Z : Jeff Bonwick's Weblog. You can find discussions on MemeStreams as you surf the web, even if you aren't a MemeStreams member, using the Threads Bookmarklet.

RAID Z : Jeff Bonwick's Weblog
by Lost at 11:50 am EST, Jan 21, 2007

RAID-Z
The original promise of RAID (Redundant Arrays of Inexpensive Disks) was that it would provide fast, reliable storage using cheap disks. The key point was cheap; yet somehow we ended up here. Why?
RAID-5 (and other data/parity schemes such as RAID-4, RAID-6, even-odd, and Row Diagonal Parity) never quite delivered on the RAID promise -- and can't -- due to a fatal flaw known as the RAID-5 write hole. Whenever you update the data in a RAID stripe you must also update the parity, so that all disks XOR to zero -- it's that equation that allows you to reconstruct data when a disk fails. The problem is that there's no way to update two or more disks atomically, so RAID stripes can become damaged during a crash or power outage.
To see this, suppose you lose power after writing a data block but before writing the corresponding parity block. Now the data and parity for that stripe are inconsistent, and they'll remain inconsistent forever (unless you happen to overwrite the old data with a full-stripe write at some point). Therefore, if a disk fails, the RAID reconstruction process will generate garbage the next time you read any block on that stripe. What's worse, it will do so silently -- it has no idea that it's giving you corrupt data.
There are software-only workarounds for this, but they're so slow that software RAID has died in the marketplace. Current RAID products all do the RAID logic in hardware, where they can use NVRAM to survive power loss. This works, but it's expensive.
There's also a nasty performance problem with existing RAID schemes. When you do a partial-stripe write -- that is, when you update less data than a single RAID stripe contains -- the RAID system must read the old data and parity in order to compute the new parity. That's a huge performance hit. Where a full-stripe write can simply issue all the writes asynchronously, a partial-stripe write must do synchronous reads before it can even start the writes.
Once again, expensive hardware offers a solution: a RAID array can buffer partial-stripe writes in NVRAM while it's waiting for the disk reads to complete, so the read latency is hidden from the user. Of course, this only works until the NVRAM buffer fills up. No problem, your storage vendor says! Just shell out even more cash for more NVRAM. There's no problem your wallet can't solve.
Partial-stripe writes pose an additional problem for a transactional filesystem like ZFS. A partial-stripe write necessarily modifies live data, which violates one of the rules that ensures transactional semantics. (It doesn't matter if you lose power during a full-stripe write for the same reason that it doesn't matter if you lose power during any other write in ZFS: none of the blocks you're writing to are live yet.)
If only we didn't have to do those evil partial-stripe writes...
Enter RAID-Z.

We deploy RAID Z with our solutions. I know there are some storage gurus on memestreams. Thoughts?