A good, clear explanation of the difference between RAID-Z and prior RAID implementations.

Jeff Bonwick’s Weblog “RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe width. Every block is its own RAID-Z stripe, regardless of blocksize. This means that every RAID-Z write is a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, completely eliminates the RAID write hole. RAID-Z is also faster than traditional RAID because it never has to do read-modify-write.

Whoa, whoa, whoa – that’s it? Variable stripe width? Geez, that seems pretty obvious. If it’s such a good idea, why doesn’t everybody do it?

Well, the tricky bit here is RAID-Z reconstruction. Because the stripes are all different sizes, there’s no simple formula like “all the disks XOR to zero.” You have to traverse the filesystem metadata to determine the RAID-Z geometry. Note that this would be impossible if the filesystem and the RAID array were separate products, which is why there’s nothing like RAID-Z in the storage market today. You really need an integrated view of the logical and physical structure of the data to pull it off.”

My first job out of college included programming drivers for disk controller that was an ancestor of IDE (think WD1000). Having spent so much time thinking about how to get optimal performance out of a disk drive, when the RAID notion was published, naturally I was interested. Spent some time mulling over the problem, and came to a similar conclusion - so the above explanation immediately made perfect sense. Of course, ideas do not always work out when implemented. Could be I’d missed something critical, which would explain the previous RAID implementations. I am delighted that someone translated the hazy notion into working code!

As Tim Bray wrote “This stuff matters.”, with a link to “WHEN TO (AND NOT TO) USE RAID-Z”. I found Roch’s explanation … troubling … which lead me to hunt down the RAID-Z explanation above.

I strongly suspect that Roch’s opening statement is wrong.

“RAID-Z is the technology used by ZFS to implement a data-protection scheme which is less costly than mirroring in terms of block overhead.”

When a data block is bad, you must have a good copy somewhere else to recover the data. That means that with an N disk array, a RAID that can tolerate a single disk failure can store at most data equivalent to N/2 disks (slightly less due to some worthwhile bookkeeping overhead). You cannot get around this basic requirement.

On the other hand, you can take advantage of the replicated data to increase read performance - at least in principle. To get any sort of a handle on whether this works out, you have to look at the physics of current generation disks, and the constraints of current generation controllers. Though the underlying physics have not changed, my exact knowledge is near 25 years out of date.

Write performance is also very much dependent on current disk and controller implementations. Given that the time to program disk controllers is much smaller than the time required to the disks to perform the requested operation, you could easily have all disks performing concurrent operations. The physics say that the write performance to a N disk RAID should be at equivalent under full load to an N/2 independent disks - and under light to moderate load performance of a RAID should be near equal to independent disks. Controller or motherboard implementations may preclude full performance.

Current generation SATA controllers dedicate a channel (and cable) to each drive. Looking at the published numbers (always a bit chancy), it looks like SATA transfer rates are 2 to 3 times the disk transfer rates, and SATA transfer rates are far below memory transfer rates. What this suggests is that even a “dumb” controller doing large contiguous read or write operations should be able to keep 2 to 3 disks active concurrently. A “smart” controller (with dedicated DMA per channel) should in this case be able to keep all drives active concurrently - up to the memory transfer limit of perhaps 5-10 drives (as a WAG). Of course, most read or write operations are preceded by a seek. Given that current average latency on seek is about twice the time for a full-track write (about 0.5 to 1MB - another WAG), you might expect under more typical(?) usage to keep 10-20 drives concurrently active.

Quoting from Roch:

To store file data onto a RAID-Z group, ZFS will spread a filesystem (FS) block onto the N devices that make up the group. So for each FS block, (N - 1) devices will hold file data and 1 device will hold parity information. This information would eventually be used to reconstruct (or resilver) data in the face of any device failure. We thus have 1 / N of the available disk blocks that are used to store the parity information. A 10-disk RAID-Z group has 9/10th of the blocks effectively available to applications.

It seems to me that in the usual case - allowing for single disk failure - a single filesystem storage quanta (a “block”) would have to be written to exactly two drives. You also need to write some bookkeeping information that - with integration between the RAID and the filesystem - could cost only a fraction of an I/O. Under light to moderate load (the likely usual case), keeping the bookkeeping information on a third drive may be more performant.

The upshot is that I expect a 10-disk array would in fact to have (10-1)/2 blocks (or 45%) of the blocks of storage effectively available to applications. This differs from Roch’s statement.

Now let’s look at this from the performance angle in particular that of delivered filesystem blocks per second (FSBPS). A N-way RAID-Z group achieves it’s protection by spreading a ZFS block onto the N underlying devices. That means that a single ZFS block I/O must be converted to N device I/Os. To be more precise, in order to acces an ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for input as the parity data need not generally be read-in.

My expectation is exactly one device I/O per block on read (plus perhaps a fractional I/O for bookkeeping), and two device I/O on write (plus the fractional bookkeeping I/O - all of which may be concurrent).

Effectively, as a first approximation, an N-disk RAID-Z group will behave as a single device in terms of delivered random input IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will globally act as a 200-IOPS capable RAID-Z group. This is the price to pay to achieve proper data protection without the 2X block overhead associated with mirroring.

Again, assuming an N disk RAID that allows for single disk failure, my expectation for loaded read performance is typically the equivalent of (N-1) independent disks. For loaded aggregate write performance, I expect the equivalent of (N-1)/2 independent disks - approaching (N-1) performance under lighter loads. To allow for single disk failure you cannot get around the same sort of 2X block overhead as mirroring, but with careful design (including perhaps skewing storage across the array) you could get higher typical performance numbers.

Admittedly my exact knowledge of the hardware is way out of date. Still, I suspect that Roch’s write-up is inaccurate.

I am aware that the above could be explained in greater detail. Unlike Jeff (or Roch?) this is not my day job, so you are pretty much on your own if you find the above hard to follow. :)

The above leads to an interesting side-speculation. If generic SATA controllers are just slightly “dumb”, then with a bit of semi-custom silicon Sun could field inexpensive boxes with ~2-5 times the SATA RAID-Z performance of generic boxes. Don’t know how this compares to actual current hardware.

My rough estimate is that for current technology the “sweet spot” for RAID arrays is in the range of 4 to 8 drives. In this range a RAID would yield the biggest bang-for-the-buck for the widest range of applications. A 10-disk RAID-Z array would yield maximum benefit for practically all applications, with slightly better storage utilization. Outside this range few applications would benefit. Just a guess… :)

Update: Roch pointed out something I’d missed, so the above needs update … (later).