ZFS vs. the ISO/OSI 7 layer model.

The Blob on the Road To Damascus.

Many years ago, long before the internet became the ubiquitous behemoth of today, I worked for a small outfit supporting about 30 remote sites using the same nameless brand of computer. One of my duties was designing, coding, and eventually doing most of the work running, a semi-automatic software updating system.

The remote sites were connected by an X.25 network, above which guys from another site had grafted a transport service layer which all the apps used. Various people were using and developing mutant versions of ftp, mail and telnet, name services and job transfer protocols. Our technology was cool, neat, well designed, ISO/OSI standards compliant and about to go global.

As time passed and colleagues moved on I ended up babysitting a lot of this stuff, from the kernel hacks right up the stack to the software updating system. It was a small part of my job: it all worked, and I had never bothered to study the code in any detail.

One day the file checksumming code I'd built into the software updater as an afterthought started screaming all the time. Our network was so reliable that this had never happened before. It shouldn't happen: it was clearly a bug, and I was told to find it and squash it, fast. We knew nothing was wrong with the hardware, because in those days we had on-site engineers, and they'd already looked and found nothing wrong.

I knew the code in the upper layers, and it was OK, so I had to dive into the kernel networking. Some three weeks later, on the verge of a nervous breakdown, I admitted to my boss that I'd looked through the entire stack, and I couldn't find the problem. My boss trusted me enough to demand that a better engineer be brought in from the vendor's National Center to "assist" our local guys in stripping down and rebuilding the offending computer, again.

What I couldn't tell my boss, and what was driving me crazy, is that I agreed with the engineers. Each layer in the ISO/OSI model guarantees a certain service to the layer above it; this is a central feature of the stack's design. A fact. Hence, the observed behaviour could not occur due to a hardware fault. It had to be a software problem, and one that I, top techie on the block, couldn't find.

A few days later, after much swearing and poking around with voltmeters and oscilloscopes, our engineer brought us our HDLC offloader for inspection. This expensive Z80-based card handled the CPU-intensive work of checksumming, framing and error-correcting the data stream coming from or going to the X.25 port, and it was the first thing we'd had checked out when the problems started. We were calmly enraged.

Our engineer pointed to a badly-soldered connection on the card, and explained that it was the "error line". If the Z80 on the card got an error, it set a status register, raised the error line, and signalled an interrupt to the main CPU. The rig the engineers were using to test the card had no way to simulate data errors, so the fault had escaped detection first time around.

Ten minutes later, the engineer had re-soldered the connection, replaced the card in the backplane, and magically everything was working right again. With the exception of my head.

I was traumatised by this. I'd done three weeks of late night digging into complex source code written in three languages, despaired at finding nothing wrong, and then had my belief in the veracity of the ISO/OSI standard networking model shattered. All because of a solder blob.

And to add insult to injury, a mate who was an Electrical Engineer and who knew nothing about networks found it quite hilarious: he'd been babbling about "end to end data integrity" for months, but he had a beard and wore sandals, so everyone ignored him. Also, he was using this weird, toy operating system, called UNIX...

Of protocols and pragmatism.

TCP/IP doesn't really fit into the ISO/OSI 7 layer model. It's a pragmatic protocol, not one designed by a committee. It doesn't really have layers, and it was proven and in use long before the concept of a layered model became de rigueur.

In just about every modern TCP/IP network, there are layers. If a packet sent across an ethernet is somehow corrupted during transmission, the receiver will simply throw it away, and TCP/IP will arrange for the packet to be re-transmitted. WAN links are smarter: the lower levels will reliably re-transfer any corrupt packet until it arrives successfully without TCP/IP knowing anything about it, exactly the same as with an ISO/OSI network.

So what's so special about TCP/IP? The answer is subtle but simple. TCP/IP doesn't trust the lower layers. With TCP/IP, just before a packet is sent out, the sender calculates a checksum and embeds it into the packet, and the receiver validates the packet using that checksum. From an ISO/OSI perspective, that's completely unnecessary, because each layer in the underlying network guarantees to the layer above that the data it's passing on is valid. Yeah, right.

This is the "end-to-end data integrity" that my UNIX-loving friend was trying to explain to everyone all those years ago. UNIX preferred TCP/IP, and that's the main reason why it flourished whereas X.25 became extinct.

The network is the, um, storage?

The hard drive in your computer is clearly not a network. Anyone can see that.

What about an external drive, connected with USB or Firewire? OK, there's a cable involved, but it's not a network, is it?

A SCSI bus? That's not a network. What about FC connections? Well, maybe. The fibre has to go from one end of the datacenter to the other, but surely it's not really a network just because the cable is longer?

However, if you take that FC link and plug it into a switch, it somehow becomes a SAN, or Storage Area Network. Well, OK, but it's not a proper network, is it?

Rent a fibre or use FCoE to your second datacenter, though, and there's absolutely no doubt about it. Your storage is a network. With added pain.

But if you examine each of these setups closely, and you'll find they're actually all networks. That FC card is using layered protocols to send data between the host and the storage controller. Your SCSI HBA talks to the drive using the SCSI protocol, which includes parity checking for the bus and a mechanism to re-transmit bad requests. Likewise your USB or Firewire controller: although the hardware technology is completely different, the software does the same job, and in much the same way.

What about that internal hard drive, though? Well, it's a network too. The controller talks to the drive's firmware over the PATA or SATA bus, and there are various checks for data integrity at both ends of the cable. If you've ever tried to use a PATA drive with a substandard or overlong ribbon, you'll know all about UDMA CRC errors.

In addition to all these error-checking and error-correcting protocols which are running "on the wire" to prevent data corruption between the host and the drive, the drive itself has sophisticated methods for ensuring that data which is written to it can be reliably read back. Every block written to the drive has an ECC appended to it, which can both validate and to some degree correct the data it's associated with. The drive can tell when some part of the disk is bad, and write your data somewhere good instead, keeping track of where everything is written. And it can constantly monitor the state of the disk, re-arranging data when correctable errors are found, reporting any discrepancies to the host.

It's pretty fundamental: if you write something to a drive, then when you read it back you'll get exactly the data you wrote, or you'll get some kind of error which means that your data has become corrupt.

Let's rephrase that. Each layer in the storage model guarantees a certain service to the layer above it.

So what has all this got to do with ZFS?

ZFS compares to traditional layered storage in the same way that TCP/IP compares to the ISO/OSI 7 layer network model. TCP/IP doesn't rely on or trust the lower layers of any network which it's going through, even though they might have error detection algorithms which are theoretically far stronger than its own. Likewise, ZFS is happy to use all of the error correction and detection features of the underlying storage layers, but it's perfectly OK without them because it checks for itself. ZFS doesn't trust the lower layers.

Again, this is a subtle difference, and one which most people would be inclined to dismiss. Researchers and corporations spend vast amounts of effort and money making their storage products reliable and error-free. Nobody would argue with the claim that storage hardware is much better today than it was a decade ago. Even so, it's rare to find a user who hasn't had some kind of storage hardware problem.

Storage gremlins are particularly pernicious. They damage valuable data. Often, even though you're told about the damage, it's hard to tell what's been damaged, and you have to resort to that time-wasting full restore from backup. You avoid this by mirroring your data, or using RAID5: when an error occurs on one drive, you use the other (or others) to recreate the contents of the bad drive. You know the only error was on your bad drive, because no errors were reported on the others.

It's hard to get anyone to think about bitrot. To most people, it's a myth. No storage manufacturer is likely to admit that errors can creep into their data, undetected by all the layers of checking and correction. They'll tell you about MTBFs and give you the odds of an error occuring, but I've never seen the probability of an undetected error quoted. How could they calculate it?

But think about it this way: the more terabytes that global megacorps and geeks such as myself add to their storage, the more likely it is, statistically, that you're going to get some undetected corruption in your hard drive which will wreck your precious data.

Me and the global megacorps don't care. Because we're using ZFS.

Author: Ian Pallfreeman: ip@xenopsyche.com

9th September 2011.