Thinking about storage, and the cloud (OpenStack) took me … a bit sideways.

Storage for backup is write-mostly, read almost never, and potentially massively deduplicated.

Storage for general cloud usage tends to be - at times - read-mostly, and potentially deduplicated.

The joker with backup is the disaster-recovery (DR) scenario. In the DR case, the critical majority usage becomes massively read-mostly.

Can we optimize for both cases?

Deduplication bet

Deplication for storage is now a common offering. This is a good thing, so far as it offers better performance and economics. This is also a risk that you need to understand.

High quality large-value hash functions, and abundant inexpensive CPU cycles - combined mean we can optimize our use of storage. Representing a block of data by a smaller string of bits is a bet. If the string of bits is long enough, a failed hash (collision) is less likely than failures due to cosmic rays - or the Sun dying. (A very good bet.)

The problem is not the hash. The problem is the chunk size.

Most of the storage-vendors use 4K block sizes. Why? As 4K is the most commonly used smallest filesystem allocation unit for both Unix and Windows. When you chunk data at 4K boundaries, most of the time, you win the bet.

Also note that Oracle tends to use 8K as a block size. If much of your use is Oracle, with 8K chunk size in storage - again you most always win the bet.

Site perspective

From the perspective of an individual site, betting on deduplicated storage should make you nervous.

A small update to the applications running on your site could possibly have a large effect on the deduplication ratio. You might want to keep your storage utilization (in excess of capacity), fairly small.

Vendor role

From the perspective of the vendor, the deduplication bet is spread across the entire customer base.

If you bought your storage from a high-end vendor (like EMC), they like to monitor your usage. If your usage starts to spike … a polite guy shows up at your door, with a new box of storage. You drop in the new box, with little or no disruption to your operations.

From the vendor’s perspective, betting across the entire customer base is a good bet.

A bit later … when things have settled … the polite guy from your vendor asks whether you want to buy the new storage, or reduce you usage. Not really unfair, as you are using more. Also the vendor reduced a crisis to a slight discomfort. Quite literally - you got what you paid for.

If you bought from a low-end vendor, and you ran out of storage … your life is not fun.

Problem of scale

Using 4K blocks for fixed-sized hashed storage is a big problem. The overhead for tracking each 4KB block of data is huge. As you try to scale-up the size of storage, the overhead of (tiny) 4KB blocks is a problem. Your vendor starts to run into problems, scaling up.

Quite logically, your vendor wants to scale up the block size. Perhaps you are not especially bothered by the software update from your vendor, and you wonder why they appear nervous.

Problem is, your usage might explode.

Say your application-load changes, so that the same data is stored, but the offsets change by 1KB. If your storage hashes on 4KB boundaries … suddenly you may need 4x as much storage. If your storage hashes on 8KB boundaries … suddenly you may need 8x as much storage.

Your deduplicated storage has exploded.

Fixed-sized hashed blocks are vulnerable to this explosion. Variable-sized hashed blocks are more robust to such changes, but have much more overhead.

Variable versus fixed blocks

Fixed size deduplicated block storage is a lot easier - which is why this is the base of most deduplicated storage. Variable block size - with boundaries chosen by a rolling hash - is much much harder. The overhead per chunk is higher.

Unless your vendor is very very good at algorithms, you will take a huge hit, somewhere in the range of performance.

Lazy hybrid - and key insight

At scale, the key insight is that storage is cheap, but I/O is expensive.

Flash storage is (becoming) cheap, but the I/O activity between nodes is expensive. That equation does not look likely to change.

Taken together that means the primary purpose of deduplication is to minimize I/O.

Solution?

Fixed-size blocks are fragile. Variable-sized blocks are heavy. What if you could realize the advantages of both? What if you could cheat?

What if you could use fixed size blocks, but bump up size enough to get vendors past problems?

If the primary aim of deduplication is to minimize I/O, then the choice of algorithms on both client and server side become somewhat simple. If you want to chunk on pattern-dependant boundaries, to strongly constraint explosions in de-duplication, but not force highest possible de-duplication … is the end sum better?

Reality check

All the above is theory. Meaningful experimental measure might lead to a different result. Or not.