Have an experiment in mind, and ended up going down a rabbit-hole. :)

Problem

Use of deduplication in storage has become somewhat common. But there are problems.

Fixed-size hashed blocks are easiest, but have problems with explosive decompression and scale.

One of the bits of learned customer-site “wisdom” I read on the web was to keep your deduplicated storage at 30-50% usage especially if you are seeing high deduplication rates. My guess is experienced sites have seen occasional explosive growth in usage. While the site might be seeing 80-90% deduplication, because they have to keep large free space, the effective deduplication rate is more like 60-70%. So if you can offer 60-70% deduplication with stable usage, this would be more than competitive.

Fixed-size blocks are fragile - at least that is my guess from theory. (I do not have any vendor’s secret numbers, and would not expect to see those numbers published.) Bit of a reading-between-the-lines exercise. Seem to see the same story, repeated:

  1. Vendor starts with 4K block-size. Works well. Good bet.
  2. Runs into problems scaling up.
  3. Increases block-size - a major disruptive update.
  4. Lots of unhappy noises from customers.

Keep in mind - implementing fixed-size hashing is simpler. You get a working product with good performance in short time. I bet there are quite a number of attempts at variable-size hashing that failed to become a viable product. (Harder problem and easy to get wrong, in some aspect.)

Insight - minimize I/O first

Thinking about storage at scale - eventually realized deduplication is more about minimizing I/O, and less about saving space.

Storage space is relatively cheap. Even flash storage is under $0.50/GB, currently, and headed lower.

I/O operations are not cheap, especially when the storage is remote, and especially in a cloud. Flash storage offers much higher rates than spinning disks, but with many applications running against the same storage, I/O can become the bottleneck.

Insight - waste space to make an efficient hybrid

The main advantage of deduplication is to reduce I/Os, not to save space. What if we could combine the simplicity and performance of fixed-size hashing, with the stability of variable-size hashing? What if the base implementation was fixed-size (but larger blocks, so lower overhead). What if we used a rolling-hash to occasionally re-align the stream, but not too often? To re-align the stream, simply insert zeros from the point the rolling-hash fires, to the end of the block.

This was another flash of insight. Maybe I save space by wasting space! :)

The majority of the implementation retains the simplicity and speed of fixed-size hashing. You do not get deduplication rates as high, but actual usage is much less fragile. You can be comfortable running with higher actual usage.

Experiment needed

The above is all a guess.

As a first try, I would choose 64KB fixed-size blocks and a rolling hash that forces re-aligns every ~1MB (or much more). Aim is to try a range of sizes - I expect there is a “knee” in the curve … somewhere.

Down the rabbit-hole

I need to collect numbers.

Which leads to a measuring a cheap rolling hash. Roughly good enough.

Which leads to measuring a (hopefully) efficient fingerprinting hash. Skein is not as fast as I would like. Really want throughput similar to SSD storage, so ~1GB/s would be nice. Alternatives?

Interesting that the NIST competition for SHA3 has a bit of a tail effect. The competition to improve hash functions continued after the NIST SHA3 competition ended.

Seems Blake2 might(?) be a likely interesting result. Blake2 makes (some?) use of SIMD, and offers (some?) use of multiple CPUs through OpenMP. (Insert pause to grok those two subjects.) For my application … I might want to use SIMD within Blake2, but use OpenMP not within the fingerprinting hash but at the application level.

Note that using a crypto-grade hash is probably overkill for this usage. If I were doing this for a product, I would spend the weeks needed to invent and prove a faster fingerprinting hash. (Be clear this is not a trivial exercise.) For the purposes of this experiment, I want to spend less time.