Prelude to an experiment, and down the rabbit-hole
Have an experiment in mind, and ended up going down a rabbit-hole. :)
Use of deduplication in storage has become somewhat common. But there are problems.
Fixed-size hashed blocks are easiest, but have problems with explosive decompression and scale.
One of the bits of learned customer-site "wisdom" I read on the web was to keep your deduplicated storage at 30-50% usage especially if you are seeing high deduplication rates. My guess is experienced sites have seen occasional explosive growth in usage. While the site might be seeing 80-90% deduplication, because they have to keep large free space, the effective deduplication rate is more like 60-70%. So if you can offer 60-70% deduplication with stable usage, this would be more than competitive.
Fixed-size blocks are fragile - at least that is my guess from theory. (I do not have any vendor's secret numbers, and would not expect to see those numbers published.) Bit of a reading-between-the-lines exercise. Seem to see the same story, repeated:
- Vendor starts with 4K block-size. Works well. Good bet.
- Runs into problems scaling up.
- Increases block-size - a major disruptive update.
- Lots of unhappy noises from customers.
Keep in mind - implementing fixed-size hashing is simpler. You get a working product with good performance in short time. I bet there are quite a number of attempts at variable-size hashing that failed to become a viable product. (Harder problem and easy to get wrong, in some aspect.)
Insight - minimize I/O first
Thinking about storage at scale - eventually realized deduplication is more about minimizing I/O, and less about saving space.
Storage space is relatively cheap. Even flash storage is under $0.50/GB, currently, and headed lower.
I/O operations are not cheap, especially when the storage is remote, and especially in a cloud. Flash storage offers much higher rates than spinning disks, but with many applications running against the same storage, I/O can become the bottleneck.
Insight - waste space to make an efficient hybrid
The main advantage of deduplication is to reduce I/Os, not to save space. What if we could combine the simplicity and performance of fixed-size hashing, with the stability of variable-size hashing? What if the base implementation was fixed-size (but larger blocks, so lower overhead). What if we used a rolling-hash to occasionally re-align the stream, but not too often? To re-align the stream, simply insert zeros from the point the rolling-hash fires, to the end of the block.
This was another flash of insight. Maybe I save space by wasting space! :)
The majority of the implementation retains the simplicity and speed of fixed-size hashing. You do not get deduplication rates as high, but actual usage is much less fragile. You can be comfortable running with higher actual usage.
The above is all a guess.
As a first try, I would choose 64KB fixed-size blocks and a rolling hash that forces re-aligns every ~1MB (or much more). Aim is to try a range of sizes - I expect there is a "knee" in the curve ... somewhere.
Down the rabbit-hole
I need to collect numbers.
Which leads to a measuring a cheap rolling hash. Roughly good enough.
Which leads to measuring a (hopefully) efficient fingerprinting hash. Skein is not as fast as I would like. Really want throughput similar to SSD storage, so ~1GB/s would be nice. Alternatives?
Interesting that the NIST competition for SHA3 has a bit of a tail effect. The competition to improve hash functions continued after the NIST SHA3 competition ended.
Seems Blake2 might(?) be a likely interesting result. Blake2 makes (some?) use of SIMD, and offers (some?) use of multiple CPUs through OpenMP. (Insert pause to grok those two subjects.) For my application ... I might want to use SIMD within Blake2, but use OpenMP not within the fingerprinting hash but at the application level.
Note that using a crypto-grade hash is probably overkill for this usage. If I were doing this for a product, I would spend the weeks needed to invent and prove a faster fingerprinting hash. (Be clear this is not a trivial exercise.) For the purposes of this experiment, I want to spend less time.