random memes }

Building a Cloud - futures

This is more of a futures question. Say you have an empty rack (or racks) in your datacenter, and are going to build a cloud. How do you know what to buy or build?

I have this long-running [Gedankenexperiment][gedanken] (thought experiment)[^1] in the back of my mind. If I were to design hardware for a cloud, how would I proceed?

This is my take on what makes sense when building out a cloud.

My opinion, not any pretense at absolute truth. :)

Learn from the best

If you are doing to build hardware for the cloud, learn from the best. Amazon, Google, Facebook (and more recently Apple and Microsoft) - all have massively more experience than just about anyone else. The Open Compute Project (see Wikipedia and site) is a huge hint.

Also visited this aspect in a prior post.

Learn from the past

The needs of hyper-scale public clouds are not the needs of the private cloud.

If you have re-written all your applications for the public-cloud model, you can use public-cloud hardware. You are not going to re-write all your applications. That means you need some hardware more resistant to failure - more in the direction of "enterprise" hardware. In the private cloud, you are going to find many legacy applications, with need for varying levels of protection from hardware failure.

End of the day - developers want a single model for deploying applications. The cloud is that model.

Tiers - base

Protection from failure can be factored into tiers, and factored by cost - lowest tier is lowest cost.

Failure of any of these means local state is lost. Applications written for the cloud do not keep any critical data on the local node. If your application does not keep any critical data locally, it is a candidate to run on this lowest-cost tier.

Tiers - in-rack storage

Lots of applications were designed for "crash-consistency". Windows boxes tended to crash - from either hardware or software glitches. If your application could recover in good order, you could work well on Windows boxes of the 1990's and 2000's.

Applications written to this model are tolerant of hardware failures, with the exception of storage.

This is an argument for moving storage off compute nodes to attached in-rack storage. The in-rack storage needs better protection from failure, so is more costly, but the cost is amortized across the entire rack.

Specialized in-rack storage can be designed for redundancy. How that redundancy is implemented does not matter. What matters is cost, and that no single failure means lost data.

Zinger - fast-enough connection(s) to storage

The zinger here is the connection between compute nodes and storage. 1Gb/s ethernet connections are cheap, but not remotely fast enough. A single 10Gb/s network connection is almost fast enough (compared to local flash storage) but - at least at present - rather expensive. If you have to gang 10G/s ethernet connections, the cost over a rack is going to add up rather a lot.

Keep in mind, when a packet leaves a box over ethernet, it can be routed anywhere on the planet (and perhaps beyond). That generality has a cost. The connection to in-rack storage is point-to-point, and only has to travel a few feet.

As best I can tell, the cost of a single 10GB/s ethernet connection (counting both on the switch and compute node) is about the cost of a terabyte of local flash.

A cheaper connection, with more bandwidth is needed, to compete with local flash storage.

Keep in mind, cloud hardware was originally built around the economics of mass-market hardware. The mass market is stuck on 1Gb/s ethernet, and this looks unlikely to change any time soon. What is the mass market using for faster connections? At present (thanks to Apple) the faster/cheaper choice is [Thunderbolt][thunderbolt].

As best I can tell, Thunderbolt 2 is (currently) about a tenth the cost of 10Gb/s ethernet (and at twice the throughput).

Thunderbolt is a point-to-point connection, which is all and exactly what you need for in-rack storage.

This is a bet. Some daring vendor is going to offer both compute nodes and storage nodes connected by Thunderbolt, and offer better performance at a lower price. This matters a lot in the cloud. With computing at large scale (where everyone is headed), efficiencies matter.

Tiers - fault-tolerant compute nodes

The reason "enterprise" hardware exists is there are some applications that simply cannot fail. The cost of failure to the business is very high. That means purchase of (expensive) extremely redundant hardware is well justified.

This use-case does not go away with the cloud. The programmer-model and the APIs used do not change. What changes is the "flavor" used to deploy applications.

What is now counted as "enterprise" grade hardware will be rolled into the cloud. Clouds - especially private clouds - will offer an extremely fault-tolerant tier.

Keep in mind, buying better hardware - however expensive - can be much cheaper for a business than rewriting software.


The hyperconverged model is simpler for the site deploying a cloud. All nodes have storage, all nodes are equal. Need to scale up? Deloy more nodes.

Zinger - the cost of I/O

The problem is that past the base tier, this model depends on replication of storage between nodes. Replication at scale requires hyper-fast ethernet - and hyper-fast ethernet is very expensive.

The approach of using cheap/fast point-to-point links (as is practical with in-rack storage), is not suitable for the hyperconverged model. Point-to-point connections between each and every compute node in a rack is not especially practical.

Hyperconverged nodes as a solution for anything above the base tier are expensive. Faster ethernet is expensive compared to the low cost of local or in-rack connection to storage. When the overly-general connection costs you the same as a terabyte (or more) of local flash storage, the trade-off is not good.

Where hyperconverged works

There are cases where hyperconverged does well.

  1. Premium / "enterprise" grade / cost is (almost) no object tier.
  2. Base tier where replication/redundancy is not needed.
  3. Sites where average rate of volume writes is low.
  4. Sites where a small proportion of applications do a lot of volume I/O.

If you are building a first cloud, and you know the cloud is going to host critical applications that require reliable hardware, then (1) is a good choice. You get first-class support from your vendor. You can safely run all your applications on the cloud. As you gain experience - and collect metrics - you can plan for building more cost-effective tiers ("flavors" in the cloud) to exactly fit your usage.

If you are building a first cloud, and want to keep costs minimal, then (2) is a good choice. Keep in mind some of your applications should not run on such a cloud, as failures are expensive. While somewhat painful, experience running your suite of applications on such a cloud will - over time - help identify applications that need better reliability.

(Note that at present it is not clear if hyperconverged hardware is well supported in OpenStack - at least without help from your vendor.)

If your applications do relatively little I/O to storage (especially writes), then (3) applies. With a small-enough rate of writes, replication (for redundancy) is much less of an issue. You might even be able to scale back the network connections to cheap 1Gb/s ethernet. There likely are some fairly large segments of your applications that fit this model.

If the bulk of your applications fit (3), and a small segment fit (4), then the I/O intense applications can spread the load across a larger number of (lightly used) nodes.

As your experience and use of the cloud matures, you will know enough to build out racks with the right balance and placement of compute and storage. Well-balanced racks are going to give you better performance for less money.

If the cost of 10Gb/s (and up) ethernet drops radically in cost, then hyperconverged nodes will make sense in more cases. At present, the high cost of ethernet fast enough for connection to storage strongly tilts the balance toward local flash storage.


There are of course lots of hybrids of the above.

I expect a private cloud will offer flavors for each tier. The proportion to each tier will be specific to the site, and will change over time.

The vendor that can get their collective head around what is needed by the customer, and can offer optimal hardware, will win. This is a different game than the past. As history has shown, a large fraction of current vendors will fail to adapt.

This is all fairly straightforward.

[gedanken]: https://en.wikipedia.org/wiki/Thought_experiment [openstack]: https://www.openstack.org/ [flapping]: https://cloudplatform.googleblog.com/2015/03/Google-Compute-Engine-uses-Live-Migration-technology-to-service-infrastructure-without-application-downtime.html [thunderbolt]: https://en.wikipedia.org/wiki/Thunderbolt_(interface)
See also: [building a cloud](/weblog/2016/building-a-cloud-1)

[^1]: As you might guess, my degree from University was Physics, and quite a number of my professors were German.