Building a Cloud

2016-07-05

Say you have an empty rack (or racks) in your datacenter, and are going to build a cloud. How do you know what to buy?

I have this long-running Gedankenexperiment (thought experiment)[^1] in the back of my mind. If I were to put together a cloud, how would I proceed?

Amazon - an exact answer

If I were at Amazon, obtaining the answer would be fairly straight-forward. Amazon keeps lots (and lots) of metrics. I could crunch through the numbers, looking for patterns of usage. If I were building out a general-purpose datacenter:

Look for the biggest cluster in the metrics.
Design a collection of hardware to efficiently and exactly suit that pattern of use.
Assign a named flavor such that instances deployed with that flavor are preferentially deployed to that collection of hardware.
Subtract the cluster from the metrics, and repeat from (1).

The collection of hardware for each flavor would be sized proportionate to the amount of usage in past metrics.

If I were adding to an existing datacenter, I would look for clusters of usage, not efficiently served by the existing flavors.

If you are Amazon - with excellent metrics - building out new hardware is a fairly exact exercise.

Building a first cloud

If you are building the first cloud at your site, you do not have metrics. The exact/Amazon approach does not apply. You know only roughly what applications will first land in your cloud, and in what size. You know even less how usage will change over time.

This is a truly hairy problem. While setting up OpenStack becomes easier every year, building an OpenStack cloud is already a formidable exercise. Add to this selecting and configuring the hardware for your cloud, and finding (and debugging) the appropriate drivers.

Do it yourself

You could try and build out a cloud using your in-house (limited) expertise. A lot of folk go down this path - and quite a few folk get badly burned.

The best advice here is to keep your first cloud as simple and vanilla as possible. No exotic hardware. No vendor-specific drivers. No second-class hypervisors. No complex networks. As simple and as close to a mainstream configuration as is possible.

Once you have your first cloud in production, only then think about adding complexity.

A specific recipe for a low-risk first OpenStack cloud:

KVM as the hypervisor.
Cinder storage as LVM volumes.
- Flash drives to fill the Cinder storage nodes (for Cinder volumes).
- Flash drive(s) in each Nova compute node (for ephemeral volumes).
Network as simple as possible. (As suits your existing in-house expertise.)
- 1Gb/s network to "public" access (general access to your site's network).
- 1GB/s network for "management" (between OpenStack nodes, only).
- 10Gb/s network for Cinder volume I/O (between OpenStack nodes, only).

(This could be an article in itself. Maybe one exists, somewhere?)

Personally, I would spend my time deploying OpenStack-Ansible (OSA), first as all-in-one, later on the full rack. (There are other worthy alternatives, but at present I would choose OSA.)

This approach which might spend the least money, can also easily spend the most time.

Buy your first cloud

If you want get a cloud up in shortest time, with least risk - spend money. (This is one of those you get what you pay for moments.) Use a vendor with a lot of OpenStack expertise to build your first cloud.

Keep in mind, currently OpenStack is going through the early hockey-stick rapid growth phase. That means expertise in OpenStack is limited, and spread pretty thin - even and especially with vendors.

Keep your first cloud as simple and vanilla as possible - even when buying from a vendor. Better not to be tempted into a more-complex setup. (You could easily find unexpected behaviors, that will eat time.)

Where do you go from here?

Once your first cloud is in production, collect metrics. Look at your site's usage.

Is memory heavily used, but not CPU? (Add memory.)
Is CPU heavily used? (Add CPUs.)
Is the netowrk heavily used? (Add more/faster network links.)
Is Cinder storage a bottleneck? (Add more/faster storage.)

In essence, you are using the Amazon approach (above), but with a lot less data. :)

Get used to this practice, as you are going to do this a lot, over time.

See also: [futures](/weblog/2016/building-a-cloud-2)

[^1]: As you might guess, my degree from University was Physics, and quite a number of my professors were German.