There is still a good deal of confusion, even among the IT literate about what “The Cloud” really is. It doesn’t help that some consumer orientated companies are using the term to help publicise their own services, for example, HTC has a cloud, Apple has an iCloud, DropBox lets you store your files in the cloud. The list is almost endless…
To the end user it appears as if all this stuff vanishes into the mythical ether called “the cloud”, but no one actually tells you what the cloud is.
During my time at Microsoft as a PM in Windows Azure I learnt a good bit about clouds, and in this blog post I’m going to try and clear up this perception a little by providing a quick high level overview of what a cloud is, why they are going to be more relevant in the future and roughly how they work.
The cloud in two sentences:
Clouds are a way of organising servers which makes them cheaper and more efficient to run. This way of organising computers makes launching new services cheaper, because of this “clouds” are increasingly becoming the backend muscle which powers a lot of the new consumer services which are being offered.
Peering into the Mist: What is this new organisation?
Traditionally as an online service provider we would try to work out how successful we think our new service is going to be, then we’d do some load and stress testing of our development system and based on this we’d work out how many servers we need. In this calculation we’d try to err on the side of caution, allowing for spikes in users / traffic. Then we’d have to go and buy or rent these machines, configure and set them up. As a company this would mean that we’d have to fund the purchasing / renting of these machines up front, this could be a large amount of cash for a service which may or may not take off!
The Cloud approach is different. It allows us to rent just the machines we need now just for launch and no more. If the service takes off we can add more machines. In fact on some (nearly all) cloud services allow us to programmatically add new machines. An example of an application which makes use of this is SmugMug, an Amazon Cloud hosted service. They experience more users on Sunday night than at any other time. Their service automatically adds new servers when needed, and removes servers when no long required. This means that SmugMug provides its end users with a responsive web site which can cope with demand, and at the same time they can save as much money as possible by only using the computers they need, when they need it.
Making the Cloud Work
It sounds easy doesn’t it; “just add the computers you need when you need it”. To get that working automatically it more challenging than you might think at first. Each of the major cloud hosting companies provide a basic set of tools for developers to use which allows them to cope with these challenges.
To allow the cloud services to offer new servers at a moment’s notice each of the servers is offered as a virtual machine. If we want another server the cloud will give us another virtual machine. This machine could be on the same physical computer, or it might not – it could be on any physical computer inside the providers cloud. These virtual servers can be setup and torn down automatically via an API. Almost all cloud providers offer an API to do this, right now.
On occasion virtual servers may crash, or fail. Most clouds providers allow us to cope with this by watching for errors and automatically kicking off replacement servers.
Cloud providers generally charge on a “per CPU hour” rate. This means we only pay for the computing power we need. This makes it cheaper for us to run our service, we only have to scale up temporary when we have a spike in traffic. We no longer need to own all the machines needed to cope with peak traffic for our service.
Cloud providers make this cost effective by having lots of virtual servers running on their physical servers within large centres, and having many clients within the same data centre. Effectively the gamble that not every customer is going hit their peak demand at the same time (an unlikely event).
So let’s say we have a spike in traffic and we start up a bunch of web servers, how do we direct new web traffic to each of the new machines? Well most cloud operators provide load balancing solutions. In Windows Azure these are VIP (Virtual IP Addresses), and in Amazon they are Elastic IPs. Different names, but conceptually they both offer the same thing, a single static public IP address on the internet, behind which we can have any number of web servers.
From a developers point of view this can cause a number of design constraints, a client web page may make several requests to our static public IP address, but due to load balancing each individual request could be serviced by different virtual severs. Additionally these virtual servers can be brought up and torn down at a moment’s notice. Each of the servers we running can vanish at any time. Got data stored just on that one server and the server vanishes – well bye bye data! Well that sucks! – So how do we store data?
Unified Large Scale Storage
Each of the clouds provides the idea of a single large storage repository which is accessible from each of the servers within the cloud. This means that you can store all your applications data in a single logical location and access it from any machine. This is great as it solves the problem of data sto
rage on vanishing servers. It also helps us scale as it means that every new server we create can access the same data, when we need more space we can just use it – we don’t have to purchase new hard drives or create our own Storage Area Networks (SAN).
Generally each of the cloud provider offer a pretty competitive price per gig for data stored within their storage solution.
Cloud Storage Implementation
To the developer this storage is often provided by a set of cloud hosting company specific APIs, Amazon have a set, as do Rackspace, and Windows Azure. Each of these companies go to some length to make sure that your data is safe. Most will replicate any data you store a number of times in different locations, in an effort to ensure that a physical disk or data centre failure will not destroy your data.
Most of the data storage APIs available offer non-SQL basic access. Typically they offer large tables of name value pairs (think giant INI files). While not SQL, this type of storage is quick, covers most of the scenarios you may need and is an effective structure for data replication within the cloud infrastructure.
Queuing and Messaging
So we have a bunch of servers which can get setup at any time and then torn down whenever we don’t need them all storing data within the same logical location. But how do we communicate between them? – We could just use the Cloud Storage, but often this doesn’t cope with want we want to do, and with virtual machines being created and torn down rapidly it is conceivable that we would request a particular machine to process a job only to have that machine either fail or be programmatically removed. In which case we may loose the job! - To cope with this most Cloud providers offer a messaging queuing system, indeed both Azure and Amazon offer message queuing systems. This allows servers to pass messages between them in a secure way. This is often used to submit jobs from front end web servers to back end processing servers and back again. If a server falls the job should be recoverable from the messaging queue and can therefor be picked up by another virtual server, resulting in a small delay, but no data loss.
Designing Applications for Scaling in the Cloud
At this point it is worth considering some of the design implications mentioned above. Designing applications for deployment within a cloud requires a bit of upfront consideration. We need to design for:
· A highly concurrent environment where machines get be set up and pulled down and any time:
o Don’t store data on the Virtual machine, use cloud storage
· Static IP addresses for groups of servers (VIP / Elastic IP) and load balancing
o Don’t store state, create job’s which can be considered atomic, once complete the data can be considered consistent and persisted in the large store
· Message Queues
o Structure groups of servers together in groups behind Static IP addresses and load balancers
o Place atomic jobs on the queue and submit to groups of IP’s via static IP addresses or use message Queues and many consumers from a Queue
· Large, cheap centralised storage
o Store all application data in the large centralised storage where it can be backed up and made secure by the cloud provider
At the top of this blog post I said I was going to cover what a cloud is, why they are going to be more relevant in the future and an overview of how they work, and I have:
· Shown the way clouds offer a new way of organising computers which makes them cheaper and more efficient to run.
· Shown the economic benefit of clouds, and hence why they will become more relevant in the future.
· Briefly illustrated the main tools cloud provide (Virtual Machines, Load balancing, Storage, Message Queues) when they are necessary and roughly how the cloud works.
I hope you find this blog post useful.