One of the early advantages of migrating off of traditional shared hosting, VPS or dedicated hosting and onto a platform such as AWS is the ability to auto-scale. This sounds like a bit of a magical concept, so what do we mean?
Well, the basic concept is that the compute resources available to you can increase when demand increases (meaning you can serve more people) and decrease when demand drops (meaning you can spend less money).
The ability to scale on demand is a key cloud benefit - no longer does an online retailer need to pay for a 12 month contract of some huge servers that sit idle most of the year apart from the run up to Christmas. Instead, deploy just the servers you need to handle expected load - most cloud providers charge by the hour or minute. Autoscaling takes this a step further by automating this process based on some metric from your servers.
It’s a bit of a jump from being told “this service can autoscale” to you actually being able to utilise it, especially if you’re migrating onto AWS from a more basic environment. As much as AWS might like to pretend there’s a magic button that just makes your servers never fall over again (there is, but it’s called rewrite your entire application to work with AWS Lambda) you’ll still need to use some good old fashioned configuring, along with some trial & error, to get it set up.
Before we can autoscale, we need to be able to scale at all, and that’s out of scope for this article. What you’ll need to start with is:
- You’re running EC2 instances via an Autoscaling group
- You have the ability for your application to run on 2 or more servers simultaneously. This will require that:
- The application doesn’t rely on data saved to the server’s hard drive (an easy way to fix this if it does is to use Elastic File System and have the server symlink content directories onto an EFS volume at boot using user data)
- Traffic is served through a Load Balancer and the target group is hooked up to the autoscaling group
- Session storage uses a separate service (such as a key-value store like Redis, Memcache, DynamoDB) or sticky sessions is enabled on the target group of the load balancer.
To test your setup you should be able to, with a running web application, launch a new server, wait until it’s marked “healthy” by the load balancer, then terminate an existing server, and the application should remain reachable throughout (you might want to check this on a staging setup if you’re not sure about terminating a production server - but there’s quite a freedom in knowing that a production server can go away and users won’t notice).
So, where’s that magic we talked about? Your site gets linked by Obama’s Twitter account, top page of Hacker News, or you happen to sell the hottest toy of the season and it’s nearly Christmas - how do you start scaling automatically? Not just up (to sustain increased load) but down (to keep your costs as low as possible) too!
It all starts with Cloudwatch Alarms; we create an alarm that’s watching a specific “metric”. A metric is a series of datapoints collected by Cloudwatch, often with different “dimensions”. This all sounds a bit vague so here’s an example:
- A metric named “AWS/EC2” (namespace)
- It has dimensions such as individual instance IDs, or Autoscaling Group Names
Metrics don’t allow access to the raw data but instead provide aggregate values. E.g. a
CPUUtilization series for an instance can give down to the minute or (for extra cost) down to the second average, max, min and various percentile values. You can also see aggregates as a sum (useful for discrete metrics such as errors logged or requests served) and a count of samples in a given time range.
Metrics are sent automatically to Cloudwatch by a range of AWS services, and for a monthly cost it is possible for applications to send custom metrics as well.
It is these metrics that can be used to examine the health of an autoscaling group serving a web application.
The most obvious metric to jump at for autoscaling (and subject of many tutorials) is CPU usage. By a broad definition, if the CPU usage of an instance is too high for too long, it’s probably time to use more instances.
Whilst this can be true, it ignores the multifaceted nature of what our servers might be doing. For example an instance screaming along at 90% CPU usage might just be doing work as fast as it can - if lots of requests are coming in and we have the capacity to serve those as fast as possible, high CPU usage isn’t a problem. Furthermore instances may have other causes for high load - unattended upgrades for security patches, handling some compression on a file that’s sent down to a user, encoding an image.
If we peg an alarm to the max CPU load of an autoscaling group we might find servers starting up for brief spikes that then settle down - but if we require a long period of high CPU we might already have degraded response time before we scale up.
Latency sits at the other end of the spectrum - latency is the thing you want to avoid, and generally running more application servers is the way to do so. Application load balancers provide a
TargetResponseTime metric, so if the average here goes too high, new servers could be helpful. Unfortunately this falls down for very low traffic periods - a single user uploading a 10MB profile image over a poor connection might mean 10s or more for a response, and if that’s the only request in a few minutes suddenly the average shoots up. This metric is also a lagging indicator - by the time latency increases the application is already degraded, and given new servers take minutes to provision this could be a noticeable impact for users.
There’s no right answer, but I’ve found that a reasonable answer for sites served using Apache is the size of the current worker pool. This reflects traffic directly, as the pool per server expands with inbound requests, and unlike CPU utilization other services running on the machine are less likely to affect it. It’s also a less laggy metric than response time, as the pool will increase in size prior to requests being slow to fulfil (requests tend to slow once the pool hits its max size)
By default, of course, AWS has no idea what you’re running on the application server. So to make this work we’ll need a custom metric, which is relatively easy to grab with a simple bash script:
Set this on a schedule (eg. using cron, or supervisord to run it more regularly) and AWS can now track your Apache workers.
All that remains is to consider the alarms to trigger autoscaling. We’ll generally need two alarms - one of which triggers under the “things are going wrong” condition, and tells the Autoscaling Group to launch a new instance. The other will be an everything’s OK alarm which remains in a state of perpetual alerting unless something isn’t ok.
It’s common to use the same metric for scaling up as down, but usually leaving a gap between the thresholds - so that we don’t just bounce between scaling up and down as traffic spikes and dips. Autoscaling actions can also have a cooldown on them, and setting longer cooldowns on scaling down means we don’t just remove a bunch of servers too quickly for a temporary lull.
These alarms can be added via the console but the best way to manage your AWS resources is via an Infrastructure as Code (IaC) tool such as Terraform. I’ve written about Terraform before and won’t go into the setup now, but here’s the basics of some alarm code we can use to control our autoscaling group using the aforementioned
ApacheWorkers metric (see inline comments for what some things do):
# Using a locals block lets us declare some strings
Building this into a larger Terraform configuration would allow us to deploy these alarms to begin working as soon as our metric is being sent.
Of course if the servers break completely the metric may also stop sending, in which case scale ups won’t trigger - for this reason it’s good to ensure that if the servers are attached to a load balancer then “ELB health checks” rather than “EC2 health checks” are used by the autoscaling group. This means that if the server stops responding to web requests for long enough to be considered “unhealthy” by a load balancer, it will be terminated by the autoscaling group. It’s worth noting that this could result in premature termination if the code serving the health check breaks - it may be worth having health checks sent to a separate code base to prevent application errors taking servers down continually.
It may not be the case that you’re ready to roll with this straight away - you might be further back in terms of running across multiple servers, or using different server software. But hopefully some of the principles here could be useful for those wanting to deploy automated autoscaling and who are already able to scale manually, and are running Apache as their web server.
If you’re interested in more about the basics of Terraform, or the basics of web serving on EC2 (covering AMIs, autoscaling and load balancing) then let me know and I’ll try writing some more articles on the topics!