In our previous post we covered work that was done to make Major League Soccer’s (MLS Digital) platform efficiently use the infrastructure it was running on. Much of this work was actually done after we helped bring Amazon Web Services’ (AWS) autoscaling back online. An important factor of autoscaling is the time it takes to bring capacity online. This is important because it factors into how far in advance you need to confidently predict that you’ll need the additional capacity.
EC2 Autoscaling on the surface is a very simple concept. You create an autoscaling group, and a launch configuration. In these two actions, you specify some key pieces of information.
In Launch Configuration:
- The Amazon Machine Image (AMI) you would like to use
- The type of instance you’d like to launch (in our case C4.Xlarge)
- Networking information (Security groups, VPC, IPs and subnets, etc)
- Name, comments, etc.
- Public Keys for connecting with SSH
In the Autoscaling Group:
- The number of instances you would like to spin up (minimum, maximum, and ‘desired’)
- Configure Scaling policies (basically which cloudwatch metrics, and values to trigger autoscaling up and down)
Refactoring the Deployment System
For a complete, indepth look at configuring autoscaling visit AWS' getting started with Autoscaling. The biggest impediment to autoscaling from MLS’s perspective was their codebase. MLS ran every club site as an individual docroot. This has advantages and disadvantages. The biggest advantage was being able to pin a club site to a specific build number. A major downside was a 2500MB deployed footprint. At the time we deployed code based on tarball builds and a compressed version of this footprint was still approximately 600MB.
In MLS’ first iteration of autoscaling, instances would come online, and run SaltStack to provision the instance, and deploy the code. The configuration of the instance and code deployment took too long, and was error-prone. Autoscaling processes would wind up terminating the instance before it could fully come online. Due to this, MLS had disabled autoscaling and was manually spinning up additional instances before big events.
After examining the autoscaling failures, we decided to refactor the deployment system. The deployment workflow was:
- A jenkins job that built a compressed code tarball and pushed it to S3
- A salt command to trigger the deploy targeting the web servers. Using jinja templates, this command would generate and execute a bash script to download the tarball, and deploy it.
Boosting Site Efficiency
We decided to swap to rolling our own AMIs during the build process so new instances would spin up with the codebase ready to go. We chose Packer to do this job. Packer is a tool for creating machine and container images for multiple platforms from a single source configuration. It is quite versatile, and would allow us to create images for AWS and VirtualBox (to be used in our developer Vagrant machines). Packer uses a single json configuration file. Within the configuration file you can configure multiple AWS availability zones, security groups, etc.
Using Packer, we did the following:
- Deploy a temporary surrogate instance based on a prebuilt AMI (taken from a running web server)
- Provision surrogate instance using SaltStack
- Deploy the codebase to the surrogate (again with SaltStack)
- Perform housekeeping on surrogate (delete logs, bash histories, temporary SaltStack ID, etc)
- Create AMI using MLS BuildID as its identifier.
We used the AWS CLI to create a launch configuration and included the newly provisioned AMI, and updated the autoscaling group to use it. Since all of the heavy lifting of provisioning was handled in the surrogate VM, Packer enabled us to spin up new instances with the latest codebase in less than thirty seconds.
Fine Tuning Infrastructure and Reducing Hosting Costs
The next step was to tune the autoscaling group properly. We wanted to maintain a balance, and insure that a new instance would be spawned only when it was actually needed. Thankfully, we had plenty of empirical data from Cloudwatch, and from the instances running in the legacy environment.
Using this historical data, we configured the group to autoscale when CPU was above 40% for more than ten seconds. This setting was vetted during a weekend of matches. After the weekend, we analyzed amount of autoscaling events that occurred, and determined that we should raise the the value to 50% cpu over the same ten second duration.
Packer worked very well, but we did run into one major issue. After an upgrade , we noticed that our surrogate instances couldn’t be reached by SSH. After some thorough troubleshooting, we determined that a bug in Packer only accepted the first security group specified by the “security_group_ids” parameter. A member of our team analyzed the code and came up with a fix. Once we verified it worked, he submitted his patch, and the bug was fixed.
Overall, 2015 was a year of significant improvement for MLS Digital. At the end of 2015, everything was running in AWS and their monthly AWS bills were a quarter of what they once were, and in early 2016 they were able to shut off their legacy hosting provider and achieve a considerable reduction in their monthly hosting costs. The infrastructure is also more stable, and the number and severity of alerts that the MLS Digital tech team gets is reduced.
If you’d like to learn more about Phase2’s work with MLS Digital check out this video.