Sports related sites have a variety of properties that make them very interesting from a DevOps perspective. They tend to deal with dramatic swings in traffic during events, which give rise both to the need for automated scaling, as well as a need to highly tune performance to help keep scaling costs under control. They also tend to have features such as live scoring and stats updates that can complicate common performance strategies such as caching.
Perhaps a great example of this is the migration work we did with Major League Soccer’s Digital team (MLS Digital) in 2015. This migration would take them from a Drupal 6 based solution hosted at an ISP (MP6) to a Drupal 7 based solution running in AWS (MP7). You can read more about some of the reasons MLS chose to go with AWS in their blog post on the subject.
To understand some of the challenges that MLS faced, it is important to know that each soccer club operates independently, managing their own content, and has their own instance of Drupal. In aggregate, during peak utilization, the platform averaged approximately one million page views daily. Additionally, the 2015 MLS schedule featured some interesting innovations: the playoff bracket expanded to 12 teams and an event called Decision Day capped off the end of the regular season.
On Decision Day, every team in the league faced off against one another over the course of four hours. The results of these matches would determine final entrance into the top levels of the Audi MLS Cup playoffs, and we knew that traffic would be considerably higher than normal game days.
High Functioning Autoscaling Platform
Getting to a high functioning autoscaling platform requires at minimum three things.
You are able to determine the health of your infrastructure, so you know when additional capacity needs to be added.
Capacity needs to be able to come online quickly to keep your site responsive to visitors.
Finally, you need to make efficient use of your capacity to keep costs from skyrocketing.
Determining the Health of the Infrastructure
Our first task was to determine the root cause of observed platform instability. MLS would suffer seemingly random outages where a server would fail to respond for end users but health checks would continue to pass. The passing health checks but incorrect responses for end users meant that there were at least two issues to find. The first, why were health checks passing when they should have failed? The second, what caused the failure to respond for end users?
Real Time Sports Data
Sports sites frequently need as close to real time data about scores and stats as possible. As visitor volume increases, however, true real time data can get very expensive to deliver. A technique that is often used to reduce costs is that of microcaching. This is a caching technique whereby content is cached for a very short period of time, perhaps as little as 1 second.
After extensive troubleshooting, we discovered that the setting in NGINX relating to microcache was applying to the health check page. Not only that, it was also set to cache content for two hours rather than the intended few seconds. This gave us the answer to why health checks were passing when they should have failed and correcting this would allow failing servers to be taken out of rotation. Simply fixing this, however, would not be sufficient. With autoscaling enabled it would dramatically raise costs as new servers were spun up and added to the pool, and with autoscaling disabled all servers would eventually get pulled due to their failing status.
Now that we could identify a failing instance reliably we were able to isolate one for further analysis into why the failure was occurring. Once we had an isolated failing instance we were able to determine that PHP-FPM processes were terminating and respawning over and over again. Using the utility cgi-fcgi we were able to send raw commands to PHP-FPM and confirm that it was the source of the issue.
We weren’t able to determine a definitive cause but had seen similar cases before related to how the PHP-FPM process was communicating with NGINX. We switched PHP-FPM from listening on a socket, to listening on a TCP port instead to see if this prevented the issue from occurring. We performed some small scale load tests against the instance, and after some time felt confident that the random outage issue was resolved.
With that work done we had achieved the first requirement of a high functioning autoscaling platform. We could reliably determine the platform’s health so that we knew when we needed to add capacity.
In my next post, I’ll talk about the work we did to make sure that the capacity was being used efficiently.