Avoiding Flat Tires in your Web Application

Steven Merrill, Director of Devops
#Mobile | Posted

This Monday, the CitiBike bike share launched in New York City. The website was beautiful and responsive, and more than 15,000 registrations were processed through it before the launch happened.

But then, a funny thing happened. The website and the mobile apps' maps started coming up blank, and they stayed blank for more than 12 hours. What follows should not be characterized as a failing of the technical team. Launches are tough, and I don't mean to pile on what was obviously a tough situation. Instead, I would like to look at a few choices that were made and how the system might have been better architected for scalability.

Architecture Spelunking

I have no intimate knowledge of the architecture of the Citi Bike NYC application and hosting infrastructure, but we can find out a couple things very quickly.

The misbehaving path in question was http://www.citibikenyc.com/stations/json. With that information in hand, we can use the dig and curl commands to get a lot of information.

  1. ┌┤smerrill@lilliputian-resolution [May 30 18:23:35] ~
  2. └╼ dig +short www.citibikenyc.com
  3. 70.32.89.47
  4. 70.32.83.162

It looks like there's two web servers, and the site is using DNS-based load balancing. Let's take a look at the (now-fixed) JSON feed with curl -v. Using curl -v instead of curl -I ensures that we actually send a GET request instead of a HEAD request, just to make sure that the behavior is as close to a real browser as possible.

  1. ┌┤smerrill@lilliputian-resolution [May 31 12:11:59] ~
  2. └╼ curl -v www.citibikenyc.com/stations/json > /dev/null
  3. * About to connect() to www.citibikenyc.com port 80 (#0)
  4. * Trying 70.32.83.162...
  5. % Total % Received % Xferd Average Speed Time Time Time Current
  6. Dload Upload Total Spent Left Speed
  7. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* connected
  8. * Connected to www.citibikenyc.com (70.32.83.162) port 80 (#0)
  9. > GET /stations/json HTTP/1.1
  10. > User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
  11. > Host: www.citibikenyc.com
  12. > Accept: */*
  13. >
  14. < HTTP/1.1 200 OK
  15. < Server: nginx
  16. < Date: Fri, 31 May 2013 16:11:59 GMT
  17. < Content-Type: text/html
  18. < Transfer-Encoding: chunked
  19. < Connection: keep-alive
  20. < Vary: Accept-Encoding
  21. < Set-Cookie: ci_session=5mh4HirmjGbrsFaoFoNf5I3MOe5%2FgpGTtterFR5ATyZIoSQewcZTqEk8CmTo1Y6A%2Bv29mRsaV7wmtMDot42z3Qo5Om5MBEVIWhVCLsBGjSWWNmyFXc4UVbNpLKIUYM3
  22. LjeP08uWGQCw642sOgQaaZURYQlUoyBx%2F6KffPECV7IPgT7Lw8G%2FUJYzMDUHXdQDzfRenxAuMZmLpt%2BBWxUEZqCs87VWaYzhQEFnXwxSAcm4VtowNMdZZHfc8Rcw%2FSWzL4z6zJZlDhzYG0Lp%2B%
  23. 2F7rBv1wwBpqsCcSS8cXBNSW0XZc7VHJ2AuB%2BOvXzoLWwKMMH%2FHd%2FTI%2B%2BIH5Ec4O8Jjup%2Bg%3D%3D; expires=Fri, 31-May-2013 16:21:54 GMT; path=/
  24. < Vary: Accept-Encoding,User-Agent
  25. < X-Apache-Server: www1.citibikenyc.com
  26. < MS-Author-Via: DAV
  27. <
  28. { [data not shown]
  29. 100 116k 0 116k 0 0 853k 0 --:--:-- --:--:-- --:--:-- 993k
  30. * Connection #0 to host www.citibikenyc.com left intact
  31. * Closing connection #0

Just based on that, it looks like we can deduce several salient points about the architecture:

  • There are two DNS addresses for the backend web servers, www1.citibikenyc.com and www2.citibikenyc.com.
  • The www.citibikenyc.com DNS entry contains both the www1 and www2 addresses.
  • The servers appear to be running nginx on port 80 and then connecting to a local Apache server based on the X-Apache-Server header.
  • The site is built using the CodeIgniter PHP framework (the main site appears to be using the Fuel CMS atop CodeIgniter.)
  • A quick trip to arin lets us know that those two IP addresses point to machines in a MediaTemple datacenter, so no CDN is in play.
  • Each request to the /stations/json path returns a Set-Cookie header that sets a CodeIgniter session cookie with a 10 minute lifetime.

For comparison, this is what the response looked like during the outage. The main differences are that during the outage, the feed was still returning a 200 HTTP response code, but the Content-Length was 0, and indeed the response body was blank, like so:

  1. ┌┤smerrill@lilliputian-resolution [May 27 11:34:46] ~
  2. └╼ curl -v www.citibikenyc.com/stations/json > /dev/null
  3. * About to connect() to www.citibikenyc.com port 80 (#0)
  4. * Trying 70.32.83.162...
  5. % Total % Received % Xferd Average Speed Time Time Time Current
  6. Dload Upload Total Spent Left Speed
  7. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* connected
  8. * Connected to www.citibikenyc.com (70.32.83.162) port 80 (#0)
  9. > GET /stations/json HTTP/1.1
  10. > User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
  11. > Host: www.citibikenyc.com
  12. > Accept: */*
  13. >
  14. < HTTP/1.1 200 OK
  15. < Server: nginx
  16. < Date: Mon, 27 May 2013 15:34:47 GMT
  17. < Content-Type: text/html
  18. < Content-Length: 0
  19. < Connection: keep-alive
  20. < Vary: Accept-Encoding
  21. < Set-Cookie: ci_session=x7ymouLWLEeY6%2Fo%2BudFoQHOixWyP3b9Ygp8Ocv94roUESql6Gwet1nuCVBcILmqt9DQzbFsSLLkXyOZ5qL%2Fl%2FD88F0Q0uXeLptE3zlGHxP0EISPGk5gW91SVscxi1klVRYv5Mt5zTO0KzB4obwc%2FY1AUFEodhplKXeaSURPXAw7roZVumXkmM1ALGbWQx5FF6LKm%2FtzudHm8NQPJYXDx3s3sUdVNWvWQpWe3iKEE5Su0TzqCKZcBxWYcssuPNVGEx8c5SpijHw6iR7sqnTMBnMdv7m4jsuj9ZweGk6JfGEp3G5%2BXAMqdsWfE%2Fa8449o92%2BlLug0NCFRdxH7ViZHXBA%3D%3D; expires=Mon, 27-May-2013 15:44:47 GMT; path=/
  22. < Vary: Accept-Encoding,User-Agent
  23. < X-Apache-Server: www1.citibikenyc.com
  24. < MS-Author-Via: DAV
  25. <
  26. { [data not shown]
  27. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
  28. * Connection #0 to host www.citibikenyc.com left intact
  29. * Closing connection #0

The Case of the Missing JSON

It seems the root of the problem was that the Apache server behind the nginx server was overloaded. It appears that at one point, the /stations/json endpoint was throwing errors based on this tweet:

@stevenmerrill That @citibikenyc feed has been returning Apache "too many connections" errors for a couple of hrs.Shows strong demand.

— codeline telemetry (@codelinegeekery) May 27, 2013

 

And then it was changed to this odd 200 response with no data after that. Naturally, that lead me to a simple question.

Y u no cache?

Why not cache or pre-generate the JSON? Even with a TTL of 60 or 30 seconds, caching the rendered JSON instead of going back to a dynamic application for every visitor could help protect the backend LAMP stack from spikes in traffic.

I can only speculate, but I imagine that the design of this feed went something like the following:

"Our CitiBike stations will be pinging in every 30 seconds with new information about how many bikes and docks are available, so we have to be sure that the information we send to our mobile apps is as fresh as possible."

Keeping your customers up to date about if they'll be able to get or return a bike is certainly of paramount importance to a system like this, but that doesn't have to translate into dynamically generating JSON data on every request.

Cache Rules Everything Around Me

My first thought upon seeing this situation was hasty tweet:

Oddly, the misbehaving @citibikenyc JSON feed returns a 200 and an empty body. Maybe the team should have written flat files and used a CDN.

— Steven Merrill (@stevenmerrill) May 27, 2013

 

There are several possible ways that the JSON response could be cached to lower the load on the backend web servers.

The lowest-risk option would be to have an out-of-band process like cron or Jenkins generate a static JSON file on the filesystem every 30 or 60 seconds. Then have nginx deliver that static JSON file to clients when they visit the /stations/json URL. This is a great option, since the load generated to service that endpoint would be both consistently timed (every 30 or 60 seconds) and predictable. For even more scalability beyond the two nginx servers, this file could then be pushed to a CDN, or the file could be served from Fastly, a Varnish-powered CDN with instant purging.

Another option that would allow for more finely-tuned updates is to use a message queueing system. With this design, running a task on cron or with Jenkins would not be needed. Instead, the bike docking stations would send messages to the queue whenever an action occurred that changed the number of bikes at the station (bike checked out or bike returned.) In a naive version of the system, each of these state-change messages would regenerate the whole JSON file. In a smarter version, the state-change messages would update a single cached JSON fragment representing the station, and the new JSON file could be generated from this cache so that database load would be lowered when the file had to be regenerated.

Both of these approaches are much preferred because the system load is controllable and is not tied to how many requests for the JSON file are coming in. A third server that didn't respond to web requests could do this processing.

Finally, since the application already has nginx at the frontend, another option is to use nginx's proxy_cache and proxy_cache_use_stale directives and cache the output of /stations/json for a certain TTL. This is not quite as ideal, as the potential exists to have TTL drift between the two web servers, such that someone might get data that was a few seconds stale based on which server they hit, but the business would probably agree that that is better than outright downtime.

For this to work, the Set-Cookie header must be removed. As far as I can tell, there is no personalized data being sent from that endpoint, so this session cookie does not serve a useful purpose.

In Summary

The launch of a web application can be tough. Load patterns emerge in the real world that your load testing didn't take into account. Regardless, if you know that any part of your system may get a lot of traffic or very bursty traffic, you can engineer your application for resilience by moving data generation out of uncached web requests and into cron- or queue-backed update processes.

Phase2 recently experienced a challenge like this in helping the Robin Hood Foundation architect a website for the 12-12-12 Concert for Hurricane Sandy. If you're interested in learning more about that you can listen to the panel talk we just gave at Drupalcon Portland.

Steven Merrill

Director of Devops