Avoiding Flat Tires in your Web Application

This Monday, the CitiBike bike share launched in New York City. The website was beautiful and responsive, and more than 15,000 registrations were processed through it before the launch happened.

But then, a funny thing happened. The website and the mobile apps’ maps started coming up blank, and they stayed blank for more than 12 hours. What follows should not be characterized as a failing of the technical team. Launches are tough, and I don’t mean to pile on what was obviously a tough situation. Instead, I would like to look at a few choices that were made and how the system might have been better architected for scalability.

Architecture Spelunking

I have no intimate knowledge of the architecture of the Citi Bike NYC application and hosting infrastructure, but we can find out a couple things very quickly.

The misbehaving path in question was http://www.citibikenyc.com/stations/json. With that information in hand, we can use the dig and curl commands to get a lot of information.

It looks like there’s two web servers, and the site is using DNS-based load balancing. Let’s take a look at the (now-fixed) JSON feed with curl -v. Using curl -v instead of curl -I ensures that we actually send a GET request instead of a HEAD request, just to make sure that the behavior is as close to a real browser as possible.

Just based on that, it looks like we can deduce several salient points about the architecture:

  • There are two DNS addresses for the backend web servers, www1.citibikenyc.com and www2.citibikenyc.com.
  • The www.citibikenyc.com DNS entry contains both the www1 and www2 addresses.
  • The servers appear to be running nginx on port 80 and then connecting to a local Apache server based on the X-Apache-Server header.
  • The site is built using the CodeIgniter PHP framework (the main site appears to be using the Fuel CMS atop CodeIgniter.)
  • A quick trip to arin lets us know that those two IP addresses point to machines in a MediaTemple datacenter, so no CDN is in play.
  • Each request to the /stations/json path returns a Set-Cookie header that sets a CodeIgniter session cookie with a 10 minute lifetime.

For comparison, this is what the response looked like during the outage. The main differences are that during the outage, the feed was still returning a 200 HTTP response code, but the Content-Length was 0, and indeed the response body was blank, like so:

The Case of the Missing JSON

It seems the root of the problem was that the Apache server behind the nginx server was overloaded. It appears that at one point, the /stations/json endpoint was throwing errors based on this tweet:

 

And then it was changed to this odd 200 response with no data after that. Naturally, that lead me to a simple question.

Y u no cache?

Why not cache or pre-generate the JSON? Even with a TTL of 60 or 30 seconds, caching the rendered JSON instead of going back to a dynamic application for every visitor could help protect the backend LAMP stack from spikes in traffic.

I can only speculate, but I imagine that the design of this feed went something like the following:

“Our CitiBike stations will be pinging in every 30 seconds with new information about how many bikes and docks are available, so we have to be sure that the information we send to our mobile apps is as fresh as possible.”

Keeping your customers up to date about if they’ll be able to get or return a bike is certainly of paramount importance to a system like this, but that doesn’t have to translate into dynamically generating JSON data on every request.

Cache Rules Everything Around Me

My first thought upon seeing this situation was hasty tweet:

 

There are several possible ways that the JSON response could be cached to lower the load on the backend web servers.

The lowest-risk option would be to have an out-of-band process like cron or Jenkins generate a static JSON file on the filesystem every 30 or 60 seconds. Then have nginx deliver that static JSON file to clients when they visit the /stations/json URL. This is a great option, since the load generated to service that endpoint would be both consistently timed (every 30 or 60 seconds) and predictable. For even more scalability beyond the two nginx servers, this file could then be pushed to a CDN, or the file could be served from Fastly, a Varnish-powered CDN with instant purging.

Another option that would allow for more finely-tuned updates is to use a message queueing system. With this design, running a task on cron or with Jenkins would not be needed. Instead, the bike docking stations would send messages to the queue whenever an action occurred that changed the number of bikes at the station (bike checked out or bike returned.) In a naive version of the system, each of these state-change messages would regenerate the whole JSON file. In a smarter version, the state-change messages would update a single cached JSON fragment representing the station, and the new JSON file could be generated from this cache so that database load would be lowered when the file had to be regenerated.

Both of these approaches are much preferred because the system load is controllable and is not tied to how many requests for the JSON file are coming in. A third server that didn’t respond to web requests could do this processing.

Finally, since the application already has nginx at the frontend, another option is to use nginx’s proxy_cache and proxy_cache_use_stale directives and cache the output of /stations/json for a certain TTL. This is not quite as ideal, as the potential exists to have TTL drift between the two web servers, such that someone might get data that was a few seconds stale based on which server they hit, but the business would probably agree that that is better than outright downtime.

For this to work, the Set-Cookie header must be removed. As far as I can tell, there is no personalized data being sent from that endpoint, so this session cookie does not serve a useful purpose.

In Summary

The launch of a web application can be tough. Load patterns emerge in the real world that your load testing didn’t take into account. Regardless, if you know that any part of your system may get a lot of traffic or very bursty traffic, you can engineer your application for resilience by moving data generation out of uncached web requests and into cron- or queue-backed update processes.

Phase2 recently experienced a challenge like this in helping the Robin Hood Foundation architect a website for the 12-12-12 Concert for Hurricane Sandy. If you’re interested in learning more about that you can listen to the panel talk we just gave at Drupalcon Portland.

  • Fredric Mitchell

    interesting points.

    how would this work if you needed dynamic json files, maybe by a query service?

    would you still generate flat files?

  • Bengie25

    How do people keep creating web services that are not IO, CPU, nor memory bound? Number of connections is an artificial limitation, why artificially limit your service?

    The biggest issue these days seems to be poor systems designs by using blocking calls and spawning threads.

    I have read some blogs in the past of good system designers getting a single quad-core Xeon web-server to push 10Gb/s with 300,000 active connections slamming it with requests.

    It took some kernel compilations with network stack parameter tweaks, but it can be done.

  • joon hyong

    Good points..