Fault Tolerance For An Improved User Experience

Adam Ross, Software Architect
#Design | Posted

A close cousin of high availability, fault tolerance is the ability of a software system to cope when bad things happen.

What would happen to your site if…

  • Google disappeared?
  • Your database ran out of disk space?
  • Your load balancer forgot about half its servers?
  • Two pieces of code disagreed on the definition of "numeric"?

When your system is fault tolerant, the damage from any of these happening is minimal. If your system has a good user experience in its fault tolerance, there should be no frustration, confusion, or fear when these errors happen on the part of the system administrators.

Of course, faults are the rare exception in good software. We are talking about how to handle the very small percentage of software interactions that go bad for one reason or another.

The extra effort you spend to build a fault tolerant system will lead to numerous  benefits for various groups of people that interact with your site:

  • Site visitors that will never know the problem happened, or if they do, they find out quickly and know what to do.
  • The Support team who can easily verify the system status to know whether there is a persisting problem.
  • All members of the Development team that can identify what went wrong and how to fix it.
  • Product owners can have confidence that even if the infrastructure fails the application will be robust enough to keep going on or fail as gracefully as possible.

So how do I build for fault tolerance?

Great question!

There’s no silver bullet here, fault tolerance is an architectural principle rather than a trick. Here are a few things you can do to look after the people-oriented aspects of fault tolerant design.

Use Local Copies of Content

Caching is often thought of in a performance-minded way: cache this data so the server does not have to build it, or  cache that data to avoid network latency slowing down the next page load. From the fault tolerant perspective: cache some data if the original source disappears for a while, you still have content to present to your users.

For example, suppose you are embedding a Twitter widget in your site. If Twitter went away, would you rather a hole in your page layout, or a snapshot from the last fresh data you pulled from the Twitter API?

Even if you cache content aggressively, there is still risk that each time the content expires, the upstream provider will be gone. If you make it a point to retrieve new data before purging the cache of your old data you minimize this risk. (Use this with care, sometimes stale data is almost as bad as none, inform your users!)

Never Keep People Waiting

These days the web moves quickly. If any system is a few seconds off its usual behavior, something is very wrong. If something is very wrong, the best case scenario is mind-numbing slowness. The worst case is failed operations, data corruption, or participating in a denial-of-service attack. By ending a too-slow operation quickly, users can keep moving about their activities, whether it is trying the task again or moving on to something else. Fast is always more forgivable than slow.

Tune a time-out for every service interaction (such as querying the database or making a Salesforce API request). Don’t make your user wait 30 seconds to find out they twiddled their thumbs for nothing. When in doubt, 3 seconds is a decent limit for most web services. Remember the mantra: fail fast.

Provide Clear Guidance to the User

When something goes wrong, tell the user what happened, what it means to them, and what they should do about it.

Silence or opaque error messages in the wake of flaky or unpredictable behavior leads to confusion. Confusion leads to impatience and distrust in your site’s reliability. Distrust leads to loss of reputation and customers.

Let the user know if and when they should try again, reassure them that their data is safe, or which data might be missing. If the system can retry the operation on its own, inform the user so they do not spend the effort to repeat the action. When something that seems bad happens, there should be no mysteries.

(If you can completely manage the particular fault that occurs, you’ve almost transcended fault tolerance. At that point, the mystery of perfection hidden behind your queuing system is perfectly fine!

Example: A user loads a transit pass online. The system takes an extra second then informs them (as usual) that it will take up to 24 hours for the transaction to be applied to their account. Under the hood, the main database is down but the transaction was stored for later processing.)

Record What Happened for the Technical Team

When something goes wrong, there needs to be an audit trail. Logs can be used to signal emergencies to the systems team. They can also be used to reproduce a sequence of sad events by the development team trying to troubleshoot something. Good logs are clear, informative, and provide key variables such as the user that took the action, or the parameters of the transaction. Effective log monitoring can even provide early warning signs of a problem to the technical folks before the users and administrator run into something truly noteworthy. Pre-empting problems is much better than reacting to them after a user has suffered.

Communicate the Current System Status

A tell-tale status board lined up with reassuring green lights is a great confidence booster. However, when something goes wrong, there’s nothing more reassuring than a few yellow or red warning signs, indicating that the system (or even some humans) are aware of the problem. When something has gone awry, the worst feeling is ignorance: users feel orphaned, the support staff has nothing to say, and the technical team is starting at square one to solve the problem.

Don’t forget our fail-fast mantra: slow response times are also cause for a yellow alert! In Drupal, this means extending the status board for any non-standard integrations or infrastructure you are using.

Parting Thoughts

By building your system in a fault-tolerant way, you reduce your users’ pain when things go wrong. The technical tricks discussed in this post can even be considered “good scalability citizenship” for your upstream infrastructure and web-service providers. A little bit of planning in your architecture goes a long way!  Learn more about caching in Chris Johnson's blog post "Caching in Drupal." Read more on webservice integrations with Tobby Haggler in "Displaying Tweets on High Traffic Websites."

Adam Ross

Software Architect