When I was a kid, I remember watching the movie Christine, based on a lesser known Stephen King novel by the same name. It’s been a few years, but I vaguely remember the plot went something like this: nerdy loser guy fixes up classic car, evil punk kids trash car (because, well, they’re evil), car gets possessed by evil spirits/magic lightning/whatever, car repairs itself and starts killing evil punk kids. The movie was mediocre at best, and is sure to have not aged well, but one part that has always stuck in my memory was how cool it was to see the car automatically repair itself each time the punks bashed it in. Metal untwists, windows pop back into place, floor mats straighten out, paint buffs itself, life is good.
I’m not sure why this (minus the evil Satan car killing people part) appealed to me so much. Maybe it was the cool special effects (it was 1983 after all) or maybe it was some personal rebellion against all those long, cold nights holding a flashlight in the garage while my father fixed some obscure part of the family car (”the fan belt combobulator is engaging the brake line manifold…”). I think one thing this did teach me (other than a total lack of interest in repairing cars), was a complete disdain for fixing the same problem again and again. A car that can fix itself, now that’s pretty darn cool.
Of course we don’t have self repairing cars or even self-fixing computer hardware yet (and no, exploiting Jed the intern doesn’t count), but in the world of “self-healing software”, we’re further along than you may think. Sure it’s not quite as cool as watching a punctured tire reinflate itself, but it’s still pretty darn neat and saves just as much time.
Coding for Failure
Let’s first take a step back. When an aspiring software developer takes a programming course or reads through a programming book, odds are they will learn what I call “coding for success”. In other words, you have a problem to solve, and you write code to provide a solution. Make me an application that can calculate how many bananas you can carry in a wheelbarrow (the monkeys are hungry). If you run out of bananas, flash up an error dialog and ask the user to provide more bananas. Every line of code is written with the assumption that the machine is up and running and working just fine.
In this wacky world 24/7 e-commerce, though, the rules are not so simple. Say I want to push 3,000 listings over to eBay in five minutes or less. No problem, ebay has an API, I’ll just write some code to push the data. But wait, even if my code is as fast as possible, I still can’t possible push that much data in that little time, the computer is just not that fast. Alright, so I’ll change my code so it can run in parallel on the same machine (think 1 person in a harness pushing 2 wheelbarrows side by side, a lot tougher but still possible). Hmm, but that’s still not fast enough, and there’s now smoke shooting out of the back of my machine. Okay, so maybe I need more than one machine to do the work, but how does that work? Somehow I need to orchestrate the work so it can run on many machines (e.g. 5 people pushing 10 wheelbarrows). Okay, so now I figured that out, and the boss comes back saying “by next year I want to push 30,000 listings out in 5 minutes”. Guess I need to buy more servers…
As you grow your software to run on more and more machines, handling more and more work and running longer and longer between breaks, an interesting phenomenon occurs: things that should always “just work”, don’t always “just work”. The database sitting on the Mac-Daddy “Hal-9000 would be jealous” hardware sometimes says “you know, I’m kindof busy right now. Would you mind coming back a bit later?”. Or maybe you’re connecting to eBay, but eBay’s network says “gosh, there sure are a lot of people coming in right now, I’m going to have to ask you to come back later.”. Wait a minute, this isn’t supposed to happen!
With this level of scale and interdependency, your approach to software must become more evolved. No longer can you “code for success”, but rather now you must “code for failure”. Each and every line of software must be written with the challenge “If this line failed, how would the system recover?”. If the Hal-9000 database stopped working (”What are you doing…Mark?”), or your Internet connection decided to “take 5″, or evil, crazed monkeys started shoving bananas into the server, how would the system recover without losing the work?
These are lessons they don’t teach you in college, but are oh so critical to keeping software running 24/7/365 across hundreds of servers. Remember, even if a failure only happens once in a million times, if you run over a million transactions each day, you’ll get a failure each and every day. Imagine if you walked up to an ATM and tried to withdraw $500, but the ATM software failed after the $500 debit from your account was made, but before the cash was dispensed into your eager hand. Yikes, you’re out $500, not cool! And a response from the bank of “oh that only happens once in a million times. Sorry, better luck next time!”, doesn’t really fly very well. Release the monkeys!
(more…)
Share This