Evil Monkeys and Self Healing Software
When I was a kid, I remember watching the movie Christine, based on a lesser known Stephen King novel by the same name. It’s been a few years, but I vaguely remember the plot went something like this: nerdy loser guy fixes up classic car, evil punk kids trash car (because, well, they’re evil), car gets possessed by evil spirits/magic lightning/whatever, car repairs itself and starts killing evil punk kids. The movie was mediocre at best, and is sure to have not aged well, but one part that has always stuck in my memory was how cool it was to see the car automatically repair itself each time the punks bashed it in. Metal untwists, windows pop back into place, floor mats straighten out, paint buffs itself, life is good.
I’m not sure why this (minus the evil Satan car killing people part) appealed to me so much. Maybe it was the cool special effects (it was 1983 after all) or maybe it was some personal rebellion against all those long, cold nights holding a flashlight in the garage while my father fixed some obscure part of the family car (”the fan belt combobulator is engaging the brake line manifold…”). I think one thing this did teach me (other than a total lack of interest in repairing cars), was a complete disdain for fixing the same problem again and again. A car that can fix itself, now that’s pretty darn cool.
Of course we don’t have self repairing cars or even self-fixing computer hardware yet (and no, exploiting Jed the intern doesn’t count), but in the world of “self-healing software”, we’re further along than you may think. Sure it’s not quite as cool as watching a punctured tire reinflate itself, but it’s still pretty darn neat and saves just as much time.
Coding for Failure
Let’s first take a step back. When an aspiring software developer takes a programming course or reads through a programming book, odds are they will learn what I call “coding for success”. In other words, you have a problem to solve, and you write code to provide a solution. Make me an application that can calculate how many bananas you can carry in a wheelbarrow (the monkeys are hungry). If you run out of bananas, flash up an error dialog and ask the user to provide more bananas. Every line of code is written with the assumption that the machine is up and running and working just fine.
In this wacky world 24/7 e-commerce, though, the rules are not so simple. Say I want to push 3,000 listings over to eBay in five minutes or less. No problem, ebay has an API, I’ll just write some code to push the data. But wait, even if my code is as fast as possible, I still can’t possible push that much data in that little time, the computer is just not that fast. Alright, so I’ll change my code so it can run in parallel on the same machine (think 1 person in a harness pushing 2 wheelbarrows side by side, a lot tougher but still possible). Hmm, but that’s still not fast enough, and there’s now smoke shooting out of the back of my machine. Okay, so maybe I need more than one machine to do the work, but how does that work? Somehow I need to orchestrate the work so it can run on many machines (e.g. 5 people pushing 10 wheelbarrows). Okay, so now I figured that out, and the boss comes back saying “by next year I want to push 30,000 listings out in 5 minutes”. Guess I need to buy more servers…
As you grow your software to run on more and more machines, handling more and more work and running longer and longer between breaks, an interesting phenomenon occurs: things that should always “just work”, don’t always “just work”. The database sitting on the Mac-Daddy “Hal-9000 would be jealous” hardware sometimes says “you know, I’m kindof busy right now. Would you mind coming back a bit later?”. Or maybe you’re connecting to eBay, but eBay’s network says “gosh, there sure are a lot of people coming in right now, I’m going to have to ask you to come back later.”. Wait a minute, this isn’t supposed to happen!
With this level of scale and interdependency, your approach to software must become more evolved. No longer can you “code for success”, but rather now you must “code for failure”. Each and every line of software must be written with the challenge “If this line failed, how would the system recover?”. If the Hal-9000 database stopped working (”What are you doing…Mark?”), or your Internet connection decided to “take 5″, or evil, crazed monkeys started shoving bananas into the server, how would the system recover without losing the work?
These are lessons they don’t teach you in college, but are oh so critical to keeping software running 24/7/365 across hundreds of servers. Remember, even if a failure only happens once in a million times, if you run over a million transactions each day, you’ll get a failure each and every day. Imagine if you walked up to an ATM and tried to withdraw $500, but the ATM software failed after the $500 debit from your account was made, but before the cash was dispensed into your eager hand. Yikes, you’re out $500, not cool! And a response from the bank of “oh that only happens once in a million times. Sorry, better luck next time!”, doesn’t really fly very well. Release the monkeys!
Self Healing Software
So what can you do about it? This is where we’ve invested a lot of effort over the last 7 years here at ChannelAdvisor. If the failure can happen, we’ve seen it. Fix it once, shame on you (Hal), fix it twice, shame on me. The way out: self-healing software and “coding for failure”. Still not following? Let me elaborate.
There are several strategies we use to cope with the random “once in a million” problem. For those glitches that occur entirely within our own systems (databases timing out, network connectivity hiccuping, etc.), this is where transactional coding and retry come into play. What does that mean? Well, I could throw out some technobable like two-phased commit, transactional rollback, queued retry, exponential backoff, and hashed string superconductor capacitance (just kidding on that last one, or am I?), but I think an analogy would would make this easier to understand.
Say you got your hands on an advanced screening of the new Battlestar Galactica season finale (best show on TV). This is pure nerdvana, but what good is that if you can’t brag to your nerdy friends? So you decide to invite over your buds Geddy, Alex and Neil to watch the show. You give Geddy a phone call, but his line is busy (Geddy is a lamo who lives in the 70s so he doesn’t have call waiting or voice mail, but he does play a mean bass guitar). Okay, so you call back in 5 minutes. Drat still busy. Okay you call back in 30 minutes. Now he doesn’t answer. Two hours later you try again, now he picks up: “Does 9pm tonight work?”. “Sure”. “Okay, let me call Alex and Neil and see if that works for them too.”
So now you call Alex, he answers right away. “Sorry, can’t do tonight, how about tomorrow?”. “Dunno, let me call Geddy again”. So back to Geddy: “Yo G, Alex can’t make tonight. How about tomorrow night?”. Sure, Geddy says. You then call Neil, he answers right away with “yep, tomorrow at 9pm is fine”. Finally you call Alex and confirm 9pm works for everyone. Done, your plan is set - just don’t forget the beer and nacho cheese Combos.
Believe it or not software systems can work in a similar manner. A request for work (say closing out a completed listing from ebay) gets queued up in our system, and when executed needs to communicate with several different internal servers (call them Geddy, Alex, and Neil). If one of the servers is not responding, the system tries again later. If still not responding, it waits even longer (phone still busy). These retries continue with longer and longer delays until a connection is made. But that’s just the beginning - all the different servers need to be on the same page (”tonight at 9 doesn’t work, how about tomorrow?”). If designed correctly, the software can be smart enough to coordinate this transaction across all the participating servers. Once all the servers respond and agree to the same plan of action, the transaction is committed and the state of the system is changed in one fell swoop. Until that agreement occurs, though, nothing is allowed to change (Geddy, Alex and Neil can’t make any plans until they are all in agreement, even if somebody else tries to invite them to see “24″ instead).
99.99% of the time this all works on the first time (unlike Geddy, who can be kindof flaky, the server is a good friend and is almost always waiting by the phone for your call), but for that 0.01% of the time when nobody is home, your systems must smart enough to handle that event and recover in an automated manner. Otherwise your credibility suffers, one disgruntled customer at a time, day after day after day.
Compensators
For those problems that occur outside of our own system, say ebay or Paypal doesn’t respond to an API call, transactional coding is not enough as you don’t have direct control over those external software systems. It’s their software and they set all the rules. There are still alternatives, though - in this case we use what’s called a “compensator” pattern. Bascially all this means is this: if you try to do something with a partner and the partner says “nope”, you can’t tell the partner what to do, but you can change your own system to workaround or redefine the problem.
Here’s a real-world example from our system. When a buyer visits our checkout system and selects Paypal as a payment option (assuming standard Paypal payments option is configured), they are redirected to Paypal to complete that transaction at the end of the checkout process. Upon submitting the payment at Paypal, the buyer is redirected back to our checkout system where they see a completed order summary page. In 99.9% of the cases, Paypal will also send us a secure confirmation of the payment alongside the redirected buyer, signaling to our system that payment has cleared and the seller can fulfill the order. Once in a while, though, that confirmation does not arrive from Paypal before the buyer lands on our confirmation page. Since we have no direct control over the Paypal system, we can’t force them to send the confirmation a second time, but what we can do is compensate for this known behavior in our own system, where we set the rules. In this case, we put the checkout into an intermediate “on hold” state, preventing the buyer from paying twice, and also preventing the seller from fulfilling the product before payment has been verified. In a manner similar to the queued retry mechanism discussed above, minutes or even hours later Paypal will send us the notification that the payment has cleared (or failed), and we then update our records in our system to inform the seller they can now fulfill the item (payment success) or inform the buyer that they need to resubmit payment (payment failure). This is all orchestrated in an automated manner - no need for a developer to manually fix this problem every time it happens. This is not just better for the developer, it’s better for business - machines can fix problems a lot faster than humans can. If the problem occurred at 2 in the morning, why wait until 9am to fix it?
Here’s another example: when pushing out new listings to ebay, every once in a blue moon (say 1 in 10,000 times), we’ll get an ambiguous response back from their API that does not validate success or failure of the posting. Well that’s bad, if we try to post the item again we may double post it (overselling the item), but if we do not post the item again, we risk the possibility of missing a window to sell the item.
Although I suppose we could say “oh sorry, that was a problem with ebay”, we see that as the easy way out. Whenever possible we try to build our systems to compensate for those random failures as much as possible (”self healing”, reinflate the tire when you run over a tack, don’t blame the tack for being sharp!). In this specific example, ebay’s API can take a unique identifier to be associated with that posting action. If we submit a retry with that same unique identifier, ebay will then look up in their system to see if the item has already been posted. If it has, they respond with “Nope, already posted it. Have a nice day”. If it has not, they respond with “Thanks for the posting, I’ll put it up now”. Either way, the customer wins - what was once a hiccup in the system is now self-corrected with no human intervention necessary. Those pesky humans only get in the way anyway, just ask the monkeys.
Anyway, that’s about it for now. I crave feedback, so please feel free to send it my way. Was this article too technical or not technical enough? Too long winded? Too short? Completely Irrelevant? Let me have it! Otherwise, I’ll just blabber on about several other topics I have on my mind down the road. And remember: self-healing software today, self-fixing cars tomorrow, and soon…hover cars!

January 13th, 2007 at 10:18 am
[…] Of course it’s not all peaches and cream as the reverse is also true: if there’s a bug in our software and we deploy it to the production environment, (Poof) everyone sees the bug instantly. If one of the servers dies or the Internet connection is severed, suddenly productivity grinds to a halt. This is the tradeoff the modern business world is learning to accept as the pros usually outweigh the cons, as long as the duration and impact of problems can be minimized or eliminated. This is what is commonly referred to as the “reliability” (likelihood of being bug free) and “availability” (likelihood of being online and accessible) of a system. I talked about this a lot in a prior posting so won’t belabor that here. […]