Banking on Scalability
Grocery shopping…not a big fan. Cut-throat parking, endless waiting at the deli counter, squeaky shopping carts. No thanks. The checkout line is especially fun - I like to play a little game I call “find the fastest line”. The rules are simple: pick the one line that will get you the heck out in the least amount of time. Novice players oversimplify: find the shortest line and get in it. Amateurs. The pros know better, there are a dozen factors to consider: How full are the shopping carts? Any produce? How quick is the clerk? (bonus points for managers). Is there a bagger? Is anyone paying by check? And beware the ultimate evil: the customer with a stack of coupons, there will *always* be a dispute.
As amusing as this game is, it pains me to have to play it. Banks have figured this out years ago: one line per clerk is wasteful; one line feeding all clerks is fair to all. If little Jimmy is depositing $50 all in pennies, no problem – you skip his teller and move to the next. No games, everyone gets a fair shot.
Years ago at CA we made a fundamental design decision in our software that has paid off tenfold: use centralized “bank style” queuing to scale out our load efficiently. Back in the dawn of CA history (circa ~2000) our very own product manager Rick Watson, then a “light side” disciple of our engineering group (bet you didn’t know that, huh?), introduced an architecture we refer to internally as “Generic Services”. What the name lacks in flair it more than makes up for in capability. Generic Services was our own homegrown version of Grid Computing, before the phrase “Grid Computing” got hip. There are many overly complex definitions of what Grid Computing is, but I’ll oversimplify with this one statement: “Grid computing lets you spread lots of work across lots of machines in automated manner that uses all of these machines as efficiently as possible.” The enemy of a grid is an idle server – every minute your CPU sits at zero you’re wasting capacity. If you buy new servers while others are sitting idle, you’re wasting money.
Our Generic Service architecture is in many ways a simplified grid and works like such: there are many day to day tasks that our systems need to perform as part of doing business – post auctions, upload inventory, send emails, etc. When a developer writes code to solve these problems, they have to create software that can handle hundreds, thousands, or even hundreds of thousands of these tasks over the course of a day. Generic Services simplifies the solutions considerably: rather than writing complex code that can handle all of those thousands of tasks, you instead write code to handle just one task at a time, that’s it. The Generic Services infrastructure then takes this one solution and duplicates it – dozens or even hundreds of times within and across many different servers.
Okay so now you have all of these copies running on all of these servers, but how does each “copy” know what to do? That’s where the “bank line” comes in. For whatever works that needs to be done, say uploading 10,000 inventory items, the system splits all that work into the smallest self-contained unit of work possible. In this example, we’d put 10,000 messages in a queue, one for each inventory item. Each server pulls off work from the queue, first-come first-served. If one item takes longer than another, no problem - the other servers (or even other worker copies on that same server) will pick up the slack. This is no different then a group of bank tellers servicing a line of customers: some customers take longer than others, but when complete they will always say “next please” to the first person in line.
The Big Picture
So this is fascinating and all, but you may be asking “why should I care”? There are several benefits our customers indirectly get from this architecture. First and foremost, this solution leaves us well equipped to handle sharp increases in load. For example, as everyone knows, ebay runs pricing promotions from time to time that encourage new listings through their marketplace. The mother of all of these promotions is a rare event called “Free Listing Day”, last held in 2003 I believe. Because this is such a great promotion for our customers, they obviously want to maximize their benefits by pushing out as many listings as possible within a 24 hour period. As you can probably guess, this results in a sharp spike up in the posting load flowing through our systems, historically as high as 10x the daily average.
Handling spikes like this is where Generic Services and the whole “grid” concept shine. As load ratchets up, we tune our systems to handle that load. This can be done in two ways: if the increase is anticipated to be permanent (e.g. adding more customers to the system), then we purchase new servers and plug them into the grid. If the load is a temporary spike (like Free Listing Day), we can temporarily reallocate extra capacity for those affected services (say posting items) at the expense of less time-critical services (say filing Unpaid Item requests with ebay), all without purchasing additional hardware or changing our code. This type of tradeoff gives us the flexibility to respond instantly to changing market conditions on a day by day, or even minute by minute basis, based on where the needs warrant.
Other benefits include quicker implementation of new features (just plug into the framework and you get instant scalability), standardized quality control (no need to write new monitors and error correction routines for each new feature, just plug into the existing framework and benefit instantly), and an efficient model to allow us to anticipate the growing capacity needs of our customers.
Cool stuff. So remember, next time you chat with Rick Watson or see him at one of our upcoming conferences, make sure to thank him for “Generic Services”.

January 13th, 2007 at 10:21 am
[…] Data is King In my last 2 posts, I’ve focused on how all the “work” in our system gets distributed by bank style queues and self-heals from “evil monkey” errors through rollbacks and retries. Now it’s time to focus on the data itself. Every single scrap of information that powers your business: from the titles of your inventory to the buyers who purchase your products to the pagination setting you use in your views, all of it is stored in a database. Specifically at ChannelAdvisor, it’s a Microsoft SQL Server 2000 database (yeah yeah, we know we need to upgrade to 2005, but we are always conservative in upgrading our servers until several Microsoft patches have burned in first). If all of the application servers that processed the “work” were the linemen, the database would be the quarterback. The database holds all of the cards – it knows the score, the play, and where to pass the ball. Data is king! […]