I hope everyone is enjoying their Q4 right now. I know all of our servers are appreciating the extra love and attention from the holiday shoppers.

While everyone is frantically fulfilling orders, and maybe placing a few of their own, I felt this would be the perfect time to talk a bit about how all of that data is stored in our system. Fascinating holiday reading I know, so I’ll try my best to keep this entertaining and stick to the high level points of why your business should care.

Data is King
In my last 2 posts, I’ve focused on how all the “work” in our system gets distributed by bank style queues and self-heals from “evil monkey” errors through rollbacks and retries. Now it’s time to focus on the data itself. Every single scrap of information that powers your business: from the titles of your inventory to the buyers who purchase your products to the pagination setting you use in your views, all of it is stored in a database. Specifically at ChannelAdvisor, it’s a Microsoft SQL Server 2000 database (yeah yeah, we know we need to upgrade to 2005, but we are always conservative in upgrading our servers until several Microsoft patches have burned in first). If all of the application servers that processed the “work” were the linemen, the database would be the quarterback. The database holds all of the cards – it knows the score, the play, and where to pass the ball. Data is king!

As Spiderman would say, with great power comes great responsibility (although I still think Batman could whoop him). With such crucial information stored in the database, many steps must be taken to make sure that data stays safe. The first of these steps we take at CA is to distribute this information across many servers – last I counted I believe we had 21 quad-CPU, 32 gig Dell database servers actively serving the Merchant and Pro products, for which this article is geared. These are not your typical MP3 servers, these suckers have some honkin’ horsepower available to service our customers’ needs.

So with all of these servers, we have a lot of customer data to store. To service this, we assign each and every account to a specific server. One server will handle many accounts, but no single account will live on more than one server. Sounds risky? It’s not, bear with me on that until the next section. Why only one server? A simple reason: it’s fast. Spreading the data across multiple servers has merits and is an approach we are considering for some future architectures, but also has the disadvantage of slowing down performance when aggregating all of that data together over the network. Aggregating on a single server is much faster, which is good for you and your customers. So which server are you on? The good news is you don’t need to care. We handle all the globitty-gack for you under the covers. Still want to know? Well a magician doesn’t share all of his secrets, but I’ll give you enough knowledge to be dangerous.

Each and every account in our system has a randomly assigned unique number associated with it. Think of this as your social security number, but not as confidential. This number uniquely identifies your account, and follows your data everywhere – through the bowels of the “bank-like” queues, past the gates of the evil monkeys, and onto the promise land of your database server. Whenever our software does anything with your data, it first takes this number and looks up which server it should be using. Think of this like using the phone book in reverse – given a phone number, you then look up an address. The software performs this lookup with every call, it does so very fast.

Given the results of this lookup, the software then knows exactly where to go. All the data is written out to that one server, and read back as needed. This architecture allows us to scale – as we and our customers grow their business, we just allocate more servers, defining more and more unique IDs mapping to those servers. All of the data is insulated and shielded by this number, so there is never a danger of one account’s data bleeding into another’s. (and if you’re worried about hackers knowing your number, don’t – all of these safeguards are hardwired into our internal network. Even if a hacker knew your number, there’s nothing they could do with it since they can’t change our code in our internal network systems – I can talk about the considerable security measures in a later blog perhaps).

So what’s your number? I’m not going to tell you. If you’re industrious you can probably figure it out, it’s not a big trade secret. No, I’m being obscure here not because of a big security risk, but because it’s dangerous to yourself to get too attached to a specific number. As our system grows we from time to time need to reassign these numbers to rebalance the load, and becoming dependent on a specific number can get you into trouble.

The way our system is designed right now, we have all the hooks in place to forward traffic from old numbers to new numbers whenever we have to change them, think of this like the postal address forwarding. This approach is dependent on our customers not hard-wiring their own URLs to the old IDs, though. This has never been a problem in the past because we do not advertise this number, and we make it clear this number should be considered “internal” and subject to change. For this reason I ask you respect that convention for your own best interest and just rest assured you have a unique id in the system that will help your business grow.

Preventing Single Point of Failure
So I mentioned above the account data lived on a specific server. So what if that server dies, that’s bad right? Well yes, but not too bad. This falls into the “design for failure” mantra I was talking about in my earlier blog. Failures will happen, so it behooves us not to design a system to handle them. For this particular problem, we use 2 well established industry standard mechanisms to protect your data.

The first is a Storage Area Network, or SAN as all the hip engineers call it. SANs are what made EMC rich and famous (maybe a little too rich in the late 90s). Think of a SAN as a massive network of hard-drives with many computers tied into that network. So rather than the single hard drive living in your PC, you now have many of these suckers running side by side in a big scary black cabinet with lots of blinking lights. When you write out your data, instead of going to one hard drive, it can go to several, each with separate copies of that data. Therefore, if one hard drive dies, you still have a copy of your data on another, ready to be served up.

While the massive redundancy in a SAN is of course a huge benefit, you can also get redundancy in cheaper solutions like direct attached RAID storage. What makes a SAN nice, is the “network” part I mentioned above. In SAN terminology they call this a “fabric”. Basically it just means more than one server can tie into all this great available storage, and do it very fast. (1 gigabit per second transfer rates are not uncommon, faster if you break up the data into “chunks”, “striping” across many different hard drive “spindles” all running in parallel – buzzwords aplenty).

The network aspect is extremely beneficial for database servers. Now you can configure your database to write all of it’s information to the SAN, and if for whatever reason that database server dies (circuit board fries, network cards flake out, asteroid hits, whatever), you can now use software to allow another server to kick in automatically and handle the load. In our case, we use Microsoft SQL Server Clustering Software . I’m not a clustering expert so won’t even try to go into the boring details on this, but from a mile high level it works like this: for every 3 “active” database servers that we have, all handling customer requests as described above, we also have 1 “passive” database server that is just sitting there waiting for something to fail. If any one of the “active” servers hits a problem, the cluster software automatically takes that server offline, and replaces it with that “passive” warm standby.

This entire process takes less than 30 seconds, and does not require any human intervention. This means even if something fails at 6 am on Christmas morning (the on-call technician’s worst nightmare), the failover will happen automatically with minimal disruption to our live systems (and those that are disrupted will rollback and retry like I discussed in the earlier blog). After this occurs, an alert is sent out to a technician, who then starts working on fixing the hardware in the system that failed.

Now sure, I can already see at least one wise guy out there getting ready to say this: “so what if 2 servers failed at the same time?”. Yes this is theoretically possible, but the odds are incredibly low. And even if this did occur, we could still bring one of the other passive cluster nodes online to help out, it would just take a little longer to set this up. The good news is the data would never be lost, there would just be a delay of service. This is the type of tradeoff all enterprise service companies must face – how much redundancy do you build in? The more the better of course, but there is a high cost with that level, and with increasing cost to us there is increasing cost to our customers. At the end of the day you must balance cost with acceptable risk, and have backup strategies in place for all contingencies. In this case, I feel we struck a reasonable balance.

Anyway, hopefully I put everyone to sleep with this one. Databases are so key to all of our systems, but can definitely be a bit of a dry topic. If you have any questions feel free to post a reply or drop me a note to isham@channeladvisor.com directly. Happy holiday shopping. Keep doing your part for the economy!