CA Development Lifecycle Episode 3…Chucky Lives! (or something)
Welcome back! Today I’m going to keep grinding through the ChannelAdvisor Product Development lifecycle I introduced earlier in my first and second posts on this topic. If you have not already read those posts, I highly recommend you take a moment to read through those as well as this other required reading.
When we last left our heroes, we finished reviewing Software as a Service, the different “environments” we utilize at CA for deploying code, as well as what is a Change Request (CR), and how CRs are classified into defects, enhancements, work orders, and architecture issues. Today I’m going to delve into our actual processes for creating and deploying software.
CRT Release
CRT stands for “Change Request Team” and is a concept originally introduced by Ralph Kasuba, our VP of Engineering ( aka “chief squire” ) way back in mid-2005 (trust me, 1.5 years is “way back” for us). The CRT process was an alternate process created to streamline deployment of “smaller” changes to our production environment in a timeline of weeks rather than months. Prior to these “CRT releases”, we found ourselves continually frustrated because even the smallest most trivial changes to the system (content changes, a new setting here or there, etc.) would sometimes have to wait 2-3 months to deploy out to our production environment since they first had to be synchronized with all of the other large code changes going on in the system for the next release. This was akin to the trap Microsoft frequently found itself in when many of their products had to wait for the new version of Windows to ship.
Of course, 2-3 months to wait for a small change is too long, so we realized that was something that had to be addressed sooner rather than later. Fortunately, the SaaS model gave us some good options for flexibility here (since, remember, you can deploy fixes immediately), and this is where the CRT process came in to shine.
So here’s how it works: As I mentioned in the last blog post, all of these small changes to the system are tagged with a Change Request ticket, be it an enhancement, defect, or whatever. Every other week, the Product Manager assigned to each product team (and you’ve already gotten to know a few of them through this blog, including Rick Watson, Max Leisten, Joe Brown, and Kevin McCarthy), works with the corresponding Product Development Team Leader to assemble a list of which CRs should be tackled in the next “CRT Release”.
Each development team then devotes the time equivalent of one full time developer (although that work may be split up across several developers in practice) to work through that prioritized list of CRs as defined in the prior step. These CRs go through an accelerated development schedule: the Product Manager will provide any requirements needed for the change (”the new shipping setting must do X,Y, and Z”), the developer will implement in a timespan generally no greater than 1 week, and then our crack Quality Assurance team will dive in to make sure the change (A) works and (B) did not mess up anything else in the system (”sorry, you can’t post any new auctions now, but you can now highlight your shipping icons in Candy Apple Red!”). Once all the changes are validated and flagged for deployment, they are then merged into our “long term” code branch queueing up for the next big release (more on that in my next post).
Our CRT releases generally go out twice a month, usually on the first and third Wednesday of each month, but that is subject to variance from time to time and not something you should count on. Since the nature of the changes are usually small, we generally do not have to bring the systems offline for this deployment, rather we can toggle some of the servers into an “offline” mode while they receive the code changes, and then flip-flop these servers online while the other half is brought offline to receive those same changes. This is actually a pretty fascinating topic in itself (well, at least to nerds like me), so maybe that’s something I can talk about in a future post.
All in all our CRT process has been a great success for us! (and I’m not just saying that because of the shameless plug for my boss above…ahem *raise* ahem. cough.). In actuality, planning for quick turnaround on smaller, more tactical changes is a requirement for an agile company, and something that benefits our customers as well as ourselves. But what do we do when even 2 weeks is too long? It happens, so read on…
Hotfix
I’m not even sure where we originally came up with this term. Maybe it comes from the hardware world of firmware upgrades, or maybe somebody was just angry when they had to make a quick fix at 3am some morning long in the past. Dunno, but somehow is stuck. Regardless, a “hotfix” is how we internally refer to defect fixes that “just can’t wait.”. These are a reality of SaaS - a partner may change their APIs over night (won’t name names there) or a system that was working yesterday just decided to go “bump” in the night. It happens, and we need to be prepared to handle it.
Fortunately we have a process for dealing with this and it flows like such:
1. The problem is identified by our monitoring system, internal staff, or sometimes our customers (we consider this a failure on our part if a customer has to tell us first, but I won’t lie, sometimes it happens. When it does happen, though, we take it pretty seriously that we need to review the monitors watching that system and make sure that does not happen again - I may go into our monitoring processes in a future blog).
2. The problem is entered into our Change Management system as a severity 1 or 2 problem. This system in turn send out pages to the appropriate on call staff, day or night, any day of the week.
3. If the problem is indeed a software/code problem, the developer who owns the affected component area will analyze and make a fix to the problem. This fix is then deployed to our staging environment for validation. Generally these fixes are optimized to provide the least amount of risk, with a more elegant solution going out in a later release.
4. An on-call QA Engineer will validate the changes in the staging environment, and then (if passed) signal the change is ready for deployment to our release engineering team.
5. The release team deploys the change to our production environment, and again the QA engineer validates the change once again in the production environment. Assuming all is well, the ticket is closed and any affected parties are notified.
The key here is speed, and this whole process is sometimes performed in as little as 15-30 minutes, even at 3am on a Sunday morning. The engineering staff managing this process is among our greatest assets, and definitely the unsung heroes of the company. The team takes great pride in keeping the systems up 24/7, and busts their butts whenever a critical problem comes up, no matter what time of day, or what day of the year. It’s a fantastic team that I’m very proud to be working with…Group Hug!
In our next Episode…
Alrighty, time to put away the Kleenex, we’ve made some great progress here. I warned you there was a lot to cover here, but we’re getting close to the grand finale. Fonzie is strapping on his water skis and getting ready to show those pesky sharks what it takes to be cool…heeyyyy!
In my next post I’ll delve into our “Long Term Releases” - the bigger releases that crank our some of our larger features like Matrix Inventory, and hopefully I’ll have some time left to talk about Agile/Scrum methodologies and wrap this series into a big bow.
Same bat time…
