Canary Builds for Split Testing and Continuous Delivery

When we made the switch to Continuous Delivery in 2012 one challenge we needed to overcome was how we would maintain active split tests while at the same time allowing the team to deploy as soon as a story is finished.

Some of the things we wanted to achieve with our continuous delivery model:

  1. Increase flow in our delivery process
  2. Catch any bugs that slipped through
  3. Identify any usability issues
  4. Realise value from our code as soon as it is generated

We were intent on measuring our AB tests and multivariate tests vigorously to ensure we had appropriate confidence in our conclusions. And this is where things can get tricky. If we’re running an AB test whilst at the same time continually deploying changes to production how can avoid polluting our test results? We could try managing this in code or keep track of things in our heads, or run a register of live tests and make try to stay clear of anything that might have an impact. All of those approaches seemed contrary to helping a team generate flow.

Instead the approach we took was the introduction of “Canary Builds”. From what I can tell the term was introduced by the Chromium team for the edge builds they release for developers and early adopters. Our usage of the term is a little different in that we’re releasing to an advance party of users but those users are randomly selected.

Running AB tests was handled in many different ways. Some tests were run using client side js tools like Google’s Website Optimizer or Visual Website Optimizer which do a great job of presenting results and providing tools to make testing small changes easier to deliver. However some of the tests we wanted to run required more fundamental changes to the product offering. For example if we want to test what happens if we rerank the results of our hotels availability search or changed the “hero” image selection criteria we needed to get server side to achieve that.

In the process of solving our server side split testing challenge we adopted an approach to deployment and infrastructure configuration that deploys different sets of code to different servers. We created a Chef cookbook that sets up a load balancer to drive A/B split tests, splitting traffic between experiment variants. The application was able to provide it’s variant name back to the app itself and from there to Statsd and Graphite, Omniture and some other usage data repositories like our Real User Metrics datastore, and our hotels pricing and availability datastore.

So AB testing for server side changes was solved for the time being. The next challenge was how do we keep doing Continuous Delivery (and continuous deployment for some of our apps) without disrupting our split tests. Given the low volume of data we were getting for some of the tests we needed to run for a week or more. During that time we still needed to deploy our new code, keep flow in the team, get user feedback fast, and see any bugs that slipped through.

Our Tech Lead Dave Nolan provided an ingenious solution that turned out to deliver value above and beyond just side stepping the conflict between AB testing and Continuous Deployment. Our Chef cookbook was modified to optionally send a configurable proportion of our traffic to a Canary Build.  We adopted 5% of traffic as a working default. The result was that our split tests kept running the tagged versions of our apps until we had enough data to be confident in our split tests while the latest version of master was presented to real users after every commit. Our feedback loops stayed short and we had the added benefit of only subjecting a small percentage of users to the new software which worked for us as we had adopted an attitude that prioritises MMTR over MTBF.

Of course all of this introduced some new challenges around how we persist our application data, handle migrations and the like but I’ll talk about that in another post. We also still needed to address how we would capture value from our code as soon as possible and I hope to write something about our early work with Myna.

I’d love to hear about how other team manage concurrent split testing and Continuous Delivery.

Leave a Comment.