[LRUG] Deployment approach to rolling out huge changes

Fri Sep 7 05:56:21 PDT 2012

On 7 Sep 2012, at 10:28, Ed James (Alt) wrote:
> 
> Again, thanks Paul. Some sound advice.
> 
> Once we've decided on an approach I'll post again to let people know what we've done. Might be useful for the next guy.

Some very good points made in this thread. It's of particular interest to me since it's the kind of thing I've been involved in throughout my career in infrastructure work. I've seen some very large, even global deployments, which haven't managed to get this right. Here are a few ideas I've picked up, often from seeing it done wrong.

Essentially, it's a wicked problem: each deployment situation is unique, there are no hard-and-fast rules, no right answer, and generally you only get one shot. You can make things as idiot-proof as you like, and the Universe will just come up with better idiots.

However, maybe we can describe some general principles which will help us. There are several reasons why things go wrong on deployment, including:

* The code didn't work the way it was supposed to
* The server configuration was different to what we expected
* Real data was different to our test data

Although you can never eliminate these problems entirely, you can reduce them asymptotically, depending on your time and money budget, and you can mitigate their worst consequences.

To deploy with confidence, you need good tests, which are as close as possible to what happens in production. So not just unit tests, but behavioural tests that exercise the full stack: a user hits the login page, submits the form, sees her account page, changes her details, sees the results, browses some products, adds a product to her cart, adds another product, removes the first product, checks out, pays, gets a confirmation email. As Paul pointed out, people click random things in unexpected ways, so ideally your tests would include replayed sample data from production log files.

Ed mentioned testing payments, which is usually a tricky one because it involves a third party. Mocking can help, either at the network or the API level, and some payment providers have a dummy API which only differs from the real one in the URL. Alternatively, sometimes you can pass them a test callback instead of a live one. But testing only gets you so far, and I think it's a good idea to also do real payments with real money (hopefully your own). 

You also need to be sure that your test / staging / acceptance environment is as close as possible to production, which in practice means using automated builds. Could you wipe and rebuild your production servers from configuration management every night and be confident they'll work in the morning? If not, why not? If you do much work in the cloud you'll find the Universe often runs this test for you, so you better pass.

Ed isn't alone in having important cron jobs which run in production and mustn't be duplicated. This isn't just an issue for testing, but where you have multiple servers for redundancy or load balancing, you also need to ensure that jobs don't run in more than one place. When you run these jobs in testing, you need to mock or stub out anything which changes the world (a fake SMTP server, for example). When you're running multiple copies of production, or transitioning between releases, you need some way to enforce a critical section. A database token is one way to do this; provided you have an atomic test-and-set operation (eg using a transaction) you can grab the token and block all other copies of yourself from running.

The data problem is, I think, the hardest. Test data isn't real enough and, as Paul says, live data is too real. Also, real data is sometimes very big. A good approach is often to mix test data, carefully constructed to exercise edge cases and regressions, with a random sample of live data, to break that obscure bit of code which can't handle invalid UTF-8 characters.

Arguably, the smaller the change you're deploying, the less chance of it breaking something, which is where continuous deployment can be helpful. That is to say, every time you check in, complete tests are run, and if they pass, your change is pushed to production. This requires excellent test coverage and comprehensive automation, and it continually tests not only your app, but your deployment and configuration management stack.

However, some changes just can't be broken down into lots of little, incremental, reversible steps. In this case, you either have to:

* do a big-bang deployment and hope (maybe with a feature flag so you can quickly turn it off if you have a problem)
* run the old and new systems side by side for a while and gradually migrate users over
* roll out the upgrade to one data centre / geographic region at a time and pause to catch any issues

I'd underline Paul's point that there must always be a way back - ideally, at each stage. Also, you should have rehearsed your rollback procedure, found that actually it doesn't work (RBS, anyone?) and fixed it before relying on it for real. You should also have excellent real-time monitoring so that you know something has gone wrong before your customers do.

All this might seem a bit discouraging, especially for small companies short on resources, but bear in mind that most big companies haven't solved these problems either. The successful companies try harder, that's all. Usually people don't bother taking most of these precautions anyway, at least until they have their first deployment disaster. As Emerson said, we learn geology the morning after the earthquake.

Regards,
John
-- 
Bitfield Consulting: we make software that makes things work
http://bitfieldconsulting.com/