[LRUG] Deployment approach to rolling out huge changes

Fri Sep 7 02:17:02 PDT 2012

On 7 Sep 2012, at 09:02, "Ed James (Alt)" <ed.james.spam at gmail.com> wrote:

> The scripts do sometimes fail, but they are restarted by a monitoring daemon. The issue is not that they sometimes fail and restart (which they do), but that once our changes are deployed the way these jobs process data will change.

Can I suggest then a dual setup going into separate databases? One doing it the old way, one doing it the new way? This means that if you need to get back to the old way, your old processes will have been busily working away keeping your old DB ticking along. A few weeks after you're sure you never need the old processes, you can clear them - and their data - out of the way.

> The Beta environment allows us to see application changes in a production environment using live data. It means we don't always have to put new beta features behind things like beta flags or introduce conditions into the code. This is quite common practice as far as I know (Facebook do something similar). 

To me it feels... odd. It might be common practice, it might even be useful, but it sounds like you're using it as a means to get around the fact you don't have UI test coverage (indicated below), which seems strange.

There has to be UX testing, and product owner sign-off and all that, but doing it with live data, using the same production data set as your live production deployment seems to me to be a little like trying to juggle chainsaws: it's pretty clever, until it all goes wrong, and then you're just left with a bloody mess... :-)

> Using production snapshots in a staging environment gives us real data to play with. Test data can only take you so far, because you have to create that test data.

Completely agree with this, and there's only so many tests you can write to emulate people clicking random things in ways you never anticipated.

> There are also parts of our system (mainly the UI) that are not covered by automated tests (I know…). Our application is several years old and the database snapshot is over 2GB. Using snapshots also means when we deploy to our staging environment we are testing the deployment itself.

That seems reasonable, providing you've got measures to protect live users from getting confused by it. I've been wary about this approach before because I've seen situations where staging servers populated with live data suddenly start firing off emails to live users because observers in the staging environment are firing, have valid SMTP capability, and real world email addresses in the DB :-)

> Yes, while it would be technically possible to roll back, it's just not feasible. We would lose data (i.e. money) and our users would also lose statistical data. We also have a large number of tables and a fairly complex schema, so restoring data integrity post-rollback would also be a very tricky task, and something we want to avoid having to do.

In this particular circumstance I'd go with a dual database setup if possible - one which is mimicking old behaviour, one which is your new behaviour. Have your background workers write to both for a little while, and that way you've got a bit of security if you need to roll back - your data is still being written in the old way for at least a couple of weeks until you're happy.

That doesn't stop you doing all the other stuff you're doing with staging/beta, etc. - it just gives you an extra level of safety.

Paul