[LRUG] Deployment approach to rolling out huge changes

Mon Nov 26 01:53:29 PST 2012

Hi all  

I thought I'd post a follow-up to this thread following our deployment. I got some very good advice from this post and appreciate the effort from those to contributed. So, here's what we did…

We made a fundamental change to our application, that, after careful review, we realised would be extremely difficult and dangerous to roll out incrementally. We were therefore faced with a rather big, waterfall-ish deployment.  

We decided not to use our beta environment (see below). In this case it would have introduced too much complexity running two different code bases against the same database.

We made heavy use of our staging environment and spent a lot of time testing both the application changes and the deployment itself. There were also significant changes that needed to be made to the database, both in terms of adding new columns to many of our tables and also changing the table indexes. A lot of these staging deployments required a full database rollback prior to deploying, which, while quite time consuming (our db is around 4GB) was without doubt worthwhile. We caught several important issues as a result of doing this. We also made changes to our staging environment so that it more closely resembled our production setup (specifically the database). This proved extremely worthwhile.

We then ran through the deployment plan several times over about 2 weeks, making sure we hadn't missed anything and also ensuring that our sequence of steps was valid. We also had estimated times for each step (following our staging testing). Every developer on our team reviewed the deployment plan.

We use AWS. We have 4 applications in production, across 19 servers. Several of these servers are running background jobs (we use Resque (scheduled and ad-hoc), DJ, cron and daemons), and while the obvious ones can be scaled across multiple servers, some processes can only have a single instance running. I think the most important aspect of this was planning for failure. It's easy to rollback our application (change a git tag and deploy), but we also spun up a new cluster of app servers that were kept on standby should we need a faster rollback, as this would only require a config change at the ELB level.

We scheduled the deployment for a Sunday morning. We took our site down for maintenance and shut down all the background services, while ensuring that video playback was not affected (we're a video hosting platform). We then created a read replica of our RDS database and waited for the replication log to reach zero activity, at which point we promoted the replica. We now had a full RDS backup instance that was ready for immediate use should we need it. We then cleanly shut down all the background processes. We deployed our apps to all the servers, closely monitoring activity on New Relic. We then ran a bunch of rake tasks that prepped the database - this took about 45 mins. While our main site was in maintenance mode, we made sure that admin users (us) were still able to login via a backdoor, so we all logged in and did a few quick checks to make sure everything was good. Then, one by one, brought each background service back online, still monitoring both the logs and New Relic. Once everything was up, we took the main site our of maintenance mode and monitored activity on New Relic for about an hour. There were a few minor issues, but nothing that warranted rolling back. We kept our stand-by application servers and database for about 24 hours, then deleted them.

So our deployment was successful. It wasn't perfect, and the problems that we experienced were things that we did anticipate but could not test given the difference in scale between our staging and production environments. The key lesson we took out of this was that careful planning was essential, and lots and lots of testing was absolutely vital. When we thought we had done enough testing, we did some more and found things that needed fixing. We also involved the entire team in our testing - not just developers, but designers, sales and marketing, and management. Everyone found at least one issue that would have been a serious bug had it reached production. It was a team effort and therefore a success shared by the team.

Thanks again for everyone's help and advice.  

--  
Ed James
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

On Tuesday, 11 September 2012 at 17:42, Ed James (Alt) wrote:

> Just came across this regarding the Beta environment I described previously:  
>  
> http://37signals.com/svn/posts/3251-running-beta-in-production  
>  
> --  
> Ed James (Spam)
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>  
>  
> On Monday, 10 September 2012 at 10:30, Chris Parsons wrote:
>  
> >  
> > On 10 Sep 2012, at 09:35, Ed James (Alt) <ed.james.spam at gmail.com (mailto:ed.james.spam at gmail.com)> wrote:
> > > The changes to the code, while far-reaching, are not actually that difficult. It's mainly moving a whole bunch of logic from one model into another, and then changing the various relationships between models. So the coding is mainly refactoring, rather than coding brand new functionality. But the models we're changing are the central models in our system, hence my original post.
> >  
> > Fair enough. Any more comment would probably hit the barrier of "don't know the details" so good luck with the change!
> >  
> > Chris
> >  
> >  
> > >  
> > > Thanks.
> > >  
> > > --  
> > > Ed James (Spam)
> > > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> > >  
> > >  
> > > On Monday, 10 September 2012 at 09:21, Chris Parsons wrote:
> > >  
> > > >  
> > > > On 10 Sep 2012, at 09:04, "Ed James (Alt)" <ed.james.spam at gmail.com (mailto:ed.james.spam at gmail.com)> wrote:
> > > > > Unfortunately the feature in question involves a fundamental change to our system, one which affects almost every part of the codebase in that it changes a key relationship between our models. Because it also changes the way we charge users it's not something that can be easily "phased in" to our production environment (we are not changing our prices, but rather how we allocate the charges).
> > > > Just picking up on this, with a few options:
> > > >  
> > > > 1) is there any way you can run both relationships together? For example, can you have both code paths running for different people, with a feature flag to test the new way the code is being used?
> > > >  
> > > > 2) can you make the old code write to the database in the new way using a translation layer, and then phase in the new code selectively by serving a new copy of your app to a few users?
> > > >  
> > > > HTH
> > > > Chris
> > > >  
> > > > --
> > > > Chris Parsons
> > > > chris.p at rsons.org (mailto:chris.p at rsons.org)
> > > > http://twitter.com/chrismdp
> > > > http://pa.rsons.org (http://pa.rsons.org/)
> > > >  
> > > >  
> > > >  
> > > >  
> > > > _______________________________________________
> > > > Chat mailing list
> > > > Chat at lists.lrug.org (mailto:Chat at lists.lrug.org)
> > > > http://lists.lrug.org/listinfo.cgi/chat-lrug.org
> > > >  
> > > >  
> > >  
> > >  
> > > _______________________________________________
> > > Chat mailing list
> > > Chat at lists.lrug.org (mailto:Chat at lists.lrug.org)
> > > http://lists.lrug.org/listinfo.cgi/chat-lrug.org
> >  
> > --
> > Chris Parsons
> > chris.p at rsons.org (mailto:chris.p at rsons.org)
> > http://twitter.com/chrismdp
> > http://pa.rsons.org
> >  
> >  
> >  
> >  
> > _______________________________________________
> > Chat mailing list
> > Chat at lists.lrug.org (mailto:Chat at lists.lrug.org)
> > http://lists.lrug.org/listinfo.cgi/chat-lrug.org
> >  
> >  
> >  
>  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20121126/6d7f4b1d/attachment-0003.html>