[LRUG] Queue-related war stories

Mon Mar 16 03:20:58 PDT 2015

Hi Ali,

Polling/Cron:

I am always very wary of polling for async progress, states are often unclear and timing can change with increases in traffic, etc. 

What you want to know is whether a worker has crashed, taken forever, etc. For me I’d push that logic into the worker itself. It should have clear failure states and defined actions for what it should do when failing: dump data to disk, retry from beginning, send you an email with a sad emoticon. 

Handling timeouts can be harder for a worker but it can be possible to tell it to give up on a task if it is taking too long. (I haven’t explored this much myself)

In terms of workers being dead for a day, this is more a maintenance/monitoring situation where in most company’s I’ve worked out we have had a health of the nation page open on a small monitor that describes key metrics for the nation, either in graphs or simple counts, likely polling the DB or command line for worker numbers. This requires occasionally viewing to check things appear normal, but also enables non-technical staff to observe that graph B looks HUGE today.

Email sending:

At a previous company the issue with sending multiple emails (in their case in the thousands) could also fall foul of dropping messages. 

We experimented with single jobs for each mail but found we lost the collective information and often overwhelmed machines with the scaffolding around preparing the email job to run, running and closing it up far exceeded the cost in CPU/memory of the email sending itself.

For a long time we settled with aggressive timeouts and pushing of failed emails into a “failed” queue/db state which would be email the team every morning with a count of failed deliveries for each client. However as you can imagine some days we’d get 5, other days 1000’s. We never tried re-sending after failing as usually a failed email sending will repeatedly fail afterwards - often due to an invalid email.

Another approach is to use one of the many company’s that specialise in email sending and instead of pushing emails you push data to them to process.

If you prefer to keep your email sending in house then I’d approach it with aggressive time outs, avoid automatic retries and creating failed persistent queues that require manual actions to clean/clear/run BUT do send notifications to the team on their current 
size.

Server/Worker restarts:

Version control can really aid here. If you treat the data/code/queue you are sending to as an API it becomes a little easier to imagine. Your worker is on version 2, it only reads from the queues that either contain it (RabbitMQ routing key) or match that version exactly. Server deploys with version 3, that data is pushed to Version 3 queue that your workers don’t care about. So it starts to stack up and then you can deploy your version 3 of the workers to handle that queue. This also helps with rollbacks on server/workers as you can rollback the server to push to version 2 and rollback workers afterwards.

Otherwise you can also tell the workers to push failed messages into another queue if it cannot understand what to do with it - but here you need to separate checked versus unchecked exceptions.

Kind regards,

Phil

> On 16 Mar 2015, at 09:12, Najaf Ali <ali at happybearsoftware.com> wrote:
> 
> Hi all,
> 
> I'm trying to identify some general good practices (based on real-life problems) when it comes to working with async job queues (think DJ, Resque and Sidekiq).
> 
> So far I've been doing this by collecting stories of how they've failed catastrophically (e.g. sending thousands of spurious SMS's to your customers) and seeing if I can identify any common themes based on those.
> 
> Here are some examples of what I mean (anonymised to protect the innocent):
> 
> * Having a (e.g. hourly) cron job that checks if a job has been done and then enqueues the job if it hasn't. It knows this because the successfully completed job would leave some sort of evidence of completion in e.g. the database. If your workers go down for a day, this means the same job would be enqueued over and over again superfluously.
> 
> * Sending multiple emails (hundreds) in a single job lead to a problem where if just one of those emails (say the 24th) fails to be delivered, the entire job fails and emails 1-23 get sent again when your worker retries it again and again and again.
> 
> * With the workers/app running the same codebase but on different virtual servers, deploying only to the application server (and not the server running the workers) resulted in the app servers queueing jobs that the workers didn't know how to process.  
> 
> It would be great to hear what sort of issues/incidents you've come across while using async job queues like the above. I don't think I have enough examples to make any generalisations about the "right way" to use them yet, so more interested in just things that went wrong and how you fixed them at the moment.
> 
> Feel free to reply off-list if you'd rather not share with everyone, I intend to put the findings together in a blog post with a few guesses as to how to avoid these sorts of problems.
> 
> All the best,
> 
> -Ali
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
> Manage your subscription: http://lists.lrug.org/options.cgi/chat-lrug.org
> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org