[LRUG] Queue-related war stories

Garry Shutler garry at robustsoftware.co.uk
Mon Mar 16 02:40:00 PDT 2015


Hi Ali/all,

Possibly the worst one that obviously happened to a friend-of-a-friend...
Having a task A that has a path that can create a task B which has a path
that when the moons align can create a task A. Logical infinite loops
within your async queues are fun!

We do a lot of background processing at Cronofy and for the problems you've
mentioned we've often used Redis to avoid (or resolve) the problems async
processing introduces. This is particularly easy if you're using Resque or
Sidekiq as you'll already have an instance available.

The top technique we use is setting a "touch" value against a key that
identifies the task, often with an expiry of a few hours or so, and then
checking that when executing the task (and maybe at the cron job if there
is one to avoid needless queuing). That then means you can only run that
task once within that relevant window and the expiry means the flag only
stays around for the relevant period of time before being cleaned up by
Redis so your database size doesn't continually grow. Some tasks you don't
care, but some like sending emails or SMS you really do!

Sometimes you need to also use a distributed lock
<http://redis.io/topics/distlock> if the same task can be queued multiple
times within your system at the same time but that's overkill in many
situations.

Hope that's useful.

Cheers,
Garry

*Garry Shutler*
@gshutler <http://twitter.com/gshutler>
gshutler.com

On 16 March 2015 at 09:12, Najaf Ali <ali at happybearsoftware.com> wrote:

> Hi all,
>
> I'm trying to identify some general good practices (based on real-life
> problems) when it comes to working with async job queues (think DJ, Resque
> and Sidekiq).
>
> So far I've been doing this by collecting stories of how they've failed
> catastrophically (e.g. sending thousands of spurious SMS's to your
> customers) and seeing if I can identify any common themes based on those.
>
> Here are some examples of what I mean (anonymised to protect the innocent):
>
> * Having a (e.g. hourly) cron job that checks if a job has been done and
> then enqueues the job if it hasn't. It knows this because the successfully
> completed job would leave some sort of evidence of completion in e.g. the
> database. If your workers go down for a day, this means the same job would
> be enqueued over and over again superfluously.
>
> * Sending multiple emails (hundreds) in a single job lead to a problem
> where if just one of those emails (say the 24th) fails to be delivered, the
> entire job fails and emails 1-23 get sent again when your worker retries it
> again and again and again.
>
> * With the workers/app running the same codebase but on different virtual
> servers, deploying only to the application server (and not the server
> running the workers) resulted in the app servers queueing jobs that the
> workers didn't know how to process.
>
> It would be great to hear what sort of issues/incidents you've come across
> while using async job queues like the above. I don't think I have enough
> examples to make any generalisations about the "right way" to use them yet,
> so more interested in just things that went wrong and how you fixed them at
> the moment.
>
> Feel free to reply off-list if you'd rather not share with everyone, I
> intend to put the findings together in a blog post with a few guesses as to
> how to avoid these sorts of problems.
>
> All the best,
>
> -Ali
>
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
> Manage your subscription: http://lists.lrug.org/options.cgi/chat-lrug.org
> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20150316/ed04f165/attachment-0001.html>


More information about the Chat mailing list