[LRUG] Queue-related war stories

Mon Mar 16 02:37:42 PDT 2015

Hi Najaf

Over the years I’ve done quite a lot of this type of large batching of messages to be delivered, and we’ve typically solved the problem as follows:

A single job is added to your queue to broadcast your messages (emails or SMS).  This is done synchronously from the user’s interface that they use to send the instruction so it should either fail or succeed.  You can then safely confirm that the job is now queued.
That single job is then responsible for setting up individual jobs to deliver each of the messages. As you have stated, you don’t want this job to not be able to recover from where it left off in the case of catastrophic failure, so you really want this within a transaction of some sort.  However, most job queues will be in a different database transactional context to your database itself, so it’s very difficult to get it to rollback correctly.  So instead, this single job would iterate over the list of recipients and create a new record in the database for each message that is queued to be sent for that user - for example, if you had the tables users and email_campaigns, you would have a join table email_campaigns_users and for every email sent to a user you would add a record with a reference to the user & the email campaign.  You could then either have a before_commit hook on that model that inserts a job into the job queue send a message to that user, or you could do this as part of the job that creates a record of sending a message to this user.  Either way, if the job is not inserted into the queue successfully, the commit will fail which is exactly what you want.  Finally, if this job fails at any point, it will be tried again and you need to ensure your query to build up a queue of users to send messages to excludes any users who have been sent the message.
The final job to deliver the actual message is now incredibly simple, so the job queue can easily distribute this job across all workers and the logic to to deliver the message should be incredibly simple.

Hope that helps somewhat. 

Matt
https://ably.io <https://ably.io/> 

> On 16 Mar 2015, at 09:12, Najaf Ali <ali at happybearsoftware.com> wrote:
> 
> Hi all,
> 
> I'm trying to identify some general good practices (based on real-life problems) when it comes to working with async job queues (think DJ, Resque and Sidekiq).
> 
> So far I've been doing this by collecting stories of how they've failed catastrophically (e.g. sending thousands of spurious SMS's to your customers) and seeing if I can identify any common themes based on those.
> 
> Here are some examples of what I mean (anonymised to protect the innocent):
> 
> * Having a (e.g. hourly) cron job that checks if a job has been done and then enqueues the job if it hasn't. It knows this because the successfully completed job would leave some sort of evidence of completion in e.g. the database. If your workers go down for a day, this means the same job would be enqueued over and over again superfluously.
> 
> * Sending multiple emails (hundreds) in a single job lead to a problem where if just one of those emails (say the 24th) fails to be delivered, the entire job fails and emails 1-23 get sent again when your worker retries it again and again and again.
> 
> * With the workers/app running the same codebase but on different virtual servers, deploying only to the application server (and not the server running the workers) resulted in the app servers queueing jobs that the workers didn't know how to process.  
> 
> It would be great to hear what sort of issues/incidents you've come across while using async job queues like the above. I don't think I have enough examples to make any generalisations about the "right way" to use them yet, so more interested in just things that went wrong and how you fixed them at the moment.
> 
> Feel free to reply off-list if you'd rather not share with everyone, I intend to put the findings together in a blog post with a few guesses as to how to avoid these sorts of problems.
> 
> All the best,
> 
> -Ali
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
> Manage your subscription: http://lists.lrug.org/options.cgi/chat-lrug.org
> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20150316/0d548891/attachment-0007.html>