[LRUG] A general question about exception handling in services

Wed Apr 10 13:01:01 PDT 2013

I get the impression there is a pattern for doing this and probably someone
on this list has some good input into it.

We've been thinking about how to handle failures in internal services,
whilst integrating with third party services and trading off robustness and
ability to debug complex requests and yet still notice actual genuine
errors in our codebase. (e.g. avoiding things like 'try' and 'rescue nil'
or 'rescue Exception')

Let's say we have three internal services A,B,C and some external API
providers X,Y, Z.

Some object may be responsible for communicating with Z, but this object
doesn't have access to the original incoming request.
Also it's absolutely critical that if this request to Z fails, the rest of
the request can complete and the our external API user is hidden from the
failure and some manual or separate automated process resolves the issue.

To emphasise the criticality of such a system it would be where a user has
paid for a service and one part of the fulfilment of the customer's
purchase is achieved by an API call to an external provider Z. If this
doesn't occur then we'd have angry customers and so we make sure the
request is fulfilled by any means possible (manual if necessary), but still
assure the customer we have fulfilled their order.

We've been toying with the idea of generating unique identifiers for our
incoming requests and sending these in to all other internal services, then
we'd be able to log these ids in all our log statements. We'd also ideally
use these ids in communications to airbrake.

We could pretty easily create middleware that can generate the ids and
send/receive them in headers to our other services, but the issue comes
with having access to this info in our models.

sinatra route/rails controller code
 --   some long
 --   stack frame
 --  model code communicating with Z

One solution that would get us to our controller/route code where we can
access the request info would be throwing or raising
exceptions, but this then prevents us continuing the request in the normal
way and completing the required tasks after a call to Z fails. Also it's
horrendous goto flow control.

Other undesirable hacks would be sticking something on the thread itself,
or a global variable.

The other thing is to actually ensure we can pass down request info all the
way through a stack, but this completely breaks single responsibility and
is going to result in complex spaghetti.

There has to be some kind of intelligent solution to this that is elegant,
readable and maintainable and isn't any of the things I've mentioned.

Oh and another idea is to only swallow all exceptions in our protective
blocks around API calls in production mode, and to not do this in dev or
test. I.e. surface programming errors, but give as cast iron a guarantee as
we can that no failure of Z can possibly result in non-completion of the
rest of the request in production.

Will appreciate hearing your thoughts,

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20130410/f3321d81/attachment.html>