[LRUG] Puma, CLOSE_WAIT. Arg.

Thu Feb 18 04:17:35 PST 2016

Actually puma docs suggest doing that when using preload_app and
ActiveRecord...

https://github.com/puma/puma#clustered-mode

Simon Morley

Big Chief | PolkaSpots Supafly Wi-Fi
Bigger Chief | Cucumber Tony

Got an unlicensed Meraki? Set it free with Cucumber
cucumberwifi.io/meraki

On 18 February 2016 at 12:05, Frederick Cheung <frederick.cheung at gmail.com>
wrote:

>
>
>
> On 18 February 2016 at 11:17:34, Simon Morley (simon at polkaspots.com)
> wrote:
>
>
> class RadiusDatabase
>   self.abstract_class = true
>   establish_connection "radius_#{Rails.env}".to_sym
> end
>
> class Radacct < RadiusDatabase
> end
>
> Then I decreased our database pool from 20 to 5 and added a wait_timeout
> of 5 (since there seems to be some discrepancies with this). Things got
> much better (but weren't fixed).
>
> I tried querying differently, including using
> connection_pool.with_connection. I've tried closing the connections
> manually and also used ActiveRecord::Base.clear_active_connections!
> periodically. No joy.
>
> By this point, we were running 2-4 instances - handling around very little
> traffic in total (about 50rpm). Every few hours, they'd block, all of them.
> At the same time, we'd see a load of rack timeouts - same DB. I've checked
> the connections - they were each opening only a few to MySQL and MySQL was
> looking good.
>
> One day, by chance, I reduced the 4 instances to 1. *And the problem is
> solved!!! WHAT*? Obviously the problem isn't solved, we can only use a
> single server.
>
>
> Are you using puma in the mode where it forks workers? if so, then you
> want to reconnect post fork or multiple processes will share the same file
> descriptor and really weird shit will happen.
>
> The puma readme advises to do this:
>
> before_fork do
>   ActiveRecord::Base.connection_pool.disconnect!
> end
>
> I don't know off the top of my head whether that  will do the job for
> classes that have established a connection to a different db - presumably
> they have a separate connection pool
>
> Fred
>
> I don't know what's going on here. Have I been staring at this for too
> long (yes)?
>
> Our other servers are chugging along happily now, using a connection pool
> of 20, no errors, no timeouts (different db though).
>
> Has anyone got any suggestions / seen this? Is there something
> fundamentally wrong with the way we're establishing a connection to the
> external dbs? Surely this is MySQL related
>
> Thanks for listening,
>
> S
>
>
> Simon Morley
>
> Got an unlicensed Meraki? Set it free with Cucumber
> cucumberwifi.io/meraki
>
>
> On 15 January 2016 at 13:58, Gerhard Lazu <gerhard at lazu.co.uk> wrote:
>
>> The understanding of difficult problems/bugs and the learning that comes
>> with it cannot be rushed. Each and every one of us has his / her own pace,
>> and all "speeds" are perfectly fine. The only question that really matters
>> is whether it's worth it (a.k.a. the cost of lost opportunity). If the
>> answer is yes, plough on. If not, look for alternatives.
>>
>> Not everyone likes or wants to run their own infrastructure. The monthly
>> savings on the PaaS, IaaS advertised costs are undisputed, but few like to
>> think - never mind talk - about how many hours / days / weeks have been
>> spent debugging obscure problems which "solve themselves" on a managed
>> environment. Don't get me started on those that are building their own
>> Docker-based PaaS-es without even realising it...
>>
>> As a side-note, I've been dealing with a similar TCP-related problem for
>> a while now, so I could empathise with your struggles the second I've seen
>> your post. One of us is bound to solve it first, and I hope it will be you
>> ; )
>>
>> Have a good one, Gerhard.
>>
>> On Fri, Jan 15, 2016 at 10:01 AM, Simon Morley <simon at polkaspots.com>
>> wrote:
>>
>>> You must be more patient that I am. It's been a long month - having said
>>> that, I'm excited to find the cause.
>>>
>>> I misunderstood you re. file descriptors. We checked the kernel limits /
>>> files open on the systems before and during and there's nothing untoward.
>>>
>>> Since writing in, it's not happened as before - no doubt it'll take
>>> place during our forthcoming office move today.
>>>
>>> I ran a strace (thanks for that suggestion John) on a couple of
>>> processes yesterday and saw redis blocking. Restarted a few redis servers
>>> to see if that helped. Can't be certain yet.
>>>
>>> As soon as it's on, I'll run a tcpdump. How I'd not thought about that I
>>> don't know...
>>>
>>> Actually, this is one thing I dislike about Rails - it's so nice and
>>> easy to do everything, one forgets we're dealing with the real servers /
>>> components / connections. It's too abstract in ways, but that's a whole
>>> other debate :)
>>>
>>> S
>>>
>>>
>>>
>>> Simon Morley
>>>
>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>> Bigger Chief | Cucumber Tony
>>>
>>> simon at PolkaSpots.com
>>> Linkedin: I'm on it again and it still sucks
>>> 020 7183 1471 <020%207183%201471>
>>>
>>> 🚀💥
>>>
>>> On 15 January 2016 at 06:53, Gerhard Lazu <gerhard at lazu.co.uk> wrote:
>>>
>>>> File descriptors, for traditional reasons, include TCP connections.
>>>>
>>>> Are you logging all requests to a central location? When the problem
>>>> occurs, it might help taking a closer look at the type of requests you're
>>>> receiving.
>>>>
>>>> Depending on how long the mischief lasts, a tcpdump to pcap, then
>>>> wireshark might help. Same for an strace on the Puma processes, similar to
>>>> what John suggested . Those are low level tools though, verbose, complex
>>>> and complete, it's easy to get lost unless you know what you're looking for.
>>>>
>>>> In summary, CLOSE_WAITs piling up from haproxy (client role) to Puma
>>>> (server role) indicates the app not closing connections in time (or maybe
>>>> ever) - why? It's a fun one to troubleshoot ; )
>>>>
>>>> On Thu, Jan 14, 2016 at 11:35 PM, Simon Morley <simon at polkaspots.com>
>>>> wrote:
>>>>
>>>>> Right now, none of the servers have any issues. No close_waits.
>>>>>
>>>>> All is well. Seemingly.
>>>>>
>>>>> When it occurs ALL the servers end up going. Sometimes real fast.
>>>>> That's why I thought we had a db bottleneck. It happens pretty quickly,
>>>>> randomly, no particular times.
>>>>>
>>>>> We don't ever really get spikes of traffic, there's an even load
>>>>> inbound throughout.
>>>>>
>>>>> I thought we had someone running a slow loris style attack on us. So I
>>>>> added some rules to HA Proxy and Cloudflare ain't seen nofin honest guv.
>>>>>
>>>>> Will find a way to chart it and send a link over.
>>>>>
>>>>> Will see if we're not closing any files - not much of that going on.
>>>>> There's some manual gzipping happening - we've had that in place for over a
>>>>> year though - not sure why it'd start playing up now. Memory usage is high
>>>>> but consistent and doesn't increase.
>>>>>
>>>>> S
>>>>>
>>>>>
>>>>>
>>>>> Simon Morley
>>>>>
>>>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>>>> Bigger Chief | Cucumber Tony
>>>>>
>>>>> simon at PolkaSpots.com
>>>>> Linkedin: I'm on it again and it still sucks
>>>>> 020 7183 1471 <020%207183%201471>
>>>>>
>>>>> 🚀💥
>>>>>
>>>>> On 14 January 2016 at 22:14, Gerhard Lazu <gerhard at lazu.co.uk> wrote:
>>>>>
>>>>>> That sounds like a file descriptor leak. Are the CLOSE_WAITs growing
>>>>>> over time?
>>>>>>
>>>>>> You're right, New Relic is too high level, this is a layer 4-5 issue.
>>>>>>
>>>>>> The simplest thing that can plot some graphs will work. Throw the
>>>>>> dirtiest script together that curls the data out if it comes easy, it
>>>>>> doesn't matter how you get those metrics as long as you have them.
>>>>>>
>>>>>> This is a great blog post opportunity ; )
>>>>>>
>>>>>> On Thu, Jan 14, 2016 at 8:40 PM, Simon Morley <simon at polkaspots.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I would ordinarily agree with you about the connection however they
>>>>>>> hang around for hours sometimes.
>>>>>>>
>>>>>>> The 500 in the hyproxy config was actually left over from a previous
>>>>>>> experiment. Realistically I know they won't cope with that.
>>>>>>>
>>>>>>> Using another server was to find any issues with puma. I'm still
>>>>>>> going to try unicorn just in case.
>>>>>>>
>>>>>>> Will up the numbers too - thanks for that suggestion.
>>>>>>>
>>>>>>> I'll look at a better monitoring tool too. So far new relic hasn't
>>>>>>> helped much.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> S
>>>>>>>
>>>>>>> Simon Morley
>>>>>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'm doing it with Cucumber Tony. Are you?
>>>>>>>
>>>>>>> On 14 Jan 2016, at 20:30, Gerhard Lazu <gerhard at lazu.co.uk> wrote:
>>>>>>>
>>>>>>> Hi Simon,
>>>>>>>
>>>>>>> CLOSE_WAIT suggests that Puma is not closing connections fast
>>>>>>> enough. The client has asked for the connection to be closed, but Puma is
>>>>>>> busy.
>>>>>>>
>>>>>>> Quickest win would be to increase your Puma instances. Unicorn won't
>>>>>>> help - or any other Rack web server for the matter.
>>>>>>>
>>>>>>> Based on your numbers, start with 10 Puma instances. Anything more
>>>>>>> than 100 connections for a Rails instance is not realistic. I would
>>>>>>> personally go with 50, just to be safe. I think I saw 500 conns in your
>>>>>>> haproxy config, which is way too optimistic.
>>>>>>>
>>>>>>> You want metrics for detailed CPU usage by process, connections open
>>>>>>> with state by process, and memory usage, by process. Without these, you're
>>>>>>> flying blind. Any suggestions anyone makes without real metrics - including
>>>>>>> myself - are just guesses. You'll get there, but you're making it far too
>>>>>>> difficult for yourself.
>>>>>>>
>>>>>>> Let me know how it goes, Gerhard.
>>>>>>>
>>>>>>> On Thu, Jan 14, 2016 at 3:16 PM, Simon Morley <simon at polkaspots.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello All
>>>>>>>>
>>>>>>>> We've been battling with Puma for a long while now, I'm looking for
>>>>>>>> some help / love / attention / advice / anything to prevent further hair
>>>>>>>> loss.
>>>>>>>>
>>>>>>>> We're using it in a reasonably typical Rails 4 application behind
>>>>>>>> Nginx.
>>>>>>>>
>>>>>>>> Over the last 3 months, our requests have gone from 500 rpm to a
>>>>>>>> little over 1000 depending on the hour. Over this period, we've been seeing
>>>>>>>> weird CLOSE_WAIT conns appearing in netstat, which eventually kill the
>>>>>>>> servers.
>>>>>>>>
>>>>>>>> We have 3 Rails servers behind Haproxy running things. Load is
>>>>>>>> generally even.
>>>>>>>>
>>>>>>>> Running netstat on the servers shows a pile of connections in the
>>>>>>>> CLOSE_WAIT state with varying recv-q values as so:
>>>>>>>>
>>>>>>>> tcp      2784    0 localhost:58786         localhost:5100
>>>>>>>>  CLOSE_WAIT
>>>>>>>> tcp      717      0 localhost:35794         localhost:5100
>>>>>>>>  CLOSE_WAIT
>>>>>>>> tcp      784      0 localhost:55712         localhost:5100
>>>>>>>>  CLOSE_WAIT
>>>>>>>> tcp        0        0 localhost:38639         localhost:5100
>>>>>>>>    CLOSE_WAIT
>>>>>>>>
>>>>>>>> That's just a snippet. A wc reveals over 400 of these on each
>>>>>>>> server.
>>>>>>>>
>>>>>>>> Puma is running on port 5100 btw. We've tried puma with multiple
>>>>>>>> threads and a single one - same result. Latest version as of today.
>>>>>>>>
>>>>>>>> I've checked haproxy and don't see much lingering around.
>>>>>>>>
>>>>>>>> Only a kill -9 can stop Puma - otherwise, it says something like
>>>>>>>> 'waiting for requests to finish'
>>>>>>>>
>>>>>>>> I ran GDB to see if I could debug the process however I can't claim
>>>>>>>> I knew what I was looking at. The processes that seemed apparent were event
>>>>>>>> machine and mongo.
>>>>>>>>
>>>>>>>> We then ditched EM (we were using the AMQP gem) in favour of Bunny.
>>>>>>>> That made zero difference.
>>>>>>>>
>>>>>>>> So we upgraded Mongo and Mongoid to the latest versions, neither of
>>>>>>>> which helped.
>>>>>>>>
>>>>>>>> I thought we might have a bottleneck somewhere - Mongo, ES or
>>>>>>>> MySQL. But, none of those services seem to have any issues / latencies.
>>>>>>>>
>>>>>>>> It's also 100% random. Might happen 10 times in an hour, then not
>>>>>>>> at all for a week.
>>>>>>>>
>>>>>>>> The puma issues on github don't shed much light.
>>>>>>>>
>>>>>>>> I don't really know where to turn at the moment or what to do next?
>>>>>>>> I was going to resort back to Unicorn but I don't think the issue is that
>>>>>>>> side and I wanted to fix the problem, not just patch it up.
>>>>>>>>
>>>>>>>> It's starting to look like a nasty in my code somewhere but I don't
>>>>>>>> want to go down that route just yet...
>>>>>>>>
>>>>>>>> Sorry for the long email, thanks in advance. Stuff.
>>>>>>>>
>>>>>>>> I hope someone can help!
>>>>>>>>
>>>>>>>> S
>>>>>>>>
>>>>>>>> Simon Morley
>>>>>>>>
>>>>>>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>>>>>>> Bigger Chief | Cucumber Tony
>>>>>>>>
>>>>>>>> simon at PolkaSpots.com <simon at polkaspots.com>
>>>>>>>> Linkedin: I'm on it again and it still sucks
>>>>>>>> 020 7183 1471 <020%207183%201471>
>>>>>>>>
>>>>>>>> 🚀💥
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Chat mailing list
>>>>>>>> Chat at lists.lrug.org
>>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>>> Manage your subscription:
>>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Chat mailing list
>>>>>>> Chat at lists.lrug.org
>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>> Manage your subscription:
>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Chat mailing list
>>>>>>> Chat at lists.lrug.org
>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>> Manage your subscription:
>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Chat mailing list
>>>>>> Chat at lists.lrug.org
>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>> Manage your subscription:
>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Chat mailing list
>>>>> Chat at lists.lrug.org
>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>> Manage your subscription:
>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Chat mailing list
>>>> Chat at lists.lrug.org
>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>> Manage your subscription:
>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Chat mailing list
>>> Chat at lists.lrug.org
>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>> Manage your subscription:
>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>
>>>
>>
>> _______________________________________________
>> Chat mailing list
>> Chat at lists.lrug.org
>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>> Manage your subscription: http://lists.lrug.org/options.cgi/chat-lrug.org
>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>
>>
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
> Manage your subscription: http://lists.lrug.org/options.cgi/chat-lrug.org
> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20160218/4a4058c2/attachment.html>