[LRUG] Puma, CLOSE_WAIT. Arg.

Thu Feb 18 04:24:53 PST 2016

Which version of Ruby are you using?

On 18 February 2016 at 12:17, Simon Morley <simon at polkaspots.com> wrote:

> Actually puma docs suggest doing that when using preload_app and
> ActiveRecord...
>
> https://github.com/puma/puma#clustered-mode
>
>
>
> Simon Morley
>
> Big Chief | PolkaSpots Supafly Wi-Fi
> Bigger Chief | Cucumber Tony
>
> Got an unlicensed Meraki? Set it free with Cucumber
> cucumberwifi.io/meraki
>
>
> On 18 February 2016 at 12:05, Frederick Cheung <frederick.cheung at gmail.com
> > wrote:
>
>>
>>
>>
>> On 18 February 2016 at 11:17:34, Simon Morley (simon at polkaspots.com)
>> wrote:
>>
>>
>> class RadiusDatabase
>>   self.abstract_class = true
>>   establish_connection "radius_#{Rails.env}".to_sym
>> end
>>
>> class Radacct < RadiusDatabase
>> end
>>
>> Then I decreased our database pool from 20 to 5 and added a wait_timeout
>> of 5 (since there seems to be some discrepancies with this). Things got
>> much better (but weren't fixed).
>>
>> I tried querying differently, including using
>> connection_pool.with_connection. I've tried closing the connections
>> manually and also used ActiveRecord::Base.clear_active_connections!
>> periodically. No joy.
>>
>> By this point, we were running 2-4 instances - handling around very
>> little traffic in total (about 50rpm). Every few hours, they'd block, all
>> of them. At the same time, we'd see a load of rack timeouts - same DB. I've
>> checked the connections - they were each opening only a few to MySQL and
>> MySQL was looking good.
>>
>> One day, by chance, I reduced the 4 instances to 1. *And the problem is
>> solved!!! WHAT*? Obviously the problem isn't solved, we can only use a
>> single server.
>>
>>
>> Are you using puma in the mode where it forks workers? if so, then you
>> want to reconnect post fork or multiple processes will share the same file
>> descriptor and really weird shit will happen.
>>
>> The puma readme advises to do this:
>>
>> before_fork do
>>   ActiveRecord::Base.connection_pool.disconnect!
>> end
>>
>> I don't know off the top of my head whether that  will do the job for
>> classes that have established a connection to a different db - presumably
>> they have a separate connection pool
>>
>> Fred
>>
>> I don't know what's going on here. Have I been staring at this for too
>> long (yes)?
>>
>> Our other servers are chugging along happily now, using a connection pool
>> of 20, no errors, no timeouts (different db though).
>>
>> Has anyone got any suggestions / seen this? Is there something
>> fundamentally wrong with the way we're establishing a connection to the
>> external dbs? Surely this is MySQL related
>>
>> Thanks for listening,
>>
>> S
>>
>>
>> Simon Morley
>>
>> Got an unlicensed Meraki? Set it free with Cucumber
>> cucumberwifi.io/meraki
>>
>>
>> On 15 January 2016 at 13:58, Gerhard Lazu <gerhard at lazu.co.uk> wrote:
>>
>>> The understanding of difficult problems/bugs and the learning that comes
>>> with it cannot be rushed. Each and every one of us has his / her own pace,
>>> and all "speeds" are perfectly fine. The only question that really matters
>>> is whether it's worth it (a.k.a. the cost of lost opportunity). If the
>>> answer is yes, plough on. If not, look for alternatives.
>>>
>>> Not everyone likes or wants to run their own infrastructure. The monthly
>>> savings on the PaaS, IaaS advertised costs are undisputed, but few like to
>>> think - never mind talk - about how many hours / days / weeks have been
>>> spent debugging obscure problems which "solve themselves" on a managed
>>> environment. Don't get me started on those that are building their own
>>> Docker-based PaaS-es without even realising it...
>>>
>>> As a side-note, I've been dealing with a similar TCP-related problem for
>>> a while now, so I could empathise with your struggles the second I've seen
>>> your post. One of us is bound to solve it first, and I hope it will be you
>>> ; )
>>>
>>> Have a good one, Gerhard.
>>>
>>> On Fri, Jan 15, 2016 at 10:01 AM, Simon Morley <simon at polkaspots.com>
>>> wrote:
>>>
>>>> You must be more patient that I am. It's been a long month - having
>>>> said that, I'm excited to find the cause.
>>>>
>>>> I misunderstood you re. file descriptors. We checked the kernel limits
>>>> / files open on the systems before and during and there's nothing untoward.
>>>>
>>>> Since writing in, it's not happened as before - no doubt it'll take
>>>> place during our forthcoming office move today.
>>>>
>>>> I ran a strace (thanks for that suggestion John) on a couple of
>>>> processes yesterday and saw redis blocking. Restarted a few redis servers
>>>> to see if that helped. Can't be certain yet.
>>>>
>>>> As soon as it's on, I'll run a tcpdump. How I'd not thought about that
>>>> I don't know...
>>>>
>>>> Actually, this is one thing I dislike about Rails - it's so nice and
>>>> easy to do everything, one forgets we're dealing with the real servers /
>>>> components / connections. It's too abstract in ways, but that's a whole
>>>> other debate :)
>>>>
>>>> S
>>>>
>>>>
>>>>
>>>> Simon Morley
>>>>
>>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>>> Bigger Chief | Cucumber Tony
>>>>
>>>> simon at PolkaSpots.com
>>>> Linkedin: I'm on it again and it still sucks
>>>> 020 7183 1471 <020%207183%201471>
>>>>
>>>> 🚀💥
>>>>
>>>> On 15 January 2016 at 06:53, Gerhard Lazu <gerhard at lazu.co.uk> wrote:
>>>>
>>>>> File descriptors, for traditional reasons, include TCP connections.
>>>>>
>>>>> Are you logging all requests to a central location? When the problem
>>>>> occurs, it might help taking a closer look at the type of requests you're
>>>>> receiving.
>>>>>
>>>>> Depending on how long the mischief lasts, a tcpdump to pcap, then
>>>>> wireshark might help. Same for an strace on the Puma processes, similar to
>>>>> what John suggested . Those are low level tools though, verbose, complex
>>>>> and complete, it's easy to get lost unless you know what you're looking for.
>>>>>
>>>>> In summary, CLOSE_WAITs piling up from haproxy (client role) to Puma
>>>>> (server role) indicates the app not closing connections in time (or maybe
>>>>> ever) - why? It's a fun one to troubleshoot ; )
>>>>>
>>>>> On Thu, Jan 14, 2016 at 11:35 PM, Simon Morley <simon at polkaspots.com>
>>>>> wrote:
>>>>>
>>>>>> Right now, none of the servers have any issues. No close_waits.
>>>>>>
>>>>>> All is well. Seemingly.
>>>>>>
>>>>>> When it occurs ALL the servers end up going. Sometimes real fast.
>>>>>> That's why I thought we had a db bottleneck. It happens pretty quickly,
>>>>>> randomly, no particular times.
>>>>>>
>>>>>> We don't ever really get spikes of traffic, there's an even load
>>>>>> inbound throughout.
>>>>>>
>>>>>> I thought we had someone running a slow loris style attack on us. So
>>>>>> I added some rules to HA Proxy and Cloudflare ain't seen nofin honest guv.
>>>>>>
>>>>>> Will find a way to chart it and send a link over.
>>>>>>
>>>>>> Will see if we're not closing any files - not much of that going on.
>>>>>> There's some manual gzipping happening - we've had that in place for over a
>>>>>> year though - not sure why it'd start playing up now. Memory usage is high
>>>>>> but consistent and doesn't increase.
>>>>>>
>>>>>> S
>>>>>>
>>>>>>
>>>>>>
>>>>>> Simon Morley
>>>>>>
>>>>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>>>>> Bigger Chief | Cucumber Tony
>>>>>>
>>>>>> simon at PolkaSpots.com
>>>>>> Linkedin: I'm on it again and it still sucks
>>>>>> 020 7183 1471 <020%207183%201471>
>>>>>>
>>>>>> 🚀💥
>>>>>>
>>>>>> On 14 January 2016 at 22:14, Gerhard Lazu <gerhard at lazu.co.uk> wrote:
>>>>>>
>>>>>>> That sounds like a file descriptor leak. Are the CLOSE_WAITs growing
>>>>>>> over time?
>>>>>>>
>>>>>>> You're right, New Relic is too high level, this is a layer 4-5 issue.
>>>>>>>
>>>>>>> The simplest thing that can plot some graphs will work. Throw the
>>>>>>> dirtiest script together that curls the data out if it comes easy, it
>>>>>>> doesn't matter how you get those metrics as long as you have them.
>>>>>>>
>>>>>>> This is a great blog post opportunity ; )
>>>>>>>
>>>>>>> On Thu, Jan 14, 2016 at 8:40 PM, Simon Morley <simon at polkaspots.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I would ordinarily agree with you about the connection however they
>>>>>>>> hang around for hours sometimes.
>>>>>>>>
>>>>>>>> The 500 in the hyproxy config was actually left over from a
>>>>>>>> previous experiment. Realistically I know they won't cope with that.
>>>>>>>>
>>>>>>>> Using another server was to find any issues with puma. I'm still
>>>>>>>> going to try unicorn just in case.
>>>>>>>>
>>>>>>>> Will up the numbers too - thanks for that suggestion.
>>>>>>>>
>>>>>>>> I'll look at a better monitoring tool too. So far new relic hasn't
>>>>>>>> helped much.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> S
>>>>>>>>
>>>>>>>> Simon Morley
>>>>>>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm doing it with Cucumber Tony. Are you?
>>>>>>>>
>>>>>>>> On 14 Jan 2016, at 20:30, Gerhard Lazu <gerhard at lazu.co.uk> wrote:
>>>>>>>>
>>>>>>>> Hi Simon,
>>>>>>>>
>>>>>>>> CLOSE_WAIT suggests that Puma is not closing connections fast
>>>>>>>> enough. The client has asked for the connection to be closed, but Puma is
>>>>>>>> busy.
>>>>>>>>
>>>>>>>> Quickest win would be to increase your Puma instances. Unicorn
>>>>>>>> won't help - or any other Rack web server for the matter.
>>>>>>>>
>>>>>>>> Based on your numbers, start with 10 Puma instances. Anything more
>>>>>>>> than 100 connections for a Rails instance is not realistic. I would
>>>>>>>> personally go with 50, just to be safe. I think I saw 500 conns in your
>>>>>>>> haproxy config, which is way too optimistic.
>>>>>>>>
>>>>>>>> You want metrics for detailed CPU usage by process, connections
>>>>>>>> open with state by process, and memory usage, by process. Without these,
>>>>>>>> you're flying blind. Any suggestions anyone makes without real metrics -
>>>>>>>> including myself - are just guesses. You'll get there, but you're making it
>>>>>>>> far too difficult for yourself.
>>>>>>>>
>>>>>>>> Let me know how it goes, Gerhard.
>>>>>>>>
>>>>>>>> On Thu, Jan 14, 2016 at 3:16 PM, Simon Morley <simon at polkaspots.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hello All
>>>>>>>>>
>>>>>>>>> We've been battling with Puma for a long while now, I'm looking
>>>>>>>>> for some help / love / attention / advice / anything to prevent further
>>>>>>>>> hair loss.
>>>>>>>>>
>>>>>>>>> We're using it in a reasonably typical Rails 4 application behind
>>>>>>>>> Nginx.
>>>>>>>>>
>>>>>>>>> Over the last 3 months, our requests have gone from 500 rpm to a
>>>>>>>>> little over 1000 depending on the hour. Over this period, we've been seeing
>>>>>>>>> weird CLOSE_WAIT conns appearing in netstat, which eventually kill the
>>>>>>>>> servers.
>>>>>>>>>
>>>>>>>>> We have 3 Rails servers behind Haproxy running things. Load is
>>>>>>>>> generally even.
>>>>>>>>>
>>>>>>>>> Running netstat on the servers shows a pile of connections in the
>>>>>>>>> CLOSE_WAIT state with varying recv-q values as so:
>>>>>>>>>
>>>>>>>>> tcp      2784    0 localhost:58786         localhost:5100
>>>>>>>>>  CLOSE_WAIT
>>>>>>>>> tcp      717      0 localhost:35794         localhost:5100
>>>>>>>>>  CLOSE_WAIT
>>>>>>>>> tcp      784      0 localhost:55712         localhost:5100
>>>>>>>>>  CLOSE_WAIT
>>>>>>>>> tcp        0        0 localhost:38639         localhost:5100
>>>>>>>>>    CLOSE_WAIT
>>>>>>>>>
>>>>>>>>> That's just a snippet. A wc reveals over 400 of these on each
>>>>>>>>> server.
>>>>>>>>>
>>>>>>>>> Puma is running on port 5100 btw. We've tried puma with multiple
>>>>>>>>> threads and a single one - same result. Latest version as of today.
>>>>>>>>>
>>>>>>>>> I've checked haproxy and don't see much lingering around.
>>>>>>>>>
>>>>>>>>> Only a kill -9 can stop Puma - otherwise, it says something like
>>>>>>>>> 'waiting for requests to finish'
>>>>>>>>>
>>>>>>>>> I ran GDB to see if I could debug the process however I can't
>>>>>>>>> claim I knew what I was looking at. The processes that seemed apparent were
>>>>>>>>> event machine and mongo.
>>>>>>>>>
>>>>>>>>> We then ditched EM (we were using the AMQP gem) in favour of
>>>>>>>>> Bunny. That made zero difference.
>>>>>>>>>
>>>>>>>>> So we upgraded Mongo and Mongoid to the latest versions, neither
>>>>>>>>> of which helped.
>>>>>>>>>
>>>>>>>>> I thought we might have a bottleneck somewhere - Mongo, ES or
>>>>>>>>> MySQL. But, none of those services seem to have any issues / latencies.
>>>>>>>>>
>>>>>>>>> It's also 100% random. Might happen 10 times in an hour, then not
>>>>>>>>> at all for a week.
>>>>>>>>>
>>>>>>>>> The puma issues on github don't shed much light.
>>>>>>>>>
>>>>>>>>> I don't really know where to turn at the moment or what to do
>>>>>>>>> next? I was going to resort back to Unicorn but I don't think the issue is
>>>>>>>>> that side and I wanted to fix the problem, not just patch it up.
>>>>>>>>>
>>>>>>>>> It's starting to look like a nasty in my code somewhere but I
>>>>>>>>> don't want to go down that route just yet...
>>>>>>>>>
>>>>>>>>> Sorry for the long email, thanks in advance. Stuff.
>>>>>>>>>
>>>>>>>>> I hope someone can help!
>>>>>>>>>
>>>>>>>>> S
>>>>>>>>>
>>>>>>>>> Simon Morley
>>>>>>>>>
>>>>>>>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>>>>>>>> Bigger Chief | Cucumber Tony
>>>>>>>>>
>>>>>>>>> simon at PolkaSpots.com <simon at polkaspots.com>
>>>>>>>>> Linkedin: I'm on it again and it still sucks
>>>>>>>>> 020 7183 1471 <020%207183%201471>
>>>>>>>>>
>>>>>>>>> 🚀💥
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Chat mailing list
>>>>>>>>> Chat at lists.lrug.org
>>>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>>>> Manage your subscription:
>>>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Chat mailing list
>>>>>>>> Chat at lists.lrug.org
>>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>>> Manage your subscription:
>>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Chat mailing list
>>>>>>>> Chat at lists.lrug.org
>>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>>> Manage your subscription:
>>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Chat mailing list
>>>>>>> Chat at lists.lrug.org
>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>> Manage your subscription:
>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Chat mailing list
>>>>>> Chat at lists.lrug.org
>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>> Manage your subscription:
>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Chat mailing list
>>>>> Chat at lists.lrug.org
>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>> Manage your subscription:
>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Chat mailing list
>>>> Chat at lists.lrug.org
>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>> Manage your subscription:
>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Chat mailing list
>>> Chat at lists.lrug.org
>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>> Manage your subscription:
>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>
>>>
>> _______________________________________________
>> Chat mailing list
>> Chat at lists.lrug.org
>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>> Manage your subscription: http://lists.lrug.org/options.cgi/chat-lrug.org
>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>
>>
>
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
> Manage your subscription: http://lists.lrug.org/options.cgi/chat-lrug.org
> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>
>

-- 
Riccardo Tacconi

http://github.com/rtacconi
http://twitter.com/rtacconi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20160218/81bcdfb7/attachment.html>