[LRUG] Puma, CLOSE_WAIT. Arg.

Simon Morley simon at polkaspots.com
Thu Feb 18 04:28:00 PST 2016


Ruby 2.2.2
Rails 4.2.5.1
mysql2 0.4.2 (tried a few)
Puma 2.16.0



Simon Morley

Big Chief | PolkaSpots Supafly Wi-Fi
Bigger Chief | Cucumber Tony

Got an unlicensed Meraki? Set it free with Cucumber
cucumberwifi.io/meraki


On 18 February 2016 at 12:24, Riccardo Tacconi <rtacconi at gmail.com> wrote:

> Which version of Ruby are you using?
>
> On 18 February 2016 at 12:17, Simon Morley <simon at polkaspots.com> wrote:
>
>> Actually puma docs suggest doing that when using preload_app and
>> ActiveRecord...
>>
>> https://github.com/puma/puma#clustered-mode
>>
>>
>>
>> Simon Morley
>>
>> Big Chief | PolkaSpots Supafly Wi-Fi
>> Bigger Chief | Cucumber Tony
>>
>> Got an unlicensed Meraki? Set it free with Cucumber
>> cucumberwifi.io/meraki
>>
>>
>> On 18 February 2016 at 12:05, Frederick Cheung <
>> frederick.cheung at gmail.com> wrote:
>>
>>>
>>>
>>>
>>> On 18 February 2016 at 11:17:34, Simon Morley (simon at polkaspots.com)
>>> wrote:
>>>
>>>
>>> class RadiusDatabase
>>>   self.abstract_class = true
>>>   establish_connection "radius_#{Rails.env}".to_sym
>>> end
>>>
>>> class Radacct < RadiusDatabase
>>> end
>>>
>>> Then I decreased our database pool from 20 to 5 and added a wait_timeout
>>> of 5 (since there seems to be some discrepancies with this). Things got
>>> much better (but weren't fixed).
>>>
>>> I tried querying differently, including using
>>> connection_pool.with_connection. I've tried closing the connections
>>> manually and also used ActiveRecord::Base.clear_active_connections!
>>> periodically. No joy.
>>>
>>> By this point, we were running 2-4 instances - handling around very
>>> little traffic in total (about 50rpm). Every few hours, they'd block, all
>>> of them. At the same time, we'd see a load of rack timeouts - same DB. I've
>>> checked the connections - they were each opening only a few to MySQL and
>>> MySQL was looking good.
>>>
>>> One day, by chance, I reduced the 4 instances to 1. *And the problem is
>>> solved!!! WHAT*? Obviously the problem isn't solved, we can only use a
>>> single server.
>>>
>>>
>>> Are you using puma in the mode where it forks workers? if so, then you
>>> want to reconnect post fork or multiple processes will share the same file
>>> descriptor and really weird shit will happen.
>>>
>>> The puma readme advises to do this:
>>>
>>> before_fork do
>>>   ActiveRecord::Base.connection_pool.disconnect!
>>> end
>>>
>>> I don't know off the top of my head whether that  will do the job for
>>> classes that have established a connection to a different db - presumably
>>> they have a separate connection pool
>>>
>>> Fred
>>>
>>> I don't know what's going on here. Have I been staring at this for too
>>> long (yes)?
>>>
>>> Our other servers are chugging along happily now, using a connection
>>> pool of 20, no errors, no timeouts (different db though).
>>>
>>> Has anyone got any suggestions / seen this? Is there something
>>> fundamentally wrong with the way we're establishing a connection to the
>>> external dbs? Surely this is MySQL related
>>>
>>> Thanks for listening,
>>>
>>> S
>>>
>>>
>>> Simon Morley
>>>
>>> Got an unlicensed Meraki? Set it free with Cucumber
>>> cucumberwifi.io/meraki
>>>
>>>
>>> On 15 January 2016 at 13:58, Gerhard Lazu <gerhard at lazu.co.uk> wrote:
>>>
>>>> The understanding of difficult problems/bugs and the learning that
>>>> comes with it cannot be rushed. Each and every one of us has his / her own
>>>> pace, and all "speeds" are perfectly fine. The only question that really
>>>> matters is whether it's worth it (a.k.a. the cost of lost opportunity). If
>>>> the answer is yes, plough on. If not, look for alternatives.
>>>>
>>>> Not everyone likes or wants to run their own infrastructure. The
>>>> monthly savings on the PaaS, IaaS advertised costs are undisputed, but few
>>>> like to think - never mind talk - about how many hours / days / weeks have
>>>> been spent debugging obscure problems which "solve themselves" on a managed
>>>> environment. Don't get me started on those that are building their own
>>>> Docker-based PaaS-es without even realising it...
>>>>
>>>> As a side-note, I've been dealing with a similar TCP-related problem
>>>> for a while now, so I could empathise with your struggles the second I've
>>>> seen your post. One of us is bound to solve it first, and I hope it will be
>>>> you ; )
>>>>
>>>> Have a good one, Gerhard.
>>>>
>>>> On Fri, Jan 15, 2016 at 10:01 AM, Simon Morley <simon at polkaspots.com>
>>>> wrote:
>>>>
>>>>> You must be more patient that I am. It's been a long month - having
>>>>> said that, I'm excited to find the cause.
>>>>>
>>>>> I misunderstood you re. file descriptors. We checked the kernel limits
>>>>> / files open on the systems before and during and there's nothing untoward.
>>>>>
>>>>> Since writing in, it's not happened as before - no doubt it'll take
>>>>> place during our forthcoming office move today.
>>>>>
>>>>> I ran a strace (thanks for that suggestion John) on a couple of
>>>>> processes yesterday and saw redis blocking. Restarted a few redis servers
>>>>> to see if that helped. Can't be certain yet.
>>>>>
>>>>> As soon as it's on, I'll run a tcpdump. How I'd not thought about that
>>>>> I don't know...
>>>>>
>>>>> Actually, this is one thing I dislike about Rails - it's so nice and
>>>>> easy to do everything, one forgets we're dealing with the real servers /
>>>>> components / connections. It's too abstract in ways, but that's a whole
>>>>> other debate :)
>>>>>
>>>>> S
>>>>>
>>>>>
>>>>>
>>>>> Simon Morley
>>>>>
>>>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>>>> Bigger Chief | Cucumber Tony
>>>>>
>>>>> simon at PolkaSpots.com
>>>>> Linkedin: I'm on it again and it still sucks
>>>>> 020 7183 1471 <020%207183%201471>
>>>>>
>>>>> 🚀💥
>>>>>
>>>>> On 15 January 2016 at 06:53, Gerhard Lazu <gerhard at lazu.co.uk> wrote:
>>>>>
>>>>>> File descriptors, for traditional reasons, include TCP connections.
>>>>>>
>>>>>> Are you logging all requests to a central location? When the problem
>>>>>> occurs, it might help taking a closer look at the type of requests you're
>>>>>> receiving.
>>>>>>
>>>>>> Depending on how long the mischief lasts, a tcpdump to pcap, then
>>>>>> wireshark might help. Same for an strace on the Puma processes, similar to
>>>>>> what John suggested . Those are low level tools though, verbose, complex
>>>>>> and complete, it's easy to get lost unless you know what you're looking for.
>>>>>>
>>>>>> In summary, CLOSE_WAITs piling up from haproxy (client role) to Puma
>>>>>> (server role) indicates the app not closing connections in time (or maybe
>>>>>> ever) - why? It's a fun one to troubleshoot ; )
>>>>>>
>>>>>> On Thu, Jan 14, 2016 at 11:35 PM, Simon Morley <simon at polkaspots.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Right now, none of the servers have any issues. No close_waits.
>>>>>>>
>>>>>>> All is well. Seemingly.
>>>>>>>
>>>>>>> When it occurs ALL the servers end up going. Sometimes real fast.
>>>>>>> That's why I thought we had a db bottleneck. It happens pretty quickly,
>>>>>>> randomly, no particular times.
>>>>>>>
>>>>>>> We don't ever really get spikes of traffic, there's an even load
>>>>>>> inbound throughout.
>>>>>>>
>>>>>>> I thought we had someone running a slow loris style attack on us. So
>>>>>>> I added some rules to HA Proxy and Cloudflare ain't seen nofin honest guv.
>>>>>>>
>>>>>>> Will find a way to chart it and send a link over.
>>>>>>>
>>>>>>> Will see if we're not closing any files - not much of that going on.
>>>>>>> There's some manual gzipping happening - we've had that in place for over a
>>>>>>> year though - not sure why it'd start playing up now. Memory usage is high
>>>>>>> but consistent and doesn't increase.
>>>>>>>
>>>>>>> S
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Simon Morley
>>>>>>>
>>>>>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>>>>>> Bigger Chief | Cucumber Tony
>>>>>>>
>>>>>>> simon at PolkaSpots.com
>>>>>>> Linkedin: I'm on it again and it still sucks
>>>>>>> 020 7183 1471 <020%207183%201471>
>>>>>>>
>>>>>>> 🚀💥
>>>>>>>
>>>>>>> On 14 January 2016 at 22:14, Gerhard Lazu <gerhard at lazu.co.uk>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> That sounds like a file descriptor leak. Are the CLOSE_WAITs
>>>>>>>> growing over time?
>>>>>>>>
>>>>>>>> You're right, New Relic is too high level, this is a layer 4-5
>>>>>>>> issue.
>>>>>>>>
>>>>>>>> The simplest thing that can plot some graphs will work. Throw the
>>>>>>>> dirtiest script together that curls the data out if it comes easy, it
>>>>>>>> doesn't matter how you get those metrics as long as you have them.
>>>>>>>>
>>>>>>>> This is a great blog post opportunity ; )
>>>>>>>>
>>>>>>>> On Thu, Jan 14, 2016 at 8:40 PM, Simon Morley <simon at polkaspots.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> I would ordinarily agree with you about the connection however
>>>>>>>>> they hang around for hours sometimes.
>>>>>>>>>
>>>>>>>>> The 500 in the hyproxy config was actually left over from a
>>>>>>>>> previous experiment. Realistically I know they won't cope with that.
>>>>>>>>>
>>>>>>>>> Using another server was to find any issues with puma. I'm still
>>>>>>>>> going to try unicorn just in case.
>>>>>>>>>
>>>>>>>>> Will up the numbers too - thanks for that suggestion.
>>>>>>>>>
>>>>>>>>> I'll look at a better monitoring tool too. So far new relic hasn't
>>>>>>>>> helped much.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> S
>>>>>>>>>
>>>>>>>>> Simon Morley
>>>>>>>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm doing it with Cucumber Tony. Are you?
>>>>>>>>>
>>>>>>>>> On 14 Jan 2016, at 20:30, Gerhard Lazu <gerhard at lazu.co.uk> wrote:
>>>>>>>>>
>>>>>>>>> Hi Simon,
>>>>>>>>>
>>>>>>>>> CLOSE_WAIT suggests that Puma is not closing connections fast
>>>>>>>>> enough. The client has asked for the connection to be closed, but Puma is
>>>>>>>>> busy.
>>>>>>>>>
>>>>>>>>> Quickest win would be to increase your Puma instances. Unicorn
>>>>>>>>> won't help - or any other Rack web server for the matter.
>>>>>>>>>
>>>>>>>>> Based on your numbers, start with 10 Puma instances. Anything more
>>>>>>>>> than 100 connections for a Rails instance is not realistic. I would
>>>>>>>>> personally go with 50, just to be safe. I think I saw 500 conns in your
>>>>>>>>> haproxy config, which is way too optimistic.
>>>>>>>>>
>>>>>>>>> You want metrics for detailed CPU usage by process, connections
>>>>>>>>> open with state by process, and memory usage, by process. Without these,
>>>>>>>>> you're flying blind. Any suggestions anyone makes without real metrics -
>>>>>>>>> including myself - are just guesses. You'll get there, but you're making it
>>>>>>>>> far too difficult for yourself.
>>>>>>>>>
>>>>>>>>> Let me know how it goes, Gerhard.
>>>>>>>>>
>>>>>>>>> On Thu, Jan 14, 2016 at 3:16 PM, Simon Morley <
>>>>>>>>> simon at polkaspots.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello All
>>>>>>>>>>
>>>>>>>>>> We've been battling with Puma for a long while now, I'm looking
>>>>>>>>>> for some help / love / attention / advice / anything to prevent further
>>>>>>>>>> hair loss.
>>>>>>>>>>
>>>>>>>>>> We're using it in a reasonably typical Rails 4 application behind
>>>>>>>>>> Nginx.
>>>>>>>>>>
>>>>>>>>>> Over the last 3 months, our requests have gone from 500 rpm to a
>>>>>>>>>> little over 1000 depending on the hour. Over this period, we've been seeing
>>>>>>>>>> weird CLOSE_WAIT conns appearing in netstat, which eventually kill the
>>>>>>>>>> servers.
>>>>>>>>>>
>>>>>>>>>> We have 3 Rails servers behind Haproxy running things. Load is
>>>>>>>>>> generally even.
>>>>>>>>>>
>>>>>>>>>> Running netstat on the servers shows a pile of connections in the
>>>>>>>>>> CLOSE_WAIT state with varying recv-q values as so:
>>>>>>>>>>
>>>>>>>>>> tcp      2784    0 localhost:58786         localhost:5100
>>>>>>>>>>  CLOSE_WAIT
>>>>>>>>>> tcp      717      0 localhost:35794         localhost:5100
>>>>>>>>>>    CLOSE_WAIT
>>>>>>>>>> tcp      784      0 localhost:55712         localhost:5100
>>>>>>>>>>    CLOSE_WAIT
>>>>>>>>>> tcp        0        0 localhost:38639         localhost:5100
>>>>>>>>>>      CLOSE_WAIT
>>>>>>>>>>
>>>>>>>>>> That's just a snippet. A wc reveals over 400 of these on each
>>>>>>>>>> server.
>>>>>>>>>>
>>>>>>>>>> Puma is running on port 5100 btw. We've tried puma with multiple
>>>>>>>>>> threads and a single one - same result. Latest version as of today.
>>>>>>>>>>
>>>>>>>>>> I've checked haproxy and don't see much lingering around.
>>>>>>>>>>
>>>>>>>>>> Only a kill -9 can stop Puma - otherwise, it says something like
>>>>>>>>>> 'waiting for requests to finish'
>>>>>>>>>>
>>>>>>>>>> I ran GDB to see if I could debug the process however I can't
>>>>>>>>>> claim I knew what I was looking at. The processes that seemed apparent were
>>>>>>>>>> event machine and mongo.
>>>>>>>>>>
>>>>>>>>>> We then ditched EM (we were using the AMQP gem) in favour of
>>>>>>>>>> Bunny. That made zero difference.
>>>>>>>>>>
>>>>>>>>>> So we upgraded Mongo and Mongoid to the latest versions, neither
>>>>>>>>>> of which helped.
>>>>>>>>>>
>>>>>>>>>> I thought we might have a bottleneck somewhere - Mongo, ES or
>>>>>>>>>> MySQL. But, none of those services seem to have any issues / latencies.
>>>>>>>>>>
>>>>>>>>>> It's also 100% random. Might happen 10 times in an hour, then not
>>>>>>>>>> at all for a week.
>>>>>>>>>>
>>>>>>>>>> The puma issues on github don't shed much light.
>>>>>>>>>>
>>>>>>>>>> I don't really know where to turn at the moment or what to do
>>>>>>>>>> next? I was going to resort back to Unicorn but I don't think the issue is
>>>>>>>>>> that side and I wanted to fix the problem, not just patch it up.
>>>>>>>>>>
>>>>>>>>>> It's starting to look like a nasty in my code somewhere but I
>>>>>>>>>> don't want to go down that route just yet...
>>>>>>>>>>
>>>>>>>>>> Sorry for the long email, thanks in advance. Stuff.
>>>>>>>>>>
>>>>>>>>>> I hope someone can help!
>>>>>>>>>>
>>>>>>>>>> S
>>>>>>>>>>
>>>>>>>>>> Simon Morley
>>>>>>>>>>
>>>>>>>>>> Big Chief | PolkaSpots Supafly Wi-Fi
>>>>>>>>>> Bigger Chief | Cucumber Tony
>>>>>>>>>>
>>>>>>>>>> simon at PolkaSpots.com <simon at polkaspots.com>
>>>>>>>>>> Linkedin: I'm on it again and it still sucks
>>>>>>>>>> 020 7183 1471 <020%207183%201471>
>>>>>>>>>>
>>>>>>>>>> 🚀💥
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Chat mailing list
>>>>>>>>>> Chat at lists.lrug.org
>>>>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>>>>> Manage your subscription:
>>>>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Chat mailing list
>>>>>>>>> Chat at lists.lrug.org
>>>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>>>> Manage your subscription:
>>>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Chat mailing list
>>>>>>>>> Chat at lists.lrug.org
>>>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>>>> Manage your subscription:
>>>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Chat mailing list
>>>>>>>> Chat at lists.lrug.org
>>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>>> Manage your subscription:
>>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Chat mailing list
>>>>>>> Chat at lists.lrug.org
>>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>>> Manage your subscription:
>>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Chat mailing list
>>>>>> Chat at lists.lrug.org
>>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>>> Manage your subscription:
>>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Chat mailing list
>>>>> Chat at lists.lrug.org
>>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>>> Manage your subscription:
>>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Chat mailing list
>>>> Chat at lists.lrug.org
>>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>>> Manage your subscription:
>>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>>
>>>>
>>> _______________________________________________
>>> Chat mailing list
>>> Chat at lists.lrug.org
>>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>>> Manage your subscription:
>>> http://lists.lrug.org/options.cgi/chat-lrug.org
>>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>>
>>>
>>
>> _______________________________________________
>> Chat mailing list
>> Chat at lists.lrug.org
>> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
>> Manage your subscription: http://lists.lrug.org/options.cgi/chat-lrug.org
>> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>>
>>
>
>
> --
> Riccardo Tacconi
>
> http://github.com/rtacconi
> http://twitter.com/rtacconi
>
> _______________________________________________
> Chat mailing list
> Chat at lists.lrug.org
> Archives: http://lists.lrug.org/pipermail/chat-lrug.org
> Manage your subscription: http://lists.lrug.org/options.cgi/chat-lrug.org
> List info: http://lists.lrug.org/listinfo.cgi/chat-lrug.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lrug.org/pipermail/chat-lrug.org/attachments/20160218/f4d37bb9/attachment-0002.html>


More information about the Chat mailing list