Unusable instance -- 100% CPU from Passenger RubyApp -- Foreman 1.12, CentOS 7

Justin_Foreman · November 1, 2016, 6:02am

From a fresh restart, after a random amount of time (typically within 10
minutes), my Foreman / Katello instance becomes unusable because two
Passenger RubyApp: /usr/share/foreman processes spike to 100% cpu and stay
there. I'm not seeing anything obvious in the logs, and I've tried to
attach strace to the ruby processes, but I am not seeing anything out of
the ordinary (although I'm not the sharpest with strace output).

Any help would be appreciated, as we're dead in the water. I even tried
using the backup/restore functionality to a new server (which worked
successfully after a bit of massaging), but ended up with the same two ruby
processes at 100% CPU.

Justin

ohadlevy · November 1, 2016, 6:58am

> From a fresh restart, after a random amount of time (typically within 10
> minutes), my Foreman / Katello instance becomes unusable because two
> Passenger RubyApp: /usr/share/foreman processes spike to 100% cpu and stay
> there. I'm not seeing anything obvious in the logs, and I've tried to
> attach strace to the ruby processes, but I am not seeing anything out of
> the ordinary (although I'm not the sharpest with strace output).
>
> Any help would be appreciated, as we're dead in the water. I even tried
> using the backup/restore functionality to a new server (which worked
> successfully after a bit of massaging), but ended up with the same two ruby
> processes at 100% CPU.
>
>
from another thread from yesterday, lzap wrote:

Can you guys try foreman-tracer utility (SystemTap based, will only
work on CentOS 7 or higher or ST-enabled kernels) on your production
instance? No changes required in Foreman, setup is quite easy:

https://github.com/lzap/foreman-tracer

Interesting statistics would be

foreman-tracer rails objects-total

and

foreman-tracer rails objects

It's like "top" utility experience, pastebin the bottlenecks please.

···

On Tue, Nov 1, 2016 at 8:02 AM, Justin Foreman wrote:

Justin

–
You received this message because you are subscribed to the Google Groups
“Foreman users” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-users+unsubscribe@googlegroups.com.
To post to this group, send email to foreman-users@googlegroups.com.
Visit this group at https://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

Justin_Foreman · November 1, 2016, 11:47am

I wasn't sure how long to run them, so I ran each for 60 seconds.

foreman-tracer rails objects-total
http://pastebin.com/QdZePcWQ

foreman-tracer rails objects
http://pastebin.com/jVbDRm3c

lzap · November 1, 2016, 12:56pm

Ok one minute is fine, the counters will reset in 5 minutes anyway.

Ok, the problem is in setup_clone / setup_object_clone method which
creates a deep copy of each record for comparison. But I wonder how is
possible you have 20k calls of this clone after just 1 minute.

Tell me more about your infrastructure. How many hosts? What is the
everage count of NICs associated with a host? Don't you have some kind
of broken host with 20k NICs associated? Remember, Puppet fact upload
will cause creation of NIC record for each NIC reported, so you could
have some broken host reporting "ethXYZ_address" each puppet run
causing the NIC table to grow.

Also, can you tell the 100% CPU utilization is by Ruby process itself,
or is that caused by swapping? If this is really the Ruby process
doing some work, please also run

foreman-tracer rails calls

For a minute or two to see where is it looping in. Then pastebin again, thanks.

LZ

···

On Tue, Nov 1, 2016 at 12:47 PM, Justin Foreman wrote: > I wasn't sure how long to run them, so I ran each for 60 seconds. > > foreman-tracer rails objects-total > http://pastebin.com/QdZePcWQ > > foreman-tracer rails objects > http://pastebin.com/jVbDRm3c > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.

–
Later,
Lukas @lzap Zapletal

Justin_Foreman · November 1, 2016, 4:06pm

Okay, now we're getting somewhere. This is an environment with five oVirt
nodes in two clusters.
Cluster1: 2 nodes
Cluster2: 3 nodes

Each have a handful of VMs, some manually installed, and some provisioned
by Foreman spanning both clusters.

The oVirt nodes each appear to have maybe 10-50 nics (mostly VLAN
interfaces and bridges for VMs). When I run the puppet agent on any of the
three nodes in Cluster2, the Nic::Managed count shoots through the roof.
Even if I kill the puppet agent, the count continues to rise.

Here's a list of the NICs on one of the offending hosts:

http://pastebin.com/DxZup68B

Honestly, the NIC information for these hosts aren't very useful. As a
temporary workaround, is there a way to exclude gathering NIC information
during this process?

Thanks!
Justin

···

On Tuesday, November 1, 2016 at 8:56:50 AM UTC-4, Lukas Zapletal wrote: > > Ok one minute is fine, the counters will reset in 5 minutes anyway. > > Ok, the problem is in setup_clone / setup_object_clone method which > creates a deep copy of each record for comparison. But I wonder how is > possible you have 20k calls of this clone after just 1 minute. > > Tell me more about your infrastructure. How many hosts? What is the > everage count of NICs associated with a host? Don't you have some kind > of broken host with 20k NICs associated? Remember, Puppet fact upload > will cause creation of NIC record for each NIC reported, so you could > have some broken host reporting "ethXYZ_address" each puppet run > causing the NIC table to grow. > > Also, can you tell the 100% CPU utilization is by Ruby process itself, > or is that caused by swapping? If this is really the Ruby process > doing some work, please also run > > foreman-tracer rails calls > > For a minute or two to see where is it looping in. Then pastebin again, > thanks. > > LZ > > > On Tue, Nov 1, 2016 at 12:47 PM, Justin Foreman > wrote: > > I wasn't sure how long to run them, so I ran each for 60 seconds. > > > > foreman-tracer rails objects-total > > http://pastebin.com/QdZePcWQ > > > > foreman-tracer rails objects > > http://pastebin.com/jVbDRm3c > > > > -- > > You received this message because you are subscribed to the Google > Groups > > "Foreman users" group. > > To unsubscribe from this group and stop receiving emails from it, send > an > > email to foreman-user...@googlegroups.com . > > To post to this group, send email to forema...@googlegroups.com > . > > Visit this group at https://groups.google.com/group/foreman-users. > > For more options, visit https://groups.google.com/d/optout. > > > > -- > Later, > Lukas @lzap Zapletal >

Justin_Foreman · November 1, 2016, 4:10pm

foreman-tracer rails calls:
http://pastebin.com/80ziGqXr

···

On Tuesday, November 1, 2016 at 12:06:45 PM UTC-4, Justin Foreman wrote: > > Okay, now we're getting somewhere. This is an environment with five oVirt > nodes in two clusters. > Cluster1: 2 nodes > Cluster2: 3 nodes > > Each have a handful of VMs, some manually installed, and some provisioned > by Foreman spanning both clusters. > > The oVirt nodes each appear to have maybe 10-50 nics (mostly VLAN > interfaces and bridges for VMs). When I run the puppet agent on any of the > three nodes in Cluster2, the Nic::Managed count shoots through the roof. > Even if I kill the puppet agent, the count continues to rise. > > Here's a list of the NICs on one of the offending hosts: > > http://pastebin.com/DxZup68B > > Honestly, the NIC information for these hosts aren't very useful. As a > temporary workaround, is there a way to exclude gathering NIC information > during this process? > > Thanks! > Justin > > On Tuesday, November 1, 2016 at 8:56:50 AM UTC-4, Lukas Zapletal wrote: >> >> Ok one minute is fine, the counters will reset in 5 minutes anyway. >> >> Ok, the problem is in setup_clone / setup_object_clone method which >> creates a deep copy of each record for comparison. But I wonder how is >> possible you have 20k calls of this clone after just 1 minute. >> >> Tell me more about your infrastructure. How many hosts? What is the >> everage count of NICs associated with a host? Don't you have some kind >> of broken host with 20k NICs associated? Remember, Puppet fact upload >> will cause creation of NIC record for each NIC reported, so you could >> have some broken host reporting "ethXYZ_address" each puppet run >> causing the NIC table to grow. >> >> Also, can you tell the 100% CPU utilization is by Ruby process itself, >> or is that caused by swapping? If this is really the Ruby process >> doing some work, please also run >> >> foreman-tracer rails calls >> >> For a minute or two to see where is it looping in. Then pastebin again, >> thanks. >> >> LZ >> >> >> On Tue, Nov 1, 2016 at 12:47 PM, Justin Foreman >> wrote: >> > I wasn't sure how long to run them, so I ran each for 60 seconds. >> > >> > foreman-tracer rails objects-total >> > http://pastebin.com/QdZePcWQ >> > >> > foreman-tracer rails objects >> > http://pastebin.com/jVbDRm3c >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups >> > "Foreman users" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an >> > email to foreman-user...@googlegroups.com. >> > To post to this group, send email to forema...@googlegroups.com. >> > Visit this group at https://groups.google.com/group/foreman-users. >> > For more options, visit https://groups.google.com/d/optout. >> >> >> >> -- >> Later, >> Lukas @lzap Zapletal >> >

Justin_Foreman · November 1, 2016, 6:42pm

I set it to ignore the offending interface names, but the issue of
increasing Nic::Managed count was still happening. I found that the
offending hosts had hundreds of duplicate interfaces. I wrote a little
hammer script to delete all interfaces from the hosts. Now, all appears to
be well.

Thanks for all the help!

Justin

···

On Tuesday, November 1, 2016 at 12:12:04 PM UTC-4, Josh wrote: > > See the 'ignore interfaces with matching identifier' option under Settings > -> Provisioning. > > I had the same problem with Docker network interfaces. > > Josh > > On Tue, Nov 1, 2016 at 12:06 PM, Justin Foreman > wrote: > >> Okay, now we're getting somewhere. This is an environment with five oVirt >> nodes in two clusters. >> Cluster1: 2 nodes >> Cluster2: 3 nodes >> >> Each have a handful of VMs, some manually installed, and some provisioned >> by Foreman spanning both clusters. >> >> The oVirt nodes each appear to have maybe 10-50 nics (mostly VLAN >> interfaces and bridges for VMs). When I run the puppet agent on any of the >> three nodes in Cluster2, the Nic::Managed count shoots through the roof. >> Even if I kill the puppet agent, the count continues to rise. >> >> Here's a list of the NICs on one of the offending hosts: >> >> http://pastebin.com/DxZup68B >> >> Honestly, the NIC information for these hosts aren't very useful. As a >> temporary workaround, is there a way to exclude gathering NIC information >> during this process? >> >> Thanks! >> Justin >> >> On Tuesday, November 1, 2016 at 8:56:50 AM UTC-4, Lukas Zapletal wrote: >>> >>> Ok one minute is fine, the counters will reset in 5 minutes anyway. >>> >>> Ok, the problem is in setup_clone / setup_object_clone method which >>> creates a deep copy of each record for comparison. But I wonder how is >>> possible you have 20k calls of this clone after just 1 minute. >>> >>> Tell me more about your infrastructure. How many hosts? What is the >>> everage count of NICs associated with a host? Don't you have some kind >>> of broken host with 20k NICs associated? Remember, Puppet fact upload >>> will cause creation of NIC record for each NIC reported, so you could >>> have some broken host reporting "ethXYZ_address" each puppet run >>> causing the NIC table to grow. >>> >>> Also, can you tell the 100% CPU utilization is by Ruby process itself, >>> or is that caused by swapping? If this is really the Ruby process >>> doing some work, please also run >>> >>> foreman-tracer rails calls >>> >>> For a minute or two to see where is it looping in. Then pastebin again, >>> thanks. >>> >>> LZ >>> >>> >>> On Tue, Nov 1, 2016 at 12:47 PM, Justin Foreman >>> wrote: >>> > I wasn't sure how long to run them, so I ran each for 60 seconds. >>> > >>> > foreman-tracer rails objects-total >>> > http://pastebin.com/QdZePcWQ >>> > >>> > foreman-tracer rails objects >>> > http://pastebin.com/jVbDRm3c >>> > >>> > -- >>> > You received this message because you are subscribed to the Google >>> Groups >>> > "Foreman users" group. >>> > To unsubscribe from this group and stop receiving emails from it, send >>> an >>> > email to foreman-user...@googlegroups.com. >>> > To post to this group, send email to forema...@googlegroups.com. >>> > Visit this group at https://groups.google.com/group/foreman-users. >>> > For more options, visit https://groups.google.com/d/optout. >>> >>> >>> >>> -- >>> Later, >>> Lukas @lzap Zapletal >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "Foreman users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to foreman-user...@googlegroups.com . >> To post to this group, send email to forema...@googlegroups.com >> . >> Visit this group at https://groups.google.com/group/foreman-users. >> For more options, visit https://groups.google.com/d/optout. >> > >

Josh · November 1, 2016, 4:11pm

See the 'ignore interfaces with matching identifier' option under Settings
-> Provisioning.

I had the same problem with Docker network interfaces.

Josh

···

On Tue, Nov 1, 2016 at 12:06 PM, Justin Foreman wrote:

Okay, now we’re getting somewhere. This is an environment with five oVirt
nodes in two clusters.
Cluster1: 2 nodes
Cluster2: 3 nodes

Each have a handful of VMs, some manually installed, and some provisioned
by Foreman spanning both clusters.

The oVirt nodes each appear to have maybe 10-50 nics (mostly VLAN
interfaces and bridges for VMs). When I run the puppet agent on any of the
three nodes in Cluster2, the Nic::Managed count shoots through the roof.
Even if I kill the puppet agent, the count continues to rise.

Here’s a list of the NICs on one of the offending hosts:

http://pastebin.com/DxZup68B

Honestly, the NIC information for these hosts aren’t very useful. As a
temporary workaround, is there a way to exclude gathering NIC information
during this process?

Thanks!
Justin

On Tuesday, November 1, 2016 at 8:56:50 AM UTC-4, Lukas Zapletal wrote:

Ok one minute is fine, the counters will reset in 5 minutes anyway.

Ok, the problem is in setup_clone / setup_object_clone method which
creates a deep copy of each record for comparison. But I wonder how is
possible you have 20k calls of this clone after just 1 minute.

Tell me more about your infrastructure. How many hosts? What is the
everage count of NICs associated with a host? Don’t you have some kind
of broken host with 20k NICs associated? Remember, Puppet fact upload
will cause creation of NIC record for each NIC reported, so you could
have some broken host reporting “ethXYZ_address” each puppet run
causing the NIC table to grow.

Also, can you tell the 100% CPU utilization is by Ruby process itself,
or is that caused by swapping? If this is really the Ruby process
doing some work, please also run

foreman-tracer rails calls

For a minute or two to see where is it looping in. Then pastebin again,
thanks.

LZ

On Tue, Nov 1, 2016 at 12:47 PM, Justin Foreman jfor...@gmail.com >> wrote:

I wasn’t sure how long to run them, so I ran each for 60 seconds.

foreman-tracer rails objects-total
http://pastebin.com/QdZePcWQ

foreman-tracer rails objects
http://pastebin.com/jVbDRm3c

–
You received this message because you are subscribed to the Google
Groups
"Foreman users" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to foreman-user...@googlegroups.com.
To post to this group, send email to forema...@googlegroups.com.
Visit this group at https://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

–
Later,
Lukas @lzap Zapletal

–
You received this message because you are subscribed to the Google Groups
"Foreman users" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-users+unsubscribe@googlegroups.com.
To post to this group, send email to foreman-users@googlegroups.com.
Visit this group at https://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

lzap · November 2, 2016, 12:13pm

Glad you solved it out. Thank you so much for the input data, we will
work on the performance improvements in this area. The 20k dup calls
do not seem right, there will be a snag.

···

On Tue, Nov 1, 2016 at 7:42 PM, Justin Foreman wrote: > I set it to ignore the offending interface names, but the issue of > increasing Nic::Managed count was still happening. I found that the > offending hosts had hundreds of duplicate interfaces. I wrote a little > hammer script to delete all interfaces from the hosts. Now, all appears to > be well. > > Thanks for all the help! > > Justin > > On Tuesday, November 1, 2016 at 12:12:04 PM UTC-4, Josh wrote: >> >> See the 'ignore interfaces with matching identifier' option under Settings >> -> Provisioning. >> >> I had the same problem with Docker network interfaces. >> >> Josh >> >> On Tue, Nov 1, 2016 at 12:06 PM, Justin Foreman wrote: >>> >>> Okay, now we're getting somewhere. This is an environment with five oVirt >>> nodes in two clusters. >>> Cluster1: 2 nodes >>> Cluster2: 3 nodes >>> >>> Each have a handful of VMs, some manually installed, and some provisioned >>> by Foreman spanning both clusters. >>> >>> The oVirt nodes each appear to have maybe 10-50 nics (mostly VLAN >>> interfaces and bridges for VMs). When I run the puppet agent on any of the >>> three nodes in Cluster2, the Nic::Managed count shoots through the roof. >>> Even if I kill the puppet agent, the count continues to rise. >>> >>> Here's a list of the NICs on one of the offending hosts: >>> >>> http://pastebin.com/DxZup68B >>> >>> Honestly, the NIC information for these hosts aren't very useful. As a >>> temporary workaround, is there a way to exclude gathering NIC information >>> during this process? >>> >>> Thanks! >>> Justin >>> >>> On Tuesday, November 1, 2016 at 8:56:50 AM UTC-4, Lukas Zapletal wrote: >>>> >>>> Ok one minute is fine, the counters will reset in 5 minutes anyway. >>>> >>>> Ok, the problem is in setup_clone / setup_object_clone method which >>>> creates a deep copy of each record for comparison. But I wonder how is >>>> possible you have 20k calls of this clone after just 1 minute. >>>> >>>> Tell me more about your infrastructure. How many hosts? What is the >>>> everage count of NICs associated with a host? Don't you have some kind >>>> of broken host with 20k NICs associated? Remember, Puppet fact upload >>>> will cause creation of NIC record for each NIC reported, so you could >>>> have some broken host reporting "ethXYZ_address" each puppet run >>>> causing the NIC table to grow. >>>> >>>> Also, can you tell the 100% CPU utilization is by Ruby process itself, >>>> or is that caused by swapping? If this is really the Ruby process >>>> doing some work, please also run >>>> >>>> foreman-tracer rails calls >>>> >>>> For a minute or two to see where is it looping in. Then pastebin again, >>>> thanks. >>>> >>>> LZ >>>> >>>> >>>> On Tue, Nov 1, 2016 at 12:47 PM, Justin Foreman >>>> wrote: >>>> > I wasn't sure how long to run them, so I ran each for 60 seconds. >>>> > >>>> > foreman-tracer rails objects-total >>>> > http://pastebin.com/QdZePcWQ >>>> > >>>> > foreman-tracer rails objects >>>> > http://pastebin.com/jVbDRm3c >>>> > >>>> > -- >>>> > You received this message because you are subscribed to the Google >>>> > Groups >>>> > "Foreman users" group. >>>> > To unsubscribe from this group and stop receiving emails from it, send >>>> > an >>>> > email to foreman-user...@googlegroups.com. >>>> > To post to this group, send email to forema...@googlegroups.com. >>>> > Visit this group at https://groups.google.com/group/foreman-users. >>>> > For more options, visit https://groups.google.com/d/optout. >>>> >>>> >>>> >>>> -- >>>> Later, >>>> Lukas @lzap Zapletal >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Foreman users" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to foreman-user...@googlegroups.com. >>> To post to this group, send email to forema...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/foreman-users. >>> For more options, visit https://groups.google.com/d/optout. >> >> > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.

–
Later,
Lukas @lzap Zapletal