Trends:clean cron job makes AWS instance unresponsive

Fergus_Nelson · April 30, 2014, 9:21am

http://projects.theforeman.org/issues/3983

We upgraded to 1.4 yesterday and at 0800 UTC this morning the instance
become unresponsive. The symptoms were complete lack of network
connectivity.The instance was only recovered by a stop and start.

I have disabled the cron job until i find a solution.

Has anyone else experienced this?

Fergus_Nelson · April 30, 2014, 9:50am

further investigation may imply a memory leak of some description. I was
able to get the following from top before the machine died.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4127 foreman 20 0 6207m 5.1g 1176 R 12 70.6 0:56.29 ruby

···

On Wednesday, 30 April 2014 10:21:08 UTC+1, Fergus Nelson wrote: > > http://projects.theforeman.org/issues/3983 > > We upgraded to 1.4 yesterday and at 0800 UTC this morning the instance > become unresponsive. The symptoms were complete lack of network > connectivity.The instance was only recovered by a stop and start. > > I have disabled the cron job until i find a solution. > > Has anyone else experienced this? > > > > >

Sean_Alderman · April 30, 2014, 1:06pm

You may find this thread interesting.

https://groups.google.com/d/msg/foreman-users/hi7TagIRE8A/5ZRzlI8ChxEJ

Unfortunately, in my case, I have 5 trends, about 115 hosts in Foreman. I
have increased the RAM allocated to my Foreman server to 12GB, and as yet
I've not run into the trouble again. That said, I still feel this is an
unreasonable amount of resources to allocate for such a small environment,
and if the issue resurfaces I may have to migrate the Foreman server to
real metal.

···

On Wednesday, April 30, 2014 5:50:43 AM UTC-4, Fergus Nelson wrote: > > further investigation may imply a memory leak of some description. I was > able to get the following from top before the machine died. > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 4127 foreman 20 0 6207m 5.1g 1176 R 12 70.6 0:56.29 ruby > > > On Wednesday, 30 April 2014 10:21:08 UTC+1, Fergus Nelson wrote: >> >> http://projects.theforeman.org/issues/3983 >> >> We upgraded to 1.4 yesterday and at 0800 UTC this morning the instance >> become unresponsive. The symptoms were complete lack of network >> connectivity.The instance was only recovered by a stop and start. >> >> I have disabled the cron job until i find a solution. >> >> Has anyone else experienced this? >> >> >> >> >>

lzap · May 5, 2014, 8:05am

I am willing to help tracking the ruby memory leak with SystemTap.

Are you guys on RHEL6 or a clone?

Would you mind probing the process with:

http://lukas.zapletalovi.com/2012/01/probing-ruby-apps-with-systemtap-in.html

In our case that would be:

yum -y install systemtap systemtap-runtime ruby
cd /root
wget "https://sourceware.org/systemtap/wiki/RubyMarker?action=AttachFile&do=get&target=rubyfuntop.stp"
stap -c rubyfuntop.stp foreman-rake trends:clean

And watch the "top" like utility for top functions that are being
called.

Also you can do similar using rubycount.stp SystemTap script from the
above blogpost, that will give you top X function calls. Paste the top
100 lines to me just for starters.

This way we can track things in Ruby, if you get this working, I can
create a simple STC that will track object creation and garbage
collection so we will be hopefully able to track down the memory leak.

I created a ticket, please let's communicate there:

http://projects.theforeman.org/issues/5568

LZ

···

On Wed, Apr 30, 2014 at 06:06:24AM -0700, Sean Alderman wrote: > You may find this thread interesting. > > https://groups.google.com/d/msg/foreman-users/hi7TagIRE8A/5ZRzlI8ChxEJ > > Unfortunately, in my case, I have 5 trends, about 115 hosts in Foreman. I > have increased the RAM allocated to my Foreman server to 12GB, and as yet > I've not run into the trouble again. That said, I still feel this is an > unreasonable amount of resources to allocate for such a small environment, > and if the issue resurfaces I may have to migrate the Foreman server to > real metal. > > > On Wednesday, April 30, 2014 5:50:43 AM UTC-4, Fergus Nelson wrote: > > > > further investigation may imply a memory leak of some description. I was > > able to get the following from top before the machine died. > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 4127 foreman 20 0 6207m 5.1g 1176 R 12 70.6 0:56.29 ruby > > > > > > On Wednesday, 30 April 2014 10:21:08 UTC+1, Fergus Nelson wrote: > >> > >> http://projects.theforeman.org/issues/3983 > >> > >> We upgraded to 1.4 yesterday and at 0800 UTC this morning the instance > >> become unresponsive. The symptoms were complete lack of network > >> connectivity.The instance was only recovered by a stop and start. > >> > >> I have disabled the cron job until i find a solution. > >> > >> Has anyone else experienced this? > >> > >> > >> > >> > >> > > -- > You received this message because you are subscribed to the Google Groups "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at http://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.

–
Later,

Lukas “lzap” Zapletal
irc: lzap #theforeman