General question about RAM usage

karli-sjoberg · February 22, 2018, 3:31pm

Thank you! Except when I’m working on something specific, I do this maybe once or twice a year, so the knowledge gets completely flushed

/K

karli-sjoberg · February 22, 2018, 8:21pm

There you go!
https://github.com/theforeman/foreman_maintain/pull/146

/K

karli-sjoberg · March 5, 2018, 8:32pm

Hey man!

I’m working on getting ‘statsd_exporter’ going with prometheus and it seems I’m having issues getting anything into prometheus. If I go to ‘http://#{address}:9102/metrics’ I can see all of the data:

# HELP fm_rails_activerecord_instances Number of instances of ActiveRecord models
# TYPE fm_rails_activerecord_instances counter
fm_rails_activerecord_instances{class="ActiveRecord__SessionStore__Session"} 13
fm_rails_activerecord_instances{class="AuthSource"} 6
fm_rails_activerecord_instances{class="Bookmark"} 0
...

But if I go to ‘http://#{address}:9090’ there’s nothing at all!? I mean, I have the Prometheus interface and all, but no data. There’s a blue button called ‘Add Graph’, but nothing happens when you click it. I can go to ‘Status -> Targets’ and see my ‘statsd_exporter’ instance running, it says it’s ‘UP’, last scraped seconds ago, but nothing shows up like graphs.

Any ideas?

TIA
/K

karli-sjoberg · March 6, 2018, 7:00am

Aaand for some reason it just started working… But now I´m at work and SSH-tunneling to my Puppet/Foreman server…

/K

karli-sjoberg · March 6, 2018, 7:09am

So, now that I have Prometheus up and running, what do I look for?

/K

lzap · March 6, 2018, 1:57pm

We are having deep dive on this next week, I’d suggest to hop in if you can so I can answer your questions right away:

https://community.theforeman.org/t/foreman-deep-dives-telemetry-in-foreman-with-lukas-zapletal-wed-14-mar-2pm-gmt/8464/3

Anyway, since this thread is all about RAM usage, than you are looking for allocation spikes. There are two interesting metrics:

fm_rails_activerecord_instances (per class)
fm_rails_ruby_gc_allocated_objects (per controller/action)

So construct Prometheus queries for those, both are counters so you can simply do average in N minute window aggregates (5, 10, 30 depending on your load). Spikes in allocations mean that we are creating too many objects, this way you should be identify ActiveRecord classes if that’s the case, or at least controller/action.

It’s hard to tell which numbers are not normal, we are all just starting with telemetry so we don’t have yet relevant data from the field. Any deviation observed is interesting.

What’s also interesting is fm_rails_http_request_total_duration which is duration (histogram), there you will find slow controllers and actions. If there is a memory leak, these actions are usually slow as well. Feel free to share graphs, numbers and tables from Prometheus once you find interesting bits.

karli-sjoberg · March 6, 2018, 3:01pm

There you have something to compare with

karli-sjoberg · March 6, 2018, 3:01pm

karli-sjoberg · March 6, 2018, 3:03pm

The first part of the graphs are without ‘passenger-recycler’ and after that with it “on”. Nice before and after

lzap · March 6, 2018, 3:25pm

Ok, 300 creations of Nic::Base in 30 minutes is not interesting spike at all, as well as 320k ruby object creations in hosts/facts action in 30 minutes is nothing really interesting. We know what fact importing code is our biggest bottleneck so far.

It is worth aggregating all allocations for all classes and controller/actions and watch rate of that, are there any spikes? It is useful to pair those with memory consumption of Rails processes as well, so you can compare the graphs when you see sudden increase.

One important thing, Ruby VM is known to never return memory back to OS, Ruby processes only grow up to the point where they must be restarted. Foreman suffers from this as well, we all know it, passenger is set to restart process worker after 10k requests. If thats not enough use passenger-recycler or change that setting.

karli-sjoberg · March 6, 2018, 4:25pm

Seeing as I’m a total noob at this, could you tell me what I should type in as a graph command? That way you would be able to see what you are most interrested in seeing, and then I’ll post a screenshot of it.

/K

lzap · March 8, 2018, 7:54am

To get total number of AR allocations:

sum(rate(fm_rails_activerecord_instances[5m]))

the same for Ruby object allocations:

sum(rate(fm_rails_ruby_gc_allocated_objects[5m]))

You want to watch request processing rate:

sum(rate(fm_rails_http_requests[5m]))

Here is average request processing graph:

sum(fm_rails_http_request_total_duration_sum / fm_rails_http_request_total_duration_count)

Those are the basic ones, more on the deep dive.

You want to run node_exporter to also get Linux OS information and you must also run passenger exporter to get data from passenger (RSS and memory info):

Then you can watch everything on one dashboard and do some decisions.

karli-sjoberg · March 8, 2018, 1:57pm

Thank you very much, I´ve started looking at them and can see spikes every 15 mins going up and down. I´ve also enabled all the debugging I can without passenger not wanting to start for unknown reasons, and I thought this might be interesting to you:

2018-03-08T13:35:00 08095d8e [app] [I] Completed 500 Internal Server Error in 359479ms (Views: 32.8ms | ActiveRecord: 2924.8ms)

There are lots of those:

# grep -E '([0-9]{6}+)ms' log/production.log | wc -l
5958

Am I on to something there? Would it be interesting to you to see a full flow of that, like grep 08095d8e log/production.log ?

/K

karli-sjoberg · March 14, 2018, 2:36pm

Should I start an issue instead of discussing this on the forum?

/K

lzap · March 19, 2018, 10:21am

No this is fine, I was on vacation.

So you are doing your investigation right, do the grep again and find transasction id which is the second column of the log (e.g. 08095d8e in your example). Then grep this id again and pastebin the whole transaction. I bet this is puppet or report import, that’s quite slow operation with many bottlenecks.

karli-sjoberg · March 19, 2018, 12:01pm

Oh, ok, I see.

Here you go:
https://pastebin.com/EHTYD4yF

/K

lzap · March 20, 2018, 4:36pm

The super slow performance is perhaps due to swapping. You obviously hit some memory leak.

Can you restart whole app and then monitor memory of passenger processes, can you identify how it grows? Is it constant or sudden? Correlate that with requests to identify what is causing that, if it is possible.

Unfortunately, most of us are on PostgreSQL and while fact upload is demanding and not super fast, it is possible to scale it and we do not see such big mem leaks.

karli-sjoberg · March 21, 2018, 7:31am

     	 lzap Discovery 
March 20
The super slow performance is perhaps due to swapping.

No, no swapping at all at that point, even had RAM unused as I recall.

You obviously hit some memory leak.
Can you restart whole app and then monitor memory of passenger
processes, can you identify how it grows? Is it constant or sudden?

I would say that the increases in RAM usage comes in bursts, like it
stays at a level and holds there until the next increase, and so on.
Like a staircase with broad steps

Correlate that with requests to identify what is causing that, if it
is possible.

I would say it comes in waves, possibly correlated with the host in
question, but I can try to shut the Puppet agent down in the server and
see if that changes anything.

Unfortunately, most of us are on PostgreSQL and while fact upload is
demanding and not super fast, it is possible to scale it and we do
not see such big mem leaks.

I´m open to trying things out, the Puppet/Foreman server is using a
local MariaDB, so it can be replaced if need be. But it would be nicer
to identify the underlying problem instead of just changing to
something else that just hides it.

/K

karli-sjoberg · March 21, 2018, 11:24am

I confirmed my suspicion that it is the two very similar servers that I use as hypervisors in my oVirt infra that are causing this!

I disabled ‘passenger-recycler’ and the puppet agent in the two hypervisors, and then started collecting the total amount of RAM that Passenger processes are using and let it run for about an hour:
Screenshot_2018-03-21_10-18-17

karli-sjoberg · March 21, 2018, 11:25am

Then I restarted the puppet agent in the two hypervisors and let it run for another hour:
Screenshot_2018-03-21_11-03-47

As you can see, the difference is staggering! Then there´s a dip in the RAM usage that quickly goes up again. I think that might be an idle passenger process that was killed because of a setting I have in Apache: PassengerPoolIdleTime 30

So it´s definitely got something to do with the two hypervisor servers (game and hive). Where do we go from here?

/K