General question about RAM usage

karli-sjoberg · January 30, 2018, 5:58am

Hi all,

I have a Foreman/Puppet server that is constantly eating all of my RAM in the server, plus all of the swap. I´ve bumped it up as far as I can on that hardware, it now has 16GB of RAM and 32GB of swap. Here´s passenger status:

----------- General information -----------
Max pool size : 3
App groups    : 1
Processes     : 3
Requests in top-level queue : 0

----------- Application groups -----------
/usr/local/share/foreman:
  App root: /usr/local/share/foreman
  Requests in queue: 84
  * PID: 9929    Sessions: 1       Processed: 2       Uptime: 10h 54m 24s
    CPU: 5%      Memory  : 4401M   Last used: 10h 24m
  * PID: 11282   Sessions: 1       Processed: 464     Uptime: 10h 24m 35s
    CPU: 100%    Memory  : 2848M   Last used: 12m 35s
  * PID: 11968   Sessions: 1       Processed: 486     Uptime: 10h 9m 28s
    CPU: 96%     Memory  : 3109M   Last used: 12m 29s

On this VM I am running Foreman (16.0), MariaDB, Puppetserver, Puppet client, Apache24 and Foreman Proxy.

Is this normal for a Foreman server with 16 clients? If so, how much RAM would it need to stop swapping?

Thanks in advance!
Karli

ohadlevy · January 30, 2018, 6:18am

Its not normal at all, which plugins do you have installed?

karli-sjoberg · January 30, 2018, 9:53am

ohadlevy
Author
    January 30
Its not normal at all, which plugins do you have installed?

Good! That is at least good to know:)

I’m not sure which plugins I have installed, it’s a default installation, basically. Is there a rake command or something that lists which plugins are installed? The only one I know I have installed afterwards is the foreman_discovery plugin.

/K

Gwmngilfen · January 30, 2018, 1:41pm

There is (foreman-rake plugins:list, from memory) or you can check the About page in the UI

karli-sjoberg · January 30, 2018, 2:00pm

Gwmngilfen
Community Lead
    January 30
There is (foreman-rake plugins:list, from memory) or you can check the About page in the UI

You were almost correct, it was one character too much, the “s”

And I was correct, it only listed foreman_discovery 9.1.5.

/K

lzap · February 5, 2018, 12:24pm

We rely on passenger open-source version which has poor options in recycling worker processes. In newest version of Foreman there is now passenger-recycle small script which does the job (part of rubygem-foreman_maintain package). Here is little bit outdated version of the script (final is little bit different around configuration I think):

https://gist.github.com/lzap/8dddbe66ec8d43cbd4277c1de7045c17

Use this to recycle your processes until you find root cause.

We don’t provide yet any tools for diagnosis, you need to figure out yourself which requests cause memory to grow. Ideally, pair your monitoring data with production.log and try to find those requests. Also good measure are slow queries - these often allocate lots of Ruby objects, so try to find slow queries in production log (pattern \d\d\d\dms or more digits).

I am working on Foreman telemetry which will instantly give you more answers - which actions are slow, where are Ruby and ActiveRecords allocation peaks and other data. It’s not yet merged, but backporting into any version should be easy if you want to try it (most of the code is brand new).

github.com/theforeman/foreman

Fixes #18675 - telemetry foreman API

theforeman:develop ← lzap:prometheus1

opened 01:24PM - 18 Dec 17 UTC

lzap

+441 -8

Foreman telemetry core API. This PR brings a common telemetry API which provides… three generic metric types: * **counter** (monotony increasing counter - for example number of http requests) * **gauge** (arbitrary float value - for example number of jobs queued) * **histogram** (arbitrary amount of buckets with observations and float values - for example duration of a request) Each individual metric must be predefined in config/initializers/5_telemetry_metrics.rb file in this format: telemetry.add_counter(:http_requests, 'A counter of HTTP requests made', [:controller, :action]) Optionally buckets can be defined for histograms (a sane default is used if not defined): telemetry.add_histogram(:http_request_total_duration, 'Total duration of controller action', [:controller, :action], [5, 10, 50, 200, 1000]) For more about why I picked these, watch my proposal video: https://www.youtube.com/watch?v=gCLSI9-4QpE There are two implementations for the best possible integration options: Prometheus endpoint and Statsd packets. To configure the former, add this to your settings: ```yaml :telemetry: :type: 'prometheus' :prefix: 'fm_rails' ``` Then visit `/metrics` to see aggregated data. Warning - due to limitation of Prometheus Ruby library, this will not work for multi-process web servers (our case) and this will provide you incorrect data. Therefore, until this is fixed upstream, a recommended way is to use Statsd: ```yaml :telemetry: :type: 'statsd' :prefix: 'fm_rails' :statsd: :host: '127.0.0.1:8125' :protocol: 'statsd' ``` Then download and run any statsd aggregator on localhost port 8125. I tested with statsite and statsd_exporter for Prometheus, which is super easy to use: ``` ./statsd_exporter -statsd.listen-udp :8125 ``` This PR provides a rake task that will create useful mapping for statsd_exporter which makes life much more easier and provides label mapping: ``` be rake telemetry:prometheus_statsd output=/tmp/mapping.yaml ./statsd_exporter -statsd.listen-udp :8125 -statsd.mapping-config /tmp/mapping.yaml ``` The last step is to grab the data into some monitoring framework, Prometheus is very easy to use as it's a single binary and configuration is easy: ``` scrape_configs: - job_name: 'foreman' static_configs: - targets: ['localhost:9102'] ./prometheus ``` This PR adds bunch of basic telemetry data, we can add more later on via initializer, this could be DSL instead of direct Ruby code (config/initializers/5_telemetry_metrics.rb). The PR currently adds: | Metric name | Labels | Type | Description | | ----------- | ------ | ---- | ----------- | | fm_rails_activerecord_instances | class | counter | Number of instances of ActiveRecord models | | fm_rails_bruteforce_locked_ui_logins | | counter | Number of blocked logins via bruteforce protection | | fm_rails_failed_ui_logins | | counter | Number of failed logins in total | | fm_rails_http_request_db_duration | controller,action | histogram | Time spent in database for a request | | fm_rails_http_request_total_duration | controller,action | histogram | Total duration of controller action | | fm_rails_http_request_view_duration | controller,action | histogram | Time spent in view for a request | | fm_rails_http_requests | controller,action | counter | A counter of HTTP requests made | | fm_rails_proxy_api_duration | method | histogram | Time spent waiting for Proxy (ms) | | fm_rails_proxy_api_response_code | code | counter | Number of Proxy API responses per HTTP code | | fm_rails_ruby_gc_allocated_objects | controller,action | counter | Ruby GC statistics per request (total_allocated_objects) | | fm_rails_ruby_gc_count | controller,action | counter | Ruby GC statistics per request (count) | | fm_rails_ruby_gc_freed_objects | controller,action | counter | Ruby GC statistics per request (total_freed_objects) | | fm_rails_ruby_gc_major_count | controller,action | counter | Ruby GC statistics per request (major_gc_count) | | fm_rails_ruby_gc_minor_count | controller,action | counter | Ruby GC statistics per request (minor_gc_count) | | fm_rails_successful_ui_logins | | counter | Number of successful logins in total | A short demo how to set thigs up: https://youtu.be/i5iCOLEByZk TODO: * [x] Installer * [x] Documentation * [x] Packages for new dependencies

karli-sjoberg · February 5, 2018, 1:22pm

Awe! Some! Thanks for pointing me towards the telemetry stuff, you answered my question without me even giving it

And thanks for the cleaning thingie as well, a lot more elegant than the hourly apache reloading I´ve resorted to, heh! This environment is my home lab I´ve set up just for fun and giggles but I´m trying to use it to it´s fullest, got it hooked up towards my SAMBA AD with users, roles and stuff like that. Wouldn´t surprise me if it turns out to be related somehow. I´ll look into the telemetry stuff and report back!

Thanks again!

/K

lzap · February 5, 2018, 2:31pm

If you have any luck finding bottlenecks in our code tell us about it and we will fix it

karli-sjoberg · February 22, 2018, 2:03pm

So it took a while to get it running, since I´m using FreeBSD, so I had to look around for some help. Fortunately, there are very helpful people around the FreeBSD mailing lists

I eventually got it running to the point that it was saying “No, this stuff just works on linux!”, I started hacking it. It now runs and does what you´d expect. Since “private dirty RSS” isn´t available, I used the process´s RSS instead, which is far for perfect but better than nothing. It now kills the passenger processes if they go out of control. I´ve attached a patch that applies cleanly towards foreman_maintain-0.1.3.

Now I can start looking at the telemetry stuff you pointed me towards. Thank you!

“Sorry, new users can not upload attachments.” Well, screw it then…

/K

ekohl · February 22, 2018, 2:28pm

It’s probably best to submit the patch to https://github.com/theforeman/foreman_maintain

karli-sjoberg · February 22, 2018, 2:35pm

OK, sure! How do I do that? I always forget

ekohl · February 22, 2018, 2:43pm

Create a fork of the repo on github, push it to a branch and create a pull request. I think https://help.github.com/articles/creating-a-pull-request-from-a-fork/ describes the process pretty much.

karli-sjoberg · February 22, 2018, 3:31pm

Thank you! Except when I’m working on something specific, I do this maybe once or twice a year, so the knowledge gets completely flushed

/K

karli-sjoberg · February 22, 2018, 8:21pm

There you go!
https://github.com/theforeman/foreman_maintain/pull/146

/K

karli-sjoberg · March 5, 2018, 8:32pm

Hey man!

I’m working on getting ‘statsd_exporter’ going with prometheus and it seems I’m having issues getting anything into prometheus. If I go to ‘http://#{address}:9102/metrics’ I can see all of the data:

# HELP fm_rails_activerecord_instances Number of instances of ActiveRecord models
# TYPE fm_rails_activerecord_instances counter
fm_rails_activerecord_instances{class="ActiveRecord__SessionStore__Session"} 13
fm_rails_activerecord_instances{class="AuthSource"} 6
fm_rails_activerecord_instances{class="Bookmark"} 0
...

But if I go to ‘http://#{address}:9090’ there’s nothing at all!? I mean, I have the Prometheus interface and all, but no data. There’s a blue button called ‘Add Graph’, but nothing happens when you click it. I can go to ‘Status -> Targets’ and see my ‘statsd_exporter’ instance running, it says it’s ‘UP’, last scraped seconds ago, but nothing shows up like graphs.

Any ideas?

TIA
/K

karli-sjoberg · March 6, 2018, 7:00am

Aaand for some reason it just started working… But now I´m at work and SSH-tunneling to my Puppet/Foreman server…

/K

karli-sjoberg · March 6, 2018, 7:09am

So, now that I have Prometheus up and running, what do I look for?

/K

lzap · March 6, 2018, 1:57pm

We are having deep dive on this next week, I’d suggest to hop in if you can so I can answer your questions right away:

https://community.theforeman.org/t/foreman-deep-dives-telemetry-in-foreman-with-lukas-zapletal-wed-14-mar-2pm-gmt/8464/3

Anyway, since this thread is all about RAM usage, than you are looking for allocation spikes. There are two interesting metrics:

fm_rails_activerecord_instances (per class)
fm_rails_ruby_gc_allocated_objects (per controller/action)

So construct Prometheus queries for those, both are counters so you can simply do average in N minute window aggregates (5, 10, 30 depending on your load). Spikes in allocations mean that we are creating too many objects, this way you should be identify ActiveRecord classes if that’s the case, or at least controller/action.

It’s hard to tell which numbers are not normal, we are all just starting with telemetry so we don’t have yet relevant data from the field. Any deviation observed is interesting.

What’s also interesting is fm_rails_http_request_total_duration which is duration (histogram), there you will find slow controllers and actions. If there is a memory leak, these actions are usually slow as well. Feel free to share graphs, numbers and tables from Prometheus once you find interesting bits.

karli-sjoberg · March 6, 2018, 3:01pm

There you have something to compare with

karli-sjoberg · March 6, 2018, 3:01pm