Foreman 2.0.0 - memory leak?

Hi everyboy,
we’ve build Foreman 2.0.0 from source and run it with puma + apache reverse proxy on SLES12 SP5 (ruby2.5 and nodejs10/npm10).
All our servers have 4cpus and 10G memory. Puma is configured to use 2 Workers with 0,16 threads. “preload_app!” is actived.
A typical memory consumption looks like: foreman_memory_consumption

Is this a normal behaviour?
If you also need further data, please let me know.

best regards

I started to look at the same thing but I haven’t got around to graphing it.

Can you take a look at which services are taking up the memory? You can use systemd for this by setting DefaultMemoryAccounting=yes in /etc/systemd/system.conf. Note that if it was false, a service needs to be restarted before it starts to measure it.

Then you can get output like:

# systemctl show --property=Names,MemoryCurrent httpd postgresql foreman* dynflow*
MemoryCurrent=11235328
Names=httpd.service

MemoryCurrent=209670144
Names=postgresql.service

MemoryCurrent=253784064
Names=dynflow-sidekiq@orchestrator.service

MemoryCurrent=431247360
Names=foreman.service

MemoryCurrent=123834368
Names=foreman-proxy.service

MemoryCurrent=252919808
Names=dynflow-sidekiq@worker.service

Ok, I had to adjust the parameter DefaultMemoryAccounting to yes and restart the services. We have to wait i little bit :/.

Since we build foreman from source, we did not use dynflow-* yet. postgres and the foreman-proxy installed on different server. Foreman and apache are the only services run on this server.

Memory consumption so far, still growing slowly:
image

Systemd output:

systemctl show --property=Names,MemoryCurrent httpd foreman* dynflow*
MemoryCurrent=18722816
Names=httpd.service apache2.service

MemoryCurrent=5175435264
Names=foreman.service

Can you also share which plugins you have installed, if any?

@tbrisker @lzap any insights?

Currently we have not yet installed any plugins.

Hey,

can you pull the same graph but per service? I am particularly interested in foreman (puma) and dynflow (sidekiq).

Compared to what we had previously (passenger), there is one big difference in your setup. We had a single thread per passenger process, by default our installer configured 2-N number of concurrent passenger workers depending on number of cores IIRC. That might be as low as 4-8 in your case.

What I see now is that you have 2 processes with 32 threads in total. A thread is lightwight compared to a process, however there is some amount of memory resources that are still significant - stack frame for example and thread locals.

I’d not expect it to be that big tho. Therefore, once you get those graphs I’d be interested in correlating those with a setup where you set 2 workers 8 threads per worker just to compare this with the current configuration.

Also, let’s focus on dynflow as well - these are now three processes and I think they can also contribute to the total memory consumption. Graph those as well.

He stated before that there is no dynflow:

You can also see that it’s just foreman.service taking up 5 GB of memory:

Then try configurations with 1 worker and 1, 8, 16 threads to correlate how it’s groing. IMHO 16 is too many for Ruby with GIL, I haven’t tested but I expect the sweet spot somewhere between 3 - 10 threads.

Just to be sure, I am curious and I’d like to rule out this new setup.

It might be a memory leak introduced in 2.0.0 too.

This is the default:

That in turn is copied from Puma’s defaults: https://github.com/puma/puma#thread-pool

I’d be curious anyway.

What I usually use is SystemTap or dtrace. We have some examples for the former: https://github.com/lzap/foreman-tracer

Or you can use any Ruby memory analyzer to see object allocations, advantage of the tracing is that it’s quite effective and does not slow down your deployment too much.

ok, i started to use the default configuration from the puppet modul for foreman (workers=0, threads_min=0, threads_max=16)

in parallel i try different puma configurations (workers, threads_min, threads_max). it seems that these have an effect on memory consumption.
my experience so far:
Nodes 1:
1.1 used the defaults, 1.2 worker:2, threads_min: 8, threads_max:16
image

Node 2:
2.1 used the defaults, 2.2 worker: 4, threads_min: 4, threads_max: 8
image

Node 3:
3.1 used the defaults, 3.2 worker: 4, threads_min: 8, threads_max: 16
image

all have in common that the threads_min is configured to != 0. i will keep watching the different setups.

1 Like

The differences are insignificant.

Before we dig further, can you double check the list of plugins you have enabled?

Also can you run this simple tool to analyze your rails log and share which controllers and actions are the slowest and mostly used? https://github.com/pmoravec/rails-load-stats

No, we really have no plugins activated or installed.

Yes, i can do that. Our logging type was set to syslog + json, which results in many script errors. I changed the settings and now have to wait until there are some more log entries.

Uh, that’s surprisingly high memory usage then! Let’s wait what do you hit the most.

Oh there is SystemTap in SUSE, you could use our examples to track allocations.

sooo, here are some results so far from my “production-test-node”:
current configuration of puma: workers: 4, threas 8,8
image

results from the analyze.sh:
there were 15107 requests taking 21401325 ms (i.e. 5.94 hours, i.e. 0.25 days) in summary

type                                            count   min     max     avg     mean    sum             percentage
--------------------------------------------------------------------------------------------------------------------
ConfigReportsController#create                  3213    58      1165    90      71      291307          1.36 %
HostsController#facts                           3309    23      17043   1811    1555    5995471         28.01 %
HostsController#index                           9       51194   105400  72737   64195   654640          3.06 %
PingController#ping                             1237    7       273     13      10      17166           0.08 %
PuppetclassesController#index                   450     173     432     210     188     94622           0.44 %
SmartClassParametersController#index            8       183     332     215     195     1727            0.01 %
SmartClassParametersController#show             9       188     641     298     211     2682            0.01 %
AuditsController#index                          1       2109    2109    2109    2109    2109            0.01 %
ConfigReportsController#index                   3       88      93      90      89      270             0.00 %
ConfigReportsController#show                    1       63      63      63      63      63              0.00 %
DashboardController#index                       6       40      126     60      46      362             0.00 %
DashboardController#show                        17      54      6088    504     88      8580            0.04 %
FactValuesController#index                      2       104     1425    764     104     1529            0.01 %
HostgroupsController#auto_complete_search       4       18      32      21      18      86              0.00 %
HostgroupsController#edit                       2       2923    3225    3074    2923    6148            0.03 %
HostgroupsController#index                      6       180     1040    517     193     3106            0.01 %
HostgroupsController#update                     3       307     406     350     339     1052            0.00 %
HostsController#auto_complete_search            13      10      28      12      11      166             0.00 %
HostsController#edit                            3       1996    2630    2258    2149    6775            0.03 %
HostsController#externalNodes                   6416    1172    28292   2227    1713    14292067        66.78 %
HostsController#index                           3       202     274     248     270     746             0.00 %
HostsController#multiple_puppetrun              1       94      94      94      94      94              0.00 %
HostsController#nics                            5       31      108     65      66      327             0.00 %
HostsController#overview                        2       54      96      75      54      150             0.00 %
HostsController#resources                       2       73      100     86      73      173             0.00 %
HostsController#runtime                         2       632     674     653     632     1306            0.01 %
HostsController#show                            2       307     533     420     307     840             0.00 %
HostsController#templates                       5       154     255     197     167     986             0.00 %
HostsController#update                          3       427     774     577     531     1732            0.01 %
NotificationRecipientsController#index          359     7       129     16      14      5969            0.03 %
PuppetclassesController#auto_complete_search    1       10      10      10      10      10              0.00 %
PuppetclassesController#index                   1       229     229     229     229     229             0.00 %
PuppetclassesController#update                  1       331     331     331     331     331             0.00 %
SubnetsController#index                         2       100     120     110     100     220             0.00 %
UsersController#login                           6       7       4140    1380    25      8284            0.04 %

concurrent requests:
- MAX: 14 when processing request with ID 3b7da576
- AVG: 2
- MEAN: 2
- 90%PERCENTILE: 4
1 Like

First, performance. We have heard that 1.24/2.0 do not perform ENC well, that’s 2/3 of your time spent. I’ve made a patch for 2.1 which is relatively slow and shows good improvements there on my instance, your milage may very depending on what do you have in your database, but try to apply this if this helps:

Now, to the memory. Our experience is that a worker process with some plugins installed including Katello can max out at 2 GB on heavy deployments. This is where you should target a auto restart. This brings a question, @ekohl do we have an automatic worker restart bound to some memory in the new puma deployment? This was one of the passenger limitation, this feature was only available in the paid version of the product. We should definitely deploy something like this now that we have a report that memory can grow that fast.

Next steps are finding what are those leaks, but before you do that, do you have any reason why not to use Ruby 2.7? That version has a new feature - GC compacting or defragmenting. Ruby VM is known for fragmenting memory creating “gaps” on the heap which are never returned to the OS. The latest stable version should be better. Try that the codebase should be 2.7 compatible if I am not mistaken. Compare the graph with what you have here (4+8-8).

Ok, we can try this patch :slight_smile: but this may some time to apply.

Next steps are finding what are those leaks, but before you do that, do you have any reason why not to use Ruby 2.7?

Because our company using SLES :face_vomiting: …all jokes aside, i followed your “Install from Source” instruction and it says “Ruby 2.5 or newer”. So i started with ruby2.5 :slight_smile: I can try to build Foreman with ruby2.7, but this will take some time and its not possible to do that on our production system.