Puma cluster worker using huge amounts of memory

jjvcuyler · January 21, 2022, 8:30pm

I’m sure this is the result of some forgotten tweak made in an attempt at tuning, but I’m at a loss. I’m running Foreman 3.1 on a CentOS 8 VM. The VM has 8 cores and 48G RAM. Within an hour of booting, all of the RAM is consumed and the system starts to swap. The puma cluster workers are each using huge amounts of memory. These are sample from top.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                              
  32738 foreman   20   0   11.1g  10.1g   9816 S  99.3  21.4  38:08.33 rails                                                                                                                
  26020 foreman   20   0   13.1g  10.7g   7428 S  99.0  22.6  74:22.56 rails                                                                                                                
  32666 foreman   20   0   10.0g   9.1g   9684 S  99.0  19.2  40:18.07 rails                                                                                                                
  29342 foreman   20   0   11.5g  10.1g   8108 S  28.9  21.3  57:11.50 rails

# free -g
              total        used        free      shared  buff/cache   available
Mem:             47          46           0           0           0           0
Swap:            31           8          23

As I said, I’m sure this is something I did (and forgot I did). Can someone point me in the right direction as to what to check?

lzap · January 24, 2022, 12:23pm

We had some memory consumption regression in the past, but all we have seen was fixed with 3.1. I am afraid we will need to investigate this more, there’s nothing to tell from a top output.

What is your workload? Which controllers are being hit the most? Any plugins enabled? Are you running Katello as well?

From the output it looks like all processes are quite busy, what are they doing? What’s hammering them?

jjvcuyler · January 24, 2022, 2:06pm

I am running Katello as well. There are about 300 clients. According to the documentation, 8/48 should be overkill for this setup. The following plugins are installed:

foreman-tasks
foreman_hooks
foreman_openscap
foreman_puppet
foreman_remote_execution
katello

I included the lines from top just to show the VIRT/RES of each rails (puma worker) process. After another hour, swap is also almost exhausted, and the VIRT/RES of each rails process has increased by 1.5-2x.

lzap · January 25, 2022, 1:59pm

Can you tell from production.log what are the controllers that are being used? What’s the workload? If you are running a monitoring system, ideally, can you tell when a process saw a spike so we can correlate this with logs in the production.log?

We really need to narrow this down a bit, Foreman as many other Ruby apps do have memory leaks, we are pretty successful with finding them and fixing them. But there must be something special with your setup, 3.1 is going through heavy testing at the moment and we haven’t see such big problems so far.

lzap · January 25, 2022, 2:27pm

Cheap trick to find out which controller/action was slow as the first (typically when there is some problem with e.g. accidentally loading all records into the DB the offender is also very slow) is to restart, tail production.log and grep records with NNNNNms (5+ digits milliseconds). There are typically the offenders consuming all the memory.

jjvcuyler · January 25, 2022, 3:21pm

I looked at the 100 log entries that took the most time, virtually all have something to do with network interfaces, with the most common message being, “Identifier has already been taken.”

I took the 100 lines with the highest ms, pulled out the indentifier (i.e. [W|app|45ccd5b1]) then grepped those. This is the (truncated) output for the highest ms. In all, 2957 lines contain “45ccd5b1”.

2022-01-25T03:53:06 [I|app|45ccd5b1] Import facts for 'amrabqvh02.albmolecular.com' completed. Added: 0, Updated: 0, Deleted 0 facts

2022-01-25T03:53:09 [I|aud|45ccd5b1] Nic::Managed (5645) update event on attached_to 1009ad1ba3, 103742f815

2022-01-25T03:53:09 [I|aud|45ccd5b1] Nic::Managed (5645) update event on identifier 1009ad1ba3.25, 103742f815.25

2022-01-25T03:53:09 [I|aud|45ccd5b1] Nic::Managed (5196) update event on attached_to 1009ad1ba3, 103742f815

2022-01-25T03:53:09 [I|aud|45ccd5b1] Nic::Managed (5196) update event on identifier 1009ad1ba3.25, 103742f815.25

2022-01-25T03:53:10 [I|aud|45ccd5b1] Nic::Managed (6325) update event on attached_to 1009ad1ba3, 103742f815

2022-01-25T03:53:10 [I|aud|45ccd5b1] Nic::Managed (6325) update event on identifier 1009ad1ba3.25, 103742f815.25

2022-01-25T03:53:11 [I|aud|45ccd5b1] Nic::Managed (2665) update event on attached_to 1009ad1ba3, 103742f815

2022-01-25T03:53:11 [I|aud|45ccd5b1] Nic::Managed (2665) update event on identifier 1009ad1ba3.25, 103742f815.25

… skip 2900+ similar “Nic::Managed (XXXX) update event on [identifier|attached_to]” lines …

2022-01-25T05:14:17 [I|aud|45ccd5b1] Nic::Managed (3841) update event on attached_to 10f46009b3, eth4

2022-01-25T05:14:17 [I|aud|45ccd5b1] Nic::Managed (3841) update event on identifier 10f46009b3.28, eth4.28

2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:20 [W|app|45ccd5b1] Saving vif10.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:

2022-01-25T05:14:20 [W|app|45ccd5b1] Identifier has already been taken

2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:20 [W|app|45ccd5b1] Saving vif12.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:

2022-01-25T05:14:20 [W|app|45ccd5b1] Identifier has already been taken

2022-01-25T05:14:21 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:21 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:21 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:21 [W|app|45ccd5b1] Saving vif14.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:

2022-01-25T05:14:21 [W|app|45ccd5b1] Identifier has already been taken

2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:22 [W|app|45ccd5b1] Saving vif3.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:

2022-01-25T05:14:22 [W|app|45ccd5b1] Identifier has already been taken

2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:22 [W|app|45ccd5b1] Saving vif4.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:

2022-01-25T05:14:22 [W|app|45ccd5b1] Identifier has already been taken

2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]

2022-01-25T05:14:22 [W|app|45ccd5b1] Saving vif7.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:

2022-01-25T05:14:22 [W|app|45ccd5b1] Identifier has already been taken

2022-01-25T05:14:55 [I|app|45ccd5b1] Completed 201 Created in 4909255ms (Views: 102.3ms | ActiveRecord: 352203.4ms | Allocations: 553340631)

lzap · January 25, 2022, 4:48pm

We have the winner!

This is a known problem - when you have a host(s) with many interfaces facts coming out from RHSM/Puppet/Ansible can actually kill your instance. These are typically hypervisors or container hosts with thousands of NICs.

While we would like to rearchitect how we import facts, the solution we have today is to filter unwanted facts:

https://theforeman.org/2018/03/foreman-117-fact-filtering.html

Follow the article, set up your filter, drop unwanted facts and restart and you should be fine now.

For the record, filtering mechanism only support * ATM but in 3.2 you will be able to use ? as well (similarly as in globs). Internally Foreman converts this to a regexp.

jjvcuyler · January 25, 2022, 5:09pm

Thanks so much for you help! That certainly did something. I added vif* (Oracle VM’s hypervisor’s virtual interfaces) to the “Ignore interfaces with matching identifier” and "“Exclude pattern for facts stored in Foreman” arrays.

# foreman-rake interfaces:clean
Starting ingnored interfaces clean up…
Finished, cleaned 1825 interfaces
# foreman-rake facts:clean
Starting orphaned facts clean up
Finished, cleaned 4992 facts

lzap · January 26, 2022, 7:33am

No problem, let’s add this to the list of defaults:

In the future, we would like to store facts more efficiently and also I think that the default behavior of adding all NICs automatically to the database is a bad design - this should have been human-interaction only (single click or command to add one or more NICs from facts).

jjvcuyler · January 26, 2022, 1:10pm

I don’t think you want the comma inside the quotes.

lzap · January 27, 2022, 9:02am

Yeah I just noticed that,thanks!