I’m sure this is the result of some forgotten tweak made in an attempt at tuning, but I’m at a loss. I’m running Foreman 3.1 on a CentOS 8 VM. The VM has 8 cores and 48G RAM. Within an hour of booting, all of the RAM is consumed and the system starts to swap. The puma cluster workers are each using huge amounts of memory. These are sample from top.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32738 foreman 20 0 11.1g 10.1g 9816 S 99.3 21.4 38:08.33 rails
26020 foreman 20 0 13.1g 10.7g 7428 S 99.0 22.6 74:22.56 rails
32666 foreman 20 0 10.0g 9.1g 9684 S 99.0 19.2 40:18.07 rails
29342 foreman 20 0 11.5g 10.1g 8108 S 28.9 21.3 57:11.50 rails
# free -g
total used free shared buff/cache available
Mem: 47 46 0 0 0 0
Swap: 31 8 23
As I said, I’m sure this is something I did (and forgot I did). Can someone point me in the right direction as to what to check?
We had some memory consumption regression in the past, but all we have seen was fixed with 3.1. I am afraid we will need to investigate this more, there’s nothing to tell from a top output.
What is your workload? Which controllers are being hit the most? Any plugins enabled? Are you running Katello as well?
From the output it looks like all processes are quite busy, what are they doing? What’s hammering them?
I am running Katello as well. There are about 300 clients. According to the documentation, 8/48 should be overkill for this setup. The following plugins are installed:
I included the lines from top just to show the VIRT/RES of each rails (puma worker) process. After another hour, swap is also almost exhausted, and the VIRT/RES of each rails process has increased by 1.5-2x.
Can you tell from production.log what are the controllers that are being used? What’s the workload? If you are running a monitoring system, ideally, can you tell when a process saw a spike so we can correlate this with logs in the production.log?
We really need to narrow this down a bit, Foreman as many other Ruby apps do have memory leaks, we are pretty successful with finding them and fixing them. But there must be something special with your setup, 3.1 is going through heavy testing at the moment and we haven’t see such big problems so far.
Cheap trick to find out which controller/action was slow as the first (typically when there is some problem with e.g. accidentally loading all records into the DB the offender is also very slow) is to restart, tail production.log and grep records with NNNNNms (5+ digits milliseconds). There are typically the offenders consuming all the memory.
I looked at the 100 log entries that took the most time, virtually all have something to do with network interfaces, with the most common message being, “Identifier has already been taken.”
I took the 100 lines with the highest ms, pulled out the indentifier (i.e. [W|app|45ccd5b1]) then grepped those. This is the (truncated) output for the highest ms. In all, 2957 lines contain “45ccd5b1”.
… skip 2900+ similar “Nic::Managed (XXXX) update event on [identifier|attached_to]” lines …
2022-01-25T05:14:17 [I|aud|45ccd5b1] Nic::Managed (3841) update event on attached_to 10f46009b3, eth4
2022-01-25T05:14:17 [I|aud|45ccd5b1] Nic::Managed (3841) update event on identifier 10f46009b3.28, eth4.28
2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:20 [W|app|45ccd5b1] Saving vif10.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:
2022-01-25T05:14:20 [W|app|45ccd5b1] Identifier has already been taken
2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:20 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:20 [W|app|45ccd5b1] Saving vif12.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:
2022-01-25T05:14:20 [W|app|45ccd5b1] Identifier has already been taken
2022-01-25T05:14:21 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:21 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:21 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:21 [W|app|45ccd5b1] Saving vif14.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:
2022-01-25T05:14:21 [W|app|45ccd5b1] Identifier has already been taken
2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:22 [W|app|45ccd5b1] Saving vif3.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:
2022-01-25T05:14:22 [W|app|45ccd5b1] Identifier has already been taken
2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:22 [W|app|45ccd5b1] Saving vif4.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:
2022-01-25T05:14:22 [W|app|45ccd5b1] Identifier has already been taken
2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:22 [W|app|45ccd5b1] Not queueing Nic::Managed: ["Identifier has already been taken"]
2022-01-25T05:14:22 [W|app|45ccd5b1] Saving vif7.0 NIC for host amrabqvh02.albmolecular.com failed, skipping because:
2022-01-25T05:14:22 [W|app|45ccd5b1] Identifier has already been taken
2022-01-25T05:14:55 [I|app|45ccd5b1] Completed 201 Created in 4909255ms (Views: 102.3ms | ActiveRecord: 352203.4ms | Allocations: 553340631)
This is a known problem - when you have a host(s) with many interfaces facts coming out from RHSM/Puppet/Ansible can actually kill your instance. These are typically hypervisors or container hosts with thousands of NICs.
While we would like to rearchitect how we import facts, the solution we have today is to filter unwanted facts:
Follow the article, set up your filter, drop unwanted facts and restart and you should be fine now.
For the record, filtering mechanism only support * ATM but in 3.2 you will be able to use ? as well (similarly as in globs). Internally Foreman converts this to a regexp.
Thanks so much for you help! That certainly did something. I added vif* (Oracle VM’s hypervisor’s virtual interfaces) to the “Ignore interfaces with matching identifier” and "“Exclude pattern for facts stored in Foreman” arrays.
No problem, let’s add this to the list of defaults:
In the future, we would like to store facts more efficiently and also I think that the default behavior of adding all NICs automatically to the database is a bad design - this should have been human-interaction only (single click or command to add one or more NICs from facts).