Goferd memory leak

Problem:
We are currently experiencing what looks like a memory leak on most of our non-production hosts.
Our Katello stack got updated last weekend from 3.4 to 3.9, but clients were still on Katello agent 3.4.
Since the symptomes looked exactly the same, we assumed we were hit by https://access.redhat.com/solutions/3198642 , but an update of the non-production hosts to Foreman Client 1.20 did not change the behaviour.
Our environment looks like this:
We have 3 lifecycle environments, dev, test and production, each with its own content proxy. Dev and test are currently on the same CV version, production is behind a little. All systems are RHEL7.
If it helps, here is a list of packages updatable in production compared to the related dev host:

 bind-libs.x86_64                        32:9.9.4-74.el7_6.1                     rhel-7-server-rpms
 bind-libs-lite.x86_64                       32:9.9.4-74.el7_6.1                     rhel-7-server-rpms
 bind-license.noarch                        32:9.9.4-74.el7_6.1                     rhel-7-server-rpms
 bind-utils.x86_64                        32:9.9.4-74.el7_6.1                     rhel-7-server-rpms
 device-mapper.x86_64                       7:1.02.149-10.el7_6.8                     rhel-7-server-rpms
 device-mapper-event.x86_64                      7:1.02.149-10.el7_6.8                     rhel-7-server-rpms
 device-mapper-event-libs.x86_64                      7:1.02.149-10.el7_6.8                     rhel-7-server-rpms
 device-mapper-libs.x86_64                       7:1.02.149-10.el7_6.8                     rhel-7-server-rpms
 glib2.x86_64                        2.56.1-4.el7_6                       rhel-7-server-rpms
 glibc.i686                         2.17-260.el7_6.6                      rhel-7-server-rpms
 glibc.x86_64                        2.17-260.el7_6.6                      rhel-7-server-rpms
 glibc-common.x86_64                        2.17-260.el7_6.6                      rhel-7-server-rpms
 java-1.8.0-openjdk.x86_64                       1:1.8.0.222.b10-0.el7_6                     rhel-7-server-rpms
 java-1.8.0-openjdk-headless.x86_64                     1:1.8.0.222.b10-0.el7_6                     rhel-7-server-rpms
 kernel.x86_64                         3.10.0-957.21.3.el7                     rhel-7-server-rpms
 kernel-tools.x86_64                        3.10.0-957.21.3.el7                     rhel-7-server-rpms
 kernel-tools-libs.x86_64                      3.10.0-957.21.3.el7                     rhel-7-server-rpms
 libgudev1.x86_64                        219-62.el7_6.7                       rhel-7-server-rpms
 libteam.x86_64                        1.27-6.el7_6.1                       rhel-7-server-rpms
 lvm2.x86_64                         7:2.02.180-10.el7_6.8                     rhel-7-server-rpms
 lvm2-libs.x86_64                        7:2.02.180-10.el7_6.8                     rhel-7-server-rpms
 microcode_ctl.x86_64                       2:2.1-47.5.el7_6                      rhel-7-server-rpms
 perf.x86_64                         3.10.0-957.21.3.el7                     rhel-7-server-rpms
 python.x86_64                         2.7.5-80.el7_6                       rhel-7-server-rpms
 python-libs.x86_64                       2.7.5-80.el7_6                       rhel-7-server-rpms
 python-perf.x86_64                       3.10.0-957.21.3.el7                     rhel-7-server-rpms
 python2-qpid-proton.x86_64                      0.28.0-1.el7                       EPEL7
 qpid-proton-c.x86_64                       0.28.0-1.el7                       EPEL7
 systemd.x86_64                        219-62.el7_6.7                       rhel-7-server-rpms
 systemd-libs.x86_64                        219-62.el7_6.7                       rhel-7-server-rpms
 systemd-sysv.x86_64                        219-62.el7_6.7                       rhel-7-server-rpms
 teamd.x86_64                        1.27-6.el7_6.1                       rhel-7-server-rpms
 tzdata-java.noarch                       2019b-1.el7                       rhel-7-server-rpms

I could not find any differences in configuration between the smart-proxies for dev and test compared to the one for production, so I would assume that is probably not the cause. Also, the memory leak seams to only occure over night when Katello offline backup is performed.
The memory leak goes to an extent where OOM killer comes in and fries the actual application.
On system affected, top has looked like this today morning after restarting goferd yesterday:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                   
 71499 root      20   0 4764120   2.5g    964 S   0.0 68.3   3:28.78 python

After systemctl restart goferd it goes back to sane values and stays like that until the night:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
[....]
43659 root      20   0  956124  37228   8908 S   0.0  1.0   0:18.21 python

Expected outcome:
Goferd not consuming absurd amouts of memory.

Foreman and Proxy versions:
1.20.3
Foreman Client 1.20.3

Foreman and Proxy plugin versions:
rubygem-algebrick.noarch 0.7.3-4.el7
rubygem-dynflow.noarch 0.8.34-2.fm1_17.el7
rubygem-faraday.noarch 0.9.1-6.el7
rubygem-faraday_middleware.noarch 0.10.0-2.el7
rubygem-sequel.noarch 4.20.0-6.el7
rubygem-smart_proxy_dhcp_infoblox.noarch 0.0.13-1.fm1_18.el7
rubygem-smart_proxy_dynflow.noarch 0.2.1-1.el7
rubygem-smart_proxy_remote_execution_ssh.noarch 0.2.0-2.el7
tfm-rubygem-angular-rails-templates.noarch 1:1.0.2-4.el7
tfm-rubygem-bastion.noarch 6.1.16-1.fm1_20.el7
tfm-rubygem-deface.noarch 1.3.2-1.el7
tfm-rubygem-diffy.noarch 3.0.1-5.el7
tfm-rubygem-docker-api.noarch 1.28.0-4.el7
tfm-rubygem-foreman-tasks.noarch 0.14.3-1.fm1_20.el7
tfm-rubygem-foreman-tasks-core.noarch 0.2.5-2.fm1_20.el7
tfm-rubygem-foreman_docker.noarch 4.1.0-2.fm1_20.el7
tfm-rubygem-foreman_hooks.noarch 0.3.15-1.fm1_20.el7
tfm-rubygem-foreman_remote_execution.noarch 1.6.7-1.fm1_20.el7
tfm-rubygem-foreman_remote_execution_core.noarch 1.1.4-1.el7
tfm-rubygem-foreman_snapshot_management.noarch 1.5.1-1.fm1_20.el7
tfm-rubygem-foreman_templates.noarch 6.0.3-2.fm1_20.el7
tfm-rubygem-git.noarch 1.2.5-9.el7
tfm-rubygem-hammer_cli_foreman_bootdisk.noarch 0.1.3-7.el7
tfm-rubygem-hammer_cli_foreman_docker.noarch 0.0.4-4.el7
tfm-rubygem-hammer_cli_foreman_tasks.noarch 0.0.13-1.fm1_20.el7
tfm-rubygem-parse-cron.noarch 0.1.4-4.fm1_20.el7
tfm-rubygem-polyglot.noarch 0.3.5-2.el7
tfm-rubygem-rainbow.noarch 2.2.1-3.el7
tfm-rubygem-smart_proxy_dynflow_core.noarch 0.2.1-1.fm1_20.el7
tfm-rubygem-wicked.noarch 1.3.3-1.el7

Other relevant data:
During the backup downtime of the main Katello instance (proxies should still be up during that window) I see these messages repeated every 10 seconds on all hosts, regardless of environment:

 Jul 31 00:01:00 host.example.com goferd[79015]: [INFO][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.proton.connection:131 - closed: proton+amqps://contentproxy.example.com:5647
 Jul 31 00:01:00 host.example.com goferd[79015]: [INFO][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://contentproxy.example.com:5647
 Jul 31 00:01:00 host.example.com goferd[79015]: [INFO][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://contentproxy.example.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
 Jul 31 00:01:00 host.example.com goferd[79015]: [INFO][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.proton.connection:92 - opened: proton+amqps://contentproxy.example.com:5647
 Jul 31 00:01:00 host.example.com goferd[79015]: [INFO][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.connect:30 - connected: proton+amqps://contentproxy.example.com:5647
 Jul 31 00:01:00 host.example.com goferd[79015]: [ERROR][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.proton.reliability:47 - receiver 1c85acb1-93d4-4d3c-bac2-0d145fe10a64 from pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf closed due to: Condition('qd:no-route-to-dest', 'No route to the   destination node')

If someone could tell me what goferd is actually used for, that would help me a lot, too. If we do not find a solution for this soon, I am thinking about disabling goferd on all our hosts.

@areyus: Goferd is the service that reports the state of the clients, typically information like packages installed etc to the Satellite server.
Based on what I have been told by experienced folks on the team, you can safely turn off goferd for clients with katello-host-tools installed cause host-tools will take care of the package status reporting.

We recently had the same problem and decided the best route was to go goferless (available since 3.4)

https://community.theforeman.org/t/katello-3-4-client-release-introducing-agent-gofer-less-host-tools/7096https://access.redhat.com/articles/3154811
https://access.redhat.com/articles/3875321

Saves a LOT of hits on your API (we have 10k content hosts) and also gained a lot of wasted space from qpid journals

We use Chef instead of Puppet for most of our system management and Katello/Foreman is mostly just Pulp with a nice Lifecycle GUI + PXE Build system for us, so we didn’t lose any functionality by going goferless

There was an update of goferd in the latest client repository if you want to try if it is still leaking.

But as the others said goferd is now optional and you will only loose the option for client getting instructions from Katello. I am not sure if 3.9 already provides the option to use remote execution instead to push instructions from Katello to the agents or if it was introduced in a later version, but this would completely remove the need for goferd.

Thanks for the heads-up everyone. Since we already are utilizing rex for every kind of remote execution task I know of, it is probably best to just get rid of it. I will check for the katello-host-tools package beeing installed across out systems and then try and set up a plan forward along that document outlined by @indygwyn if no major roadblocks pop up.

If anyone else has tips on what to look out for (like potential dragons I might encounter :wink: ) I would be happy to hear them.

1 Like