Problem:
We are currently experiencing what looks like a memory leak on most of our non-production hosts.
Our Katello stack got updated last weekend from 3.4 to 3.9, but clients were still on Katello agent 3.4.
Since the symptomes looked exactly the same, we assumed we were hit by https://access.redhat.com/solutions/3198642 , but an update of the non-production hosts to Foreman Client 1.20 did not change the behaviour.
Our environment looks like this:
We have 3 lifecycle environments, dev, test and production, each with its own content proxy. Dev and test are currently on the same CV version, production is behind a little. All systems are RHEL7.
If it helps, here is a list of packages updatable in production compared to the related dev host:
bind-libs.x86_64 32:9.9.4-74.el7_6.1 rhel-7-server-rpms
bind-libs-lite.x86_64 32:9.9.4-74.el7_6.1 rhel-7-server-rpms
bind-license.noarch 32:9.9.4-74.el7_6.1 rhel-7-server-rpms
bind-utils.x86_64 32:9.9.4-74.el7_6.1 rhel-7-server-rpms
device-mapper.x86_64 7:1.02.149-10.el7_6.8 rhel-7-server-rpms
device-mapper-event.x86_64 7:1.02.149-10.el7_6.8 rhel-7-server-rpms
device-mapper-event-libs.x86_64 7:1.02.149-10.el7_6.8 rhel-7-server-rpms
device-mapper-libs.x86_64 7:1.02.149-10.el7_6.8 rhel-7-server-rpms
glib2.x86_64 2.56.1-4.el7_6 rhel-7-server-rpms
glibc.i686 2.17-260.el7_6.6 rhel-7-server-rpms
glibc.x86_64 2.17-260.el7_6.6 rhel-7-server-rpms
glibc-common.x86_64 2.17-260.el7_6.6 rhel-7-server-rpms
java-1.8.0-openjdk.x86_64 1:1.8.0.222.b10-0.el7_6 rhel-7-server-rpms
java-1.8.0-openjdk-headless.x86_64 1:1.8.0.222.b10-0.el7_6 rhel-7-server-rpms
kernel.x86_64 3.10.0-957.21.3.el7 rhel-7-server-rpms
kernel-tools.x86_64 3.10.0-957.21.3.el7 rhel-7-server-rpms
kernel-tools-libs.x86_64 3.10.0-957.21.3.el7 rhel-7-server-rpms
libgudev1.x86_64 219-62.el7_6.7 rhel-7-server-rpms
libteam.x86_64 1.27-6.el7_6.1 rhel-7-server-rpms
lvm2.x86_64 7:2.02.180-10.el7_6.8 rhel-7-server-rpms
lvm2-libs.x86_64 7:2.02.180-10.el7_6.8 rhel-7-server-rpms
microcode_ctl.x86_64 2:2.1-47.5.el7_6 rhel-7-server-rpms
perf.x86_64 3.10.0-957.21.3.el7 rhel-7-server-rpms
python.x86_64 2.7.5-80.el7_6 rhel-7-server-rpms
python-libs.x86_64 2.7.5-80.el7_6 rhel-7-server-rpms
python-perf.x86_64 3.10.0-957.21.3.el7 rhel-7-server-rpms
python2-qpid-proton.x86_64 0.28.0-1.el7 EPEL7
qpid-proton-c.x86_64 0.28.0-1.el7 EPEL7
systemd.x86_64 219-62.el7_6.7 rhel-7-server-rpms
systemd-libs.x86_64 219-62.el7_6.7 rhel-7-server-rpms
systemd-sysv.x86_64 219-62.el7_6.7 rhel-7-server-rpms
teamd.x86_64 1.27-6.el7_6.1 rhel-7-server-rpms
tzdata-java.noarch 2019b-1.el7 rhel-7-server-rpms
I could not find any differences in configuration between the smart-proxies for dev and test compared to the one for production, so I would assume that is probably not the cause. Also, the memory leak seams to only occure over night when Katello offline backup is performed.
The memory leak goes to an extent where OOM killer comes in and fries the actual application.
On system affected, top has looked like this today morning after restarting goferd yesterday:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
71499 root 20 0 4764120 2.5g 964 S 0.0 68.3 3:28.78 python
After systemctl restart goferd
it goes back to sane values and stays like that until the night:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
[....]
43659 root 20 0 956124 37228 8908 S 0.0 1.0 0:18.21 python
Expected outcome:
Goferd not consuming absurd amouts of memory.
Foreman and Proxy versions:
1.20.3
Foreman Client 1.20.3
Foreman and Proxy plugin versions:
rubygem-algebrick.noarch 0.7.3-4.el7
rubygem-dynflow.noarch 0.8.34-2.fm1_17.el7
rubygem-faraday.noarch 0.9.1-6.el7
rubygem-faraday_middleware.noarch 0.10.0-2.el7
rubygem-sequel.noarch 4.20.0-6.el7
rubygem-smart_proxy_dhcp_infoblox.noarch 0.0.13-1.fm1_18.el7
rubygem-smart_proxy_dynflow.noarch 0.2.1-1.el7
rubygem-smart_proxy_remote_execution_ssh.noarch 0.2.0-2.el7
tfm-rubygem-angular-rails-templates.noarch 1:1.0.2-4.el7
tfm-rubygem-bastion.noarch 6.1.16-1.fm1_20.el7
tfm-rubygem-deface.noarch 1.3.2-1.el7
tfm-rubygem-diffy.noarch 3.0.1-5.el7
tfm-rubygem-docker-api.noarch 1.28.0-4.el7
tfm-rubygem-foreman-tasks.noarch 0.14.3-1.fm1_20.el7
tfm-rubygem-foreman-tasks-core.noarch 0.2.5-2.fm1_20.el7
tfm-rubygem-foreman_docker.noarch 4.1.0-2.fm1_20.el7
tfm-rubygem-foreman_hooks.noarch 0.3.15-1.fm1_20.el7
tfm-rubygem-foreman_remote_execution.noarch 1.6.7-1.fm1_20.el7
tfm-rubygem-foreman_remote_execution_core.noarch 1.1.4-1.el7
tfm-rubygem-foreman_snapshot_management.noarch 1.5.1-1.fm1_20.el7
tfm-rubygem-foreman_templates.noarch 6.0.3-2.fm1_20.el7
tfm-rubygem-git.noarch 1.2.5-9.el7
tfm-rubygem-hammer_cli_foreman_bootdisk.noarch 0.1.3-7.el7
tfm-rubygem-hammer_cli_foreman_docker.noarch 0.0.4-4.el7
tfm-rubygem-hammer_cli_foreman_tasks.noarch 0.0.13-1.fm1_20.el7
tfm-rubygem-parse-cron.noarch 0.1.4-4.fm1_20.el7
tfm-rubygem-polyglot.noarch 0.3.5-2.el7
tfm-rubygem-rainbow.noarch 2.2.1-3.el7
tfm-rubygem-smart_proxy_dynflow_core.noarch 0.2.1-1.fm1_20.el7
tfm-rubygem-wicked.noarch 1.3.3-1.el7
Other relevant data:
During the backup downtime of the main Katello instance (proxies should still be up during that window) I see these messages repeated every 10 seconds on all hosts, regardless of environment:
Jul 31 00:01:00 host.example.com goferd[79015]: [INFO][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.proton.connection:131 - closed: proton+amqps://contentproxy.example.com:5647
Jul 31 00:01:00 host.example.com goferd[79015]: [INFO][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://contentproxy.example.com:5647
Jul 31 00:01:00 host.example.com goferd[79015]: [INFO][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://contentproxy.example.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Jul 31 00:01:00 host.example.com goferd[79015]: [INFO][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.proton.connection:92 - opened: proton+amqps://contentproxy.example.com:5647
Jul 31 00:01:00 host.example.com goferd[79015]: [INFO][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.connect:30 - connected: proton+amqps://contentproxy.example.com:5647
Jul 31 00:01:00 host.example.com goferd[79015]: [ERROR][pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf] gofer.messaging.adapter.proton.reliability:47 - receiver 1c85acb1-93d4-4d3c-bac2-0d145fe10a64 from pulp.agent.fed10944-3f4d-414a-9c8e-1da588777dbf closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')
If someone could tell me what goferd is actually used for, that would help me a lot, too. If we do not find a solution for this soon, I am thinking about disabling goferd on all our hosts.