Bug in latest goferd client and foreman 1.24.3 latest patches

Problem:
latest yum upgrade on foreman server renders all goferd clients useless and triggers a memory leak

Expected outcome:
goferd to resume normal working operation as soon as the foreman server gets the updates and reboots successfuly

Foreman and Proxy versions:
1.24.3

Foreman and Proxy plugin versions:
rpm -qa | egrep -E ‘gofer|proton’
python-gofer-2.11.9-1.el6.noarch
python-gofer-proton-2.11.9-1.el6.noarch
gofer-2.11.9-1.el6.noarch
python2-qpid-proton-0.32.0-2.el6.x86_64
qpid-proton-c-0.32.0-2.el6.x86_64

Distribution and version:
Centos 7 latest

Other relevant data:
goferd process goes into a Timeout loop and leaks memory rendering the underlying system useless after 4-5 hrs. it consumes initally all RAM and then fills out all swap space.

[root@client1 ~]# pstree -a -h -p 20435
python,20435 /usr/bin/goferd
  ├─{python},20440
  ├─{python},20441
  ├─{python},20442
  ├─{python},20443
  ├─{python},20444
  ├─{python},20445
  ├─{python},20446
  └─{python},20447

[root@client1 ~]# strace -p 20441
Process 20441 attached
select(0, NULL, NULL, NULL, {0, 42106}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 25308}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout)

[root@client1 ~]# tail -f /var/log/messages
Mar  8 11:23:43 client1 goferd: [INFO][pulp.agent.f410fbc3-aac6-4384-98b7-49f18b4d823d] gofer.messaging.adapter.proton.connection:92 - opened: proton+amqps://foreman.example.com:5647
Mar  8 11:23:43 client1 goferd: [INFO][pulp.agent.f410fbc3-aac6-4384-98b7-49f18b4d823d] gofer.messaging.adapter.connect:30 - connected: proton+amqps://foreman.example.com:5647
Mar  8 11:24:01 client1 goferd: [WARNING][pulp.agent.f410fbc3-aac6-4384-98b7-49f18b4d823d] gofer.messaging.adapter.proton.reliability:53 - Connection amqps://foreman.example.com:5647 disconnected: Condition('amqp:resource-limit-exceeded', 'local-idle-timeout expired')
...

Yes, same here. Please fix this bug, because all of our machines run out of ram after some hours!

1 Like

Hello,
Foreman 1.24.3 is no longer supported. Please upgrade to a newer version (2.3.3 is the latest version at the time of writing). If that is not possible right now I would suggest downgrading goferd client to a previous known-good version until an upgrade can be done.

1 Like
  1. Why do you introduce something that is not working on purpose for General availability?
  2. Issue is with the server being upgraded to latest packages not the goferd client.

No-one is introducing something that is not working on purpose, and 1.24 is not GA- it is EOL and has been for the past 9 months.
In fact, the package versions you listed are all over a year old as far as I can tell, is it possible something else that you upgraded is incompatible with them?
We only support the last two releases (right now, 2.2 and 2.3) and cannot guarantee that an older version which has reached EOL will forever be compatible with any dependencies provided by the base OS.
If you require longer-term support, I would suggest looking into one of the commercial offerings based on Foreman.

1 Like

Additionally - you mentioned that you are running on CentOS 7, but the packages you listed seem to indicate they are el6 packages? is it possible you have the wrong repo configured?

Correct I got them from a different client (centos 6).

CentOS 7 client packages below:

# rpm -qa | egrep -E 'gofer|proton'
python2-qpid-proton-0.33.0-1.el7.x86_64
gofer-2.12.5-3.el7.noarch
python-gofer-2.12.5-3.el7.noarch
python-gofer-proton-2.12.5-3.el7.noarch
qpid-proton-c-0.33.0-1.el7.x86_64

Thing is that we made sure to update all the clients with latest qpid-proton-c and python2-qpid-proton released last month and then issue full os update:

  1. stop foreman services
  2. yum update\
  3. followed by foreman-installer --scenario katello --upgrade.
  4. restart server

Maybe something is wrong in my commands so let me know and appreciate you looking into this, even though it is an older version.

This sounds a lot like https://issues.apache.org/jira/browse/PROTON-2187 which should be fixed in those package versions of qpid-proton-c

Did you restart goferd after upgrading all the clients?

yup restart was the first thing I did, even full client OS restart, but issue still persisted.
Thanks

I have seen the same problem with the versions mentioned by you.
For me the fix was to set an increased heartbeat timeout for the gofer katello plugin:

# cat /etc/gofer/plugins/katello.conf | grep -v ^#

[main]
enabled=1
latency=1
plugin=katello.agent.goferd.plugin

[messaging]
url=
uuid=
cacert=/etc/rhsm/ca/candlepin-local.pem
clientcert=/etc/pki/consumer/bundle.pem
heartbeat=120 # <---- this line

The value of 120 is arbitrary; the default seems to be something around 15s.

3 Likes

I still see this problem after restarting goferd with:

Name        : qpid-proton-c
Arch        : x86_64
Version     : 0.33.0
Release     : 1.el7
Size        : 706 k
Repo        : installed
From repo   : FD-PS-LX_EPEL_EPEL_7_x86_64

The workaround provided by @laugmanuel works fine at the moment :+1:

yup I dont see the error anymore (I mean the logs) but still need to do some more strace’ing
Thank you guys

Hello, I can reliably reproduce it:

  • use qpid-proton-c-0.33.0-1.el7.x86_64 / python2-qpid-proton-0.33.0-1.el7.x86_64 from EPEL (problem not reproducible with python-qpid-proton-0.28.0-3.el7.x86_64 from RHEL)

  • set heartbeat to 7:

    /etc/gofer/plugins/katello.conf:heartbeat=7
    /etc/gofer/agent.conf:heartbeat=7

  • restart goferd service

  • simulate network outages to cause heartbeats are missed:

    a="-I"
    while true; do
    echo “$(date): setting $a”
    iptables $a OUTPUT -p tcp --dport 5647 -j DROP
    if [ $a = “-I” ]; then
    a="-D"
    else
    a="-I"
    fi
    sleep 10
    done

  • check that goferd does repeatedly log disconnections with “amqp:resource-limit-exceeded” error, followed by successfull reconnect

  • monitor goferd memory usage

This is reproducible on any foreman/qdrouterd version (goferd connects to qdrouterd, foreman is irelevant here). The problem lies inside the proton library and some regression between the two versions.

If I will have time, I will try to 1) cope up with simplified reproducer outside goferd, using proton library only, 2) narrow down the regression to more close versions of proton library.

3 Likes

https://issues.apache.org/jira/browse/PROTON-2344 raised.

2 Likes

Thank you for creating PROTON-2344