We have about 350 RHEL nodes connected to a Foreman environment consisting
of a Foreman/Katello server and two capsules. Both the Foreman server and
the capsules are virtual, with the Foreman server running 6 vCPUs with 24GB
RAM, and the capsules running 6 vCPUs and 16GB RAM. Every few days, we
start to receive lots of puppet errors, starting with errors from node.rb:
Could not retrieve catalog from remote server: Error 400 on SERVER: Failed
when searching for node test.example.com: Failed to find test.example.com
via exec: Execution of '/etc/puppet/node.rb test.example.com' returned 1:
After a number of these, we start to receive errors that the server is
under heavy load. These continue for a while, and then may or may not clear
up. Doing a 'katello-service restart' on the Foreman server cleans
everything up for a few days to a week, then it starts again.
We're running Foreman 1.12.4 and Katello 3.1.0. We've noticed two processes
that seem to grow in memory usage, dynflow_executor and tomcat, until they
take up ~60% of the available memory. Restarting the services obviously
frees up the memory for a while. Looking at the tomcat log, I'm seeing a
lot of errors from /candlepin:
SEVERE: Servlet.service() for servlet [default] in context with path
[/candlepin] threw exception
java.lang.IllegalStateException: Cannot call reset() after response has
been committed
at
org.apache.catalina.connector.ResponseFacade.reset(ResponseFacade.java:341)
Doing some googling about candlepin errors, I found some old messages about
the katello_event_queue not being drained, and did a check for that:
qpid-stat --ssl-certificate /etc/pki/katello/qpid_client_striped.crt -b
amqps://test.example.com:5671 -q | grep katello
katello_event_queue
Y 15.5k 15.5k 0 95.3m 95.3m 0
0 2
From this, it looks like I've got 15 thousand messages in the queue, with
no consumers to drain it. I can see how that might lead to excessive memory
consumption, but have no clue where to go from here.
Another issue we've been living with, which might be related, we get no
errata applicable to our hosts, except for the Foreman server itself, which
has the errata from the day it was installed, that won't go away.
Does anyone have any ideas where to look to fix these issues? I don't want
to have to restart the services every other day or so to keep resources in
check.
James