Foreman CPU usage post upgrade

khobbits · October 15, 2019, 8:25pm

Problem:

After upgrading from 1.22 to 1.23, foreman seems to drive itself into a bit of a tizzy.
It usually takes a good 10-20 minutes for the problem to become noticeable, but at this point the vm foreman is running on, is at a high load average, with all it’s cores at 100%.

Expected outcome:

Regular/normal CPU usage.

Foreman and Proxy versions:

Foreman and primary Smart Proxy - 1.23
Installed using foreman-installer
Upgraded from 1.21.1 to 1.22 to 1.23
Plus an additional 2 smart proxys still running 1.22

Foreman and Proxy plugin versions:

Foreman plugin: foreman-tasks, 0.16.2, Ivan Nečas, The goal of this plugin is to unify the way of showing task statuses across the Foreman instance.
It defines Task model for keeping the information about the tasks and Lock for assigning the tasks
to resources. The locking allows dealing with preventing multiple colliding tasks to be run on the
same resource. It also optionally provides Dynflow infrastructure for using it for managing the tasks.
Foreman plugin: foreman_ansible, 3.0.5, Daniel Lobato Garcia, Ansible integration with Foreman
Foreman plugin: foreman_default_hostgroup, 5.0.0, Greg Sutcliffe, Adds the option to specify a default hostgroup for new hosts created from facts/reports
Foreman plugin: foreman_discovery, 15.1.0, Aditi Puntambekar, alongoldboim, Alon Goldboim, amirfefer, Amit Karsale, Amos Benari, Avi Sharvit, Bryan Kearney, bshuster, Daniel Lobato, Daniel Lobato Garcia, Daniel Lobato García, Danny Smit, David Davis, Djebran Lezzoum, Dominic Cleal, Eric D. Helms, Ewoud Kohl van Wijngaarden, Frank Wall, Greg Sutcliffe, ChairmanTubeAmp, Ido Kanner, imriz, Imri Zvik, Ivan Nečas, Joseph Mitchell Magen, June Zhang, kgaikwad, Lars Berntzon, ldjebran, Lukas Zapletal, Lukáš Zapletal, Marek Hulan, Marek Hulán, Martin Bačovský, Matt Jarvis, Michael Moll, Nick, odovzhenko, Ohad Levy, Ondrej Prazak, Ondřej Ezr, Ori Rabin, orrabin, Partha Aji, Petr Chalupa, Phirince Philip, Rahul Bajaj, Robert Antoni Buj Gelonch, Scubafloyd, Sean O'Keeffe, Sebastian Gräßl, Shimon Shtein, Shlomi Zadok, Stephen Benjamin, Swapnil Abnave, Thomas Gelf, Timo Goebel, Tomas Strych, Tom Caspy, Tomer Brisker, and Yann Cézard, MaaS Discovery Plugin engine for Foreman
Foreman plugin: foreman_remote_execution, 1.8.2, Foreman Remote Execution team, A plugin bringing remote execution to the Foreman, completing the config management functionality with remote management functionality.
Foreman plugin: foreman_templates, 6.0.3, Greg Sutcliffe, Engine to synchronise provisioning templates from GitHub

Other relevant data:

This is an example, before the cpu get’s locked, and load average shoots up past 10…
Here, there is already 2 processed that look like they are stuck.

logs

Any suggestions on where to look next?
As far as i can tell, the production log looks fine.

lzap · October 16, 2019, 9:45am

Hello and welcome. First, you need to identify which endpoints are slow - in production.log search for slow requests. You can probably grep for XXXXms. Alternatively, you can deploy and integrate monitoring which will give you more insights into the instance:

(We are in process of upstreaming the documentation, this guide will work with Foreman 1.23 for sure.)

khobbits · October 16, 2019, 12:25pm

Hi @lzap I might be being silly, but that link doesn’t seem to take me anywhere useful.

What should I be installing?

I can see that some hosts seem to take a little while, but not sure what I should be doing with them…

Thanks,
R

lzap · October 17, 2019, 1:00pm

Oh sorry wrong link, it should be this:

khobbits · November 11, 2019, 4:03pm

Just an update on this.
So I wasn’t able to drive the monitoring tool, well enough for it to highlight what the problem was.
The short term solution I came up with was a cronjob that ran this every 10 minutes:

kill -9 $(ps axu | grep "Passenger RackApp: /usr/share/foreman" | grep -E '^foreman' | awk '{print $2}')

I’ve got a suspicion the problem is related to errors in the puppet manifests. There are a teams of people working on the manifests, and we have a couple of development environments. When the team went through and cleaned up typos and other errors, some of which had caused stacktraces to be thrown into the puppet server log file, I was able to reduce the cronjob to only run once an hour.