Foreman 2.5.3 performance issues (Memory loss, timeouts)

cody_c · October 11, 2021, 3:41pm

This thread: Foreman install 2.5, HTTP: attempt to connect to 127.0.0.1:3000 (127.0.0.1) failed started getting lost on these issues so creating a new thread to report on performance issues we are seeing with Foreman 2.5.3. After some troubles getting Foreman 2.5.3 running (see linked thread) we’ve seen dramatic memory usage compared to 2.2. We’ve upped the memory on a couple of VMs running Foreman in various environments without much change. I’ve attached some performance graphs on what we are seeing. We have a server running the default cluster mode, one running single mode, and one running single mode with the max/min threads set to 15. In all cases we see the gradual loss of available memory to the point the boxes start swapping, and restarting the container recovers the memory, which then continues it’s gradual decline.

In the first attachement,

The yellow (orange?) line is the box that was restarted from 2.5 to 2.2 for about a day, then re-started to 2.5.3 with single mode, and you can see that as soon as 2.5.3 was started (using puma instead of passenger) the memory slowly decreases over time.

The blueish line is a box that is just running the default configuration, we upped it’s memory as well but it just creeps down until it’s restarted.

The Green line was a box that was running the min/max threads set to 15 and single mode, but we ran into an issue where if we hit the box with more than 6 threads at a time doing a VM sync, it would spin up 130 http threads, and all of them would time out eventually and nothing would get synced. If I run just 6 threads against that box it would work mostly, but I’d still get 12 timeouts and about 30+ http threads, vs the two other boxes that would sit at 24 http threads and could process all the api calls (this was all done via api calls)

The second attachments

shows the last 7 days, and you can see how they just keep loosing memory over time. We never saw this with Foreman 2.2 using passenger.

I’ve included my proxypass settings as thats what I’ve been messing with, as well as my two Foreman ENV settings:

  ProxyPass / http://127.0.0.1:3000/ retry=1 acquire=3000 timeout=600 Keepalive=On
  ProxyPassReverse / http://127.0.0.1:3000/

I had to include the keepalive and timeout as building into EC2 to run the finish script would timeout before it could finish. I am unsure on the retry setting.

To run in single mode I specify:

    export FOREMAN_ENV=production
    export FOREMAN_BIND=tcp://0.0.0.0:3000
    export FOREMAN_PUMA_WORKERS=0

Before running. And to specify the workers min/max threads

    export FOREMAN_PUMA_THREADS_MIN=15
    export FOREMAN_PUMA_THREADS_MAX=15

cody_c · October 11, 2021, 3:43pm

As this is impacting our production environments, going to roll those back to 2.2 to maintain stability, but we do have 2 test environments we can run whatever we need and continue to gather metrics.

ekohl · October 11, 2021, 4:30pm

We’ve identified Bug #33640: Improve caching of SettingRegistry#load_values - Foreman and it’s already cherry picked to what will be Foreman 3.0.1. The pick to 2.5.4 has some code conflicts so it needs a bit more attention, but I’ve set the target version.

ekohl · October 11, 2021, 5:13pm

@Jonathon_Turel has opened a PR:

github.com/theforeman/foreman

[2.5] Fixes #33640 - Leverage caching of SettingRegistry#load_values

theforeman:2.5-stable ← jturel:pick_settings_performance

opened 04:53PM - 11 Oct 21 UTC

jturel

+43 -13

You can try to apply it manually: https://github.com/theforeman/foreman/pull/8834.patch and see if it improves things. Graphs from real environments are very useful and welcome here.

cody_c · October 11, 2021, 7:34pm

pulled the .rb file and re-ran a build. In the attachement the spike up is when I rolled the change ( the container restarted with the change) and everything after has been the instance running with this rb file. Ran a bunch of sync settings through the api (VMs, settings, compute resources, provision templates etc) It’s for sure a different behavior, where before the patched .rb file it was a steady decline of memory, this is much more up and down around the 5.59 GB range. It does take a day or two to really see a trend, but thought I’d add this here as an update.

cody_c · October 11, 2021, 7:36pm

Zoomed out the metric a bit to the last 6 hours and it’s pretty solid so far:

Jonathon_Turel · October 11, 2021, 9:11pm

That’s a nice result. Thank you for sharing!

cody_c · October 12, 2021, 3:02pm

Yesterday Foreman seemed to lock up / stop responding to all http requests. Unable to hit it via api or via the web UI with errors in the foreman-ssl_access_ssl.log like:

==> /var/log/httpd/foreman-ssl_access_ssl.log <==
::1 - - [12/Oct/2021:14:48:42 +0000] "GET /users/login HTTP/1.1" 502 401 "-" "curl/7.29.0"

==> /var/log/httpd/foreman-ssl_error_ssl.log <==
[Tue Oct 12 14:58:42.236928 2021] [proxy_http:error] [pid 313] (70007)The timeout specified has expired: [client ::1:36744] AH01102: error reading status line from remote server 127.0.0.1:3000
[Tue Oct 12 14:58:42.237001 2021] [proxy:error] [pid 313] [client ::1:36744] AH00898: Error reading from remote server returned by /users/login

The httpd processes would also start climbing when hitting it with either api calls or the ui, spinning up more and more http processes but not responding / returning any traffic. I restarted the process (you can see the bump in memory where it spikes up then goes down again) and it didn’t quite recover all the memory back. About a couple of hours later it’s locked up again, with httpd processes climbing with each request but nothing making it out of Foreman and the logs just repeating those 503 errors.

I will restart it again today but it doesn’t look like foreman can run more than a few hours without totally locking up.

cody_c · October 12, 2021, 3:11pm

This looks to be the same issue as Frequent web service timeouts requiring restart - #22 by therokh

Though I’m not running the same kind of config.

ekohl · October 12, 2021, 3:23pm

We saw the same issues and for us they went away when we switched from TCP to a unix socket between httpd and Puma.

cody_c · October 12, 2021, 7:00pm

I seem unable to start this using a unix socket. I have the socket on disk, and it appears to be the correct permissions, but instead I just get:

[Tue Oct 12 18:46:29.698355 2021] [proxy:error] [pid 3594] (111)Connection refused: AH02454: HTTP: attempt to connect to Unix domain socket /run/foreman.sock (foreman) failed
[Tue Oct 12 18:46:29.698393 2021] [proxy:error] [pid 3594] AH00959: ap_proxy_connect_backend disabling worker for (foreman) for 1s
[Tue Oct 12 18:46:29.698399 2021] [proxy_http:error] [pid 3594] [client ::1:50166] AH01114: HTTP: failed to make connection to backend: httpd-UDS

with the proxy pass configured as:

ProxyPass / unix:///run/foreman.sock|http://foreman/ retry=1 acquire=3000 timeout=600 Keepalive=On
ProxyPassReverse / unix:///run/foreman.sock|http://foreman/

The socket on disk setup:

srwxrwxrwx 1 root root 0 Oct 12 18:27 /run/foreman.sock

I’ve tried a couple of different permissions and users, but this is pretty permissive as is.

Which is why I switched to the tcp connections. If you got any ideas on how to get this to work on a unix socket I’m very open to switch this back and see how a couple of hours of running this patched file behaves, but I’m at a loss on how to get this to work.

ehelms · October 12, 2021, 7:09pm

How do you setup and configure your Foreman? Our installer and the underlying puppet module will configure all of this with the right setup for you. If you are doing this by hand, I think we’d need to see the permissions on the socket, and the systemd configuration for the foreman.service as well to try to compare it to how a system installed by the installer sets it all up.

cody_c · October 12, 2021, 9:25pm

Well the good news is I was able to get this to bind on the sock. I reverted my changes to the proxy pass, so that they have the

  ProxyPass / unix:///run/foreman.sock|http://foreman/ retry=0 timeout=900
  ProxyPassReverse / unix:///run/foreman.sock|http://foreman/

in both the ssl and non ssl configs (/etc/httpd/conf.d/05-foreman(-ssl).conf )

then I needed my startup script that the docker container runs to drop the socket file down, as this didn’t happen automatically, likely due to the fact that systemd doesn’t run in the docker container:

    python -c "import socket as s; sock = s.socket(s.AF_UNIX); sock.bind('/run/foreman.sock')"
    chmod 0777 /run/foreman.sock

Then starting the puma with:

    export FOREMAN_ENV=production
    export FOREMAN_BIND=unix:///run/foreman.sock
    export FOREMAN_PUMA_WORKERS=0
    /usr/share/foreman/bin/rails server --environment $FOREMAN_ENV &

And that got it up and bound to the socket file. now to just run it for a day, ensure that I don’t see those auth timeout issues and monitor the memory consumption (as this still has the patched .rd file in it’s config)

cody_c · October 15, 2021, 4:06pm

Well after about 3 days, the httpd process stops working. Checked it this morning and it had 253 httpd processes spun up, and the Web UI was unresponsive with

[Fri Oct 15 12:10:54.410960 2021] [proxy_http:error] [pid 4338] (70007)The timeout specified has expired: [client ::1:51274] AH01102: error reading status line from remote server httpd-UDS:0
[Fri Oct 15 12:10:54.411037 2021] [proxy:error] [pid 4338] [client ::1:51274] AH00898: Error reading from remote server returned by /users/login

In the logs getting repeated for a while. Had to kill the puma and httpd proccesses to get it to recover.
This is with it running against the unix socket file for puma and apache. The memory profile looks fine, so I don’t think it’s that:

While it’s still going down, it mostly stabilizes, and still has plenty of memory to work with (over 5GB) where as before the patched ruby file it would drop to the 100MB of free and a whole lot of swap.

cody_c · October 20, 2021, 9:29pm

I see that a 2.5.4 release dropped yesterday. So going to try running that and see if it helps with the http thread exhaustion issue, as well as the memory profiling and keep an eye on this over the next couple of days. We’ve had to roll back to 2.2 in just about all the locations as we kept seeing the http thread lock and memory issues, and have 2.5.4 just running in our three testing environments.

lzap · October 25, 2021, 1:04pm

So how was it?

I saw several other threads (libvirt issue) which might be related.

cody_c · October 26, 2021, 7:31pm

Memory usage I feel is solved, and I got a handful of threads dealing with other issues, so I’d say that the patch to 2.5.4 resolved this issue. I haven’t seen any memory issues on the boxes and they have been running for a while now. (only dealing with the thread/ process lockup / exhaustion issue and a few other minor UI problems. )