Foreman 2.5.4 httpd threads lock out

cody_c · October 21, 2021, 6:46pm

Over time I see that the foreman instance has spun up a bunch of httpd threads in one of our instances, and as a result all other requests stop working, resolving (api and UI) until the http threads are purged or the services are restarted. Been pounding my head against this since our upgrade to 2.5 from 2.2 (which was using passenger, and now with 2.5 using puma) I can roll back to 2.2 and the problem does not occur, yet with all the versions of 2.5 so far (up to 2.5.4) this keeps happening.

When it occurs I see a lot of https connections from foreman to foreman all with a CLOSE_WAIT status, here’s a snippet of the netstat output

tcp6       0      0 ::1:443                 ::1:55912               ESTABLISHED 21155/httpd
tcp6       0      0 ::1:55904               ::1:443                 FIN_WAIT2   -
tcp6       0      0 ::1:55910               ::1:443                 FIN_WAIT2   -
tcp6       0      0 ::1:55912               ::1:443                 ESTABLISHED 22674/curl
tcp6       0      0 ::1:55920               ::1:443                 FIN_WAIT2   -
tcp6       0      0 :::443                  :::*                    LISTEN      396/httpd
tcp6      32      0 10.4.26.100:443         10.240.144.22:49564     CLOSE_WAIT  10414/httpd
tcp6      32      0 ::1:443                 ::1:55676               CLOSE_WAIT  3821/httpd
tcp6      32      0 ::1:443                 ::1:55682               CLOSE_WAIT  396/httpd
tcp6      32      0 ::1:443                 ::1:55684               CLOSE_WAIT  10140/httpd
tcp6      32      0 ::1:443                 ::1:55694               CLOSE_WAIT  10869/httpd
tcp6      32      0 ::1:443                 ::1:55714               CLOSE_WAIT  11247/httpd
tcp6      32      0 ::1:443                 ::1:55720               CLOSE_WAIT  11304/httpd
tcp6      32      0 ::1:443                 ::1:55730               CLOSE_WAIT  11528/httpd
tcp6      32      0 ::1:443                 ::1:55754               CLOSE_WAIT  16158/httpd
tcp6      32      0 ::1:443                 ::1:55760               CLOSE_WAIT  16373/httpd
...

When I captured this there where 55 threads with this status. killing all the CLOSE_WAIT will just cause them to come back, as it seems some other connection is spinning these out of control.

The logs at the time where uploading puppet facts from puppet servers and then just stopped working:

==> /var/log/foreman/production.log <==
2021-10-21T16:50:17 [I|app|e72ba227] Imported report for waffle6-house.b1-prv.qops.net in 154.9 ms, status refreshed in 38.8 ms
2021-10-21T16:50:17 [I|app|e72ba227]   Rendering api/v2/config_reports/create.json.rabl
2021-10-21T16:50:17 [I|app|e72ba227]   Rendered api/v2/config_reports/create.json.rabl (Duration: 9.3ms | Allocations: 2467)
2021-10-21T16:50:17 [I|app|e72ba227] Completed 201 Created in 257ms (Views: 10.1ms | ActiveRecord: 162.8ms | Allocations: 58062)
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address has already been taken"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address has already been taken"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address has already been taken"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Saving br0 NIC for host streaming-kafka1-app.b1-prv.qops.net failed, skipping because:
2021-10-21T16:50:17 [W|app|fae9b4f3]  IP address has already been taken
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address can't be blank"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address can't be blank"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address can't be blank"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Saving eth0 NIC for host streaming-kafka1-app.b1-prv.qops.net failed, skipping because:
2021-10-21T16:50:17 [W|app|fae9b4f3]  IP address can't be blank
2021-10-21T16:50:17 [I|app|fae9b4f3] Completed 201 Created in 621ms (Views: 7.1ms | ActiveRecord: 156.2ms | Allocations: 104891)
2021-10-21T16:50:17 [I|app|f1745841] Started POST "/api/hosts/facts" for 10.4.28.217 at 2021-10-21 16:50:17 +0000

==> /var/log/httpd/foreman-ssl_access_ssl.log <==
10.4.28.217 - - [21/Oct/2021:16:50:16 +0000] "POST /api/reports HTTP/1.1" 201 661 "-" "Ruby"
10.4.28.217 - - [21/Oct/2021:16:50:16 +0000] "POST /api/hosts/facts HTTP/1.1" 201 1331 "-" "Ruby"



10.4.28.217 - - [21/Oct/2021:16:50:17 +0000] "POST /api/hosts/facts HTTP/1.1" 502 410 "-" "Ruby"
::1 - - [21/Oct/2021:16:50:18 +0000] "GET /users/login HTTP/1.1" 502 401 "-" "Consul Health Check"

==> /var/log/httpd/foreman-ssl_error_ssl.log <==
[Thu Oct 21 17:05:17.608176 2021] [proxy_http:error] [pid 2230] (70007)The timeout specified has expired: [client 10.4.28.217:38194] AH01102: error reading status line from remote server httpd-UDS:0
[Thu Oct 21 17:05:17.608256 2021] [proxy:error] [pid 2230] [client 10.4.28.217:38194] AH00898: Error reading from remote server returned by /api/hosts/facts
[Thu Oct 21 17:05:18.037131 2021] [proxy_http:error] [pid 557] (70007)The timeout specified has expired: [client ::1:54148] AH01102: error reading status line from remote server httpd-UDS:0
[Thu Oct 21 17:05:18.037257 2021] [proxy:error] [pid 557] [client ::1:54148] AH00898: Error reading from remote server returned by /users/login

It seems like the posts to /api/reports and /api/hosts/facts worked, got back a 201 and then the very next call to /api/hosts/facts stopped working and no subsequent calls return (all show 500s in the logs after that)

10.4.28.217 is one of our puppet servers

cody_c · October 21, 2021, 7:04pm

Not sure if it’s at all related to things like: https://github.com/puma/puma/issues/2352. Just thinking of that due to all the CLOSE_WAIT connections are from foreman to foreman.

cody_c · October 22, 2021, 3:15pm

Is there some timed process or job Foreman runs that it talks to itself? As restarting the http and puma threads it runs fine for a while (about 3 to 5 hours) before it re-exibits this behavior of having a whole lot of CLOSE_WAIT that look to be coming from foreman to foreman internally. Is there any setting with the PROXYPASS that could be contributing to this? Or other configurations that I can take a look at? The logs (or the ones I’m looking at, at least) are not terribly helpful as they are just normal looking logs and then 500s start showing up

ekohl · October 25, 2021, 10:00am

That issue appears to only describe the development environment. I think you built containers so perhaps you forgot to set RAILS_ENV=production?

lzap · October 25, 2021, 10:03am

Can you paste a full transaction from logs? Your snippet is incomplete. Simply grep the transaction number from the log to see the whole picture, with timedate we should see where it hangs. Also enabling debug might help.

I fixed a but last week when discovered hosts were also delaying responses and the root cause was our notification system - the user had too many unread notifications and discovery process was (incorrectly) creating too many notifications. It was obvious from the debug output that that was the offender. Probably not this case, but it is worth taking a look.

cody_c · October 25, 2021, 6:16pm

Before starting the application I run:
export FOREMAN_ENV=production
export FOREMAN_BIND=unix:///run/foreman.sock
export FOREMAN_PUMA_WORKERS=0

And the puma output says production mode.

cody_c · October 25, 2021, 6:31pm

I can get the 2.5.4 re-deployed in our environement and set the logs to debug, then snag em all. Is it all the content in the /var/log/foreman/production.log? Or are there other log files you need as well ? (any of the httpd ones?)

cody_c · October 25, 2021, 7:56pm

For enabling debug logs, thats nothing more than

logging:
  :level: debug

in the settings.xml correct? (this was logging_level: debug before) I can capture all the logs for when this happens. They are quite a bit as other than the sync vm thing, it’s been hard to find how to reproduce this.

ekohl · October 25, 2021, 8:06pm

AFAIK this is only used in our systemd service unit:

github.com

theforeman/foreman/blob/433faca35964354ade342c666f040d1e329ea351/extras/systemd/foreman.service#L13-L14

    
      
          ExecStart=/usr/share/foreman/bin/rails server --environment $FOREMAN_ENV
          Environment=FOREMAN_ENV=production

Unless you’re using that service or somehow implemented it yourself, setting it has no effect (that I know of).

cody_c · October 27, 2021, 3:54pm

Took me a bit to disable the keepalives we have, so that the logs are not just full of login attempts. I think the threads started missbehaving about an hour before the end of the logs. I had ran a compute sync and you can see it attempting to sync, putting in the put commands, but it never really recovered or processed them. I restarted all the services and ran our same command and it processed them fine.

foreman_logs.tar.gz (43.5 KB)

cody_c · November 8, 2021, 4:31pm

At this point we see no way to upgrade foreman, at least not until 3.1 is released with the libvirt patch, as the system is unusable after a couple of hours of running, always the same issue with to many http threads that will not close. If there is anything else to try I can re-deploy the 2.5.4 version and capture any other logs, but the activity is really always the same, puppet runs, puppet facts, and some libvirt lookups to sync vms. Issues we just don’t have with passenger version in 2.2