Over time I see that the foreman instance has spun up a bunch of httpd threads in one of our instances, and as a result all other requests stop working, resolving (api and UI) until the http threads are purged or the services are restarted. Been pounding my head against this since our upgrade to 2.5 from 2.2 (which was using passenger, and now with 2.5 using puma) I can roll back to 2.2 and the problem does not occur, yet with all the versions of 2.5 so far (up to 2.5.4) this keeps happening.
When it occurs I see a lot of https connections from foreman to foreman all with a CLOSE_WAIT status, here’s a snippet of the netstat output
tcp6 0 0 ::1:443 ::1:55912 ESTABLISHED 21155/httpd
tcp6 0 0 ::1:55904 ::1:443 FIN_WAIT2 -
tcp6 0 0 ::1:55910 ::1:443 FIN_WAIT2 -
tcp6 0 0 ::1:55912 ::1:443 ESTABLISHED 22674/curl
tcp6 0 0 ::1:55920 ::1:443 FIN_WAIT2 -
tcp6 0 0 :::443 :::* LISTEN 396/httpd
tcp6 32 0 10.4.26.100:443 10.240.144.22:49564 CLOSE_WAIT 10414/httpd
tcp6 32 0 ::1:443 ::1:55676 CLOSE_WAIT 3821/httpd
tcp6 32 0 ::1:443 ::1:55682 CLOSE_WAIT 396/httpd
tcp6 32 0 ::1:443 ::1:55684 CLOSE_WAIT 10140/httpd
tcp6 32 0 ::1:443 ::1:55694 CLOSE_WAIT 10869/httpd
tcp6 32 0 ::1:443 ::1:55714 CLOSE_WAIT 11247/httpd
tcp6 32 0 ::1:443 ::1:55720 CLOSE_WAIT 11304/httpd
tcp6 32 0 ::1:443 ::1:55730 CLOSE_WAIT 11528/httpd
tcp6 32 0 ::1:443 ::1:55754 CLOSE_WAIT 16158/httpd
tcp6 32 0 ::1:443 ::1:55760 CLOSE_WAIT 16373/httpd
...
When I captured this there where 55 threads with this status. killing all the CLOSE_WAIT will just cause them to come back, as it seems some other connection is spinning these out of control.
The logs at the time where uploading puppet facts from puppet servers and then just stopped working:
==> /var/log/foreman/production.log <==
2021-10-21T16:50:17 [I|app|e72ba227] Imported report for waffle6-house.b1-prv.qops.net in 154.9 ms, status refreshed in 38.8 ms
2021-10-21T16:50:17 [I|app|e72ba227] Rendering api/v2/config_reports/create.json.rabl
2021-10-21T16:50:17 [I|app|e72ba227] Rendered api/v2/config_reports/create.json.rabl (Duration: 9.3ms | Allocations: 2467)
2021-10-21T16:50:17 [I|app|e72ba227] Completed 201 Created in 257ms (Views: 10.1ms | ActiveRecord: 162.8ms | Allocations: 58062)
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address has already been taken"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address has already been taken"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address has already been taken"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Saving br0 NIC for host streaming-kafka1-app.b1-prv.qops.net failed, skipping because:
2021-10-21T16:50:17 [W|app|fae9b4f3] IP address has already been taken
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address can't be blank"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address can't be blank"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Not queueing Nic::Bridge: ["IP address can't be blank"]
2021-10-21T16:50:17 [W|app|fae9b4f3] Saving eth0 NIC for host streaming-kafka1-app.b1-prv.qops.net failed, skipping because:
2021-10-21T16:50:17 [W|app|fae9b4f3] IP address can't be blank
2021-10-21T16:50:17 [I|app|fae9b4f3] Completed 201 Created in 621ms (Views: 7.1ms | ActiveRecord: 156.2ms | Allocations: 104891)
2021-10-21T16:50:17 [I|app|f1745841] Started POST "/api/hosts/facts" for 10.4.28.217 at 2021-10-21 16:50:17 +0000
==> /var/log/httpd/foreman-ssl_access_ssl.log <==
10.4.28.217 - - [21/Oct/2021:16:50:16 +0000] "POST /api/reports HTTP/1.1" 201 661 "-" "Ruby"
10.4.28.217 - - [21/Oct/2021:16:50:16 +0000] "POST /api/hosts/facts HTTP/1.1" 201 1331 "-" "Ruby"
10.4.28.217 - - [21/Oct/2021:16:50:17 +0000] "POST /api/hosts/facts HTTP/1.1" 502 410 "-" "Ruby"
::1 - - [21/Oct/2021:16:50:18 +0000] "GET /users/login HTTP/1.1" 502 401 "-" "Consul Health Check"
==> /var/log/httpd/foreman-ssl_error_ssl.log <==
[Thu Oct 21 17:05:17.608176 2021] [proxy_http:error] [pid 2230] (70007)The timeout specified has expired: [client 10.4.28.217:38194] AH01102: error reading status line from remote server httpd-UDS:0
[Thu Oct 21 17:05:17.608256 2021] [proxy:error] [pid 2230] [client 10.4.28.217:38194] AH00898: Error reading from remote server returned by /api/hosts/facts
[Thu Oct 21 17:05:18.037131 2021] [proxy_http:error] [pid 557] (70007)The timeout specified has expired: [client ::1:54148] AH01102: error reading status line from remote server httpd-UDS:0
[Thu Oct 21 17:05:18.037257 2021] [proxy:error] [pid 557] [client ::1:54148] AH00898: Error reading from remote server returned by /users/login
It seems like the posts to /api/reports and /api/hosts/facts worked, got back a 201 and then the very next call to /api/hosts/facts stopped working and no subsequent calls return (all show 500s in the logs after that)
10.4.28.217 is one of our puppet servers