Rails connection pool size optimizations

Shimon_Shtein · January 29, 2024, 10:56am

I am working on a long-standing issue we have: Bug #33974: Getting http 500 internal server error due to "ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds" - Installer - Foreman
The bottom line there is that the Puma worker process exhausts the connection pool set in Rails configuration and has to wait for too long to get a new one.
I have started digging into this issue and would like to get to the proper number of connection pool size we should run with.

What I know so far:
According to Rails, each thread has to have its own connection. Hence I need to count active threads for the process.
Try 1:
Get thread stack dumps using rbtrace This path has failed, since I was unable to properly attach rbtrace to the process.
Any help in generating and analyzing process dumps would be much appreciated. The analyzers that I have found so far were either very old, or based on rbtrace, which is a problem.
Try 2:
Take a look at the database during the load to work backwards the action from the query that is executed.
We have executed a set of register/unregister commands on a luna-based system (downstream Satellite to be precise).
What I have seen:

Most of the queries were in the COMMIT phase - expected, these are the threads that were doing actual work. Unfortunately I didn’t see the history of those queries
One dynflow listener per worker
A couple of katello event polling queries

From looking into the Katello code, I have seen candlepin event listener that uses Stomp gem underneath, and that gem spawns a new thread for its listener. In our case the listener will do database queries along the way. Additionally there is a Katello event polling thread that also uses database queries.

This leaves us with PUMA_THREAD_POOL_SIZE + 3 connections that were accounted for.

Workaround:
What about counting backwards from the available PG connections count. If we know how many connections will be used by Candlepin and Pulp, we can reserve that number and then have a formula for the pool size:

PUMA_CONNECTIONS = MAX_PG_CONNECTIONS - CANDLEPIN_CONNECTIONS - PULP_CONNECTIONS - BUFFER_CONNECTIONS

MAX_POOL_SIZE = PUMA_CONNECTIONS / PUMA_WORKERS

This should give us the maximum size for the connection pool that we can afford for the system. Given that increasing the pool size does not affect the puma processes that much (according to our tests).

This way we will be left with calculating only one parameter - MAX_PG_CONNECTIONS. The buffer will be used to make sure we will be able to perform console tasks in parallel to a high load.

@ekohl raised a good concern against the workaround, I will quote it here:
I still think the blunt “just give it more connections” is not an
acceptable answer for the reasons I laid out yesterday. Configuring the
max PG connections in the installer by adding those values together is
hard and can have significant performance implications.

Quoting PostgreSQL: Documentation: 16: 52.2. How Connections Are Established

PostgreSQL implements a “process per user” client/server model. In this model, every client process connects to exactly one backend process. As we do not know ahead of time how many connections will be made, we have to use a “supervisor process” that spawns a new backend process every time a connection is requested. This supervisor process is called postmaster and listens at a specified TCP/IP port for incoming connections. Whenever it detects a request for a connection, it spawns a new backend process. Those backend processes communicate with each other and with other processes of the instance using semaphores and shared memory to ensure data integrity throughout concurrent data access.

So the tuning of those other settings should be taken into account.

Test that were performed by our Performance team:
We have set the PUMA_WORKERS=12 and PUMA_THREAD_POOL=5. I think those are the default values, and we don’t play with them at this point.

DB_POOL_SIZE < 8: generates connection pool exhaustion errors
DB_POOL_SIZE = 8: works almost perfectly, a couple of reconnections from Katello were observed
DB_POOL_SIZE = 10: works perfectly

This looks like PUMA_THREAD_POOL + 3 is indeed the amount of threads that is used by each process. I think we can attribute the reconnections to the fact that the connections are not released immediately.

Any additional thoughts would be welcome.

ekohl · January 29, 2024, 11:39am

Is this reproducible without Katello? It sounds like we would at least see the Dynflow listener.

Shimon_Shtein · January 29, 2024, 1:53pm

Hmm…
We will need to use other load method for this test (registering/unregistering hosts obvoiusly won’t work without Katello).
Any suggestions for a good load action?

jeremylenz · January 29, 2024, 3:05pm

I wonder if moving forward with Saying goodbye to the Katello Event Daemon would also help with this?

cc @Jonathon_Turel

ekohl · January 29, 2024, 3:49pm

Global registration is not a Katello feature, so I don’t see why it’s that obvious.

Shimon_Shtein · January 30, 2024, 3:04pm

Global registration is not a Katello feature, so I don’t see why it’s that obvious.

My bad, I was under the impression sub-man register was performed on the host, but apparently it’s global registration.
Still the test will be a bit different, since without Katello the template will not try content specific commands that can influence the overall performance of the system.

I wonder if moving forward with Saying goodbye to the Katello Event Daemon would also help with this?

It can help, sure. Less threads that we need to hunt for is definitely a good thing to have, although as long as we know exactly how many threads are spawned, it shouldn’t affect us too much.

Shimon_Shtein · April 24, 2024, 4:44pm

Created a PR: Fixes #33974 - Change the pool size to threads + 4 by ShimShtein · Pull Request #1161 · theforeman/puppet-foreman · GitHub