I am working on a long-standing issue we have: Bug #33974: Getting http 500 internal server error due to "ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds" - Installer - Foreman
The bottom line there is that the Puma worker process exhausts the connection pool set in Rails configuration and has to wait for too long to get a new one.
I have started digging into this issue and would like to get to the proper number of connection pool size we should run with.
What I know so far:
According to Rails, each thread has to have its own connection. Hence I need to count active threads for the process.
Get thread stack dumps using rbtrace This path has failed, since I was unable to properly attach
rbtrace to the process.
Any help in generating and analyzing process dumps would be much appreciated. The analyzers that I have found so far were either very old, or based on
rbtrace, which is a problem.
Take a look at the database during the load to work backwards the action from the query that is executed.
We have executed a set of register/unregister commands on a luna-based system (downstream Satellite to be precise).
What I have seen:
- Most of the queries were in the COMMIT phase - expected, these are the threads that were doing actual work. Unfortunately I didn’t see the history of those queries
- One dynflow listener per worker
- A couple of katello event polling queries
From looking into the Katello code, I have seen candlepin event listener that uses
Stomp gem underneath, and that gem spawns a new thread for its listener. In our case the listener will do database queries along the way. Additionally there is a Katello event polling thread that also uses database queries.
This leaves us with
PUMA_THREAD_POOL_SIZE + 3 connections that were accounted for.
What about counting backwards from the available PG connections count. If we know how many connections will be used by Candlepin and Pulp, we can reserve that number and then have a formula for the pool size:
PUMA_CONNECTIONS = MAX_PG_CONNECTIONS - CANDLEPIN_CONNECTIONS - PULP_CONNECTIONS - BUFFER_CONNECTIONS
MAX_POOL_SIZE = PUMA_CONNECTIONS / PUMA_WORKERS
This should give us the maximum size for the connection pool that we can afford for the system. Given that increasing the pool size does not affect the puma processes that much (according to our tests).
This way we will be left with calculating only one parameter -
MAX_PG_CONNECTIONS. The buffer will be used to make sure we will be able to perform console tasks in parallel to a high load.
@ekohl raised a good concern against the workaround, I will quote it here:
I still think the blunt “just give it more connections” is not an
acceptable answer for the reasons I laid out yesterday. Configuring the
max PG connections in the installer by adding those values together is
hard and can have significant performance implications.
PostgreSQL implements a “process per user” client/server model. In this model, every client process connects to exactly one backend process. As we do not know ahead of time how many connections will be made, we have to use a “supervisor process” that spawns a new backend process every time a connection is requested. This supervisor process is called postmaster and listens at a specified TCP/IP port for incoming connections. Whenever it detects a request for a connection, it spawns a new backend process. Those backend processes communicate with each other and with other processes of the instance using semaphores and shared memory to ensure data integrity throughout concurrent data access.
So the tuning of those other settings should be taken into account.
Test that were performed by our Performance team:
We have set the
PUMA_THREAD_POOL=5. I think those are the default values, and we don’t play with them at this point.
DB_POOL_SIZE < 8: generates connection pool exhaustion errors
DB_POOL_SIZE = 8: works almost perfectly, a couple of reconnections from Katello were observed
DB_POOL_SIZE = 10: works perfectly
This looks like
PUMA_THREAD_POOL + 3 is indeed the amount of threads that is used by each process. I think we can attribute the reconnections to the fact that the connections are not released immediately.
Any additional thoughts would be welcome.