Postgres insert queries hanging and consuming all the CPU

mmulq · July 15, 2020, 12:07am

Problem:

This started after I accidentally left a new Foreman/Katello VM running for a couple of weeks. Due to lack of space disk space on the hardware running this VM, the virtual disk file was hosted temporarily on an NFS share. During the two weeks I left it running, I had to reboot the server with the NFS share and forgot that I’d left the VM running. I believe this was what caused the problem, which is postgres insert queries hanging and consuming all the cores and most of the RAM. I’ve rebooted the server several times and it appears to be working properly for a few hours before these processes manage to bring the server to a crawl. One process at a time consumes a core until there is one process for each core and nearly all the RAM is used.

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 14209 99.6 0.2 778368 21608 ? Rs 22:34 47:11 postgres: foreman foreman [local] INSERT
postgres 17752 98.3 0.2 778096 17592 ? Rs 23:09 37:34 postgres: foreman foreman [local] INSERT

I can bring the CPU usage down to normal by stopping the foreman-proxy service, restarting the postgresql service, and then starting foreman-proxy again but then these running processes that never terminate come back one at a time until the services need to be restarted again.

Expected outcome:

I’d like to fix whatever has gone wrong with the database to stop having to restart these services.

Foreman and Proxy versions:

1.24.2-1.el7

Foreman and Proxy plugin versions:

foreman-tasks 0.17.5
foreman_remote_execution 2.0.8
katello 3.14.1

Distribution and version:

CentOS Linux release 7.8.2003 (Core)

Other relevant data:

I’m happy to provide additional data but I’m not sure what is relevant as I’ve had a look at the logs and nothing jumps out at me.

mmulq · July 15, 2020, 2:28am

These queries never seem to finish.

foreman=# SELECT max(now() - xact_start) FROM pg_stat_activity
                       WHERE state IN ('idle in transaction', 'active');
   max
-----------------
 00:52:51.458709
(1 row)

mmulq · July 15, 2020, 2:43am

The query above is pid 1582. I tracked it down in pg_stat_activity here.

foreman=# select * from pg_stat_activity where pid=1582;
 datid | datname | pid  | usesysid | usename |           application_name            | client_addr | 
client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | waiting | state  |                                                      query
-------+---------+------+----------+---------+---------------------------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+---------+--------+------------------------------------------------------------------------------------------------------------------
 16387 | foreman | 1582 |    16386 | foreman | Passenger RackApp: /usr/share/foreman |             |                 |          -1 | 2020-07-15 01:27:59.933182+00 | 2020-07-15 01:27:59.970731+00 | 2020-07-15 01:27:59.972241+00 | 2020-07-15 01:27:59.972242+00 | f       | active | INSERT INTO "sessions" ("session_id", "data", "created_at", "updated_at") VALUES ($1, $2, $3, $4) RETURNING "id"
(1 row)

mmulq · July 15, 2020, 11:54am

I’ve cleared the 1199 rows from the sessions table with ‘foreman-rake db:sessions:clear’, cleared the browser cookies for the foreman server, and vacuumed the DB. Let’s see if that fixes it.

mmulq · July 15, 2020, 4:22pm

That seems to have resolved it.

lzap · July 17, 2020, 12:22pm

I am curious how on Earch would 1k rows in a session table kill a VM that dramatically? We are speaking one thousand records. Was there some huge row or something? It could be some session storage attack maybe.

ekohl · July 17, 2020, 12:45pm

A missing NFS share could have caused some corruption or bad indexes. Perhaps a full vacuum would have been sufficient.

mmulq · July 17, 2020, 6:53pm

I don’t think it was having 1k rows. I’m up to 500+ rows now and there’s no problem. I think that sessions weren’t being cleaned up or added due to some bad juju from the NFS share dropping. I had rows in that table from as far back as 7/6, which was the day the NFS share went down, before clearing that table and vacuuming. Now there is nothing older than a day.

lzap · July 20, 2020, 9:46am

Don’t we have a session cleanup rake task cron job? Maybe that one was not working? Is cron turned on?

mmulq · July 20, 2020, 5:06pm

The crond service is running and I found this in /etc/cron.d/foreman

# Clean up the session entries in the database
15 23 * * *     foreman    /usr/sbin/foreman-rake db:sessions:clear >>/var/log/foreman/cron.log 2>&1

This was running but must have been failing. It appears to be working properly now as there aren’t any session over 24hrs old.