/var partition is taking up way more space than it should be

Greetings to the community,

I hope this transmission finds you well.

Problem:

Our /var partition has been growing very rapidly. This is a brand new install of katello 4.13.1-1 and foreman 3.11.2-1 running on Rocky 9. We only have two products: Rocky 8 and Rocky 9 and while we are doing “Immediate” download and not “On Demand” this partition is now taking up:

root@katello01 /etc/cron.d # df -hP /var
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_sys-lv_var 4.5T 1.7T 2.9T 37% /var

1.7 T is a lot!

Further, it has been growing rapidly. I checked it about a week ago and it was sitting at 27% utilization.

None of our mirroring policies are “Additive”.

We mirror both x86_64 and aarch64 repos from the upstream mirrors.

Expected outcome:

That the /var partition doesn’t grow crazy fast and that deduplication of packages is in working order. We do have multiple Orgs on this server both mirroring Rocky 8 and Rocky 9 so hopefully there is a way to make sure de-dupe is functioning correctly.

Foreman and Proxy versions:

foreman-3.11.2-1.el9.noarch

Foreman and Proxy plugin versions:

Distribution and version:

Rocky 9.4 (Blue Onyx)

Other relevant data:

We also have a script that purges the number of versions of our CV’s so I am not sure what is eating up all this space. Is there a way to tell what is?

Thank you for any insights / methods to verify things you may be able to suggest,

Eledor

Wow, I just went digging further and I am seeing this for pgsql (1.2 TB!):

pulp is hovering where I would expect it to be: (472 Gig) still seems a bit high but more reasonable that Posgres taking up 1.2T.

root**@katello01** /var/lib # du -h --max-depth=1

4.4M ./dnf

92M ./rpm

0 ./games

0 ./misc

0 ./rpm-state

80K ./alternatives

18M ./selinux

180K ./systemd

0 ./private

0 ./authselect

4.0K ./logrotate

12K ./NetworkManager

0 ./os-prober

4.0K ./rsyslog

2.5M ./sss

4.0K ./chrony

0 ./tpm2-tss

24K ./rhsm

4.0K ./unbound

0 ./polkit-1

0 ./gssproxy

0 ./rpcbind

0 ./initramfs

0 ./dhclient

4.0K ./nfs

0 ./ipsec

0 ./tuned

0 ./vmware

0 ./postfix

100M ./foreman

0 ./httpd

102M ./tomcat

0 ./tomcats

0 ./geoclue

0 ./flatpak

23M ./candlepin

4.0K ./foreman-maintain

1.2T ./pgsql

0 ./foreman-proxy

472G ./pulp

256K ./redis

1.6T .

[23:44]-[freemem: 40GB]-[load: 1.54 1.92 1.82]-[0]

root**@katello01** /var/lib #

The logs are huge

root**@katello01** /var/lib/pgsql/data/log # pwd

/var/lib/pgsql/data/log

[23:49]-[freemem: 40GB]-[load: 1.50 1.69 1.75]-[0]

root**@katello01** /var/lib/pgsql/data/log # du -h

1.2T .

[23:49]-[freemem: 40GB]-[load: 1.70 1.73 1.77]-[0]

root**@katello01** /var/lib/pgsql/data/log # ll

total 1209154564

-rw-------. 1 postgres postgres 490250313066 Nov 15 23:49 postgresql-Fri.log

-rw-------. 1 postgres postgres 23498468891 Nov 11 23:58 postgresql-Mon.log

-rw-------. 1 postgres postgres 9417550081 Nov 9 22:47 postgresql-Sat.log

-rw-------. 1 postgres postgres 31551285389 Nov 10 23:56 postgresql-Sun.log

-rw-------. 1 postgres postgres 505975307001 Nov 14 23:59 postgresql-Thu.log

-rw-------. 1 postgres postgres 52860304326 Nov 12 23:48 postgresql-Tue.log

-rw-------. 1 postgres postgres 122957126717 Nov 13 23:59 postgresql-Wed.log

[23:49]-[freemem: 40GB]-[load: 1.70 1.73 1.77]-[0]

root**@katello01** /var/lib/pgsql/data/log #

Is this the way things should be?

looks like the problem with exponentially increasing logfile size began Wed:

root**@katello01** /var/lib/pgsql/data/log # date

Sat Nov 16 12:08:16 AM UTC 2024

[00:08]-[freemem: 40GB]-[load: 1.36 1.76 1.87]-[0]

root**@katello01** /var/lib/pgsql/data/log # ls -lh

total 1.2T

-rw-------. 1 postgres postgres 460G Nov 15 23:59 postgresql-Fri.log

-rw-------. 1 postgres postgres 22G Nov 11 23:58 postgresql-Mon.log

-rw-------. 1 postgres postgres 2.0G Nov 16 00:08 postgresql-Sat.log (just rolled this log over)

-rw-------. 1 postgres postgres 30G Nov 10 23:56 postgresql-Sun.log

-rw-------. 1 postgres postgres 472G Nov 14 23:59 postgresql-Thu.log

-rw-------. 1 postgres postgres 50G Nov 12 23:48 postgresql-Tue.log

-rw-------. 1 postgres postgres 115G Nov 13 23:59 postgresql-Wed.log

[00:08]-[freemem: 40GB]-[load: 1.36 1.76 1.87]-[0]

root**@katello01** /var/lib/pgsql/data/log #

besides trying to figure out what in blazes is causing the Postgres DB to log 472GB of data, does anyone know if i can safely just remove an old one? Like the one from Thursday or does Postgres need to be shutdown to release the filehandle type of thing?

Only the current weekday log should be open and in use. You can check with lsof or similar tools:

# lsof /var/lib/pgsql/data/log/postgresql-*.log
COMMAND    PID     USER   FD   TYPE DEVICE SIZE/OFF      NODE NAME
postmaste 1179 postgres    6w   REG  253,0    21304 205529411 /var/lib/pgsql/data/log/postgresql-Sat.log

Without know what is logged that excessively it’s hard to tell. Check the log files and look for lines with most repetitions. It could be some problem with the database schema, long running queries, locks, etc. Are logs are usually quite self explanatory.

hi @gvde,

Thanks for the reply.

I ended up doing the following:

cp /dev/null postgresql-Wed.log
cp /dev/null postgresql-Thu.log

etc, and it cleared the space fine. You are correct in that I didn’t need to have the application shutdown to do this as it only had the current day open.

I did end up shutting everything down and then starting it back up after clearing space to see if it would fix the extensive logging but it didn’t.

Next, I checked for “paused tasks” with: foreman-maintain health check

and it reported none.

Then I went into the GUI and looked for “long running tasks” I believe it was and saw something from about a month ago. “Discover Hosts” I believe it was and I killed that.

That seemed to do the trick.

The problem seemed to be this:

root**@katello01** /var/lib/pgsql/data/log # tail -n100 postgresql-Sat.log

2024-11-16 07:12:35 UTC LOG: duration: 1021.169 ms statement: UPDATE “dynflow_actions” SET “output” = '\x83a97265706f5f75726c7390a7637261776c6564dd0004f4aed928687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616dd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd920687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd920687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd920687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd929687474703a2f2f6d6972726f722e63656e746f732e6f72672f63656e746f732f382d73747265616d2fd920687474703a2f2f6d6972726f72

Any ideas on that? When I was trying to troubleshoot last week by tailing the logfile it was impossible to see anything as it was a constant stream of ASCII text looking like that.

Well it seems to me as if the task was running and producing a lot of strange output. The extract you have posted says:

ƒ©repo_urls§crawledÝô®Ù(http://mirror.centos.org/centos/8-streamÙ)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù http://mirror.centos.org/centos/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù http://mirror.centos.org/centos/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù http://mirror.centos.org/centos/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù)http://mirror.centos.org/centos/8-stream/Ù http://mirror

At the end of the statement there should be the execution plan uuid and id identifying the task. But as you see the effect it’s most likely the task you have killed. Can you see the output of the task in the task view?

As the update took more than 1 seconds it was logged. If that happens a lot, you’ll get a lot of output.

But for details, a developer would have to check…

That comes from this line:

In the past we added it so we could figure out what makes an instance slow, but if it’s not analyzed then it doesn’t really help.

If updating a dynflow task takes more than a second it’s really long IMHO. That should be a trivial query.

The simplest question is to ask yourself if the storage is fast enough. I’d strongly recommend making sure at least /var/lib/pgsql is on SSD/NVME.

Another is to make sure you vacuum the DB. PostgreSQL: Documentation: 17: 24.1. Routine Vacuuming is pretty long and mentions the SQL statements. The vacuumdb application is IMHO easier to use. So something like:

sudo -u postgres vacuumdb --all

That doesn’t lock tables, but also doesn’t give the maximum effect. You can use --full for that.

Hi @ekohl, just from curiosity, is vacuuming anything which should be run periodically by foreman/katello user ? Or is it performed automatically by some deamon/cron task ?

Thank you very much

Hi @gvde && @ekohl,

Thanks for the further information.

To answer your one query: yes, this /var partition is on SSD drives. We have noticed we get far better performance from Foreman/Katello when they are on faster disks than SATA.

I will see if I can look in the remaining logfiles for anymore clues. Is that weird string of characters encoded or something? I don’t see any mention of trying to sync CentOS 8 stream from the way it appears.

Then to echo what @JendaVodka asked- is this vacuuming of the DB something we should be running manually?

I don’t see any entry for it in /etc/cron.d/katello or /etc/cron.d/foreman

And then is it safe to run?

Cheers

It’s a hex encoded string. It start with \x to indicate that. After that it’s all hex digits. The string itself is probably UTF-8 or similar encoded but this for instance: 72 65 70 6f 5f 75 72 6c is repo_url.