Smartproxy database size

mindo · December 18, 2023, 6:52pm

Problem:
The database of the smartproxy is much bigger than the foreman/katello main instance.
The worst offenders are the tables core_publishedartifact and core_task, is it safe to clean these tables?

Can i safely install the rpm for foreman on the smartproxy? That rpm includes the foreman-rake tool. Can I use it to clean the old jobs from the smartproxy? I did that on the main instance some months ago.

Expected outcome:
Smartproxy’s database with similar size as foreman/katello.

Foreman and Proxy versions:

foreman: 3.8

Foreman and Proxy plugin versions:

foreman_tasks: 8.2.0
foreman_ansible: 12.07
foreman_openscap: 7.0.0
foreman_remote_execution: 11.1.0
foreman_statistics: 2.1.0
katello: 4.10.0

Distribution and version:
RockLinux 8.8

Other relevant data:

Some months the foreman/katello database was getting huge.
After the correction of some bugs related to the orphans cleanup, some vacuum operations and deleting old jobs (using foreman-rake…) the database has an acceptable size. Unfortunately the database of the smartproxy we have on another location (which only manages less tha 10 machines - the main instance has several hundred) is huge.

I tried to Reclaim Space from the interface, but it appears to be stuck for several hours at 73%.
This query appears to be the problematic one:

SELECT DISTINCT ON ("core_content"."pulp_type") "core_content"."pulp_id", 
"core_content"."pulp_created", 
"core_content"."pulp_last_updated", 
"core_content"."pulp_type", 
"core_content"."upstream_id", 
"core_content"."timestamp_of_interest", 
"core_content"."pulp_domain_id" 
   FROM "core_content" 
   INNER JOIN "core_repositorycontent" 
     ON ("core_content"."pulp_id" = "core_repositorycontent"."content_id") 
WHERE ("core_repositorycontent"."repository_id" 
   IN ('018c62f018d0741f8a173ebe5f24079b'::uuid, 
       '018c62bd9d137c52b780c778bc187d0b'::uuid, 
       '1794145a4d7246ee9c69dd56d2f68328'::uuid, 
(...)

I even added some extra space to the partition so a vacuum full operation had enough space available, but it did nothing.

smartproxy:

foreman=# \c pulpcore;
You are now connected to database "pulpcore" as user "postgres".
pulpcore=# select schemaname as table_schema,
    relname as table_name,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size,
    pg_size_pretty(pg_relation_size(relid)) as data_size,
    pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid))
      as external_size
from pg_catalog.pg_statio_user_tables
order by pg_total_relation_size(relid) desc,
         pg_relation_size(relid) desc
limit 10;
 table_schema |          table_name           | total_size | data_size | external_size
--------------+-------------------------------+------------+-----------+---------------
 public       | core_publishedartifact        | 55 GB      | 22 GB     | 33 GB
 public       | core_task                     | 16 GB      | 979 MB    | 15 GB
 public       | rpm_package                   | 2737 MB    | 671 MB    | 2065 MB
 public       | core_remoteartifact           | 745 MB     | 484 MB    | 261 MB
 public       | core_userrole                 | 712 MB     | 160 MB    | 552 MB
 public       | core_repositorycontent        | 658 MB     | 171 MB    | 487 MB
 public       | guardian_userobjectpermission | 284 MB     | 77 MB     | 207 MB
 public       | core_contentartifact          | 244 MB     | 83 MB     | 161 MB
 public       | core_progressreport           | 209 MB     | 142 MB    | 67 MB
 public       | rpm_updatecollectionpackage   | 207 MB     | 165 MB    | 42 MB
(10 rows)

foreman/katello:

foreman=# \c pulpcore;
You are now connected to database "pulpcore" as user "postgres".
pulpcore=# select schemaname as table_schema,
    relname as table_name,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size,
    pg_size_pretty(pg_relation_size(relid)) as data_size,
    pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid))
      as external_size
from pg_catalog.pg_statio_user_tables
order by pg_total_relation_size(relid) desc,
         pg_relation_size(relid) desc
limit 10;
 table_schema |          table_name           | total_size | data_size | external_size
--------------+-------------------------------+------------+-----------+---------------
 public       | core_publishedartifact        | 18 GB      | 6268 MB   | 12 GB
 public       | core_repositorycontent        | 7720 MB    | 1731 MB   | 5989 MB
 public       | core_task                     | 4901 MB    | 612 MB    | 4289 MB
 public       | rpm_package                   | 4412 MB    | 1125 MB   | 3287 MB
 public       | rpm_updatecollectionpackage   | 1461 MB    | 1134 MB   | 326 MB
 public       | rpm_updatereference           | 541 MB     | 374 MB    | 167 MB
 public       | core_remoteartifact           | 491 MB     | 303 MB    | 188 MB
 public       | rpm_updaterecord              | 457 MB     | 285 MB    | 172 MB
 public       | guardian_userobjectpermission | 397 MB     | 108 MB    | 289 MB
 public       | core_content                  | 320 MB     | 110 MB    | 210 MB
(10 rows)

mindo · December 19, 2023, 12:41pm

Hello,

The Reclaim Space task finally ended but running a vacuum full only freed 5GB…


pulpcore=# select schemaname as table_schema,
    relname as table_name,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size,
    pg_size_pretty(pg_relation_size(relid)) as data_size,
    pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid))
      as external_size
from pg_catalog.pg_statio_user_tables
order by pg_total_relation_size(relid) desc,
         pg_relation_size(relid) desc
limit 10;
 table_schema |          table_name           | total_size | data_size | external_size
--------------+-------------------------------+------------+-----------+---------------
 public       | core_publishedartifact        | 50 GB      | 22 GB     | 28 GB
 public       | core_task                     | 16 GB      | 695 MB    | 15 GB
 public       | rpm_package                   | 2763 MB    | 679 MB    | 2084 MB
 public       | core_remoteartifact           | 762 MB     | 505 MB    | 257 MB
 public       | core_repositorycontent        | 571 MB     | 163 MB    | 408 MB
 public       | core_userrole                 | 543 MB     | 163 MB    | 380 MB
 public       | guardian_userobjectpermission | 284 MB     | 77 MB     | 207 MB
 public       | rpm_updatecollectionpackage   | 205 MB     | 165 MB    | 40 MB
 public       | core_contentartifact          | 191 MB     | 83 MB     | 109 MB
 public       | core_progressreport           | 174 MB     | 123 MB    | 51 MB

evgeni · December 19, 2023, 1:20pm

Anything you can do with foreman-rake will only touch the Foreman database, not the Pulpcore one which is the only one present on a proxy, so that wouldn’t be of any help.

@iballou do you know of any tooling to tame the Pulpcore DB and/or understand why it’s bigger than on the main server?

iballou · December 19, 2023, 3:28pm

We’ve had complaints about this in the past. We’ve actually recently updated our docs to mention that the smart proxy DB for Pulp can get larger than the main Katello one.

Smart proxies aren’t as “smart” as the main Katello server when it comes to storing content. On the Katello server we can reuse Pulp constructs like repositories and publications because we index so much Pulp data. However, on smart proxies, we really just sync the repositories 1:1.

For an example, let’s say you have a repo called Zoo published in 3 content view versions that each have one lifecycle environment. On Katello, assuming the CVVs were published without Zoo having had a change in content, that means one single repository version and one single publication. I’m also assuming no content filtering here.

If you sync all that content to the smart proxy, that would be at least 4 repositories – one per CVV/LCE combo (the fourth comes from Library/Default Org View).

One setting that can reduce this is the setting “distribute_archived_cvv” which shows up as “Distribute archived content view versions” in the UI. It’s on be default but can be turned off if you don’t need to verify what contents are distributed to the base content view versions. I think turning it off should reduce the amount of published artifacts in Pulp.

Maybe we should turn that off by default.

As for the reclaim space issue, it should reduce your disk size if you’re using on demand content but I’m not sure if it would reduce the postgres DB size.

@dralley / @ggainey I feel like I remember a Pulp issue about this, does it ring any bells for you?

If you have a lot of content though I could see reclaim space taking a good amount of time.

iballou · December 19, 2023, 3:32pm

And for your specific questions:

foreman-rake is meant to be run from Foreman. We have tasks like orphan cleanup that talk to the smart proxies. I don’t believe we have anything for cleaning tasks in Pulp.

mindo · December 22, 2023, 2:35pm

Thank all for the suggestion and inputs.

For some reason (and with no changes made to the repos configuration - we have almost all of them on demand) I was able to free lots of space.

After the following operations:
2 days ago in the afternoon

manually sync the smartproxy
manually run the foreman-rake command do cleanup orphans (we left it running)
manually reclaim space on the smartproxy

Yesterday (the usual operations ran during the night: syncs, orphan cleanup, publishing new CVs)

the /var/lib/pulp was much smaller
after running a vacuum full the database is about the same size as the main instance

postgres=# \c pulpcore ;
You are now connected to database "pulpcore" as user "postgres".
pulpcore=# select schemaname as table_schema,
pulpcore-#     relname as table_name,
pulpcore-#     pg_size_pretty(pg_total_relation_size(relid)) as total_size,
pulpcore-#     pg_size_pretty(pg_relation_size(relid)) as data_size,
pulpcore-#     pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid))
pulpcore-#       as external_size
pulpcore-# from pg_catalog.pg_statio_user_tables
pulpcore-# order by pg_total_relation_size(relid) desc,
pulpcore-#          pg_relation_size(relid) desc
pulpcore-# limit 10;
 table_schema |          table_name           | total_size | data_size | external_size
--------------+-------------------------------+------------+-----------+---------------
 public       | core_task                     | 17 GB      | 734 MB    | 16 GB
 public       | core_publishedartifact        | 16 GB      | 7128 MB   | 9172 MB
 public       | rpm_package                   | 2550 MB    | 628 MB    | 1922 MB
 public       | core_remoteartifact           | 702 MB     | 464 MB    | 238 MB
 public       | core_userrole                 | 583 MB     | 174 MB    | 409 MB
 public       | core_repositorycontent        | 497 MB     | 141 MB    | 356 MB
 public       | guardian_userobjectpermission | 284 MB     | 77 MB     | 207 MB
 public       | core_progressreport           | 176 MB     | 124 MB    | 52 MB
 public       | rpm_updatecollectionpackage   | 174 MB     | 139 MB    | 34 MB
 public       | core_content                  | 145 MB     | 62 MB     | 83 MB
(10 rows)

dralley · January 4, 2024, 6:45pm

This is interesting. Why is the task table so huge? And why is the data such a small percentage of it?

https://pgpedia.info/p/pg_total_relation_size.html

" pg_total_relation_size() returns the sum of the values that pg_table_size() and pg_indexes_size() would return for the relation."

Maybe “data_size” ought to be using “pg_table_size()”? Can you enlighten whether it’s the tables or the indexes that are so large?

mindo · January 9, 2024, 11:25am

@dralley Its mostly data…

pulpcore=# WITH sizes AS (
pulpcore(#  SELECT pg_table_size('core_task'),
pulpcore(#         pg_indexes_size('core_task'),
pulpcore(#         pg_total_relation_size('core_task')
pulpcore(#  )
pulpcore-# SELECT pg_total_relation_size as relation_size,
pulpcore-#        pg_table_size as table_size,
pulpcore-#        pg_indexes_size as indexes_sizes,
pulpcore-#        (pg_table_size + pg_indexes_size) as table_plus_indexes
pulpcore-# FROM sizes;
 relation_size | table_size  | indexes_sizes | table_plus_indexes
---------------+-------------+---------------+--------------------
   23928561664 | 23388291072 |     540270592 |        23928561664

dralley · January 9, 2024, 3:30pm

Did it really increase from 17gb to 23gb in the past 3 weeks? Wow.

Could you provide some statistics on how many task records there are, and break them down by type of task?

mindo · January 14, 2024, 11:37am

Hello @dralley,

Is this the info you’re looking for?

pulpcore=# select count (*) as count,name from core_task group by name order by count desc;
  count  |                      name
---------+------------------------------------------------
 2013258 | pulpcore.app.tasks.base.general_update
  147110 | pulp_rpm.app.tasks.synchronizing.synchronize
   29550 | pulp_rpm.app.tasks.publishing.publish
   16652 | pulpcore.app.tasks.repository.delete_version
     579 | pulpcore.app.tasks.base.general_delete
     516 | pulpcore.app.tasks.base.general_create
     120 | pulpcore.app.tasks.orphan.orphan_cleanup
       3 | pulpcore.app.tasks.reclaim_space.reclaim_space
(8 rows)

pulpcore=# select count (*) as count from core_task;
  count
---------
 2207788
(1 row)

dralley · January 14, 2024, 6:04pm

@mindo, yep.

@iballou, do you know whats kicking off all those general_update tasks? I’m unsure if all of that is necessary - if we can minimize it I think it would go a long way to addressing this issue.

iballou · January 18, 2024, 7:53pm

I’m not sure what would be calling them, but the dynflow task output from Foreman would show which tasks trigger

pulpcore.app.tasks.base.general_update