Huge SELECT statements

I have just looked in the postgresql log files for long running queries because it took pretty long until a content view publish began to start after sitting on planning for a few minutes.

I didn’t find anything related to that there, but I have noticed another thing which made me wonder. There are occasional SELECT statements like this:

2025-03-29 07:31:25 CET LOG:  duration: 8521.372 ms  execute <unnamed>: SELECT "katello_repository_rpms".* FROM "katello_repository_rpms" WHERE "katello_repository_rpms"."repository_id" = $1 AND NOT (rpm_id in (44973251,44973235,44973207,44973177,44973176,44973232,44973212,44973199,44973206,44973246,44973234,44973185,44973226,44973195,44973221,44973179,44973215,44973213,449...
2025-03-29 07:31:25 CET DETAIL:  parameters: $1 = '227251'

The first line is 717.132 bytes long!

I started looking at line lengths in the logs and find some others even longer:

2025-03-29 02:59:50 CET LOG:  duration: 2487.740 ms  execute <unnamed>: SELECT "katello_rpms"."id", "katello_rpms"."pulp_id" FROM "katello_rpms" WHERE (pulp_id in ('/pulp/api/v3/content/rpm/packages/0195df97-9871-7d12-95a0-7cf150f58328/','/pulp/api/v3/content/rpm/packages/0195df95-95c2-7e37-baa4-329d3af3c13a/','/pulp/api/v3/content/rpm/packages/0195df95-95c0-769a-8f22-a29c1...
2025-03-29 02:59:57 CET LOG:  duration: 5281.817 ms  execute <unnamed>: SELECT "katello_repository_rpms".* FROM "katello_repository_rpms" WHERE "katello_repository_rpms"."repository_id" = 1971 AND "katello_repository_rpms"."rpm_id" NOT IN (42726595, 18385409, 43387778, 1507875, 31453554, 31100335, 8168606, 11398189, 28568674, 12159919, 48625246, 1509134, 1507261, 21820778, ...
2025-03-29 03:00:04 CET LOG:  duration: 1377.383 ms  execute <unnamed>: INSERT INTO "katello_repository_rpms" ("repository_id","rpm_id","created_at","updated_at") VALUES (1971, 42726595, '2025-03-29 01:59:51.165391', '2025-03-29 01:59:51.165426'), (1971, 18385409, '2025-03-29 01:59:51.165430', '2025-03-29 01:59:51.165433'), (1971, 43387778, '2025-03-29 01:59:51.165436', '20...

Line lengths of these three: 6.015.773, 798.930, 6.326.853

I am not sure what the duration time of postgresql includes, i.e. whether it includes parsing time or only execution. I am amazed that it takes megabyte long statements that well.

Still I just wonder in general, if there isn’t something going fundamentally wrong if you have statements listing some 80.000 pulp or rpm ids for processing. I guess those lists of ids came from another SELECT statement before, thus shouldn’t be there a much more efficient way for that?

I have just updated to 3.13/4.15 latest this week but logs show these kinds of statements before running with 3.12/4.14.

2 Likes

Statements like these are a result of the way Rails Active Record composes SQL, in combination with the way code is written in Katello. For example: katello/app/models/katello/concerns/pulp_database_unit.rb at master · Katello/katello · GitHub

I agree that it makes the SELECT statements hard to read and seems not to make much sense, especially if you’re used to writing SQL directly. But the issue has been around forever, as far as I can tell. I can only assume it is not causing enough of a measurable performance issue to make a difference.

I think I recall us solving issues in the past where SQL queries become so long that they no longer work. The fact that this query is being fed IDs from some other query is suspicious. You’d think that the queries could simply be combined to avoid selecting so many RPMs.

Well, the question now is which lines of code are spawning these long bits of SQL.
I think Jeremy was close to it: katello/app/models/katello/concerns/pulp_database_unit.rb at master · Katello/katello · GitHub

It looks like our ContentUnitIndexer would require a refactor to stop using content unit IDs in select statements.