Publish and Promote slow with 3.11/4.13

gvde · August 8, 2024, 5:30am

Problem:
After the upgrade from 4.12 to 4.13 with the fixed candlepin package, I have noticed that our daily publish and promote script takes significantly more time to finish.

Comparing task times of the sync yesterday with 4.12 to the sync this morning with 4.13:

Content View	Promote 4.12	Promote 4.13	Publish 4.12	Publish 4.13	Package Count
CentOS 7	00:01:30	00:03:08	00:06:29	00:11:20	54487
CentOS 7 & EPEL 7	00:02:40	00:06:02	00:06:38	00:18:14	73702
AlmaLinux 8	00:02:18	00:05:33	00:07:27	00:17:04	33313
AlmaLinux 8 & EPEL 8	00:01:55	00:05:09	00:04:59	00:14:24	52447
AlmaLinux 9	00:02:07	00:05:19	00:10:22	00:15:17	31975
AlmaLinux 9 & EPEL 9	00:02:13	00:05:38	00:06:08	00:17:22	95542

In total it’s 70 minutes yesterday and the days before compared to 140 minutes today.

The times in the promote columns are for promoting the cv version in testing to production.
The times in the publish columns are for publishing a new version and promoting it to testing.

As you can see, the simple promote of a published version to a new environment already takes twice at much time.

Looking at the dynflow of the promote runs of “AlmaLinux 9 & EPEL 9”, I can see that the times for the Actions::Pulp3::Repository::RefreshDistribution step increased from ~30s to ~140s for the faster ones and from ~70s to ~290s for the slowest ones.

Expected outcome:
Same performance.

Foreman and Proxy versions:
foreman-3.11.1-1.el8.noarch
katello-4.13.1-1.el8.noarch

Distribution and version:
AlmaLinux 8.10

sajha · August 8, 2024, 1:49pm

Thanks @gvde for the data on this.

Is this Actions::Pulp3::Repository::RefreshDistribution also the same action that’s seeing the performance degradation in publish?

sajha · August 8, 2024, 1:57pm

Also for the step RefreshDistribution in the publish/promote task, could you check the output of the step in the tasks UI and grab the pulp task output in the output section for the step.

gvde · August 8, 2024, 2:05pm

Yes. It seems to. In the first block after SaveVersions etc. it’s from

357: Actions::Pulp3::Orchestration::Repository::CopyAllUnits (success) [ 0.16s / 0.16s ]
360: Actions::Pulp3::Orchestration::Repository::GenerateMetadata (success) [ 0.15s / 0.15s ]
362: Actions::Pulp3::Repository::RefreshDistribution (success) [ 14.14s / 2.24s ]
364: Actions::Pulp3::Repository::SaveDistributionReferences (success) [ 0.30s / 0.30s ]
366: Actions::Katello::Repository::IndexContent (success) [ 0.96s / 0.96s ]
370: Actions::Katello::Repository::CheckMatchingContent (success) [ 0.17s / 0.17s ]
373: Actions::Pulp3::Orchestration::Repository::GenerateMetadata (success) [ 0.05s / 0.05s ]
375: Actions::Pulp3::Repository::RefreshDistribution (success) [ 54.07s / 5.56s ]
377: Actions::Pulp3::Repository::SaveDistributionReferences (success) [ 0.80s / 0.80s ]
379: Actions::Katello::Repository::IndexContent (success) [ 0.53s / 0.53s ]

to

357: Actions::Pulp3::Orchestration::Repository::CopyAllUnits (success) [ 0.13s / 0.13s ]
360: Actions::Pulp3::Orchestration::Repository::GenerateMetadata (success) [ 0.11s / 0.11s ]
362: Actions::Pulp3::Repository::RefreshDistribution (success) [ 252.39s / 23.89s ]
364: Actions::Pulp3::Repository::SaveDistributionReferences (success) [ 0.17s / 0.17s ]
366: Actions::Katello::Repository::IndexContent (success) [ 0.22s / 0.22s ]
370: Actions::Katello::Repository::CheckMatchingContent (success) [ 0.07s / 0.07s ]
373: Actions::Pulp3::Orchestration::Repository::GenerateMetadata (success) [ 0.03s / 0.03s ]
375: Actions::Pulp3::Repository::RefreshDistribution (success) [ 221.98s / 28.35s ]
377: Actions::Pulp3::Repository::SaveDistributionReferences (success) [ 2.59s / 2.59s ]
379: Actions::Katello::Repository::IndexContent (success) [ 0.17s / 0.17s ]

all the others after that aren’t any better. Even at the end:

2306: Actions::Katello::Repository::CheckMatchingContent (success) [ 0.18s / 0.18s ]
2309: Actions::Pulp3::Orchestration::Repository::GenerateMetadata (success) [ 0.03s / 0.03s ]
2311: Actions::Pulp3::Repository::RefreshDistribution (success) [ 91.85s / 12.05s ]
2313: Actions::Pulp3::Repository::SaveDistributionReferences (success) [ 1.19s / 1.19s ]
2315: Actions::Katello::Repository::IndexContent (success) [ 0.22s / 0.22s ]

to

2306: Actions::Katello::Repository::CheckMatchingContent (success) [ 0.23s / 0.23s ]
2309: Actions::Pulp3::Orchestration::Repository::GenerateMetadata (success) [ 0.21s / 0.21s ]
2311: Actions::Pulp3::Repository::RefreshDistribution (success) [ 327.13s / 28.79s ]
2313: Actions::Pulp3::Repository::SaveDistributionReferences (success) [ 1.82s / 1.82s ]
2315: Actions::Katello::Repository::IndexContent (success) [ 0.09s / 0.09s ]

gvde · August 8, 2024, 2:09pm

You mean the details for the refresh in step 362 in my previous post? That would be:

 362: Actions::Pulp3::Repository::RefreshDistribution (success) [ 252.39s / 23.89s ]

Queue: default

Started at: 2024-08-08 03:53:32 UTC

Ended at: 2024-08-08 03:57:44 UTC

Real time: 252.39s

Execution time (excluding suspended state): 23.89s

Input:

---
repository_id: 173004
smart_proxy_id: 1
remote_user: admin
remote_cp_user: admin
current_request_id: ce3604e1-f4d4-4d15-8991-e22d2914ee36
current_timezone: UTC
current_organization_id: 
current_location_id: 
current_user_id: 4

Output:

---
pulp_tasks:
- pulp_href: "/pulp/api/v3/tasks/0191301e-d740-722e-982d-95a849301f3d/"
  pulp_created: '2024-08-08T03:53:33.249+00:00'
  pulp_last_updated: '2024-08-08T03:53:33.249+00:00'
  state: completed
  name: pulpcore.app.tasks.base.general_create
  logging_cid: ce3604e1-f4d4-4d15-8991-e22d2914ee36
  created_by: "/pulp/api/v3/users/1/"
  unblocked_at: '2024-08-08T03:57:35.662+00:00'
  started_at: '2024-08-08T03:57:36.697+00:00'
  finished_at: '2024-08-08T03:57:37.113+00:00'
  worker: "/pulp/api/v3/workers/01912d3f-5c85-7dfa-8c29-c6e8b8848feb/"
  child_tasks: []
  progress_reports: []
  created_resources:
  - "/pulp/api/v3/distributions/rpm/rpm/01913022-8fc1-7d34-9a76-f66194e67017/"
  reserved_resources_record:
  - "/api/v3/distributions/"
  - shared:/pulp/api/v3/domains/018d10eb-22f7-72a8-ab19-6f036c81f631/
task_groups: []
poll_attempts:
  total: 25
  failed: 0

Chunked output:

--- []

sajha · August 8, 2024, 2:12pm

Looking at the pulp task, it seems the task was created around 03:53 and the unblocked_at is 03:57. So it seems the task itself was blocked for some time before it was able to run.

sajha · August 8, 2024, 2:29pm

Do you see similar blockage times in other long running RefreshDistribution steps? We can check with pulp if there’s any resource locks in particular that was added here to the task if that’s the case…

gvde · August 8, 2024, 2:48pm

I don’t think those *_at timestamps are really what they are. For the 362 above it’s 0.413s between start and finish.

The next step 364 looks similar in regard to the timestamps: Also 4 minutes between create and unblock and then less then a second between start and finish:

364: Actions::Pulp3::Repository::SaveDistributionReferences (success) [ 0.17s / 0.17s ]

Queue: default

Started at: 2024-08-08 03:57:46 UTC

Ended at: 2024-08-08 03:57:46 UTC

Real time: 0.17s

Execution time (excluding suspended state): 0.17s

Input:

---
repository_id: 173004
smart_proxy_id: 1
tasks:
  pulp_tasks:
  - pulp_href: "/pulp/api/v3/tasks/0191301e-d740-722e-982d-95a849301f3d/"
    pulp_created: '2024-08-08T03:53:33.249+00:00'
    pulp_last_updated: '2024-08-08T03:53:33.249+00:00'
    state: completed
    name: pulpcore.app.tasks.base.general_create
    logging_cid: ce3604e1-f4d4-4d15-8991-e22d2914ee36
    created_by: "/pulp/api/v3/users/1/"
    unblocked_at: '2024-08-08T03:57:35.662+00:00'
    started_at: '2024-08-08T03:57:36.697+00:00'
    finished_at: '2024-08-08T03:57:37.113+00:00'
    worker: "/pulp/api/v3/workers/01912d3f-5c85-7dfa-8c29-c6e8b8848feb/"
    child_tasks: []
    progress_reports: []
    created_resources:
    - "/pulp/api/v3/distributions/rpm/rpm/01913022-8fc1-7d34-9a76-f66194e67017/"
    reserved_resources_record:
    - "/api/v3/distributions/"
    - shared:/pulp/api/v3/domains/018d10eb-22f7-72a8-ab19-6f036c81f631/
  task_groups: []
  poll_attempts:
    total: 25
    failed: 0
contents_changed: 
remote_user: admin
remote_cp_user: admin
current_request_id: ce3604e1-f4d4-4d15-8991-e22d2914ee36
current_timezone: UTC
current_organization_id: 
current_location_id: 
current_user_id: 4

Output:

--- {}

Chunked output:

--- []

Checking random pulp tasks for this one particular publish, I always see 3-5 minutes between the pulp_created timestamp and the unblocked_at, but not only for RefreshDistribution but for all that have that information, even when they show a short real time like the SaveDistributionReferences.

I don’t quite understand the meaning of those timestamps.

sajha · August 8, 2024, 6:05pm

From what I heard from pulp folks, the distribution creation tasks in pulp are always serialized despite katello starting them concurrently based on this lock it acquires here:

reserved_resources_record:
    - "/api/v3/distributions/"

…So there’s likely some distribution refresh task that blocked the following distribution creation tasks in pulp increasing the overall time for all of them.

gvde · August 8, 2024, 6:29pm

I don’t think that has changed to before. Still the total execution time increased. If you look at the execution times above, it increase dfrom 2.24s to 23.89s, from 5.56s to 28.35s, from 12.05s to 28.79s.

If that “Execution time (excluding suspended state)” is the real execution time, it definitively increased.

gvde · August 9, 2024, 5:33am

I have looked into the postgreql logs. They are logging all statements taking longer than 1s. Just counting lines already shows a much larger number:

# wc -l /var/lib/pgsql/data*/log/postgresql-Thu.log 
    34931 /var/lib/pgsql/data/log/postgresql-Thu.log
    12447 /var/lib/pgsql/data-old/log/postgresql-Thu.log

Only counting unique statements in yesterday’s log I already find

SELECT "core_task"."pulp_id", "core_task"."pulp_created", "core_task"."pulp_last_updated", "core_task"."state", "core_task"."name", "core_task"."logging_cid", "core_task"."unblocked_at", "core_task"."started_at", "core_task"."finished_at", "core_task"."error", "core_task"."enc_args", "core_task"."enc_kwargs", "core_task"."worker_id", "core_task"."parent_task_id", "core_task"."task_group_id", "core_task"."reserved_resources_record", "core_task"."pulp_domain_id", "core_task"."versions" FROM "core_task" WHERE "core_task"."state" IN ('waiting', 'running', 'canceling') ORDER BY "core_task"."pulp_created" ASC
SELECT "core_task"."pulp_id", "core_task"."pulp_created", "core_task"."pulp_last_updated", "core_task"."state", "core_task"."name", "core_task"."logging_cid", "core_task"."unblocked_at", "core_task"."started_at", "core_task"."finished_at", "core_task"."error", "core_task"."enc_args", "core_task"."enc_kwargs", "core_task"."worker_id", "core_task"."parent_task_id", "core_task"."task_group_id", "core_task"."reserved_resources_record", "core_task"."pulp_domain_id", "core_task"."versions" FROM "core_task" WHERE ("core_task"."state" IN ('waiting', 'running', 'canceling') AND "core_task"."unblocked_at" IS NOT NULL) ORDER BY "core_task"."pulp_created" ASC

The first one 3566 times, the second one 28835 times, ranging with durations from 1s to 3.7s. Those statements appear during pulp actions, i.e. synchronize, publish, promote, etc.

The old logs only show the first statement for ~10.000 times, but not the second (searching for unblocked_at).

If I sum up the durations for those statements in last week’s log I get 19495754ms ~ 325 minutes. Yesterday it was 56706922ms ~ 945 minutes. Currently I seem to have 30 postgresql worker processes for database pulpcore running.

Assuming the statements spread ideally over those 30 worker this would mean net 10,8 minutes waiting time last week and net 31,5 minutes waiting time yesterday. That would already explain 20 minutes. And seeing the load in top and pg_stat_activity output I can see that not all 30 workers are used at all times which could explain why in the end it’s more than 60 additional minutes now compared to last week.

Here is a EXPLAIN ANALYZE output from that statement which I just ran. Of course, currently there is no pulpcore operation active, thus it’s faster than 1s but still it is obvious that this statement is expensive as even now it takes 400ms. Now run this 28.000 times during the day…

pulpcore=# explain analyze SELECT "core_task"."pulp_id", "core_task"."pulp_created", "core_task"."pulp_last_updated", "core_task"."state", "core_task"."name", "core_task"."logging_cid", "core_task"."unblocked_at", "core_task"."started_at", "core_task"."finished_at", "core_task"."error", "core_task"."enc_args", "core_task"."enc_kwargs", "core_task"."worker_id", "core_task"."parent_task_id", "core_task"."task_group_id", "core_task"."reserved_resources_record", "core_task"."pulp_domain_id", "core_task"."versions" FROM "core_task" WHERE ("core_task"."state" IN ('waiting', 'running', 'canceling') AND "core_task"."unblocked_at" IS NOT NULL) ORDER BY "core_task"."pulp_created" ASC;
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=253248.77..253248.78 rows=1 width=2177) (actual time=400.898..403.104 rows=0 loops=1)
   Sort Key: pulp_created
   Sort Method: quicksort  Memory: 25kB
   ->  Gather  (cost=1000.00..253248.76 rows=1 width=2177) (actual time=400.892..403.097 rows=0 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Seq Scan on core_task  (cost=0.00..252248.66 rows=1 width=2177) (actual time=397.755..397.756 rows=0 loops=3)
               Filter: ((unblocked_at IS NOT NULL) AND (state = ANY ('{waiting,running,canceling}'::text[])))
               Rows Removed by Filter: 795002
 Planning Time: 0.130 ms
 Execution Time: 403.129 ms
(11 rows)

sajha · August 9, 2024, 12:11pm

@gvde: This looks related to the unblocked logic that went in for pulp tasks. The 2nd query seems to be coming from here: Add unblocked_at to tasks · pulp/pulpcore@51c29b5 · GitHub

Pulp has some purge methods which have gone in to clean up the tasks table. I’d assume having too many tasks in the table would slow down those task queries substantially. We recently added that to katello which made it to 4.13: Bug #37521: Pulp never purge the completed tasks - Katello - Foreman

That runs during orphan cleanup. We have a new setting: ‘Completed pulp task protection days’ which can control how many days completed tasks are saved in the system. I’d think that would help here.

gvde · August 9, 2024, 1:25pm

When is the orphan cleanup running? Can I start it manually to check if it makes a difference. It’s currently set to 30 days.

sajha · August 9, 2024, 1:44pm

You could trigger a manual run with : foreman-rake katello:delete_orphaned_content

gvde · August 9, 2024, 2:26pm

O.K. I thought that only handles content cleanup. I ran

# foreman-rake katello:delete_orphaned_content RAILS_ENV=production

as it is started from cron on Sunday evening.

The pulp task seems to be working in the background much longer than the foreman task. I have checked the pulpcore.core_task table and it brought down the number of rows from

pulpcore=# select count(*),min(started_at) from core_task;
 count   |              min              
---------+-------------------------------
 2385487 | 2022-07-13 19:05:10.345702+02
(1 row)

to

pulpcore=# select count(*),min(started_at) from core_task;
 count |              min              
-------+-------------------------------
 96096 | 2022-07-19 15:24:09.422494+02
(1 row)

However, that doesn’t really changes the performance of that query above. It’s actually worse, although it’s only using one worker now instead of two.

pulpcore=# explain analyze SELECT "core_task"."pulp_id", "core_task"."pulp_created", "core_task"."pulp_last_updated", "core_task"."state", "core_task"."name", "core_task"."logging_cid", "core_task"."unblocked_at", "core_task"."started_at", "core_task"."finished_at", "core_task"."error", "core_task"."enc_args", "core_task"."enc_kwargs", "core_task"."worker_id", "core_task"."parent_task_id", "core_task"."task_group_id", "core_task"."reserved_resources_record", "core_task"."pulp_domain_id", "core_task"."versions" FROM "core_task" WHERE ("core_task"."state" IN ('waiting', 'running', 'canceling') AND "core_task"."unblocked_at" IS NOT NULL) ORDER BY "core_task"."pulp_created" ASC;
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Sort  (cost=239975.67..239975.67 rows=1 width=2160) (actual time=565.910..565.911 rows=0 loops=1)
   Sort Key: pulp_created
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on core_task  (cost=0.00..239975.66 rows=1 width=2160) (actual time=565.902..565.903 rows=0 loops=1)
         Filter: ((unblocked_at IS NOT NULL) AND (state = ANY ('{waiting,running,canceling}'::text[])))
         Rows Removed by Filter: 96096
 Planning Time: 0.128 ms
 Execution Time: 565.934 ms
(8 rows)

The load on the server it low right now so I am starting a publish task now and post a followup soon.

gvde · August 9, 2024, 4:31pm

O.K. Here the followup: performance has improved. I could only test a publish and not the usual publish & promote as it would mess up with my life cycle before the weekend.

The publish of “AlmaLinux 9 & EPEL 9” took 6m 21s compare to 6m 38s from the last publish&promote before the upgrade.

It seems for the part I have posted before:

357: Actions::Pulp3::Orchestration::Repository::CopyAllUnits (success) [ 0.10s / 0.10s ]
360: Actions::Pulp3::Orchestration::Repository::GenerateMetadata (success) [ 0.10s / 0.10s ]
362: Actions::Pulp3::Repository::RefreshDistribution (success) [ 17.99s / 1.60s ]
364: Actions::Pulp3::Repository::SaveDistributionReferences (success) [ 0.21s / 0.21s ]
366: Actions::Katello::Repository::IndexContent (success) [ 0.37s / 0.37s ]
370: Actions::Katello::Repository::CheckMatchingContent (success) [ 0.12s / 0.12s ]
373: Actions::Pulp3::Orchestration::Repository::GenerateMetadata (success) [ 0.08s / 0.08s ]
375: Actions::Pulp3::Repository::RefreshDistribution (success) [ 164.02s / 7.66s ]
377: Actions::Pulp3::Repository::SaveDistributionReferences (success) [ 1.11s / 1.11s ]
379: Actions::Katello::Repository::IndexContent (success) [ 0.34s / 0.34s ]

The first RefreshDistribution 362 is back to almost the same (14.14s before, 17.99s now).

The second RefreshDistribution 375 is still much longer (54.07s before, 164.02s now) although better than the 221.98s I had before the cleanup.

So at the moment it seems it has improved the performance, but it’s not back to what it was before the upgrade. But I guess I’ll have more accurate numbers to compare on Monday morning when the next publish&promote cycle runs.

gvde · August 12, 2024, 4:59am

O.K. The weekday cron job run this morning again. The situation hasn’t much improved.

Content View	Promote 4.12	Promote 4.13	4.13 2nd	Publish 4.12	Publish 4.13	4.13 2nd
CentOS 7	00:01:30	00:03:08	00:02:59	00:06:29	00:11:20	00:10:44
CentOS 7 & EPEL 7	00:02:40	00:06:02	00:05:26	00:06:38	00:18:14	00:16:32
AlmaLinux 8	00:02:18	00:05:33	00:05:55	00:07:27	00:17:04	00:15:47
AlmaLinux 8 & EPEL 8	00:01:55	00:05:09	00:05:00	00:04:59	00:14:24	00:13:01
AlmaLinux 9	00:02:07	00:05:19	00:04:48	00:10:22	00:15:17	00:13:54
AlmaLinux 9 & EPEL 9	00:02:13	00:05:38	00:05:41	00:06:08	00:17:22	00:15:28

So it’s still far from the 4.12 times.

The execution times from the block I have posted before don’t look better either:

357: Actions::Pulp3::Orchestration::Repository::CopyAllUnits (success) [ 0.08s / 0.08s ]
360: Actions::Pulp3::Orchestration::Repository::GenerateMetadata (success) [ 0.14s / 0.14s ]
362: Actions::Pulp3::Repository::RefreshDistribution (success) [ 236.21s / 6.39s ]
364: Actions::Pulp3::Repository::SaveDistributionReferences (success) [ 0.11s / 0.11s ]
366: Actions::Katello::Repository::IndexContent (success) [ 0.28s / 0.28s ]
370: Actions::Katello::Repository::CheckMatchingContent (success) [ 0.19s / 0.19s ]
373: Actions::Pulp3::Orchestration::Repository::GenerateMetadata (success) [ 0.07s / 0.07s ]
375: Actions::Pulp3::Repository::RefreshDistribution (success) [ 157.67s / 6.48s ]
377: Actions::Pulp3::Repository::SaveDistributionReferences (success) [ 0.84s / 0.84s ]
379: Actions::Katello::Repository::IndexContent (success) [ 0.22s / 0.22s ]

It’s still those two RefreshDistribution steps.

gvde · August 12, 2024, 10:17am

For a quick test, I have looked at the query

SELECT "core_task"."pulp_id", "core_task"."pulp_created", "core_task"."pulp_last_updated", "core_task"."state", "core_task"."name", "core_task"."logging_cid", "core_task"."unblocked_at", "core_task"."started_at", "core_task"."finished_at", "core_task"."error", "core_task"."enc_args", "core_task"."enc_kwargs", "core_task"."worker_id", "core_task"."parent_task_id", "core_task"."task_group_id", "core_task"."reserved_resources_record", "core_task"."pulp_domain_id", "core_task"."versions"
  FROM "core_task"
 WHERE ("core_task"."state" IN ('waiting', 'running', 'canceling') AND "core_task"."unblocked_at" IS NOT NULL)
ORDER BY "core_task"."pulp_created" ASC

core_task doesn’t have an index on state nor unblocked_at.

pulpcore=#  \d core_task
                               Table "public.core_task"
          Column           |           Type           | Collation | Nullable | Default 
---------------------------+--------------------------+-----------+----------+---------
 pulp_id                   | uuid                     |           | not null | 
 pulp_created              | timestamp with time zone |           | not null | 
 pulp_last_updated         | timestamp with time zone |           |          | 
 state                     | text                     |           | not null | 
 name                      | text                     |           | not null | 
 started_at                | timestamp with time zone |           |          | 
 finished_at               | timestamp with time zone |           |          | 
 error                     | jsonb                    |           |          | 
 worker_id                 | uuid                     |           |          | 
 parent_task_id            | uuid                     |           |          | 
 task_group_id             | uuid                     |           |          | 
 logging_cid               | text                     |           | not null | 
 reserved_resources_record | text[]                   |           |          | 
 pulp_domain_id            | uuid                     |           | not null | 
 versions                  | hstore                   |           | not null | 
 enc_args                  | jsonb                    |           |          | 
 enc_kwargs                | jsonb                    |           |          | 
 unblocked_at              | timestamp with time zone |           |          | 
Indexes:
    "core_task_pkey" PRIMARY KEY, btree (pulp_id)
    "core_task_logging_cid_0bc78a42" btree (logging_cid)
    "core_task_parent_task_id_07cf4230" btree (parent_task_id)
    "core_task_pulp_cr_10223f_idx" btree (pulp_created)
    "core_task_pulp_domain_id_ad2d9a92" btree (pulp_domain_id)
    "core_task_task_group_id_a45c142c" btree (task_group_id)
    "core_task_worker_id_ca31e694" btree (worker_id)
Foreign-key constraints:
    "core_task_parent_task_id_07cf4230_fk_core_task_pulp_id" FOREIGN KEY (parent_task_id) REFERENCES core_task(pulp_id) DEFERRABLE INITIALLY DEFERRED
    "core_task_pulp_domain_id_ad2d9a92_fk_core_domain_pulp_id" FOREIGN KEY (pulp_domain_id) REFERENCES core_domain(pulp_id) DEFERRABLE INITIALLY DEFERRED
    "core_task_task_group_id_a45c142c_fk_core_taskgroup_pulp_id" FOREIGN KEY (task_group_id) REFERENCES core_taskgroup(pulp_id) DEFERRABLE INITIALLY DEFERRED
    "core_task_worker_id_ca31e694_fk_core_worker__id" FOREIGN KEY (worker_id) REFERENCES core_worker(pulp_id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "ansible_collectionimport" CONSTRAINT "ansible_collectionimport_task_id_9d08fd06_fk_core_task_pulp_id" FOREIGN KEY (task_id) REFERENCES core_task(pulp_id) DEFERRABLE INITIALLY DEFERRED
    TABLE "core_createdresource" CONSTRAINT "core_createdresource_task_id_acf70fb7_fk_core_task__id" FOREIGN KEY (task_id) REFERENCES core_task(pulp_id) DEFERRABLE INITIALLY DEFERRED
    TABLE "core_export" CONSTRAINT "core_export_task_id_4947760b_fk_core_task_pulp_id" FOREIGN KEY (task_id) REFERENCES core_task(pulp_id) DEFERRABLE INITIALLY DEFERRED
    TABLE "core_import" CONSTRAINT "core_import_task_id_b927da56_fk_core_task_pulp_id" FOREIGN KEY (task_id) REFERENCES core_task(pulp_id) DEFERRABLE INITIALLY DEFERRED
    TABLE "core_progressreport" CONSTRAINT "core_progressreport_task_id_0c3fbc3b_fk_core_task_pulp_id" FOREIGN KEY (task_id) REFERENCES core_task(pulp_id) DEFERRABLE INITIALLY DEFERRED
    TABLE "core_task" CONSTRAINT "core_task_parent_task_id_07cf4230_fk_core_task_pulp_id" FOREIGN KEY (parent_task_id) REFERENCES core_task(pulp_id) DEFERRABLE INITIALLY DEFERRED
    TABLE "core_taskschedule" CONSTRAINT "core_taskschedule_last_task_id_5c1ee058_fk_core_task_pulp_id" FOREIGN KEY (last_task_id) REFERENCES core_task(pulp_id) DEFERRABLE INITIALLY DEFERRED

Adding a index on unblocked_at speeds it up but on state is gets really fast.

With no index on state nor unblocked_at:

pulpcore=# explain analyze SELECT "core_task"."pulp_id", "core_task"."pulp_created", "core_task"."pulp_last_updated", "core_task"."state", "core_task"."name", "core_task"."logging_cid", "core_task"."unblocked_at", "core_task"."started_at", "core_task"."finished_at", "core_task"."error", "core_task"."enc_args", "core_task"."enc_kwargs", "core_task"."worker_id", "core_task"."parent_task_id", "core_task"."task_group_id", "core_task"."reserved_resources_record", "core_task"."pulp_domain_id", "core_task"."versions" FROM "core_task" WHERE ("core_task"."state" IN ('waiting', 'running', 'canceling') AND "core_task"."unblocked_at" IS NOT NULL) ORDER BY "core_task"."pulp_created" ASC;
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Sort  (cost=239962.49..239962.50 rows=1 width=2174) (actual time=636.459..636.461 rows=0 loops=1)
   Sort Key: pulp_created
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on core_task  (cost=0.00..239962.48 rows=1 width=2174) (actual time=636.451..636.452 rows=0 loops=1)
         Filter: ((unblocked_at IS NOT NULL) AND (state = ANY ('{waiting,running,canceling}'::text[])))
         Rows Removed by Filter: 93708
 Planning Time: 0.148 ms
 Execution Time: 636.478 ms
(8 rows)

Adding an index on state:

pulpcore=# create index on core_task ( state );
CREATE INDEX
pulpcore=# explain analyze SELECT "core_task"."pulp_id", "core_task"."pulp_created", "core_task"."pulp_last_updated", "core_task"."state", "core_task"."name", "core_task"."logging_cid", "core_task"."unblocked_at", "core_task"."started_at", "core_task"."finished_at", "core_task"."error", "core_task"."enc_args", "core_task"."enc_kwargs", "core_task"."worker_id", "core_task"."parent_task_id", "core_task"."task_group_id", "core_task"."reserved_resources_record", "core_task"."pulp_domain_id", "core_task"."versions" FROM "core_task" WHERE ("core_task"."state" IN ('waiting', 'running', 'canceling') AND "core_task"."unblocked_at" IS NOT NULL) ORDER BY "core_task"."pulp_created" ASC;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=16.92..16.93 rows=1 width=2174) (actual time=0.067..0.068 rows=0 loops=1)
   Sort Key: pulp_created
   Sort Method: quicksort  Memory: 25kB
   ->  Index Scan using core_task_state_idx on core_task  (cost=0.29..16.91 rows=1 width=2174) (actual time=0.062..0.062 rows=0 loops=1)
         Index Cond: (state = ANY ('{waiting,running,canceling}'::text[]))
         Filter: (unblocked_at IS NOT NULL)
 Planning Time: 0.169 ms
 Execution Time: 0.084 ms
(8 rows)

That is from 636.478 ms down to 0.084 ms. Considering the number of those queries logged because they took longer than 1s (e.g. today until now 51498 queries), I suspect this may help. Of course, if the state may quickly change a lot this may add a burden on index maintenance.

I’ll try some more tests using a copy of one of my content views and run a publish & promote to compare numbers.

sajha · August 12, 2024, 10:28am

I’d have imagined cleaning up tasks would have improved the query performance more than it did. We should file an issue at Issues · pulp/pulpcore · GitHub with your findings here.

This will likely impact sync and smart proxy sync performance as well if the distribution creation/refresh is running into the long running query.

gvde · August 12, 2024, 3:29pm

O.K. I did the definitive test: I have created a copy of the “Alma Linux 9 & EPEL 9” content view for which I gave the detailed execution times above.

To have comparable numbers, I tested during a time which not too much traffic. Without any additional indices I ran the first two “Publish & Promote” to publish a new version and promote it to the testing environment. I have waited between both and the following actions to let everything else “settle” (sync to content proxy, update content counts) until the load of the server was back close to almost 0.

The first publish and promote of the very first version of this CV took 588s, almost 10 minutes.
The second publish and promote of this CV took 722s, i.e. 12 minutes.

Now I added one index:

pulpcore=# CREATE INDEX ON core_task ( state );
CREATE INDEX

and ran

the third publish and promote of this CV (with state index) took 98s! 1,5 minutes!

All those long running queries in the logs were gone. Only three katello queries were logged. That’s it. And unlike before, the server load stayed much lower and I didn’t see all those busy SELECT queries in pg_stat_activity or with top. As I didn’t really believe those numbers, I did another one:

the fourth publish and promote of this CV (with state index) took 114s, 2 minutes.

So I have dropped the index again

pulpcore=# DROP INDEX "core_task_state_idx";
DROP INDEX

and did the final

fifth publish and promote (without state index) took 729s, i.e. a little over 12 minutes.

I then went into the pulpcore github repository. It seems I am not the first to notice:

github.com/pulp/pulpcore

Add indices to task table to speed up worker queries.

opened 07:50AM - 14 May 24 UTC

closed 10:43AM - 15 May 24 UTC

mdellweg

Feature

Specifically dispatching with `immediate=True` yields a query like: `SELECT 1 A…S "a" FROM "core_task" WHERE ("core_task"."pulp_created" < '2024-05-13 23:29:28.153293+00:00'::timestamptz AND "core_task"."state" IN ('waiting', 'running', 'canceling') AND "core_task"."reserved_resources_record" && (ARRAY['/pulp/api/v3/domains/018f5c78-b36b-7b26-9b62-46d8a412c1a3/', '/api/v3/distributions/', 'shared:/api/v3/distributions/'])::text[]) LIMIT 1;` It is not entirely clear if an index on `state` only or a combined index with `state`, `pulp_created` and `unblocked_at` will perform best.

This adds an index for the state column.

However, it has only been added from pulpcore 3.54.0

So maybe it can be backported?