Slow content proxy sync

I’d like to test out increasing work_mem before we try to revert anything. The patch did make significant improvements in many other contexts and the 4mb default value is known to be a bit low for the majority of users.

We should also start a conversation about upgrading postgresql.

@dralley We can also test increasing work_mem on one of our affected systems. Do you have any suggestions what a good test value might be?

What exactly is responsible for upgrading (or not upgrading) postgresql within the Foreman context? Do we just get whatever is currently available for EL8?

@quba There’s an entry in the postgresql changelog that near-perfectly matches the situation

https://www.postgresql.org/docs/release/13.0/

Allow hash aggregation to use disk storage for large aggregation result sets (Jeff Davis)Previously, hash aggregation was avoided if it was expected to use more than work_mem memory. Now, a hash aggregation plan can be chosen despite that. The hash table will be spilled to disk if it exceeds work_mem times hash_mem_multiplier.This behavior is normally preferable to the old behavior, in which once hash aggregation had been chosen, the hash table would be kept in memory no matter how large it got — which could be very large if the planner had misestimated. If necessary, behavior similar to that can be obtained by increasing hash_mem_multiplier.

I am not sure I fully understood that changelog entry, but I am not sure this “matches near-perfectly”. In our case the affected systems never had problems with memory usage (or am I misunderstanding the implication of “the hash table would be kept in memory no matter how large it got”)?

Note also we are currently testing an affected system using 40Mb for work_mem, which does not appear to have helped. If anything it has made things a little worse. By comparison, we have another Smart Proxy attached to the same Foreman instance where we reverted https://github.com/pulp/pulpcore/pull/4275, and this appears to have fixed the symptoms. I will know more tomorrow once the two smart proxy syncs have had a chance to finish completely.

I am not sure I fully understood that changelog entry, but I am not sure this “matches near-perfectly”. In our case the affected systems never had problems with memory usage (or am I misunderstanding the implication of “the hash table would be kept in memory no matter how large it got”)?

You’re focusing on the wrong part of the change. I believe we found that for the new query on Postgresql 13 a hash agreggation is chosen wheras on Postgresql 12 a nested loop? was chosen. The assumption is that Postgresql 12 was choosing an inefficient query plan due to its inability to spill to disk, which was resolved in Postgresql 13.

The friction is that https://github.com/pulp/pulpcore/pull/4275 was a real (and significant) optimization in a lot of situations in our testing, and only seems to be a regression on Postgresql 12, and even then perhaps only in specific circumstances. So while we may end up reverting it (or asking Katello to patch it out specifically in their builds) I’m not super eager to do so without covering our bases first.

Thanks for clearing up my reading of the changelog.

Regarding whether to revert or not to revert a pulpcore change that improves performance on Postgresql 13 while causing problems (at least in some cases) on Postgresql 12, I personally don’t have any immediate opinion. I am just trying to share information what we found during our internal testing.

What I can share is that in the combination foreman-3.5.3, katello-4.7.6, python39-pulpcore-3.21.17 on EL8, which in turn implies postgresql-12.15, sync to smart proxy becomes reliably unusable for many of the systems we tested unless we revert the aforementioned pulpcore commit. This means we will be reverting that commit for our internal systems for now.

I also believe that every released version of Foreman is using Postgresql 12 out of the box. Based on our experience, I am surprised there aren’t more people affected by this.

Given that issues like this one apparently exist, I feel like it is a general problem if Pulp is exclusively testing on Postgresql 13, and Foreman is using Postgresql 12 by default. My opinion is that this is something that the Foreman/Katello and Pulp communities need to discuss, but I don’t have any recommendation on how exactly to address it. For starters, I have no idea how hard it would be for Foreman/Katello to sync up their postgresql version with the one used by default for Pulp installations.

3 Likes

Few questions:

  • does this impact all repos, or does it seem to impact a few repos selectively?
  • can you confirm that this only impacts capsule / smart proxy sync, and not the syncs that the main instance performs?

I am not 100% sure. On the system where we did most of the testing, triggering a full capsule/smart proxy sync implies queuing up something like 600 Pulp syncs and always running n of them at a time in parallel (where n is the number of Pulp workers). For our tests we had 4 Pulp workers. With pulpcore/pull/4275 reverted, all 600 syncs (plus any resultant publish tasks etc.) would successfully complete after roughly 13 hours for an initial sync to a new proxy. When pulpcore/pull/4275 was NOT reverted, we would usually see that around 30 of 600 syncs had completed after the first 6 hours. If we waited till the next day it might have been at 60 of 600. Eventually we would simply give up and kill the remaining tasks. In other words, I was never sure if progress had actually died completely, or if it had only slowed to a crawl. It also seemed like the first couple of syncs would happen at a reasonable speed, but eventually the queue would always hit four badly affected syncs with the result that all available pulp workers where effectively stalled.

On the server itself, syncs are not nearly so “bunched up”. And most individual syncs do complete at a more or less reasonable speed. We do have some individual syncs and repos that seem to take significantly longer than other similarly sized repos, but we are not certain if those are affected by the same issue as the proxy syncs, or something else entirely.

So my best guess is that this does not universally impact all syncs (at least not to the same extent), in the sense that individual syncs do often succeed at a reasonable speed. My best guess is that the capsule sync always reveals a problem, by systematically queuing up so many syncs. Or perhaps it is something about running multiple large syncs in parallel that increases the likelihood of this issue? I can say that we never observed us running into a memory bottleneck or swapping. Whenever we had the slow/stuck syncs we could see one postgres process per worker consuming one CPU core each running a SELECT during the content association stage. Which strikes me as consistent with this being caused by pulp/pulpcore/pull/4275.

Call them out anyway, might be relevant. And if not, it’s something to look into later.

I have just reinstalled python39-pulpcore-3.22.15-1.el8.noarch on my content proxy to see how fast it is after upgrading to 3.7.0/4.9.2.

This is a normal sync from the main server to the content proxy after a new content view version has been published and promote to a environment which is synced to the content proxy.

The following query just took approx. 8 minutes to finish:

pulpcore=# explain SELECT "core_content"."pulp_id" FROM "core_content" WHERE ("core_content"."pulp_id" IN ('554f55cb-e4db-4844-8f7a-fdb279caeaec'::uuid) AND NOT ("core_content"."pulp_id" IN (SELECT V0."pulp_id" FROM "core_content" V0 WHERE V0."pulp_id" IN (SELECT U0."content_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '43e7b651-8399-4bda-b54e-900dea17af7e'::uuid AND U2."number" <= 262 AND NOT (U3."number" <= 262 AND U3."number" IS NOT NULL))))));
QUERY PLAN
Index Only Scan using core_content_pkey on core_content  (cost=51572.39..51580.41 rows=1 width=16)
  Index Cond: (pulp_id = '554f55cb-e4db-4844-8f7a-fdb279caeaec'::uuid)
  Filter: (NOT (hashed SubPlan 1))
  SubPlan 1
    ->  Nested Loop Semi Join  (cost=4254.16..51571.96 rows=1 width=16)
          Join Filter: (v0.pulp_id = u0.content_id)
          ->  Seq Scan on core_content v0  (cost=0.00..7271.37 rows=289037 width=16)
          ->  Materialize  (cost=4254.16..39965.04 rows=1 width=16)
                ->  Nested Loop  (cost=4254.16..39965.03 rows=1 width=16)
                      Join Filter: (u0.version_added_id = u2.pulp_id)
                      ->  Hash Left Join  (cost=4254.16..39884.38 rows=1 width=32)
                            Hash Cond: (u0.version_removed_id = u3.pulp_id)
                            Filter: ((u3.number > 262) OR (u3.number IS NULL))
                            ->  Bitmap Heap Scan on core_repositorycontent u0  (cost=4178.87..39687.73 rows=46235 width=48)
                                  Recheck Cond: (repository_id = '43e7b651-8399-4bda-b54e-900dea17af7e'::uuid)
                                  ->  Bitmap Index Scan on core_repositorycontent_repository_id_1946e61b  (cost=0.00..4167.31 rows=46235 width=0)
                                        Index Cond: (repository_id = '43e7b651-8399-4bda-b54e-900dea17af7e'::uuid)
                            ->  Hash  (cost=48.46..48.46 rows=2146 width=20)
                                  ->  Seq Scan on core_repositoryversion u3  (cost=0.00..48.46 rows=2146 width=20)
                      ->  Seq Scan on core_repositoryversion u2  (cost=0.00..53.83 rows=2146 width=16)
                            Filter: (number <= 262)
(21 rows)

The repository 43e7b651-8399-4bda-b54e-900dea17af7e is EPEL9.

The package 554f55cb-e4db-4844-8f7a-fdb279caeaec is this:

pulpcore=# select content_ptr_id,name,epoch,version,release,arch from rpm_package where content_ptr_id = '554f55cb-e4db-4844-8f7a-fdb279caeaec';
            content_ptr_id            |   name   | epoch | version | release |  arch  
--------------------------------------+----------+-------+---------+---------+--------
 554f55cb-e4db-4844-8f7a-fdb279caeaec | corectrl | 0     | 1.3.6   | 1.el9   | x86_64
(1 row)

This is another one which takes longer:

pulpcore=# select content_ptr_id,name,epoch,version,release,arch from rpm_package where content_ptr_id = 'ba0300b8-705a-462c-ba0a-b3c20103914e';
            content_ptr_id            |       name       | epoch | version | release |  arch  
--------------------------------------+------------------+-------+---------+---------+--------
 ba0300b8-705a-462c-ba0a-b3c20103914e | python3-autopage | 0     | 0.5.2   | 1.el9   | noarch
(1 row)

At least for the random checks it seems as if only the new packages cause the long queries, not the ones which have been synced before.

Having shifted our test focus from the proxy sync to repos on the server, it looks like it is not specific repos that are affected, but individual syncs. Having looked at two repos of similar size that are regularly synced, most of the time the Katello sync takes around 1h, but individual syncs take much longer. In one extreme case, a sync completed successfully after 23h instead of the usual 1h. In each case it was the Pulp sync and not the publish that accounted for the long run time. I can’t easily tell now whether the system was running lots of parallel syncs at the time when these outliers happened.

Could you try restoring to the most recent version of Pulpcore, with the slow query, and then make this adjustment? Works around a sync-time performance regression on PG12 by dralley · Pull Request #4609 · pulp/pulpcore · GitHub

New versions are posted which seem to resolve the issue. If you can apply that patch and let us know if it works, that’s great, otherwise if you can upgrade eventually and let us know if it works, that’s great too.

Which RPMs exactly are the new versions?

@iballou Since the patch was added to the RPM directly, would that be 3.22.15-2?

Otherwise 3.22.18 definitely has it

Hi,

We just release the rpm for Pulpcore 3.28.19 this RPM brings the fix. It’s published in our repo now.

The fix for 3.22 should be release later today or earlier tomorrow.

Thanks

1 Like

I can add to this - we use postgresql14 (from postgresql.org) on RHEL9 and after applying the 4275 patch we started to get very slow sync times for the RHEL8 EPEL repo. Other repos, for example RHEL9 EPEL, seem to work fine though.

After applying the PG12 patch the sync is down from 1-2hrs to 10min.

So this does not seem to apply only to PG12.

1 Like

I should add that this was on vanilla pulp, not katello

1 Like