Slow content proxy sync

How many “versions” of the EPEL repo do you have? It could also potentially scale primarily with that factor rather than the size of the repo itself.

1 Like

O.K. I have reversed the change from pull request 4275, restarted everything on the content proxy, then published and promoted a new version of that CV which contains the repo 43e7b651-8399-4bda-b54e-900dea17af7e. EPEL 9 included a few new packages so I suppose it is a comparable situation.

The sync took after the promote finished took 6 minutes:

smart_proxy_history_id: 27081
pulp_tasks:
- pulp_href: "/pulp/api/v3/tasks/0fcc1838-9fa1-476c-a8d1-05238125df88/"
  pulp_created: '2023-09-25T15:03:04.556+00:00'
  state: completed
  name: pulp_rpm.app.tasks.synchronizing.synchronize
  logging_cid: b76c167a-0e17-4ce4-9ee2-6490599f0630
  started_at: '2023-09-25T15:03:04.637+00:00'
  finished_at: '2023-09-25T15:09:04.001+00:00'
  worker: "/pulp/api/v3/workers/975cceef-38ee-45d7-a358-a9f4ccdee680/"
  child_tasks: []
  progress_reports:
  - message: Un-Associating Content
    code: unassociating.content
    state: completed
    done: 14
  - message: Parsed Comps
    code: sync.parsing.comps
    state: completed
    total: 22
    done: 22
  - message: Parsed Advisories
    code: sync.parsing.advisories
    state: completed
    total: 6421
    done: 6421
  - message: Downloading Metadata Files
    code: sync.downloading.metadata
    state: completed
    done: 6
  - message: Downloading Artifacts
    code: sync.downloading.artifacts
    state: completed
    done: 0
  - message: Associating Content
    code: associating.content
    state: completed
    done: 30
  - message: Skipping Packages
    code: sync.skipped.packages
    state: completed
    total: 0
    done: 0
  - message: Parsed Packages
    code: sync.parsing.packages
    state: completed
    total: 37461
    done: 37461
  created_resources:
  - "/pulp/api/v3/repositories/rpm/rpm/43e7b651-8399-4bda-b54e-900dea17af7e/versions/240/"
  - "/pulp/api/v3/publications/rpm/rpm/bb39ffd8-eb4a-4b35-bc05-82b21d133240/"
  reserved_resources_record:
  - "/pulp/api/v3/repositories/rpm/rpm/43e7b651-8399-4bda-b54e-900dea17af7e/"
  - shared:/pulp/api/v3/remotes/rpm/rpm/6732d168-717d-47d1-9406-5fbb1fb52efd/
task_groups: []
poll_attempts:
  total: 43
  failed: 0

It was difficult to capture any selects with pg_stat_activity during that time because they finished so quickly. Here are two I have captured:

SELECT "core_content"."pulp_id" FROM "core_content" WHERE ("core_content"."pulp_id" IN ('17a51ead-67c3-4bb1-9e9d-fa54cddcbb10'::uuid, 'd9330405-c55c-419a-bfe5-8c69e12737d6'::uuid) AND NOT ("core_content"."pulp_id" IN (SELECT V0."pulp_id" FROM "core_content" V0 INNER JOIN "core_repositorycontent" V1 ON (V0."pulp_id" = V1."content_id") WHERE V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '43e7b651-8399-4bda-b54e-900dea17af7e'::uuid AND U2."number" <= 240 AND NOT (U3."number" <= 240 AND U3."number" IS NOT NULL))))))

SELECT "core_content"."pulp_id" FROM "core_content" WHERE ("core_content"."pulp_id" IN ('04f64a3d-4053-4a04-a188-cba6457cd47e'::uuid, 'e28159a1-af0e-4793-acee-4eaa81e529b2'::uuid, '5ecb1955-fc1e-469e-a818-468682c8bd7e'::uuid, 'c74d591a-1717-4f32-ae89-2c6d4ea7cf05'::uuid, '9eae4837-a1e1-45db-9e31-d42f5ff8bc49'::uuid, '36df29b8-78b5-4065-b886-6f6fc405ee93'::uuid, 'f4be72a0-73c3-46af-97c9-5e20ea9d7242'::uuid, 'd74f6195-9569-44cb-b5da-4958dee4c289'::uuid) AND NOT ("core_content"."pulp_id" IN (SELECT V0."pulp_id" FROM "core_content" V0 INNER JOIN "core_repositorycontent" V1 ON (V0."pulp_id" = V1."content_id") WHERE V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '43e7b651-8399-4bda-b54e-900dea17af7e'::uuid AND U2."number" <= 240 AND NOT (U3."number" <= 240 AND U3."number" IS NOT NULL))))))

explain of the first one is:

                                                                       QUERY PLAN                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on core_content  (cost=37676.04..37683.94 rows=1 width=16)
   Recheck Cond: (pulp_id = ANY ('{17a51ead-67c3-4bb1-9e9d-fa54cddcbb10,d9330405-c55c-419a-bfe5-8c69e12737d6}'::uuid[]))
   Filter: (NOT (hashed SubPlan 1))
   ->  Bitmap Index Scan on core_content_pkey  (cost=0.00..8.86 rows=2 width=0)
         Index Cond: (pulp_id = ANY ('{17a51ead-67c3-4bb1-9e9d-fa54cddcbb10,d9330405-c55c-419a-bfe5-8c69e12737d6}'::uuid[]))
   SubPlan 1
     ->  Nested Loop  (cost=37663.86..37667.18 rows=1 width=16)
           ->  Nested Loop  (cost=37663.43..37666.72 rows=1 width=16)
                 ->  HashAggregate  (cost=37663.01..37663.02 rows=1 width=16)
                       Group Key: u0.pulp_id
                       ->  Nested Loop  (cost=4300.33..37663.00 rows=1 width=16)
                             Join Filter: (u0.version_added_id = u2.pulp_id)
                             ->  Hash Left Join  (cost=4300.33..37592.75 rows=1 width=32)
                                   Hash Cond: (u0.version_removed_id = u3.pulp_id)
                                   Filter: ((u3.number > 240) OR (u3.number IS NULL))
                                   ->  Bitmap Heap Scan on core_repositorycontent u0  (cost=4234.60..37403.04 rows=47232 width=48)
                                         Recheck Cond: (repository_id = '43e7b651-8399-4bda-b54e-900dea17af7e'::uuid)
                                         ->  Bitmap Index Scan on core_repositorycontent_repository_id_1946e61b  (cost=0.00..4222.79 rows=47232 width=0)
                                               Index Cond: (repository_id = '43e7b651-8399-4bda-b54e-900dea17af7e'::uuid)
                                   ->  Hash  (cost=43.10..43.10 rows=1810 width=20)
                                         ->  Seq Scan on core_repositoryversion u3  (cost=0.00..43.10 rows=1810 width=20)
                             ->  Seq Scan on core_repositoryversion u2  (cost=0.00..47.62 rows=1810 width=16)
                                   Filter: (number <= 240)
                 ->  Index Scan using core_repositorycontent_pkey on core_repositorycontent v1  (cost=0.43..3.71 rows=1 width=32)
                       Index Cond: (pulp_id = u0.pulp_id)
           ->  Index Only Scan using core_content_pkey on core_content v0  (cost=0.42..0.46 rows=1 width=16)
                 Index Cond: (pulp_id = v1.content_id)
(27 rows)
1 Like

Define “versions”. EPEL 9 is in four CVs, each CV is in two LEs, so a total of 8 “versions” of EPEL 9 are on the content proxy.

The repository 43e7b651-8399-4bda-b54e-900dea17af7e is the first one which gets published and promoted during my daily sync/publish/promotes, i.e. that’s the sync which “introduces” the new content to the proxy. The other ones following after that seem to be o.k. They even overlap and finish before the “big” sync ends. The content proxy is set to “On Demand”.

1 Like

With the reverted patch the sync is working fast again. Yesterday and this morning, it was only a matter of minutes. Here is the output of the pulp task syncing the repo to the content proxy from this morning:

smart_proxy_history_id: 27392
pulp_tasks:
- pulp_href: "/pulp/api/v3/tasks/856bb6ff-ebce-43d2-84ae-6fb25a2cde88/"
  pulp_created: '2023-09-27T02:37:24.115+00:00'
  state: completed
  name: pulp_rpm.app.tasks.synchronizing.synchronize
  logging_cid: d332f87c-0d68-4334-8162-fe554f069da4
  started_at: '2023-09-27T02:37:24.208+00:00'
  finished_at: '2023-09-27T02:45:41.275+00:00'
  worker: "/pulp/api/v3/workers/3a526597-8d15-422a-ac75-eb5dce815e81/"
  child_tasks: []
  progress_reports:
  - message: Downloading Metadata Files
    code: sync.downloading.metadata
    state: completed
    done: 6
  - message: Downloading Artifacts
    code: sync.downloading.artifacts
    state: completed
    done: 0
  - message: Associating Content
    code: associating.content
    state: completed
    done: 322
  - message: Skipping Packages
    code: sync.skipped.packages
    state: completed
    total: 0
    done: 0
  - message: Parsed Packages
    code: sync.parsing.packages
    state: completed
    total: 37768
    done: 37768
  - message: Parsed Comps
    code: sync.parsing.comps
    state: completed
    total: 22
    done: 22
  - message: Parsed Advisories
    code: sync.parsing.advisories
    state: completed
    total: 6432
    done: 6432
  - message: Un-Associating Content
    code: unassociating.content
    state: completed
    done: 4
  created_resources:
  - "/pulp/api/v3/repositories/rpm/rpm/43e7b651-8399-4bda-b54e-900dea17af7e/versions/241/"
  - "/pulp/api/v3/publications/rpm/rpm/b42a6709-6f5c-4ab0-a89d-9307110fc600/"
  reserved_resources_record:
  - "/pulp/api/v3/repositories/rpm/rpm/43e7b651-8399-4bda-b54e-900dea17af7e/"
  - shared:/pulp/api/v3/remotes/rpm/rpm/6732d168-717d-47d1-9406-5fbb1fb52efd/
task_groups: []
poll_attempts:
  total: 48
  failed: 0

So it seems to be related to the patch. I guess the new query either didn’t consider some scenario where it could take long and require a lot of sequentical scans or an index is missing?

1 Like

@gvde When @dralley (or other Pulp devs) are asking about number of versions in the repo, they mean number of versions in the underlying Pulp repo. Your previous post actually gives a hint:

  created_resources:
  - "/pulp/api/v3/repositories/rpm/rpm/43e7b651-8399-4bda-b54e-900dea17af7e/versions/241/"

The versions/241/ part suggests that this RPM repo is on its 241st version within Pulp. This does not actually mean there are 241 repo versions, since older versions may have long been deleted, so this is a ceiling rather than a floor. If you do not remove old CV versions, then it is very much possible that Katello needs to keep all 241 versions within the Pulp repository around. Not sure if that is part of the problem though.

1 Like

@quba42 If you tell me the SQL query to check I can do that. My guess is this:

pulpcore=# select * from core_repositoryversion where repository_id = '43e7b651-8399-4bda-b54e-900dea17af7e';
               pulp_id                |         pulp_created          |       pulp_last_updated       | number | complete | base_version_id |            repository_id             | info 
--------------------------------------+-------------------------------+-------------------------------+--------+----------+-----------------+--------------------------------------+------
 30c14099-7e98-41e0-bd97-318f1190731e | 2023-09-25 17:03:04.831039+02 | 2023-09-25 17:08:45.265643+02 |    240 | t        |                 | 43e7b651-8399-4bda-b54e-900dea17af7e | {}
 03eaf1d7-0042-4e0f-9e27-e7aa3d09f88e | 2023-09-27 04:37:24.502317+02 | 2023-09-27 04:45:20.588415+02 |    241 | t        |                 | 43e7b651-8399-4bda-b54e-900dea17af7e | {}
 19c601b1-ede7-4083-810c-ae4a944a16ac | 2022-09-28 15:08:13.525584+02 | 2022-09-28 15:08:13.5256+02   |      0 | t        |                 | 43e7b651-8399-4bda-b54e-900dea17af7e | {}
 374e64bb-f0e4-48db-962b-5fc91f0a98e1 | 2023-09-25 04:35:05.281957+02 | 2023-09-25 08:24:34.560124+02 |    239 | t        |                 | 43e7b651-8399-4bda-b54e-900dea17af7e | {}
 804a78e0-e288-49d2-ada8-00c906ed4306 | 2023-09-22 04:41:25.028418+02 | 2023-09-22 06:36:51.144996+02 |    238 | t        |                 | 43e7b651-8399-4bda-b54e-900dea17af7e | {}
(5 rows)

Is that what you were looking for? @dralley

1 Like

I think this is what @dralley was looking for, and I think “5 rows” proves that the number of repository versions is not the issue here. So yes, you are on your 241st repository version in this Pulp repository, but all but the latest four have been deleted at some point. Apparently the 0th version that does not contain any content has also been kept. Don’t ask me why every Pulp repository comes with a empty 0th repository version. :wink:

1 Like

Isn’t that obvious? It does an incremental sync which needs to be seeded to get started unless you want to implement condition checks everywhere to distinguish the initial sync from the normal sync.

And keeping it around would be useful if you want to do a new, full sync for which you would simply delete all versions except the 0.

So to me the empty 0 version is a quite elegant solution… :wink:

1 Like

So to me the empty 0 version is a quite elegant solution… :wink:

iirc version zero is always created because it made the logic easier if every repository always had atleast 1 version, even if it is empty. In fact you can delete the 0th version if you have other versions and don’t need an empty version, but it is helpful to use as a base for new versions with the copy api.

@gvde Can you tell me the version of postgres your system has installed? I’ve been trying to replicate your issue, but I am seeing completely different query plans on postrgres 13. It might be that our optimization doesn’t work well on older versions due to them using older postgres versions.

It‘s the standard postgresql version which comes with el8 and is installed by foreman-installer:

postgresql-server-12.15-1.module_el8.8.0+3600+38330172.x86_64

1 Like

What is the value of work_mem (PostgreSQL: Documentation: 16: 20.4. Resource Consumption, PostgreSQL Documentation: work_mem parameter)? Can we increase it? With this, we might enable PostgreSQL to make the optimization we want.

I have the default value

foreman=# show work_mem;
 work_mem 
----------
 4MB
(1 row)

Yeah, that seems to be a standard value. I have got the same value on a machine used for performance testing. I think we will be forced to revert the change if we cannot experiment more around posgresql versions.

@lubosmj @iballou FYI: We at ATIX now have at least two internal systems that appear to be affected by this issue.

The systems in question are running foreman-3.5.3, katello-4.7.6, python39-pulpcore-3.21.17, and postgresql-12.15.

I have confirmed that this includes the change Enhance get_content() by omitting the version_membership lookup by lubosmj · Pull Request #4275 · pulp/pulpcore · GitHub
We are testing to see if reverting this change solves the issue.

Feel free to reach out if you have any questions or ideas.

I’d like to test out increasing work_mem before we try to revert anything. The patch did make significant improvements in many other contexts and the 4mb default value is known to be a bit low for the majority of users.

We should also start a conversation about upgrading postgresql.

@dralley We can also test increasing work_mem on one of our affected systems. Do you have any suggestions what a good test value might be?

What exactly is responsible for upgrading (or not upgrading) postgresql within the Foreman context? Do we just get whatever is currently available for EL8?

@quba There’s an entry in the postgresql changelog that near-perfectly matches the situation

https://www.postgresql.org/docs/release/13.0/

Allow hash aggregation to use disk storage for large aggregation result sets (Jeff Davis)Previously, hash aggregation was avoided if it was expected to use more than work_mem memory. Now, a hash aggregation plan can be chosen despite that. The hash table will be spilled to disk if it exceeds work_mem times hash_mem_multiplier.This behavior is normally preferable to the old behavior, in which once hash aggregation had been chosen, the hash table would be kept in memory no matter how large it got — which could be very large if the planner had misestimated. If necessary, behavior similar to that can be obtained by increasing hash_mem_multiplier.

I am not sure I fully understood that changelog entry, but I am not sure this “matches near-perfectly”. In our case the affected systems never had problems with memory usage (or am I misunderstanding the implication of “the hash table would be kept in memory no matter how large it got”)?

Note also we are currently testing an affected system using 40Mb for work_mem, which does not appear to have helped. If anything it has made things a little worse. By comparison, we have another Smart Proxy attached to the same Foreman instance where we reverted https://github.com/pulp/pulpcore/pull/4275, and this appears to have fixed the symptoms. I will know more tomorrow once the two smart proxy syncs have had a chance to finish completely.