Content proxy complete sync out-of-memory

gvde · February 18, 2026, 7:45am

I think it may be the size of metadata. The current filelists.xml file (unpacked) size for 8.x is 2.3G and references more than 21 million files. Some of the rpms are large and contain a lot of files.

dralley · February 18, 2026, 9:12pm

The filelists are definitely the main contributor to the data explosion for this particular repo. There are basically a bunch of different versions of the same packages which each reference a huge number of files.

The sync itself of e.g. elastic 8.x repo uses about 6gb, and publish about 5gb (on my system). That’s definitely “too much” but is still a lot less than 80gb.

Is that 80gb a peak system number? Do you have an idea what the contribution is from the worker processes vs. the API processes? And does that include multiple simultaneous syncs, or just the one?

Also Ian’s question is relevant:

Also - do you see the same or similar large memory consumption when syncing the Elastic repositories on the main Foreman server?

re: Ian, It sounds like it’s maybe not the “sync” which is really setting it off though, but rather the indexing that Katello does after the sync finishes - requesting all the metadata back from Pulp’s API. And then Pulp’s API is maybe leaking that memory or holding onto it for too long.

As a workaround I do heavily recommend using retain_package_versions since this repo has so many different old versions of elasticsearch which are likely not useful to you in any case. Would also save a lot of time on sync/publish.

(of course we should still deal with scenarios like this better anyway!)

re: “No declared artifact with relative path”, I think that happens when Pulp ends up renaming a package that isn’t using the standard naming scheme. In any event it’s coming from here

dralley · February 19, 2026, 5:55am

So I’ve just published this PR which reduces the worst-case memory consumption quite dramatically for syncs (and moderately for publishes)

Syncing the Elastic 8.x repo for example dropped from 6.1gb max RSS to around 450mb max RSS, and the performance overhead seems very low.

That doesn’t resolve the API memory issues, if they exist, but it’s a nice step and not a complex patch.

github.com/pulp/pulp_rpm

Reduce worst-case memory consumption for sync and publish

main ← dralley:memory-use-worst-case

opened 05:22AM - 19 Feb 26 UTC

dralley

+24 -11

Use string internment / caching to improve worst-case memory consumption during …sync, by exploiting refcounting. Reduce the batch size for publish operations to likewise improve worst-case memory consumption. closes #4086 ### 📜 Checklist - [x] Commits are cleanly separated with meaningful messages (simple features and bug fixes should be [squashed](https://pulpproject.org/pulpcore/docs/dev/guides/git/#rebasing-and-squashing) to one commit) - [x] A [changelog entry](https://pulpproject.org/pulpcore/docs/dev/guides/git/#changelog-update) or entries has been added for any significant changes - [x] Follows the [Pulp policy on AI Usage](https://pulpproject.org/help/more/governance/ai_policy/) - [ ] (For new features) - User documentation and test coverage has been added See: [Pull Request Walkthrough](https://pulpproject.org/pulpcore/docs/dev/guides/pull-request-walkthrough/)

gvde · February 19, 2026, 7:14am

I am syncing two LE with 9 CVs each, each contained elastic 7, 8 and 9 repositories. (I have removed 7.x now).

gvde · February 19, 2026, 7:52am

I have just ran a complete sync of elastic 8.x on the main server.

A pulpcore-worker process went up to ~6 GB RSS during Sync, ~5 GB RSS during CreatePublication.

In addition, a pulpcore-api process went up to ~7.5 GB RSS during IndexContent.

The 80 GB peak was seen with top on my content proxy during a complete sync. I would have to start another run to monitor closely, which processes use what memory at what time. Do you want me to? It would take some preparation and it’s obviously not as clear as there are a lot of syncs in parallel.

dralley · February 20, 2026, 5:26pm

That’s about what I would have expected then (in line with what I was seeing, I mean, not that it’s not too high)

That also matches what Ian was seeing

No need, multiple parallel repository syncs + indexing being that expensive would explain the 80gb. I’m working on a big improvement for the sync memory use now. I think we have enough info to work with.

dralley · February 24, 2026, 4:02pm

Here’s some numbers for the memory use measured before / after the PR for various repos (a few normal, a few pathological)

The API mem leak concern is still an issue I suppose but we can take a look at that separately, and maybe it will be less of an issue once the task overhead is decreased

github.com/pulp/pulp_rpm

Comment by dralley - Reduce worst-case memory consumption for sync and publish

main ← dralley:memory-use-worst-case

Final numbers: ### (pathological cases) **OL9 Sync:** 12.7gb => 820mb **O…L9 Publish:** 10.4gb => 3.5gb **Elastic 8.x Sync: 6.2gb** => 580mb **Elastic 8.x Publish:** 5gb => 3gb **Gcloud SDK Sync:** 5.1gb => 423mb **Gcloud SDK Publish:** 3.9gb => 1.86gb ---------------------------------- ### (minimal difference expected) **Fedora 42 release Sync:** 765mb => 660mb **Fedora 42 release Publish:** 925mb => 862mb **Fedora 42 updates Sync:** 1.2gb => 1.2gb **Fedora 42 updates Publish:** 800mb => 600mb **EL8 BaseOS Sync:** 1.8gb => 1.5gb **EL8 BaseOS Publish:** 2.38gb => 1.04gb