Content proxy complete sync out-of-memory

I think it may be the size of metadata. The current filelists.xml file (unpacked) size for 8.x is 2.3G and references more than 21 million files. Some of the rpms are large and contain a lot of files.

The filelists are definitely the main contributor to the data explosion for this particular repo. There are basically a bunch of different versions of the same packages which each reference a huge number of files.

The sync itself of e.g. elastic 8.x repo uses about 6gb, and publish about 5gb (on my system). That’s definitely “too much” but is still a lot less than 80gb.

Is that 80gb a peak system number? Do you have an idea what the contribution is from the worker processes vs. the API processes? And does that include multiple simultaneous syncs, or just the one?

Also Ian’s question is relevant:

Also - do you see the same or similar large memory consumption when syncing the Elastic repositories on the main Foreman server?

re: Ian, It sounds like it’s maybe not the “sync” which is really setting it off though, but rather the indexing that Katello does after the sync finishes - requesting all the metadata back from Pulp’s API. And then Pulp’s API is maybe leaking that memory or holding onto it for too long.

As a workaround I do heavily recommend using retain_package_versions since this repo has so many different old versions of elasticsearch which are likely not useful to you in any case. Would also save a lot of time on sync/publish.

(of course we should still deal with scenarios like this better anyway!)

re: “No declared artifact with relative path”, I think that happens when Pulp ends up renaming a package that isn’t using the standard naming scheme. In any event it’s coming from here

1 Like

So I’ve just published this PR which reduces the worst-case memory consumption quite dramatically for syncs (and moderately for publishes)

Syncing the Elastic 8.x repo for example dropped from 6.1gb max RSS to around 450mb max RSS, and the performance overhead seems very low.

That doesn’t resolve the API memory issues, if they exist, but it’s a nice step and not a complex patch.

1 Like

I am syncing two LE with 9 CVs each, each contained elastic 7, 8 and 9 repositories. (I have removed 7.x now).

1 Like

I have just ran a complete sync of elastic 8.x on the main server.

A pulpcore-worker process went up to ~6 GB RSS during Sync, ~5 GB RSS during CreatePublication.

In addition, a pulpcore-api process went up to ~7.5 GB RSS during IndexContent.

The 80 GB peak was seen with top on my content proxy during a complete sync. I would have to start another run to monitor closely, which processes use what memory at what time. Do you want me to? It would take some preparation and it’s obviously not as clear as there are a lot of syncs in parallel.

1 Like

That’s about what I would have expected then (in line with what I was seeing, I mean, not that it’s not too high)

That also matches what Ian was seeing

No need, multiple parallel repository syncs + indexing being that expensive would explain the 80gb. I’m working on a big improvement for the sync memory use now. I think we have enough info to work with.

4 Likes

Here’s some numbers for the memory use measured before / after the PR for various repos (a few normal, a few pathological)

The API mem leak concern is still an issue I suppose but we can take a look at that separately, and maybe it will be less of an issue once the task overhead is decreased

2 Likes