Content proxy complete sync out-of-memory

Problem:

Because I had some 404 on the content proxy due to some deleted repositories, I started a complete sync for my content proxy. This, however, ended badly running out of memory on my content proxy vm which had 32 GB memory assigned. Even after a reboot, the task continued taking the server down again. I have increased memory to 64 GB and now at least it seems to be running so far. Still I can see that used memory goes up to ~50 GB at times.

ps showed that I have a couple of pulpcore-worker processes which are very busy and allocating 6-7 GB each at the peak times.

What I found confusing, though, is the number of pulpcore-worker processes: 17. This comes from the pulpcore-content.service unit file which has:

# systemctl cat pulpcore-content.service
# /etc/systemd/system/pulpcore-content.service
[Unit]
Description=Pulp Content App
Requires=pulpcore-content.socket
After=network.target
Wants=postgresql.service
...
ExecStart=/usr/bin/pulpcore-content \
          --preload \
          --timeout 90 \
          --workers 17 \
          --access-logfile -
...

It’s the same on my other content proxy and my main server. Now I had somehow in my mind it would be usually only 8 pulpcore-workers and not 17.

I have checked the answers file as well as foreman-installer full-help output:

  pulpcore_worker_count: 8

It’s set to 8 everywhere I look. So why does it start 17 workers if it’s only supposed to use 8? Or is that a different option? With 8 max I guess the 32 GB might have been enough…

Seeing my thread from September Content Proxy out of memory where I mention it’s only 8 workers, I start to think that there is something going wrong.

I think it’s here in github puppet-pulpcore/templates/pulpcore-content.service.erb at 7f81e6b9ae5cf033226d5b1dc4c0407b4fc566f2 · theforeman/puppet-pulpcore · GitHub

which is initialized with 17 if you have more than 8 CPUs on the server (which I do).

Expected outcome:
No OOM and I guess only 8 pulpcore-workers running.

Foreman and Proxy versions:
Running 3.13/4.15 with the current release.

foreman-installer-3.13.0-1.el9.noarch
foreman-installer-katello-3.13.0-1.el9.noarch
foreman-proxy-3.13.0-1.el9.noarch
foreman-proxy-content-4.15.0-1.el9.noarch
katello-certs-tools-2.10.0-1.el9.noarch
katello-client-bootstrap-1.7.9-2.el9.noarch
katello-common-4.15.0-1.el9.noarch
katello-host-tools-4.4.0-2.el9.noarch
katello-host-tools-tracer-4.4.0-2.el9.noarch
pulpcore-obsolete-packages-1.2.0-1.el9.noarch
pulpcore-selinux-2.0.1-1.el9.x86_64
python3.11-pulp-ansible-0.22.4-1.el9.noarch
python3.11-pulp-container-2.22.1-1.el9.noarch
python3.11-pulpcore-3.63.11-1.el9.noarch
python3.11-pulp-deb-3.5.1-1.el9.noarch
python3.11-pulp-glue-0.31.0-1.el9.noarch
python3.11-pulp-python-3.12.6-1.el9.noarch
python3.11-pulp-rpm-3.27.2-1.el9.noarch
rubygem-foreman_maintain-1.8.1-2.el9.noarch
rubygem-smart_proxy_pulp-3.4.0-1.fm3_13.el9.noarch

Distribution and version:
AlmaLinux 9

I can confirm that with 32GB of RAM the sync (even optimized sync) of the smart proxy ends sometimes with out of memory. I recently added a swap to two smart proxies to mitigate the situation. I thing 32GB should be enough too. But there is probably something going wrong

Which version on which platform? How many running pulpcore workers do you see?

The official system requirements are:

A minimum of 12 GB RAM is required for Smart Proxy server to function. In addition, a minimum of 4 GB RAM of swap space is also recommended. Smart Proxy running with less RAM than the minimum value might not operate correctly.

Last time it happened was on 3.12/4.14 on RHEL9. I see 4 pulpcore-workers.

O.K. I confused the content service workers with the pulpcore workers.

# systemctl status pulpcore-worker@*.service

shows me the 8 configured pulpcore workers.

# systemctl status pulpcore-content.service

shows me the pulpcore-content “app” workers. Those are 17 and that number doesn’t seem to be configurable with foreman-installer at the moment.

But those are the processes which use up all the memory, because each one can use 6 GB or more, which is far to large if you are running 17 worker processes.

So it’s the content worker processes not the pulpcore workers as I wrote initially…

I think when processes are dying you should see the large memory usage with top or ps aux.

@katello Does any of the developers has an idea? I have no way make a complete sync to the proxy, so I just have to hope that the standard sync fills in all gaps that there may be…

The 17 number is an intentional upper limit on number of content workers the installer sets. Not sure why the content workers are eating as much memory during capsule syncs.

@dralley Any pointers? Would reducing the number of workers help?

We just merged a handful of patches to reduce memory consumption, especially for appstream repos (anything with modules, really) and cases where ACS is being used. The peaks memory used was significantly reduced.

Hello, I would like to ask you in which release of katello these patches will be available ? Thank you.

@iballou ^

If the patch mentioned above is Large memory spike at the very end of repository sync · Issue #3311 · pulp/pulp_rpm · GitHub, then Katello 4.17+ has the fix.

@Odilhao note for later: we should probably bump pulp-rpm in the 3.63 repo to 3.27.6+

1 Like

I tried to run a full sync to my proxy today and, again, had to reboot the proxy a couple of time, desperately trying to cancel the sync somehow, because the proxy within a minute after reboot had use all available 32 GB ram and became unresponsive…

foreman-3.16.2-1.el9.noarch
katello-4.18.1-1.el9.noarch

I’ve increased the RAM size of the VM from 32 to 128 GB and now the full sync was able to finishef. During the sync, I have noticed memory usage of up to 80 GB.

I have also noticed that it seems to be related to syncing the elastic repositories for 7.x, 8.x, 9.x. (https://artifacts.elastic.co/packages/7.x/yum) At least, I have identified those repository_ids in the task and after it finished there were a lot of messages like this in /var/log/messages:

Feb 12 18:26:20 foreman8-content pulpcore-worker-2[53485]: pulp [fba33e89-b21c-45bb-9a93-3233c7d6ba29]: pulpcore.plugin.stages.artifact_stages:WARNING: No declared artifact with relative path 'metricbeat-8.4.2-1.aarch64.rpm' for content '(UUID('018d10fb-8027-7374-9c16-e28faff37f8a'), 'metricbeat', '0', '8.4.2', '1', 'aarch64', 'sha1', '2af1508ea363330c3f2203b6236abc03c876b642')' from remote '1-alma10-Testing-f2ba9ce4-60c1-474f-9b9c-b65d519da357'. Using last from available-paths : 'metricbeat-8.4.2-aarch64.rpm'

Since you’re on Katello 4.18, you should already have the patch mentioned above. It was included in pulp-rpm 3.29.2 and above.

Is it the first time you’ve seen this elastic repo capsule sync take up so much memory?

The best we could probably do for now is follow up with a Pulp bug. Are those packages available generally, or are they locked down? If I try to browse https://artifacts.elastic.co/packages/ at least I get 404s.

Also - do you see the same or similar large memory consumption when syncing the Elastic repositories on the main Foreman server?

In case you have memory profiling of your Foreman server/proxy set up, we could potentially check the memory peaks and match them up with certain actions in the Dynflow output. That might tell us if it’s syncing, distribution, etc.

I haven’t done a full capsule sync since when I have started this thread. I have noticed something missing a few days ago and then started the full sync.

The packages are available. You just cannot browse through the directories. But for instance the repomd for elastic 7.x is at https://artifacts.elastic.co/packages/7.x/yum/repodata/repomd.xml

Replace 7.x with 8.x or 9.x for the other versions. I hope you can reproduce it. Otherwise, I could start another full sync with 128 GB RAM assigned. That seems to be enough to get it working and I could check during the sync what’s running when it requires a lot of memory.

Does pulp have some monitor api calls where I could check which processes are running exactly doing what and how much memory they use?

1 Like

Glad to see the packages are public, I’ll try syncing them.

For diagnostics, there is this: https://pulpproject.org/pulpcore/docs/dev/learn/tasks/diagnostics/

It does seem to add quite a bit of overhead to the actions and requires passing in X-Task-Diagnostics, so it’s better used in development instead of production.

If I do reproduce the issue, I’ll poke around and see if I can figure out which Pulp action is responsible for the majority of the memory use. Then a productive Pulp bug can be filed.

I just tried an initial sync of the 3 repositories at once. There is something suspicious about these repositories because they are quite small. The largest only has 2500 packages. However, syncing Elastic 8.x alone took about 5 minutes with what appeared to be decently high memory usage.

I was able to sync them on a nightly system with 23 GB of RAM only in the end using the on demand download policy and content only repodata mirroring.

The smart proxy story is a bit different since there is a separate repo for each lifecycle environment. How many lifecycle environments are you syncing to your smart proxy? Within those lifecycle environments, how many content views? I’m looking for the number of elastic repositories that may be getting synced simultaneously.

Interesting, so I synced all 3 repositories and noticed that the pulpcore-api was holding on to about 30 percent of my systems memory. Afterwards I synced just the 8.x elastic repo, and my system is now reaching the end of its memory and is becoming unstable.

Syncing these small repositories should be no issue. I wonder if their repodata is in some format that is really disagreeing with Pulp. Anyway, I think I have enough to file an issue.

I will add - I’m testing with newer Pulp than what you have (pulp-rpm 3.32). I’ll note that performance issues were also noticed on RPM 3.29 (I realize I don’t have your z-version).

Edit: the OOM killer was provoked after just syncing the 8.x repo after syncing all 3. Interesting.

Edit edit: Ah, it turns out there is already a bug specifically about Oracle, but includes a note about Elastic and some others. I’ll add more details. GitHub · Where software is built

It looks like a workaround is supposed to be to use the additive mirroring policy with retain package versions set to some value.