Katello 4.0RC2 OOM-kill doing large Yum repo sync

John_Beranek · March 23, 2021, 2:33pm

Problem:
When Katello is synchronising a large Yum repository (Oracle Linux 7 ‘latest’), the server runs out of memory, and the ‘rq’ process is killed by the OOM killer
Expected outcome:
The sync process works without using as much memory as it does.
Foreman and Proxy versions:
foreman-2.4.0-0.4.rc2.el8.noarch
Foreman and Proxy plugin versions:
katello-4.0.0-0.9.rc2.el8.noarch
Distribution and version:
Oracle Linux 8.3
Other relevant data:
The ‘rq’ process which is performing the sync reaches ~9.3GiB of RAM usage, on the 20GiB VM, and the OOM killer kills said process, so that the sync does not complete.

iballou · March 24, 2021, 8:42pm

Hi @John_Beranek,

Can we see your Pulp versions? pip3 list | grep pulp

John_Beranek · March 25, 2021, 8:49am

Sure

pulp-certguard (1.1.0)
pulp-container (2.2.1)
pulp-deb (2.8.0)
pulp-file (1.5.0)
pulp-rpm (3.9.0)
pulpcore (3.9.1)

iballou · March 25, 2021, 2:35pm

Can I see some information about your sync from the dynflow console? I’m wondering if we can see at what point specifically it OOM’d. Any information from the expanded action in dynflow would be great. It looks like it was a pulpcore worker that was taking up the memory as ‘rq’.

Also, can I see your createrepo_c versions please?

rpm -qa | grep createrepo_c

John_Beranek · March 25, 2021, 2:47pm

So, it’s dying in action 4 below:

---
pulp_tasks:
- pulp_href: "/pulp/api/v3/tasks/b3d232dd-9617-4bbb-b0ca-a22d1e8ef695/"
  pulp_created: '2021-03-24T22:00:13.867+00:00'
  state: failed
  name: pulp_rpm.app.tasks.synchronizing.synchronize
  logging_cid: da6daf5ef4c04911bc137aa769846b77
  started_at: '2021-03-24T22:00:14.019+00:00'
  finished_at: '2021-03-24T22:01:40.910+00:00'
  error:
    traceback: ''
    description: None
  worker: "/pulp/api/v3/workers/db5703bf-2b17-4ac9-9c6b-33d2b2d28bc6/"
  child_tasks: []
  progress_reports:
  - message: Downloading Metadata Files
    code: downloading.metadata
    state: running
    done: 5
  - message: Downloading Artifacts
    code: downloading.artifacts
    state: running
    done: 0
  - message: Associating Content
    code: associating.content
    state: running
    done: 3999
  - message: Un-Associating Content
    code: unassociating.content
    state: running
    done: 0
  - message: Parsed Comps
    code: parsing.comps
    state: completed
    total: 96
    done: 96
  - message: Parsed Advisories
    code: parsing.advisories
    state: completed
    total: 4914
    done: 4914
  - message: Parsed Packages
    code: parsing.packages
    state: running
    total: 22179
    done: 0
  created_resources:
  - ''
  reserved_resources_record:
  - "/pulp/api/v3/repositories/rpm/rpm/e3989c09-ead9-4b08-8173-356bacb6c306/"
  - "/pulp/api/v3/remotes/rpm/rpm/08aab7ad-6ca9-48a1-b6b0-8e72d0bdb98b/"
create_version: true
task_groups: []
poll_attempts:
  total: 26
  failed: 1

Also, can I see your createrepo_c versions please?

$ rpm -qa | grep createrepo_c
python3-createrepo_c-0.17.1-1.el8.x86_64
createrepo_c-libs-0.17.1-1.el8.x86_64

John_Beranek · March 26, 2021, 11:52am

Out of interest I threw extra RAM at the server, and the intial repo sync did complete, ensuring only one Pulp working was working at the time. The worker topped out at around 9.8GiB of RAM usage.

velmeran · March 26, 2021, 3:04pm

I’m having the same issue, though I’m on Katello 3.18.2 and Foreman 2.3.3 and can’t throw more memory at the issue just yet. Trying to sync RHEL7 base repo and even with 8GB of free memory and no other sync running, it will run out of memory and kill the process as John stated, failing to sync.

sjansen · March 28, 2021, 9:39am

20gig is a bit low for hat. I have OL7/OL8 and some custom repos around. My Katello/Foreman VM require around 32gig of RAM to survive for at least a few days, then it goes OOM and services needs to be restarted. Give it more memory and let foreman restart every 3-4 days.

John_Beranek · March 28, 2021, 10:36am

Synchronising a repository shouldn’t take almost 10GiB of RAM, no service should need frequent restarts to keep it from running out of memory due to leaks…

John_Beranek · March 29, 2021, 8:27am

Publishing the CV containing this large repository is also having issues, with the publish pausing part-way through with a proxy error:

[Mon Mar 29 07:53:44.461278 2021] [proxy_http:error] [pid 547921] (20014)Internal error (specific information not available): [client 10.12.34.226:41282] AH01102: error reading status line from remote server httpd-UDS:0
[Mon Mar 29 07:53:44.461370 2021] [proxy:error] [pid 547921] [client 10.12.34.226:41282] AH00898: Error reading from remote server returned by /pulp/api/v3/content/rpm/packages/
[Mon Mar 29 08:18:22.757667 2021] [proxy_http:error] [pid 1301] (20014)Internal error (specific information not available): [client 10.12.34.226:60306] AH01102: error reading status line from remote server httpd-UDS:0
[Mon Mar 29 08:18:22.768418 2021] [proxy:error] [pid 1301] [client 10.12.34.226:60306] AH00898: Error reading from remote server returned by /pulp/api/v3/content/rpm/packages/
[Mon Mar 29 08:22:30.693893 2021] [proxy_http:error] [pid 1297] (20014)Internal error (specific information not available): [client 10.12.34.226:60392] AH01102: error reading status line from remote server httpd-UDS:0
[Mon Mar 29 08:22:30.693964 2021] [proxy:error] [pid 1297] [client 10.12.34.226:60392] AH00898: Error reading from remote server returned by /pulp/api/v3/content/rpm/packages/

10.12.34.226 - - [29/Mar/2021:07:52:51 +0000] "GET /pulp/api/v3/content/rpm/packages/?arch__ne=src&fields=pulp_href&limit=2000&offset=0&repository_version=%2Fpulp%2Fapi%2Fv3%2Frepositories%2Frpm%2Frpm%2Fe3989c09-ead9-4b08-8173-356bacb6c306%2Fversions%2F1%2F HTTP/1.1" 502 341 "-" "OpenAPI-Generator/3.8.0/ruby"

John_Beranek · March 29, 2021, 10:20am

The /pulp/api/v3/content/rpm/packages/ call is timing out at the 600s Apache reverse proxy timeout period, I’m going to try upping that period.

John_Beranek · March 29, 2021, 10:25am

The CVV publish eventually worked…

iballou · March 29, 2021, 1:54pm

@John_Beranek I tested it myself on a small test system (2 cores, ~15GB RAM, puppetserver disabled ) and I noticed rq maxing out at 8.2 GB. I’ve created a Pulp issue (Issue #8467: RQ worker takes up a lot of RAM when syncing OL7 repository - RPM Support - Pulp) so it can get some investigation.