Katello 4.2.x repository synchronisation issues

iballou · November 17, 2021, 2:24pm

Thanks for the error and ram usage info @John_Beranek, I’ll ask the Pulp team about it.

iballou · November 17, 2021, 3:00pm

@John_Beranek do you find that your baseline RAM usage (not during syncing) decreases (even temporarily) if you restart Katello or Pulpcore services? systemctl restart pulpcore* --all, particularly for Pulpcore.

iballou · November 17, 2021, 5:27pm

The Pulp team is going to start looking into the performance issues with the OL repositories soon.

John_Beranek · November 24, 2021, 11:50am

Although…looking at the Katello server this morning…

foreman   930281  0.0  0.8 827968 281196 ?       Ssl  Nov21   2:11 puma 5.3.2 (unix:///run/foreman.sock) [foreman]
foreman   930977  0.6  1.5 1109044 534964 ?      Sl   Nov21  27:34 puma: cluster worker 0: 930281 [foreman]
foreman   930978  0.5  1.5 1046404 524068 ?      Sl   Nov21  24:45 puma: cluster worker 1: 930281 [foreman]
foreman   930979  0.6  1.5 1160408 534444 ?      Sl   Nov21  27:40 puma: cluster worker 2: 930281 [foreman]
foreman   930983  0.6  1.5 1106172 533892 ?      Sl   Nov21  27:45 puma: cluster worker 3: 930281 [foreman]
foreman   930991  0.5  1.5 1137104 550864 ?      Sl   Nov21  22:58 puma: cluster worker 4: 930281 [foreman]
foreman   931001  0.6  1.5 1152152 547760 ?      Sl   Nov21  26:16 puma: cluster worker 5: 930281 [foreman]
foreman   931007  0.6  1.5 1064404 550272 ?      Sl   Nov21  26:07 puma: cluster worker 6: 930281 [foreman]
foreman   931013  1.6 44.3 16621128 15350872 ?   Sl   Nov21  69:33 puma: cluster worker 7: 930281 [foreman]
foreman   931016  0.5  1.5 1080860 535256 ?      Sl   Nov21  22:26 puma: cluster worker 8: 930281 [foreman]
foreman   931021  0.5  1.6 1222988 558556 ?      Sl   Nov21  24:41 puma: cluster worker 9: 930281 [foreman]
foreman   931026  0.4  1.5 1194788 547252 ?      Sl   Nov21  21:30 puma: cluster worker 10: 930281 [foreman]
foreman   931033  0.6  1.5 1119580 542744 ?      Sl   Nov21  27:52 puma: cluster worker 11: 930281 [foreman]

One worker using 44% of the server’s RAM, 15GiB? (When there are no running tasks)

Jonathon_Turel · November 29, 2021, 3:20pm

@John_Beranek regarding the puma worker that’s using a lot of memory, can you check this from your server via foreman-rake? Katello::EventDaemon::Runner.pid ? Let me know if the PID it reports matches the high memory puma worker on your system.

John_Beranek · November 29, 2021, 3:47pm

Yes, it matches - after a week (Edit: I say a week, but it looks like the worker restarted yesterday):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3026536 foreman   20   0 9079132   7.8g  11196 S   4.3  23.7  30:06.60 rails

# foreman-rake console
Loading production environment (Rails 6.0.3.7)
irb(main):001:0> Katello::EventDaemon::Runner.pid
=> 3026536
irb(main):002:0>

Jonathon_Turel · November 29, 2021, 3:58pm

Thank you. Next time you see the memory going high please share the output of hammer ping - especially the *_events parts. Maybe I can tell which event handler is causing the high usage

John_Beranek · December 1, 2021, 9:09am

So, unfortunately I didn’t catch it before another OOM-kill event during this morning’s daily sync, I’ll try to catch it in the next few days…

(Gap in netdata stats as the OOM-killer decided to kill netdata first)

lzap · December 3, 2021, 8:55am

Jonathon, can you explain me why the process is named just rails? Puma processes should be always named puma since puma app server calls setproctitle to modify process title.

I am trying to identify all processes we have on Katello deployment:

I believe it should be pretty easy for Katello Event Daemon to set process title to something reasonable.

John_Beranek · December 3, 2021, 9:16am

OK, so I have a puma process which has reached 11GiB resident, and:

# hammer ping
database:
    Status:          ok
    Server Response: Duration: 0ms
katello_agent:
    Status:          FAIL
    message:         Not running
    Server Response: Duration: 0ms
candlepin:
    Status:          ok
    Server Response: Duration: 17ms
candlepin_auth:
    Status:          ok
    Server Response: Duration: 17ms
candlepin_events:
    Status:          ok
    message:         57949 Processed, 0 Failed
    Server Response: Duration: 0ms
katello_events:
    Status:          ok
    message:         40 Processed, 0 Failed
    Server Response: Duration: 0ms
pulp3:
    Status:          ok
    Server Response: Duration: 68ms
foreman_tasks:
    Status:          ok
    Server Response: Duration: 36ms

lzap · December 3, 2021, 10:20am

Can you please share once again full list of puma processes and output of this command (the PID of the event daemon)?

John_Beranek · December 3, 2021, 10:38am

# foreman-rake console
Loading production environment (Rails 6.0.3.7)
irb(main):001:0> Katello::EventDaemon::Runner.pid
=> 1805266

# ps auxww|grep puma
root      606352  0.0  0.0 221924  1052 pts/0    S+   10:37   0:00 grep --color=auto puma
foreman  1804501  0.0  0.6 828500 229288 ?       Ssl  Nov24   6:37 puma 5.3.2 (unix:///run/foreman.sock) [foreman]
foreman  1805234  0.6  1.5 1207352 545512 ?      Sl   Nov24  86:07 puma: cluster worker 0: 1804501 [foreman]
foreman  1805242  0.6  1.5 1101376 537516 ?      Sl   Nov24  87:50 puma: cluster worker 2: 1804501 [foreman]
foreman  1805248  0.5  1.6 1208756 560996 ?      Sl   Nov24  73:39 puma: cluster worker 3: 1804501 [foreman]
foreman  1805254  0.7  1.6 1179856 556028 ?      Sl   Nov24  93:12 puma: cluster worker 4: 1804501 [foreman]
foreman  1805260  0.7  1.5 1128044 552488 ?      Sl   Nov24  90:49 puma: cluster worker 5: 1804501 [foreman]
foreman  1805266  0.8 34.4 13024232 11916008 ?   Sl   Nov24 110:08 puma: cluster worker 6: 1804501 [foreman]
foreman  1805269  0.7  1.6 1087216 554368 ?      Sl   Nov24  92:15 puma: cluster worker 7: 1804501 [foreman]
foreman  1805275  0.6  1.6 1219604 576884 ?      Sl   Nov24  88:13 puma: cluster worker 8: 1804501 [foreman]
foreman  1805277  0.7  1.6 1202848 558220 ?      Sl   Nov24  95:13 puma: cluster worker 9: 1804501 [foreman]
foreman  1805283  0.6  1.5 1105260 551788 ?      Sl   Nov24  81:42 puma: cluster worker 10: 1804501 [foreman]
foreman  1805287  0.7  1.6 1196852 558356 ?      Sl   Nov24  92:54 puma: cluster worker 11: 1804501 [foreman]
foreman  4036611  0.5  1.5 1140960 546892 ?      Sl   Dec01  19:13 puma: cluster worker 1: 1804501 [foreman]

lzap · December 3, 2021, 10:56am

Can you tell what made the worker no 1 to restart? Did you kill it? Because puma, by default, does not recycle processes. Was it a OOM killer?

John_Beranek · December 3, 2021, 11:04am

On the same day, but not the same hour…

# journalctl --no-hostname -S "7 days ago" |fgrep "Out of memory"
Nov 28 02:00:46 kernel: Out of memory: Killed process 1710529 (ebpf.plugin) total-vm:394528kB, anon-rss:153800kB, file-rss:0kB, shmem-rss:560kB, UID:987 pgtables:604kB oom_score_adj:1000
Nov 28 02:01:34 kernel: Out of memory: Killed process 1710320 (netdata) total-vm:243972kB, anon-rss:120264kB, file-rss:0kB, shmem-rss:560kB, UID:987 pgtables:412kB oom_score_adj:1000
Dec 01 03:03:12 kernel: Out of memory: Killed process 3026521 (netdata) total-vm:237880kB, anon-rss:121628kB, file-rss:0kB, shmem-rss:560kB, UID:987 pgtables:400kB oom_score_adj:1000
Dec 01 03:04:23 kernel: Out of memory: Killed process 4035985 (netdata) total-vm:168932kB, anon-rss:77416kB, file-rss:0kB, shmem-rss:556kB, UID:987 pgtables:264kB oom_score_adj:1000
Dec 01 03:04:23 kernel: Out of memory: Killed process 3026536 (rails) total-vm:16762388kB, anon-rss:15458220kB, file-rss:0kB, shmem-rss:0kB, UID:994 pgtables:32396kB oom_score_adj:0

[Edit: In fact, the killed PID above matches the last time I provided the EventDaemon’s PID.]

John_Beranek · December 3, 2021, 11:14am

So, it’s only just ‘rails’ when the kernel’s OOM-killer comes along - presumably the kernel wants to only output the original process name, and not the dynamically set one?

lzap · December 3, 2021, 11:44am

What we see (one of the worker processes - selected randomly?) is handling the events feels weird to me. I’ve created a lengthy post about this:

That is a larger problem on its own, this is why I created separate post. I will let Katello devs to investigate and help you in this one.

@Jonathon_Turel remember we have a nice telemetry API in Foreman core which is not utilized at all for Katello events. It is extremely easy to add statements to monitor amount of events, events per minute, length of queues and other important metrics which could help.

What would be also possible to call Ruby VM stat before and after each event is processed and make a reading of amount of allocated objects. This can be reported per event and easily aggregated later.
Here is a reminder to use our telemetry API for such information: Foreman telemetry API for developers

Yes, title set via setproctitle are only read by some tools like ps or top.

Jonathon_Turel · December 6, 2021, 5:37pm

I noticed that katello_agent is FAIL in your hammer ping output. Do you use katello agent in your infrastructure? If not, you should run the installer with --foreman-proxy-content-enable-katello-agent=false

John_Beranek · December 6, 2021, 10:09pm

We have no installations of Katello agent, and yet Constant qpidd errors in journal

loitho · December 6, 2021, 11:36pm

Hi, I just found this topic,
I face the same problem as John, such as those errors when syncing :

Nov 17 09:17:41 katello.example.com pulpcore-worker-7[3775518]: django.db.models.deletion.ProtectedError: ("Cannot delete some instances of model 'ReservedResource' because they are referenced through a protected foreign key: 'TaskReservedResource.resource'",>
Nov 17 09:17:42 katello.example.com pulpcore-api[3277945]: pulp [89e5995c-3b2b-437a-b367-4faf9c981c39]:  - - [17/Nov/2021:09:17:42 +0000] "GET /pulp/api/v3/tasks/039e13b0-5a99-4f8c-9906-14818f48a8d6/ HTTP/1.1" 200 1411 "-" "OpenAPI-Generator/3.14.1/ruby"

The same hammer ping result with “katello_agent” as failed even though we’ve never used katello agent, and the same qpidd errors in the journal.

This message helped me Oracle Linux 8 Appstream sync fails - #9 by John_Beranek
But sadly I think you already know about it

Jonathon_Turel · December 7, 2021, 3:47pm

I suspect that you and the OP have upgraded from older version of Katello where the Katello Agent infrastructure is enabled by default. New installations do not do so. If you aren’t using it then I absolutely recommend running the installer as I mentioned to remove all traces of Katello Agent from your server.

That installer option is --foreman-proxy-content-enable-katello-agent=false