Katello 4.2.x repository synchronisation issues

John_Beranek · November 24, 2021, 11:50am

Although…looking at the Katello server this morning…

foreman   930281  0.0  0.8 827968 281196 ?       Ssl  Nov21   2:11 puma 5.3.2 (unix:///run/foreman.sock) [foreman]
foreman   930977  0.6  1.5 1109044 534964 ?      Sl   Nov21  27:34 puma: cluster worker 0: 930281 [foreman]
foreman   930978  0.5  1.5 1046404 524068 ?      Sl   Nov21  24:45 puma: cluster worker 1: 930281 [foreman]
foreman   930979  0.6  1.5 1160408 534444 ?      Sl   Nov21  27:40 puma: cluster worker 2: 930281 [foreman]
foreman   930983  0.6  1.5 1106172 533892 ?      Sl   Nov21  27:45 puma: cluster worker 3: 930281 [foreman]
foreman   930991  0.5  1.5 1137104 550864 ?      Sl   Nov21  22:58 puma: cluster worker 4: 930281 [foreman]
foreman   931001  0.6  1.5 1152152 547760 ?      Sl   Nov21  26:16 puma: cluster worker 5: 930281 [foreman]
foreman   931007  0.6  1.5 1064404 550272 ?      Sl   Nov21  26:07 puma: cluster worker 6: 930281 [foreman]
foreman   931013  1.6 44.3 16621128 15350872 ?   Sl   Nov21  69:33 puma: cluster worker 7: 930281 [foreman]
foreman   931016  0.5  1.5 1080860 535256 ?      Sl   Nov21  22:26 puma: cluster worker 8: 930281 [foreman]
foreman   931021  0.5  1.6 1222988 558556 ?      Sl   Nov21  24:41 puma: cluster worker 9: 930281 [foreman]
foreman   931026  0.4  1.5 1194788 547252 ?      Sl   Nov21  21:30 puma: cluster worker 10: 930281 [foreman]
foreman   931033  0.6  1.5 1119580 542744 ?      Sl   Nov21  27:52 puma: cluster worker 11: 930281 [foreman]

One worker using 44% of the server’s RAM, 15GiB? (When there are no running tasks)

Jonathon_Turel · November 29, 2021, 3:20pm

@John_Beranek regarding the puma worker that’s using a lot of memory, can you check this from your server via foreman-rake? Katello::EventDaemon::Runner.pid ? Let me know if the PID it reports matches the high memory puma worker on your system.

John_Beranek · November 29, 2021, 3:47pm

Yes, it matches - after a week (Edit: I say a week, but it looks like the worker restarted yesterday):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3026536 foreman   20   0 9079132   7.8g  11196 S   4.3  23.7  30:06.60 rails

# foreman-rake console
Loading production environment (Rails 6.0.3.7)
irb(main):001:0> Katello::EventDaemon::Runner.pid
=> 3026536
irb(main):002:0>

Jonathon_Turel · November 29, 2021, 3:58pm

Thank you. Next time you see the memory going high please share the output of hammer ping - especially the *_events parts. Maybe I can tell which event handler is causing the high usage

John_Beranek · December 1, 2021, 9:09am

So, unfortunately I didn’t catch it before another OOM-kill event during this morning’s daily sync, I’ll try to catch it in the next few days…

(Gap in netdata stats as the OOM-killer decided to kill netdata first)

lzap · December 3, 2021, 8:55am

Jonathon, can you explain me why the process is named just rails? Puma processes should be always named puma since puma app server calls setproctitle to modify process title.

I am trying to identify all processes we have on Katello deployment:

I believe it should be pretty easy for Katello Event Daemon to set process title to something reasonable.

John_Beranek · December 3, 2021, 9:16am

OK, so I have a puma process which has reached 11GiB resident, and:

# hammer ping
database:
    Status:          ok
    Server Response: Duration: 0ms
katello_agent:
    Status:          FAIL
    message:         Not running
    Server Response: Duration: 0ms
candlepin:
    Status:          ok
    Server Response: Duration: 17ms
candlepin_auth:
    Status:          ok
    Server Response: Duration: 17ms
candlepin_events:
    Status:          ok
    message:         57949 Processed, 0 Failed
    Server Response: Duration: 0ms
katello_events:
    Status:          ok
    message:         40 Processed, 0 Failed
    Server Response: Duration: 0ms
pulp3:
    Status:          ok
    Server Response: Duration: 68ms
foreman_tasks:
    Status:          ok
    Server Response: Duration: 36ms

lzap · December 3, 2021, 10:20am

Can you please share once again full list of puma processes and output of this command (the PID of the event daemon)?

John_Beranek · December 3, 2021, 10:38am

# foreman-rake console
Loading production environment (Rails 6.0.3.7)
irb(main):001:0> Katello::EventDaemon::Runner.pid
=> 1805266

# ps auxww|grep puma
root      606352  0.0  0.0 221924  1052 pts/0    S+   10:37   0:00 grep --color=auto puma
foreman  1804501  0.0  0.6 828500 229288 ?       Ssl  Nov24   6:37 puma 5.3.2 (unix:///run/foreman.sock) [foreman]
foreman  1805234  0.6  1.5 1207352 545512 ?      Sl   Nov24  86:07 puma: cluster worker 0: 1804501 [foreman]
foreman  1805242  0.6  1.5 1101376 537516 ?      Sl   Nov24  87:50 puma: cluster worker 2: 1804501 [foreman]
foreman  1805248  0.5  1.6 1208756 560996 ?      Sl   Nov24  73:39 puma: cluster worker 3: 1804501 [foreman]
foreman  1805254  0.7  1.6 1179856 556028 ?      Sl   Nov24  93:12 puma: cluster worker 4: 1804501 [foreman]
foreman  1805260  0.7  1.5 1128044 552488 ?      Sl   Nov24  90:49 puma: cluster worker 5: 1804501 [foreman]
foreman  1805266  0.8 34.4 13024232 11916008 ?   Sl   Nov24 110:08 puma: cluster worker 6: 1804501 [foreman]
foreman  1805269  0.7  1.6 1087216 554368 ?      Sl   Nov24  92:15 puma: cluster worker 7: 1804501 [foreman]
foreman  1805275  0.6  1.6 1219604 576884 ?      Sl   Nov24  88:13 puma: cluster worker 8: 1804501 [foreman]
foreman  1805277  0.7  1.6 1202848 558220 ?      Sl   Nov24  95:13 puma: cluster worker 9: 1804501 [foreman]
foreman  1805283  0.6  1.5 1105260 551788 ?      Sl   Nov24  81:42 puma: cluster worker 10: 1804501 [foreman]
foreman  1805287  0.7  1.6 1196852 558356 ?      Sl   Nov24  92:54 puma: cluster worker 11: 1804501 [foreman]
foreman  4036611  0.5  1.5 1140960 546892 ?      Sl   Dec01  19:13 puma: cluster worker 1: 1804501 [foreman]

lzap · December 3, 2021, 10:56am

Can you tell what made the worker no 1 to restart? Did you kill it? Because puma, by default, does not recycle processes. Was it a OOM killer?

John_Beranek · December 3, 2021, 11:04am

On the same day, but not the same hour…

# journalctl --no-hostname -S "7 days ago" |fgrep "Out of memory"
Nov 28 02:00:46 kernel: Out of memory: Killed process 1710529 (ebpf.plugin) total-vm:394528kB, anon-rss:153800kB, file-rss:0kB, shmem-rss:560kB, UID:987 pgtables:604kB oom_score_adj:1000
Nov 28 02:01:34 kernel: Out of memory: Killed process 1710320 (netdata) total-vm:243972kB, anon-rss:120264kB, file-rss:0kB, shmem-rss:560kB, UID:987 pgtables:412kB oom_score_adj:1000
Dec 01 03:03:12 kernel: Out of memory: Killed process 3026521 (netdata) total-vm:237880kB, anon-rss:121628kB, file-rss:0kB, shmem-rss:560kB, UID:987 pgtables:400kB oom_score_adj:1000
Dec 01 03:04:23 kernel: Out of memory: Killed process 4035985 (netdata) total-vm:168932kB, anon-rss:77416kB, file-rss:0kB, shmem-rss:556kB, UID:987 pgtables:264kB oom_score_adj:1000
Dec 01 03:04:23 kernel: Out of memory: Killed process 3026536 (rails) total-vm:16762388kB, anon-rss:15458220kB, file-rss:0kB, shmem-rss:0kB, UID:994 pgtables:32396kB oom_score_adj:0

[Edit: In fact, the killed PID above matches the last time I provided the EventDaemon’s PID.]

John_Beranek · December 3, 2021, 11:14am

So, it’s only just ‘rails’ when the kernel’s OOM-killer comes along - presumably the kernel wants to only output the original process name, and not the dynamically set one?

lzap · December 3, 2021, 11:44am

What we see (one of the worker processes - selected randomly?) is handling the events feels weird to me. I’ve created a lengthy post about this:

That is a larger problem on its own, this is why I created separate post. I will let Katello devs to investigate and help you in this one.

@Jonathon_Turel remember we have a nice telemetry API in Foreman core which is not utilized at all for Katello events. It is extremely easy to add statements to monitor amount of events, events per minute, length of queues and other important metrics which could help.

What would be also possible to call Ruby VM stat before and after each event is processed and make a reading of amount of allocated objects. This can be reported per event and easily aggregated later.
Here is a reminder to use our telemetry API for such information: Foreman telemetry API for developers

Yes, title set via setproctitle are only read by some tools like ps or top.

Jonathon_Turel · December 6, 2021, 5:37pm

I noticed that katello_agent is FAIL in your hammer ping output. Do you use katello agent in your infrastructure? If not, you should run the installer with --foreman-proxy-content-enable-katello-agent=false

John_Beranek · December 6, 2021, 10:09pm

We have no installations of Katello agent, and yet Constant qpidd errors in journal

loitho · December 6, 2021, 11:36pm

Hi, I just found this topic,
I face the same problem as John, such as those errors when syncing :

Nov 17 09:17:41 katello.example.com pulpcore-worker-7[3775518]: django.db.models.deletion.ProtectedError: ("Cannot delete some instances of model 'ReservedResource' because they are referenced through a protected foreign key: 'TaskReservedResource.resource'",>
Nov 17 09:17:42 katello.example.com pulpcore-api[3277945]: pulp [89e5995c-3b2b-437a-b367-4faf9c981c39]:  - - [17/Nov/2021:09:17:42 +0000] "GET /pulp/api/v3/tasks/039e13b0-5a99-4f8c-9906-14818f48a8d6/ HTTP/1.1" 200 1411 "-" "OpenAPI-Generator/3.14.1/ruby"

The same hammer ping result with “katello_agent” as failed even though we’ve never used katello agent, and the same qpidd errors in the journal.

This message helped me Oracle Linux 8 Appstream sync fails - #9 by John_Beranek
But sadly I think you already know about it

Jonathon_Turel · December 7, 2021, 3:47pm

I suspect that you and the OP have upgraded from older version of Katello where the Katello Agent infrastructure is enabled by default. New installations do not do so. If you aren’t using it then I absolutely recommend running the installer as I mentioned to remove all traces of Katello Agent from your server.

That installer option is --foreman-proxy-content-enable-katello-agent=false

dralley · December 9, 2021, 7:51pm

If you are on pulpcore 3.14 and are not explicitly opting into the old tasking system, it is safe to delete the records in the “core_taskreservedresource” table (that is - DELETE FROM core_taskreservedresource; ), which should resolve the issue. They are leftovers from the old tasking system which for some reason have not been cleaned up properly.

@John_Beranek Did this a few days ago, he should be able to confirm.

loitho · December 10, 2021, 6:04pm

Hi,
Thank you for your reply,

I’m putting here what I ran if other people face the same issue :

sudo su - postgres -c "psql -d pulpcore"
psql (12.7)
Type "help" for help.

pulpcore=# DELETE FROM core_taskreservedresource;
DELETE 16

It might fix the issues for the sync as well

John_Beranek · December 14, 2021, 8:45am

So, this may be of interest…after running foreman-installer --foreman-proxy-content-enable-katello-agent=false the apparent memory leak in the puma process appears to have stopped.

Services have been running for a week, and look like this:

# ps auxww|egrep 'RSS|puma'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
foreman  1975495  0.0  1.0 828912 350176 ?       Ssl  Dec07   4:31 puma 5.3.2 (unix:///run/foreman.sock) [foreman]
foreman  1975589  0.7  1.8 1232884 656148 ?      Sl   Dec07  67:27 puma: cluster worker 0: 1975495 [foreman]
foreman  1975592  0.6  1.6 1086236 575460 ?      Sl   Dec07  64:15 puma: cluster worker 1: 1975495 [foreman]
foreman  1975597  0.6  1.6 1111908 564776 ?      Sl   Dec07  62:26 puma: cluster worker 2: 1975495 [foreman]
foreman  1975603  1.3  1.7 1114444 607620 ?      Sl   Dec07 129:40 puma: cluster worker 3: 1975495 [foreman]
foreman  1975609  0.6  1.7 1109892 594876 ?      Sl   Dec07  66:20 puma: cluster worker 4: 1975495 [foreman]
foreman  1975615  0.7  1.7 1167236 590700 ?      Sl   Dec07  67:55 puma: cluster worker 5: 1975495 [foreman]
foreman  1975621  0.6  1.6 1113856 580616 ?      Sl   Dec07  57:57 puma: cluster worker 6: 1975495 [foreman]
foreman  1975626  0.6  1.8 1246396 632012 ?      Sl   Dec07  59:57 puma: cluster worker 7: 1975495 [foreman]
foreman  1975631  0.6  1.6 1075944 573488 ?      Sl   Dec07  66:08 puma: cluster worker 8: 1975495 [foreman]
foreman  1975638  0.6  1.6 1090692 580552 ?      Sl   Dec07  57:56 puma: cluster worker 9: 1975495 [foreman]
foreman  1975641  0.6  1.6 1094576 577420 ?      Sl   Dec07  62:54 puma: cluster worker 10: 1975495 [foreman]
foreman  1975645  0.6  1.6 1125196 584120 ?      Sl   Dec07  65:51 puma: cluster worker 11: 1975495 [foreman]