as I was digging into increased memory usage for my VM and I need help to explain a design of Katello event listeners.
I am trying to understand how these Event Monitor and Event Daemon threads are supposed to work in production and what they do. There is a global (singleton) registry that is created and initialized in
Katello::EventDaemon::Runner.initialize Katello::EventDaemon::Runner.register_service(:candlepin_events, Katello::CandlepinEventListener) Katello::EventDaemon::Runner.register_service(:katello_events, Katello::EventMonitor::PollerThread) Katello::EventDaemon::Runner.register_service(:katello_agent_events, Katello::EventDaemon::Services::AgentEventReceiver) if ::Katello.with_katello_agent?
Let me explain my concerns. I noticed that these services/runners/monitors spawn several threads:
[lzap@nuc katello]$ ag Thread.new app/services/katello/event_monitor/poller_thread.rb 86: @thread = Thread.new do app/lib/katello/event_daemon/runner.rb 77: @monitor_thread = Thread.new do app/lib/katello/event_daemon/services/agent_event_receiver.rb 33: @thread = Thread.new do
Everytime I see a new thread in a web application, it rings a bell for me. This is not something that should be done, web application servers are not only designed with concurrency model that usually does not count with additional threads, but more importantly for Ruby environment: we run multiple instances (processes) of the application.
I do not understand Ruby on Rails initialization and Puma well enough to see where the thread should end up. But it is obviously not deterministic and I fail to understand what is the intended deployment behavior of these services. Are these threads meant to run in the master process (the one from workers are forked)? Or all of the workers?
See, in most operating systems including Linux, when a process with multiple threads is forked, only the thread that called
fork() carries over to the child process. All the data created by other threads do carry over too, however, they are no longer active and will never run. So having this in the master process could work, however, data created by the threads would be wasted. But this is not the case as you will see.
Worse scenario would be that all workers will spawn these threads and only one will “win” the PID file writing “fight”. So it might appear that only one process handles the events while actually all of them would be processing the requests (or queues or whatever these are doing). If this was the intended design, then we should probably write PIDs in a correct way, but it feels weird.
This is my instance after restart:
foreman 9282 0.0 2.9 947964 293884 ? Ssl Dec02 0:49 puma 5.3.2 (unix:///run/foreman.sock) [foreman] foreman 9308 0.0 4.6 992672 463960 ? Sl Dec02 0:42 puma: cluster worker 0: 9282 [foreman] foreman 9314 0.1 5.3 1144712 537400 ? Sl Dec02 1:23 puma: cluster worker 1: 9282 [foreman]
What we see is puma app server with two worker processes. So far so good. Now, let’s find out which process is handling those events:
[root@zzzap ~]# foreman-rake console Loading production environment (Rails 18.104.22.168) irb(main):001:0> Katello::EventDaemon::Runner.pid => 9314
That is puma cluster worker 1, that feels incorrect. I would either expect this to be the master process or (preferably) a totally separate process that is not a webserver (sidekiq perhaps). Now, the problem why this is I think incorrect is that Puma can possibly recycle worker processes (during rolling restart). We do not use this as it needs some further configuration changes and luckily Puma currently does not support recycling processes after X requests like Passenger does, so worker processes can be only recycled manually. Let’s test what happens:
[root@zzzap ~]# kill 9314 [root@zzzap ~]# ps axuwww | grep puma foreman 9282 0.0 2.9 947964 293884 ? Ssl Dec02 0:49 puma 5.3.2 (unix:///run/foreman.sock) [foreman] foreman 9308 0.0 4.6 992672 463960 ? Sl Dec02 0:42 puma: cluster worker 0: 9282 [foreman] foreman 24466 0.0 4.4 976840 449840 ? Sl 11:24 0:00 puma: cluster worker 1: 9282 [foreman] [root@zzzap ~]# foreman-rake console Loading production environment (Rails 22.214.171.124) irb(main):002:0> Katello::EventDaemon::Runner.pid => nil
As you can see, puma immediately restarted worker 1, but Katello event daemon is gone. There is a Katello event daemon runner class which is some kind of weird monitor that monitors the event daemon thread. I am not sure what this is supposed to do, both event daemon and event monitor share the same process, so if the process is killed there will be no monitor to resume anything. And monitoring threads is weird, thread can only terminate when it ends the block or exception is raised that is not cought. Both can be handled via code pretty easily.
As I said, this behavior is not deterministic, there is a report (Katello 4.2.x repository synchronisation issues - #32 by John_Beranek) when worker number 6 is the one that runs the events out of 12 workers. In my case it was always the second worker.
This design is not only bad from the administrative perspective, this cannot scale well. Puma effectively uses round-robin to distribute the load on the workers, since one of the workers need to handle those Katello events, it can be overloaded causing slower processing of events.
So my question and the purpose of this post is: What are these event listeners doing and can we move them out of the webserver process into a separate process(es)?