Changes to event handling in Katello

Jonathon_Turel · October 28, 2019, 2:58pm

Hi there,

I’ve been working on the supporting Katello modifications due to the Upcoming changes to Dynflow. The PR[1] is close to merge and my goal is for it to be included in Katello 3.14

The purpose of this thread is to bring awareness to the change even though it will likely be merged today or otherwise very soon. Let’s keep broader discussion around the change in this thread, and specific requests for change on the PR so long as it’s not merged.

Summary
Katello’s event handling will no longer be managed by Dynflow, and instead run directly in the Rails process (specifically, the ‘preloader’ process, so that there is only a single instance) in background threads. The resiliency and ability to recover from errors has been preserved.

Testing
I’ve done general load testing in a production environment and haven’t seen any anomalies. My strategy was to register several thousand hosts with an activation key providing a subscription, and after all of the registration activity, remove the product providing that subscription; this would test a lot of the general behavior the event handlers might see in a real environment. I did not collect any performance metrics.

Katello event handling?

If you’re not familiar, you’ve probably at least seen the ‘Monitor Event Queue’ and ‘Listen on Candlepin Events’ tasks under Monitor -> Tasks. Those actions are gone now, but the underlying code has been preserved and is executed by the mechanism described above. The event handlers are concerned with the following things:

Auto-publish of composite content views
Errata applicability calculation
Updating of host subscription compliance & system purpose status
Updating and removal of subscription pools in Katello’s database

Visibility
Since visibility into the event handlers is not going to be possible through Monitor -> Tasks I’ve exposed some useful data about them via the ‘About’ page by reporting processed count, error count, and queue depth for the two handlers. Consequently, this same data is exposed via the Ping API. hammer ping does need to be updated to expose the new reports for katello_events and candlepin_events as they are called.

Let me know of any questions, concerns, or anything else regarding this change.

[1] https://github.com/Katello/katello/pull/8366

Jonathon_Turel · October 28, 2019, 3:29pm

The PR is now merged. Got some feedback from Eric, and filed Feature #28141: Notifications for large event queue depth - Katello - Foreman to provide helpful notifications at a later time

lzap · October 29, 2019, 8:36am

Have you tested memory consumption after several days of use?

Whole purpose of dynflow (or any background job processor) is that Ruby is unable to return fragmented memory back to OS, so long-running processes will eventually grow way beyond memory limits and must be restarted. That’s when background processor kicks in and does this automatically. For worker processes, web server (or plugin like passenger) are handling this.

What exactly do you mean by preloader instance? Do you mean the initial Rails instance where Copy-On-Write children are forked? If that’s the case then I believe this is bad idea - those threads will make memory usage to grow and then all children will fork from there as well.

Also how the parent thread is supposed to restart itself? We use passenger which automatically recycle worker processes after 1000 requests, but it does not do this for the parent process. I believe this will ultimately taint all worker processes with garbage generated by those threads.

Please consider using our telemetry framework to expose these. There’s bunch of resources available on how to add new metrics, I can assist. I am not sure how much Katello exposes today but this I think should be available from the day one so you can easily troubleshoot this when things go bad.

Jonathon_Turel · October 30, 2019, 7:35pm

I didn’t test the memory consumption over a long period of time. I hadn’t considered that Dynflow was doing anything that clever, but perhaps it makes sense since I’ve never encountered any memory bloat from our two event-handling actions. Or maybe it’s not actually an issue in our case?

That’s exactly what I meant. While I was surprised by how well it worked, you do raise good points about the potential dangers. I’d actually made some attempts at first to run the threads in one of the forked processes but was having bad luck. I’ve revisited that since your feedback and I came up with this: Refs #27674 - start daemon threads in single worker by jturel · Pull Request #8403 · Katello/katello · GitHub

The approach in my PR uses the File#flock method to exclusively lock starting of the threads in the worker processes, so only one forked process will be handling events. It seems to work well for Puma, and I’m about to test it with Passenger. Please take a look at that PR and let me know if you feel better about this strategy.

I’m interested to learn more about it, and I’ll make a note to look into this!

lzap · November 5, 2019, 1:23pm

Well I believe that a web server process should not be spawning any threads or processes that do any kind of background processing. That’s a design that will not scale for containers and generally can be more challenging to toubleshoot and debug. However I don’t understand all the details and reasons why you are moving away from Dynflow, so I am not the best one to comment on that deeply.