Saying goodbye to the Katello Event Daemon

Jonathon_Turel · August 25, 2023, 5:47pm

Hey folks! Katello has been on my mind lately - particularly some of the parts that I tended to gravitate toward (or implement entirely) when I was on the team full-time. I have some thoughts around what I think are real improvements in those areas when it comes to reducing complexity.

The component I’ve got my sights on is the Katello Event Daemon which was introduced/shared via another (admittedly after the fact) RFC several years ago. It manages a few other subsystems in Katello by spawning threads within the Puma process and ensuring they are started, running, etc. Today, it enables the operation of:

Katello Agent and its messaging in and out of Qpid
Candlepin Events receiving messages from Candlepin’s embedded Artemis message broker
Katello Event Queue which is a “simple” mechanism enabling deduplication, scheduling, and retrying of certain actions

This RFC is about removing the Katello Event Daemon completely.

Since Katello Agent is finally being removed only the last two items would be managed by the Daemon. Can they be moved elsewhere to remove the additional complexity added by KED? I think so.

Candlepin Events
This is an easy one. My proposal is to add an internal API to Katello and stand up a new, (very) small service that connects to the Artemis broker within candlepin - just as it does now from inside Puma. This service (via systemd) will not handle the event beyond forwarding to internal API so it’s very lightweight.

Katello Event Queue
This is more nuanced because it has certain requirements (dedup, rescheduling, retrying) and a greater number of potential solutions.

The most basic solution would also use the internal API with its own endpoint to be called by another small external service (systemd again) which polls every few seconds. No business logic here - just a place to trigger the queue draining from. In practice this would generate a lot of log noise. Perhaps it could connect to the database to see if it should trigger the queue drain over the API to resolve the noise issue. I like this for sake of simplicity and least (zero) disturbance to the Event Queue system.

Another solution I’m fond of is reworking the handful of events that run via Event Queue into Sidekiq workers. The advantages there would be: no ‘new’ external service and actual removal of the “Katello Event Queue” construct. Out of the box, or with a plugin like sidekiq-unique-jobs all of the dedup, reschedule, and retry requirements can be addressed. I think this is a big win but I cannot speak to, for example, where Katello’s queue would be run. Sidekiq running in Dynflow? Adding a separate Sidekiq process for this would increase memory requirements which is a concern of mine.

I’ve got some of the above working to a high degree which I recently had an in-depth technical discussion on this PR which ultimately brought the focus back here.

Please share your thoughts here and let’s see if any of this can become reality.

aruzicka · August 28, 2023, 8:02am

Hi,
thank you for writing this up, definitely a +1 for dropping the whole “puma spawns additional threads for stuff” thing.

That sounds good in theory, although I have a gut feeling there might be dragons. Especially if we introduce sidekiq plugins we might start seeing some weird things happening, unless we run a separate sidekiq instance for it which sounds like a bit of an overkill.

What is the volume of the things passing through Katello event queue? If the volume was reasonably low, we could turn the events that run via the event queue into “proper” dynflow actions instead.

Jonathon_Turel · August 29, 2023, 6:11pm

Thanks for the feedback. I wouldn’t advocate for this solution if it means bringing up a separate sidekiq process unless there was a need across several other plugins or something like that. It’s also a bigger(riskier) change in terms of change in Katello. It would be a very clean solution though.

I’m not sure of the volume. Some of the events scale linearly with number of content hosts so it could be thousands a day (minimal guess). I think it would be too noisy in terms of showing them in the foreman-tasks UI especially since they aren’t concrete actions taken by a user.

ekohl · August 29, 2023, 6:41pm

I’m not sure I’d advocate for additional systemd services right now. Have you put any thought into running it as a Smart Proxy feature?

Those can launch background daemons, like the ISC DHCP provider uses it to monitor the DHCP lease file. https://github.com/theforeman/smart-proxy/blob/06092bb1ac536f35f710ba589ad4f0eed31ba8a6/modules/dhcp_isc/configuration_loader.rb#L28-L40 & https://github.com/theforeman/smart-proxy/blob/06092bb1ac536f35f710ba589ad4f0eed31ba8a6/modules/dhcp_isc/dhcp_isc_plugin.rb#L18 are how that’s set up.

You have a connection back to the Foreman instance, authenticated by HTTPS client certificates. It can use a better API, but https://github.com/theforeman/smart-proxy/blob/develop/lib/proxy/request.rb does what you’d expect from a REST client.

If you do that, you could even use the Smart Proxy exposed settings to find the Candlepin service instead of coding it in katello.yaml (like we do today: https://github.com/theforeman/puppet-katello/blob/2c9a8e9371a587d95be5ec3a1b940a15dcfc4393/templates/katello.yaml.erb#L7-L16). Doing so would allow us to fully remove the katello.yaml file, simplifying the installer. https://github.com/theforeman/smart_proxy_pulp/blob/cd52ba90d5be8a5bbf8bc1c4d41b6625ac0dbc7d/lib/smart_proxy_pulp_plugin/pulpcore_plugin.rb#L22-L28 is how those settings are exposed. Foreman :: Foreman Proxy Registration Protocol v2 explained is the longer version.

There are cases you should think about: what if there isn’t a Smart Proxy with that feature? But that’s already the same as with Pulp today.

Jonathon_Turel · August 29, 2023, 8:06pm

Why’s that?

Briefly. I recall you mentioned it to me quite a while back. The spirit of my RFC is to reduce layers of complexity and I perceive burying this in a plugin to be counter to that. I don’t see a fit beyond it being a place to execute the necessary code.

Katello still needs to talk to candlepin for reasons beyond handling async events (the candlepin portion) and my proposal also implies removal of the candlepin_events portion, so it’s something. I think you’re describing a Candlepin focused smart proxy feature with the events mechanism being a part of that. If that fits in with the long-term goals of the project maybe there’s something to it.

Jonathon_Turel · August 29, 2023, 8:10pm

I forgot to mention in the first post that maybe the current state is perfectly fine. It works and is a stable piece of Katello that has seen very few bugs since it was introduced. It’s OK to say that what’s there is Good Enough.

ekohl · August 29, 2023, 8:27pm

Additional complexity, overhead for admins. Every service adds a lot of complexity: you need to secure it somehow, keep it running & updated. Make sure you have logging set up well, especially in multi-server environments.

All current projects suffer from a lack of maintainers and a new service always has a base layer of maintenance.

It already provides all the common core functionality and the Smart Proxy has always been a glue layer between services. At least in Foreman. Katello never adopted that, which may be explained by the complexity of integrations (Pulp & Candlepin are big). But it’s also service discovery. Foreman only talks to Smart Proxies and compute resources. There have been suggestions to push out compute resources to Smart Proxies so Foreman only talks to Smart Proxies.

Now I understand this isn’t directly feasible because Katello is actually a reverse proxy for the RHSM API, but taking a bigger approach.

Today if a client wants to talk to RHSM it can be:

Client → Content Proxy → Foreman → Candlepin

If you implement it as a Smart Proxy feature it can be:

Client → Smart Proxy → Candlepin

If there are actions it needs to do (today implemented in Katello), it should call the Foreman API to do so.

This can also help with scalability: instead of all clients going through Foreman (meaning you need to scale that up), you could scale up multiple Smart Proxies located closer to your clients. They would still talk to a single Candlepin instance, but that was also the case before doing so.

Indeed, it was part of a bigger vision. That vision may be flawed, but this RFC may be a good chance to explore the idea.

ehelms · August 31, 2023, 12:46am

At least Sidekiq is a known quantity we know how to create and manage instances, compared to some net new thing? I guess what I am saying is, if most things are equal, something we know would be better than something we don’t know.

I know this has always been true, but I would venture that most have not (and is part of the Katello divergence) understood why it needs to be true in all cases. Why require some other service to talk to just to talk to the service you really want to? I can understand creating abstractions to lesser known quantities but why do so for a well known quantity?

This would only work for a small subset of actions (primarily a couple read operations) that are “proxied directly” to Candlepin. Most actions need to talk to Foreman as it is the source of truth for identity and auth.

If the smart-proxy can spawn a background daemon that would monitor the event queue and react, why can’t Foreman also do this and avoid a go-between?

ekohl · August 31, 2023, 11:09am

Fair points, but I mostly wanted to make clear that introducing a new service has downsides. Or in your words: an unknown quantity.

I’ll admit immediate I didn’t look at the exact code.

I’m mostly worried about the model that Puma has. Though I saw there was work ongoing to implement Sidekiq running inside Puma (puma/6.0-Upgrade.md at master · puma/puma · GitHub) and that may be a good solution. Even if we don’t implement it using Sidekiq, rewriting the event listener to use the same mechanism instead of Rack middleware may also be an option.

ekohl · March 18, 2024, 10:16am

Following up on my own suggestion, I’ve opened Start Katello event daemon in Puma after a worker boots by ekohl · Pull Request #10099 · theforeman/foreman · GitHub but right now that’s completely untested.