I opened PR with a new “Templates Rendering” host status. This can be useful, especially when migrating from unsafemode to safemode rendering. The first iteration is ready to be tested. Any comments and suggestions are welcome
Thanks, I have some performance concerns, more in the PR.
First of all, this is a good idea! I have few questions:
-
how do we know, which templates to render? is that only provisioning template kind? or is that also finish templates, cloud-init templates etc?
-
if I make a typo in provisioning template, does it mean all of my hosts will turn red? I’m thinking whether the host status is the right inidicatior for this. Did you think about rather generating a notification for the host owner?
-
how does it achieve every 5 minutes run, it seems there’s a rake task but I also see the initializer scheduling the job.
-
how long does it take to render templates for one host in your environment? how long does it take for the whole inventory (and how big is your inventory?)
Thanks for the PR and thanks for opening this RFC!
Yes, this is the goal of the PR. We want to figure out if the templates are broken for a specific host. It can be as simple as a parameter on a host that is changed that causes the templates to break. In my opinion the host status is the only place where this makes sense. Generating a notification if this breaks sounds super annoying, I’d want to be able to disable this asap.
Yeah, the notification drawer would be swamped quickly. However a daily summary email notification would do better. The status would still be good representation, also a good way to search for all hosts that have “template” issue.
Why I didn’t like the host status in the first place was, this can even change on rendering base. If parameter contains erb or if the template relies on some external system that has an outage, the host turns red and saves potentially no longer valid state. If it’s refreshed in 5 minutes, it’s not a big deal I guess. However if it does not take much time, it could be evaluated on demand and not stored. What I like about status is searching capability though. So I’m supportive of this
I’d like to consider one more thing. I’m not sure what all templates we render, but I assume provisioning. Would it make sense to extend Build status instead of introducing a new one?
I believe we render all templates that are directly associated with the host, so all the templates you see in the “Templates” tab in the host detail page.
Build status is not updated after a successful build, so it could add something like “(rebuildable)”. A separate status would have the benefit that a setting could be introduced if this sub-status should affect the global status or not.
So it’s determined by the OS. I’m thinking if there’s a valid scenario that e.g. hosts having same OS are in one location provisioned through network, while in other location through cloud-init. In such case, it may be expected that some templates will fail rendering.
I thought rebuildable
would be basically one of the successful value for build status. I like the suggestion of custom mapping between this new state and the global status, in fact we should do this for all sub-statues. I know we have that use case for OpenSCAP, where sometimes people want errosr to cause just warning.
I assume this is out of the scope for this feature. If that’s so, I’ll open RFE for this.
I am pretty sure Foreman won’t be able to re-render all templates for 10,000+ hosts deployments.
Well, that is why we should reduce the number of re-renders as much as possible. I think there is no need to re-render all the templates assigned to the host, but only those that have changed.
We tested it in a test environment, where we have only 10 hosts, each with 9 templates assigned. It took around 1.5 sec to refresh the status of a single host.
Yes, indeed. Problem is, what I suggested in the PR is I think clunky - we need to somehow track changes to the templates AND hosts as well and all data associated with hosts, e.g. host or hostgroup params. This will quickly become a nightmare and the newly added feature can be useless if we make a slight mistake. Honestly, I don’t like it at all after second thought.
We cannot afford such a mass-rendering, which is a slow operation on its own, on a regular basis. Our biggest customers have well over 50,000 hosts and their deployments are fine-tuned to handle the load. This would kill their deployments straight-up after the upgrade. Setting an interval to 2 hours instead of 5 minutes makes the feature useless, I am pretty sure this would be one of these “turn it off after install” features.
Counter proposal: When a host, host/hostgroup parameter or a template is edited, there could be a subscriber of our notification API performing the rendering check. Instead doing this in a loop, we would perform this shortly after something changed. This opens up doors for better usability - we can immediately deliver a notification to the user who made the change because we would do the validations moments - or minutes - after the change was made. Not sometime in the Sunday midnight.
We can still provide a rake task for regular check and customers can run this from cron if they want.
Sure, the “big deal” was for the fact, the status may be invalid until refresh. I’m also concerned with the performance, hence I asked about some data.
But the result may change after any parameter value change. It’s not just the template modification that can break rendering.
Now that is concerning. It would take 4 hours on 10k machines with this pace. The default interval would be 5 minutes.
Can we even list all changes that may affect the rendering? It’s potentially any touch of OS, parameter, host group, host, org, loc, subnet, domain and many more. However if we do that, I think this would be much better on even just a bit bigger setups. With numbers above, even 500 hosts would take 12 minutes just to rerender.
I guess we can’t because there are macros like dns_lookup
that are totally out of our control. That’s why I believe the only chance we have is to render all templates regularly. A five-minute interval might be a bit too narrow. How about doing this e.g. once a day and make sure just one task is running.
Note, that we didn’t gather any data on a production environment, yet.
My counter question is then: With regular check, now talking about once per 24 hours, how can we make sure the information is actually up-to-date when user opens up a page? None of the two proposals are perfect.
If we are gonna do that, then I suggest this to be a rake task and opt-in. We need to be very careful there’s not much spare CPU cycles and memory on Katello installations.
I still don’t like this, that’s a hammer solution to the problem. My another proposal: What if we calculate the status the moment user opens up the page? Since there will be lag, we can immediately show up last known state and trigger a separate HTTP request to fetch up-to-date info (no background processing involved). Since we have JS/React the component can show it’s still working and refresh the state few seconds after.
This was user will always get the fresh data, this could work hand in hand with the current proposal, but I’d rather prefer a rake task rather than recurring background job. Still, I don’t think checking all hosts everyday is useful, I’d put some reasonable constraints - for example only for active hosts (has some checkings, last_modified or something).
One issue with this is API/CLI, we would need to calculate this during the main request, or at least indicate this could be outdated. I don’t know, I still like reactive solution better.
The main advantage of this feature is, you are informed about error in some reasonable timeframe, without needing to visit the host. If you have 10k hosts, you can’t be visiting all of them after every template/parameter change. I think the rake task (or whatever other repeated mechanism) is good, but we should make the interval large enough. It’s better to know that I broke a provisioning template in a day (or a week) than figuring out 6 months after, when I immediately need to rebuild the host. Chances are I still remember what I did yesterday (or this week). If we also had a mail notification for this, I would definitely connect it to recent modifications, I could also quickly look at audits page to see if someone else did something.
We should do some more performance testing first, right now it feels like guessing. We may throttle the task with sleep in between renderings if necessary. We can make the interval configurable, however I wouldn’t be against 24h as a default.
I also like the idea of checking (by the same code) on host or template save. That solves the other usecase, but also good thing to have.
The first iteration was that if any change affected the rendering (it is hard part to monitor all the changes that may affects the rendering), the status should be set to PENDING. So, you have to wait for the status to refresh. However, I’m afraid that too long interval would make the status PENDING most of the time.
Another solution we can consider is a new Report Template, so we can generate a report with a list of hosts and all the problems that occur during rendering.
I can’t stick with the idea of re-rendering all templates in a loop. But if you guys want this feature, let’s make it opt-in and then I am fine.
I opened a new PR [1]. There is no extra rendering nor background processing. Instead there are events safemode_rendered
, safemode_rendering_error
, unsafemode_rendered
and unsafemode_rendering error
. This should allow to observe all rendered templates and detect errors.