RFC: Don't apply the compute profile when updating the host

lstejska · August 28, 2024, 9:17am

Customer’s situation:

The customer has two compute profiles for VMware:

Small (2 CPUs, 2048 RAM)
Large (4 CPUs, 4086 RAM)

They encounter an issue when attempting to update the host from Large to Small using the API / Hammer. The following error is displayed:


Failed to update a compute env-vcenter8.infra.sat.rdu2.redhat.com (VMware) instance lucy-selkirk.vmware.local: 
CpuHotPlugNotSupported: CPU hot plug is not supported for this virtual machine.

Observations

The error is in the API ONLY, works fine in UI form
Setting {Memory, CPU} hot add on Compute profile has no effect.
In Libvirt, the VM does not change at all. Once the numbers are set, they are set. Changing CP in Foreman has no effect.
The documentation [0] describes computing profiles as a tool for creating hosts, with nothing about updating.
Changing CP on the host group does not affect the hosts in the host group. Which might be (un)expected by users.

Possible solution

I suggest not updating compute profiles on host updates. This change can be confusing and may result in undesired expectations from the user’s side. The POC PR [1] is small, and it is straightforward to cherry-pick it to the older versions, which we need for the customer.

Later, we should update the documentation, clearly state the functionality, add some tooltips to the forms, deprecate the APIs, etc.

What do you think?

Links

[0] Provisioning hosts

[1] WIP Fixes #TODO - Don't apply the compute profile when updating host by stejskalleos · Pull Request #10291 · theforeman/foreman · GitHub

areyus · August 28, 2024, 1:17pm

I am all for this change, since we have been affected by this, too.
A bit of additional context/info I have found while researching this:

There is this long-standing issue on the bugtracker that I believe to describe the same problem just from a different view point.
The workflow via the UI seems to be working only because the UI always sends all the host info with the PUT request, including the complete the complete compute_attributes. That overrides anything that would be merged from the compute_profile, resulting in an accidental workaroung to the problem.

I also agree with your approach of removing the re-application of CP data on each save. I do not see any real-world benefit to this behaviour and instead it causes a lot of headache.
If compute-profiles are intended to work like hostgroups (i.e. you can change a compute-profile and that automatically alters all hosts with that CP) a lot of things would need to work differently than they do now (e.g. changes being applied on CP change and not on host change, behaviour being consistent with the UI, CPs being able to only be filled partially, etc). So I agree that removing this unexpected behavior is the better solution that fits the current purpose and real-world use-cases of CPs more.

ekohl · August 28, 2024, 1:35pm

If we go this route, I’d prefer to not save the compute profile at all for a host. I’m not sure if it’s feasible, but having a reference that is then not used again doesn’t make sense to me.

That effectively makes it more like the compute attributes that you can send via the API, with the exception that it does have a UI element to it.

AFAIK Foreman doesn’t have a way to recreate a VM it couldn’t be used in that case.

Lennonka · August 29, 2024, 7:09am

Sounds reasonable to me.

lstejska · August 29, 2024, 8:34am

Yop, we could remove the reference and use the values from the profile only when the host is created. We’ll update the documentation to explain that it’s used only for creating hosts. We can add tooltips to the form and disable compute profiles on host edit. That could be, let’s say, follow-up phase two.

Now, I need to find a fix that can be easily cherry-picked to (very) old releases with minimal changes; one customer is blocked on that.
The POC PR is a one-line change. While I agree it’s probably not the best PR in history, it does the job and fixes the problem with a one-line change.

@ekohl thoughts?

ekohl · August 29, 2024, 8:53am

I always find this difficult, because such a change in behavior may be desired for one while undesired for another.

For this specific customer, isn’t it viable to unset the compute profile for the host when updating? I’d expect that doesn’t affect the actual VM configuration and also effectively work around the problem.

lstejska · August 29, 2024, 10:35am

Well, we could be the judges

I’ll ask in the ticket.

areyus · August 29, 2024, 12:19pm

While I agree in general with this statement, for this specific case I tend to disagree. From my personal user perspective, I would consider the current behavior a bug rather than a feature since:

As mentioned in the RFC, the current behavior contradicts the documentation
The current behavior is inconsistent between the UI experience and the API experience (if only because the UI always sends the whole host object instead of only the changes)
If the current behavior was actually the intended behavior, selecting a Compute Profile should disallow other changes to the VM hardware. Alternatively, Forman would need to save the intended VM configuration separately with the host if changes from the compute profile configuration were allowed.

In any case, I believe there should be some change to at least make the behavior consistent and properly documented. If the conclusion is that the current behavior is the intended behavior, that should be properly documented and be consistent with the UI behavior.

ekohl · August 29, 2024, 12:42pm

Perhaps it wasn’t clear enough, but we combined 2 things in 1 topic: short term fix for a customer and long term fix.

The above comment was explicitly about the short term fix. Changing this kind of behavior in a patch can be surprising.

Long term I agree we should change something, but then we can clearly introduce it in the release notes as a change in behavior.

As an experienced sysadmin I’m sure you have you have been bitten by an unexpected change. Those always annoyed me so I’m trying my best to avoid this for Foreman users when I can.

Marek_Hulan · August 29, 2024, 1:14pm

Well, when you clone the host, that reference may still be useful. But if it cloned also the vm attributes, I’m fine with that.

I think we should go with the solution that comes with minimal surprise. Imagine you change a host group of a host and meanwhile the compute profile has been changed. Would you expect all your production servers have now less RAM or CPUs, especially if you didn’t see any warning that this may happen? That’s the reality of how it works today. I think that shouldn’t be the default behavior.

And I agree with all arguments that @areyus made.

I think the logic still applies even to the short term fix. This is a bug and bug need to be fixed before it causes problems to more users. In this case, the correction does not make it worse and is not destructive. One could always modify the vm attributes manually afterwards. Given though that vm attributes don’t change when compute profile is changed on host group level or compute profile itself gets updated, I think this is not something people would use or rely on.

ekohl · August 29, 2024, 1:50pm

Warning: this is more of a meta-discussion about how I see what we’re talking about here.

I think it should clone the actual VM attributes (if possible). As you noted before: if the memory was increased outside of the profile you want to clone that too.

I think we’re all in absolute agreement on the goal. It can be totally acceptable to include this change in a stable release if accompanied by clear release notes. All about managing expectations.

This reminds me of the “delete VM when host is removed” discussion. It’s hard to find a common expectation among users.

My suggestion to remove the reference from the model would IMHO make it obvious: if there’s no property on the host, users don’t expect it to be honored.

It has worked like this for a long time. It may be urgent to one customer, but if we can provide an easy workaround then that’s even faster to deliver than an actual code patch. My mind always goes through this hierarchy:

Workaround: an immediate solution to unblock the user
Hotfix: minimal code change that unblocks the user, usually applied by patching the local installation
Bugfix: code delivered through the regular channels
Change: long term solution that really resolves the issue

We’re discussing multiple layers at the same time in this thread. Note that none of these exclude each other and in fact can complement each other. We’re now exploring the problem space and potential solutions.

Marek_Hulan · August 30, 2024, 7:41am

It seems we agree on the need and the long term solution, so I’ll focus more on the short term and overall resolution process.

I agree, that’s a good long term solution.

It also reminded me the delete the VM issue from the past. Back that we were concerned with changing something that behaved one way for a long time even though it caused some real problems in production. The change turned out to be only positive, I don’t recall any negative reaction. If the system is buggy and behaves against the documentation, I think it’s OK to change the behavior quickly, even if it’s like that for a long time. Of course, it’s not a universal rule, but it’s a thinking guide.

I think that’s a great approach.

The workaround has been thought about prior the RFC was open. It was suggested to use UI but the user is searching for something, that can be automated and used from scripts. Disassociation of compute profile was thought about too, but they need the association on the host group level (for provisioning the new hosts in that host group). This is still something they could do in case the amount of affected host groups is not huge and they would stop provisioning until the change is performed. It’s a one time operation workaround, not for until this is fixed properly. Manual changes through console or DB seemed bad.

Hotfix would be possible for sure, but I think we should avoid it if we don’t have a fix ready for “very soon” release. The problem with the hotfix is, you need to reapply that after every update. It seems the desired solution would not be available before 3.13.

Hence we are at the bugfix stage.

lstejska · September 4, 2024, 8:58am

@Marek_Hulan @ekohl, do we agree on how to fix the bug? In the PR, I’m removing a line that doesn’t work, not changing the behavior much, except not raising an error for something that doesn’t work anyway.

Marek_Hulan · September 5, 2024, 8:40am

I think it is correcting the behavior that was wrong for a long time. Given how long, there was a valid concern raised. I’m personally fine with the fix, because the workaround nor hotfix seemed like a viable option. @ekohl would you agree?

ekohl · September 10, 2024, 12:19pm

It sounds valid to me, but I’d like to make sure we don’t forget about the proper solution. Otherwise we’re just stacking hacks on top of hacks.

lstejska · September 11, 2024, 1:20pm

Created a tracker issue: Tracker #37806: Compute profiles & host creation - Foreman