RFC: Turning off auto updates from facts for NICs and OS

lzap · April 26, 2021, 7:05am

Hello,

I have to deal with various orchestration problems pretty regularly and today I have found out that Foreman breaks orchestration via the Update subnet/domain from facts feature.

When a managed interface with DHCP/DNS orchestration is updated from facts, no orchestration is actually triggered. I do not think this is a bug, triggering orchestration via puppet uploads would be performance suicide. However when Subnet, Domain or IP address is updated, that moment the inventory database is out of sync with reality. All the subsequent actions on that host will ultimately fail with conflicts and hard to troubleshoot errors with problematic workarounds.

For a very long time I think this was a bad design - provisioning and inventory information should be kept separate. Networking information is not the only one causing conflicts, the same story applies for Operating System (e.g. CentOS 8 vs CentOS Stream recent problem). However I do not thing it is the right time to reengineer this from scratch.

I would like to explore options we have to solve this because dealing with orchestration errors is very time consuming for both users and us. One solution which comes to my mind is radical but it makes sense: when a fact would update Subnet, Domain or IP on a managed interface, Foreman would simply refuse to perform this even when these settings would be turned on. We would advise users to uncheck managed flag if they desire to override what the host was provisioned with.

From the user perspective, we could present this via a new Host Status field (no idea for a name to):

OK - host subnet/domain/IP is in sync with facts
IP out of sync - change it manually or change NIC to unmanaged
Subnet out of sync - change it manually or change NIC to unmanaged
Domain out of sync - change it manually or change NIC to unmanaged
Unknown - no relevant facts were reported

These statuses could nicely explain what just happened and what users need to do in order to fix the issue. We would keep the current Administer - Settings for users who want to ignore information from facts for all hosts as well, but the default behavior for managed NICs would be to ignore the changes and only update the overall Host status.

This feels like a good compromise. What was very often seen as “mystery inventory changes” is now well defined and visible through UI and API/CLI. We would not affect umnanaged hosts - those users who like to use Foreman as a plain inventory would see no difference. Only users with managed hosts would benefit better usability and no orchestration errors.

lzap · August 25, 2021, 6:05pm

Bump, anyone? Do I take this as “yes, remove these”?

@ekohl @tbrisker @Marek_Hulan

Marek_Hulan · September 3, 2021, 10:41am

If we want to fix it properly, we need to split the reported and user defined values. That way we could then allow user to “apply reported” after the review. The status is perhaps a good indicator but I’m not sure I’d like to see warning or error in case these two values differ. It may not be warning at all. At the same time if that’s one of another OK status of the host I would never notice. Host status shouldn’t be used to mitigate the design problem we have, user rely on it in their monitoring of Foreman health.

I think the right direction would be to clone IP, MAC, subnet_id, domain_id, operating_system_id and similar to ReportedData facet and start building on that.

TimoGoebel · September 3, 2021, 11:16am

Or we go the extra mile and move all the data to two facets:

reported data facet as described by Marek
desired state facet / provisioning facet to store user provided desired state data
The host model could then just redirect to either of the facets.
Might be the best way to keep the API stable.
Thoughts on that?

lzap · September 7, 2021, 9:30am

Why would we need to duplicate any kind of data? We have the facts available in the database, the change is essentially to when we do the database update - instead of doing it immediately, we would refactor this to be done after user review.

All we need is a flag(s) to indicate that there is a change (review) pending.

lzap · January 27, 2022, 8:15am

I would like to bring this to our attention once more as I was digging and resolving another problem related to this. If this list of settings we currently have is not a sign of a bad design, I don’t know what else is:

Administer - Settings - Facts - Update hostgroup from facts
Administer - Settings - Facts - Update subnets from facts
Administer - Settings - Facts - Update environment from facts
Administer - Settings - Facts - Ignore facts for operating system
Administer - Settings - Facts - Ignore domain for operating system

There is no need to store reported data in a database at all, we just need a flag that will indicate that there is a change pending. When user opens up a host with such flag, Foreman can easily fetch its facts and perform the explicit comparison showing the diff and asking to confirm on the fly. The same would apply for mass action, but this time this would be controlled (background task) - something that if goes wild can be cancelled and investigated.

In other words, unless there are any objections I will go ahead and write a patch:

New flag indicating there is a fact drift (per each fact type: Puppet, Ansible, RHSM) for a host.
New action for single host to remediate the problem (from arbitrary fact type).
New mass-action for multiple hosts to remediate the problem (background job).
A rake task to do the same so people can run this regularly.
Removal of all the above settings.

lzap · February 2, 2022, 9:32am

Relevant read: RFC: Store parsed facts on clients

Thulium-Drake · July 15, 2022, 2:20pm

@lzap Any news on this front?

lzap · August 8, 2022, 8:59am

Nope.