RFE: Host pre-registration feature instead of discovery

lzap · January 16, 2022, 8:58pm

Foreman Discovery is not ideal for both interactive and non-interactive workflows. It works the best when a host is discovered as is and with this “vanilla” configuration it is provisioned in a way that a hostgroup is assigned, no network configuration is changed. Interactive workflow via Edit Host form does not work great. If you think that auto-provisioning is the answer, well, there is no way to change NICs via that workflow. It is not great.

Discovery Image has also complicated design. Puppet Facter is the tool that uploads facts, the idea behind it was to leverage Foreman’s Puppet parsing capabilities to discover hosts as STI records. Discovered nodes also run smart-proxy only to provide its API with a plugin that can perform reboot and kexec. And there’s the TUI for PXE-less workflows that has a completely different codebase.

I did a step back over the weekend and with a pen and pencil I figured out what I think is a much superior user experience for the future. It is a massive redesign, however, I want to build on components we already have. Here are the requirements for the new solution:

New solution must provide great experience for both interactive (pick your node, assign hostgroup or create brand new host from scratch) and automatic provisioning (node is provisioned as soon as it gets discovered, or via a command).
Users must be able to completely reconfigure host configuration, including NIC changes (changing subnets, setting up VLANs, bonds, bridges).
Reported facts must be easy to modify or add (this is currently pretty tough through Facter’s custom facts).
Communication between nodes and smart-proxies must be simple and secure by default.

The solution I am proposing is all-new because I strongly believe things must change from the ground up. Discovered hosts must not be STI “unmanaged” hosts anymore, provisioning must not be an act of “editing a host” and also I think that the workflow I am about to show you can be useful also for registering existing hosts. Therefore, my solution is not called discovery anymore: enter the world of host pre-registration. Oh, it’s not a plugin anymore, you will see why in a bit.

There’s a new model class called preregistration which represents some hardware or resource that can be created later. Although these could be other things than just discovered servers (hardware switches or even virtual resources like subnets or domains), let’s start small. In its simplest form, preregistration has a timestamp and list of facts.

Lesson learned from the past: Discovered hosts do have names that are by default in the form of macAABBCCDDEEFF but all our users do eventually rename the hosts. Therefore I am intentionally keeping name out of the preregistrations table. It’s useless, MAC address is a fact. In the UI, users should be able to change what columns to see on the index page and since it will be extremely easy to create their own facts, it can be even their very own fact like servertype with values like big-9843294 or tiny-3243234.

Registration has 1:1 association to hosts table to track if it was utilized or not. Registration which are in the wait queue to be provisioned has association set to nil, obviously. This way, hosts keep their registration history (when, what facts) and they can also work as a “baseline” fact source for hosts that have no fact source available (no rhsm, no puppet, no ansible).

From the UI perspective, there is just an index page of registrations, detail page showing list of facts and searching. Each registration also has a button called “Process”, more about it later. In the first version, there is no interactive provisioning possible - everything must be done with auto-provisioning (more about it later). The plan is that once we get to the New Host form rewrite, it will have an option to create a brand new host based on a registration - users will be able to pick some values from facts (e.g. MAC address or subnet). I would like to focus on seamless auto-provisioning first. But first, let’s talk about the discovery process.

After a LOT of thinking, I have landed on the following solution for discovery: pull-based HTTPS polling via a Python script. Polling is very simple not only to implement, but it also can utilize Foreman’s strong feature - templating system. We can actually build everything just around our unattended endpoint. Although polling is not always preferred for scalability, for discovery it is actually fine - there’s usually not a whole lot of nodes and intervals can be set longer when necessary - initiation of provisioning does not have to be instant. Finally, it works great with the idea that fact gathering should be a single Python script that can be edited - the script itself can be a Foreman template!

Now, we are Rubyists so why Python? This is a practical reason. In Red Hat systems, Python is always present - there is what’s called a platform python which you cannot even uninstall easily. Second, the discovery process can also be integrated with just Anaconda via a simple %pre script where Bash or Python are the only two options. It could be a shell, but battling JSON is not what you want to do in Bash.

So, here is how it works. Foreman Discovery Image is built the same way as we do today but there is no Ruby, no Facter, no Smart Proxy, no TUI. Just the OS and a systemd service that starts up. Before we download any bit, let’s talk security. The protocol of choice is HTTPS, so discovery must have CA certificate available in order to verify the server. This can be done by attaching a USB stick with a properly formatted filesystem (FS label) and a filename, second option would be to put X509 fingerprint to the kernel command line - the certificate would be downloaded from the server first and fingerprint tested prior to any communication.

Then the first request would be GET https://smartproxy/unattended/preregister_script?mac=MAC1,MAC2,MAC3. The goal of this request is to render global “Preregister Script” template. Foreman would ship a simple Python script that will print a JSON to the standard output - facts. Only a few are really needed, users could define their own when needed. IP and MAC addresses (REMOTE_IP) are available in the rendering context so users can actually provide different scripts for different hosts (e.g. different subnets).

Lesson learned from the past: Discovered nodes gather facts over and over again every time they report facts back to the server every 15 minutes by default. This is not needed, facts should be only evaluated once, cached and sent unchanged. There is no point in sending the amount of free memory or how much free disk space is at the moment. This is really only needed once, in case of hot-swap users should reboot.

After the script is executed and facts are gathered, another request is made, this time it’s POST to https://smartproxy/unattended/preregister with facts in the request body. At this point, preregistration record is created and saved, new webhook called new_preregistration is fired (more about them later) and HTTP 202 (Accepted) code is returned.

The sending script at this point is waiting in a loop, it performs the same request every few minutes unless HTTP 200 (Ok) is returned. This is the polling I was talking about - looks like a dumb design but it is a great fit for discovery.

At this point, the registration appears in the UI/CLI, users can list them, show them, and if they want to actually initiate provisioning they can click on the “Process” button. The only purpose of this button is to support workflows that are called “semi-automated auto-provisioning”. That’s the “auto-provision” button we have today - host may proceed with discovery according to Discovery Rules. What it does in the new design is simple: it just fires another webhook called process_preregistration.

The last bit of the puzzle is Foreman Webhooks plugin and I think it is pretty obvious at this point. My requirement for the solution is to give our users much-required flexibility of creating hosts based on discovered facts. Only if we had a good and stable API they could use. Or CLI maybe? Wait! Foreman has an API and a CLI.

Users can create webhooks written either in their stack of choice or via our Shellhooks plugin (shell script) that would do the host creation. In the input parameters for those webhooks there are registration objects: all facts that were gathered. All the rest is something that our users can do pretty easily and there are no limitations - they can build pretty much any logic into these webhooks.

Lesson learned from the past: Taxonomy and discovery is hard - every shop has different requirements. That’s why designing the discovery process as open and flexible as possible is key. Webhooks can be built in a way that depending on the input (e.g. subnet) different sub-workflows in different departments can be called. We would ship some examples for Shellhooks.

In the current design, Discovery Rules can be used to assign hostgroups based on fact search conditions. The same but much more can be done via Webhooks. Some might argue that previously users could create their own rules in the UI/API/CLI while in the new design they would need to edit some shell scripts or build their own ruleset in their own stack. While I think the flexibility over UI pays off in the long-term, we could build a similar Preregistration Rules table that would instead of associating hostgroups would fire particular Webhooks. Alternatively, users could deploy some kind of UI themselves if needed.

As I said, interactive provisioning would be delivered with the New Host redesign which is coming soon (this year perhaps). But Discovery has always been strongest in the non-interactive mode. And that would be the focus for the first version.

For PXE-less mode, I would like to drop the whole TUI and replace it with just a few questions asked via the Python script when no network connection is found. Again, the ability for users to customize such script in the way they want is the way to go forward. Some might actually offer a few options: provision as database, web server or load-balancer? Hit enter to create a host.

There you have it, this feels really flexible and easy to understand and work with. Tell me what you think!

ekohl · January 17, 2022, 5:10pm

I’m first going to jump in on the fact handling, which I understand is an implementation detail. However, it did jump out to me.

What is hard about this? Using external facts you can use any language, as long as it’s executable. You can also use plain JSON/YAML/txt files (ini-like):

While distribution may be a concern (it’s harder than a single file that you can download), I don’t think adding facts was ever a concern.

I really would like to avoid building yet another fact gathering tool.

I’m not sure shipping Python code to servers is a good idea. Ansible does that, but compatibility is not trivial. Especially if you need to support RHEL 7 or worse, RHEL 6. On such non-trivial code you want proper testing, likely with actual CI. I would be concerned with implementing that.

I would argue that it should remain possible to reuse the facts as we have them today. Just call facter --json and upload those if present. As you say, call a some script and that’s the interface.

How do you intend to implement this? Do you intend to ship a helper script to download the certificate and verify the fingerprint? Will the fingerprint be on the certificate itself or on a signing CA?

These are just some things that jumped out to me. I think the general workflow can make sense.

However, one thing that will remain unchanged is that it is still an open endpoint: anyone with access to the API can (unauthenticated) create a discovered host, right?

lzap · January 18, 2022, 8:40am

Please read or try to implement a custom fact for discovery. You either need to create a ZIP file, put it on a TFTP server or rebuild the whole 0.5 GB image just to let’s say report a fact that will represent a hostgroup you want to join. Compare to just editing a template with a script.

I don’t think adding facts was ever a concern.

No, not really, I may have presented it as a main advantage while that is not the case. The main reason why I want to get rid of facter is that I want to be able to implement discovery directly in OS installer environment (Debian installer, Anaconda) or even scriptlet (%pre). There is no Ruby or Facter available and it turns out that users do not actually need a ton of facts in the new design.

Facter is not a tool for discovering hosts, it is a reporter for Puppet inventory facts. That is a completely different thing. You don’t take Ferrari to safari.

Good point, but you have to understand that most of our (discovery) users only need few facts: CPUs, memory, drives, platform (BIOS/EFI), IP/MAC. It can be just a shell script because OS installer or FDI environment is just Linux. We do not need to work on other platforms like Facter do, just call lscpu and grep the result, collect the results and call curl, work done! Facts do not need to be even JSON or structured, Anaconda today sends MAC addresses as HTTP headers and it has been working reliably for two decades (Foreman uses it when tokens are turned off, many users do this).

There must be some kind of a wrapper (helper) script that kicks off everything that is called after boot (or from Anaconda), so yes, that will check removable media for a CA certificate or kernel command line for fingerprint. It could also use CA from EFI firmware too, although I haven’t done this myself yet.

anyone with access to the API can (unauthenticated) create a discovered host

Indirectly yes, and before you elaborate how bad this is, let me remind that PXE is by design remote code execution without any security. With UEFI HTTP boot you can achieve better security by enrolling your CA keys into the firmware and enabling SecureBoot. So the moment FDI or OS installer was loaded you are sure it was downloaded from a verified server and the code was signed by Microsoft (or shim - thus it is a Linux kernel signed by Red Hat).

By the way, I say this is indirect creation of host - there must be either some kind of confirmation by human to create a host, or a webhook that actually performs the creation based off facts (expected subnet, expected hardware, vendor etc). It’s not like discovery is exposing “create host” API, it has never been like that.

Marek had a comment on IRC why this is named “pre-registration” and that we already have the “registration” feature. This is because I think the current “registration” works great with this feature and everything could be integrated together. Instead just directly registering a host via direct command, users could actually pre-register them and then based on human activity (confirmation with Proceed button) or via webhook rules, they would be actually created. This is an improvement I believe.

bhawksfan · January 18, 2022, 4:56pm

This sounds really interesting, with a lot of positives. I particularly like the concept of the web hooks rather than the existing rules engine.

Based on my personal experience, I would caution you against templating the fact generating script.

I’m a python developer, so I definitely welcome the use of python in this, but templates are difficult to get right, especially with languages that are highly sensitive to leading whitespace as python is. I’m not sure what other options would be available, perhaps allowing a static script to be downloaded.

If using foreman’s templates is the only feasible way to implement it, that’s fine. Obviously, a template can be the entire script without any ERB within it.

I’m not quite clear how this would work within %pre, since there’s not really enough information to base the rest of the kickstart template on. For example, the kickstart template wouldn’t know what the machine’s FQDN is supposed to be during the pre-registration phase.

lzap · January 19, 2022, 6:25am

Noted, there was another concern raised that python might be available in other versions in older systems (python2 vs python3). So I am actually considering utilizing shell script to gather facts now rather than python. And instead of JSON facts transferred in HTML body, we would use HTTP headers (key-value strings) instead. We really do not need to find much info.

The main idea behind Anaconda-Pre is here: Idea: Dracut/Anaconda-based discovery image

Now, FQDN is fact we will definitely not send in the new design - why would you need that? On discovery image it is always fdi, unless overwritten by DHCP which I think is turned off anyway. If someone needs some extra information (let’s say a vendor option from DHCP) a custom fact can be made.

viwon · January 26, 2022, 9:10pm

Ok, I like the idea of being able to create a host record ahead of time, then associate it with a system/hardware and kick off the build.

That would make it easier on those people who do static IP assignment and get all their hostnames/IP’s pre-assigned, they could enter that sort of data in the host config ahead of time.

This would also be interesting from an automation perspective to use an API to preload a number of host build configuration records via an API, and then kick them off as the hardware is available.

And, that workflow maybe lends itself to integration with external end user self-service systems upstream, where the user enters the host info first before the system is built.