Foreman provisioning strategy

One of the most interesting talks I had at cfgmgmtcam was about the future of provisioning in Foreman.

Current state

  • Hard to trace and as a consequence hard to debug
  • Transactional
  • Uses model’s lifecycle hooks to manage external services
  • A lot of different paths which contribute to overall complexity of the code
  • Need for support new operating systems provisioning methods, like bootable containers

We have talked about ways to change it and generate a clear strategy about the way to continue maintaining the provisioning part in Foreman

We talked about two options that we currently have on the table:

Option 1: Actor-based provisioning

The idea is to divide the provisioning process into multiple actors that each one of them would be responsible for a single aspect of the provisioning: one for DHCP record, one for DNS record, one for compute reource creation e.t.c.
The user will select which actors will be used to provision a selected machine
The actors will consue and contribute data to the model. Each actor will decide what data is needed for it to run and at the end of the run will contribute the new data to the host object.
For example the DHCP actor will consume mac address and contribute an IP address. The DNS actor will consume the IP address and provide an FQDN and so on. The orchestration of the actors can be performed as a single task with multiple parallel steps that will wait for the data to appear on the object.
The end of the process would be defined when no actor can contribute more data.

Advantages

Asynchronous
Traceable (through the checklist of actors)
Uses proper tasks
Provides a clear view of which tasks are going to be executed
A bit more mature

Disadvantages
Does not remove complexity from a single provisioning
Does not reduce the amount of provisioning paths

Option 2: Image oriented provisioning

The idea is to generate a provisioning image on the Foreman side before the provisioning begins using image-generating tools like mkosi. Once the image is generated, Foreman will use a specially-crafted provisioning image (similar to what we have in Discovery) to write the operating system image to the disk. The provisioning image can be booted by any means that are available in the network - PXE, redfish or even a physical USB stick. Of course Foreman will be able to help with the task of distributing the image.

Advantages

Asynchronous
The process of creating the image is outsourced, reducing the support burden form Foreman
The created image is testable and repeatable
No dependency on how to build the image
No dependency on hot to boot the installer
Single way to deploy any operating system

Disadvantages
A revolutional approach
New way to deploy, may require developing a new tool (the provisioning image)
Less mature

Next steps

I would like us to converge to a single path and have a clear strategy as to the direction we want to take with the provisioning. Once we decide on the proper strategy, we can start designing the smaller seps that will get us there eventually.

Thanks to @ekohl, @lstejska, @nofaralfasi, @Jan and @m-bucher for bearing with me on this one!

5 Likes

Adding @goarsna

2 Likes

I guess with Option 2, there would be some time for which we support both existing workflows and this new one. I think there’s more we need to think through if we want to fully transition to image-based approach, such as how we build the images, how we maintain them etc. But in general, I’d be in favour of this Option 2 as it has a better overall outcome. The first options feels more like a refactoring that solves only some issues.

Given how little responses this got so far, I think for such a topic we need to ask also outside of Development. Perhaps mention this on the community demo, so we hear from users. For that, we may want to come up with some more user centric descriptions of he options.

1 Like

Sure! I’ll think of a way to present the options properly.

And yes, option 2 feels more like a revolutionary change that will require some transition period.

Meaning foreman_tasks or something else?

A few questions for the 1. option:

  • Rollback—What if one of the many actors fails? Does each actor know how to roll back itself and the whole process?
  • If multiple actors are waiting for multiple conditions (aka finished actors), you will need a monitoring controller to decide which actor will run and order the execution.
  • As the actor dependency grows, it could get messy quickly. A graph tool visualizing the relationships between the actors might help.
  • Extension - how can the list of actors (and their deps) be extended from the plugin?

Option no. 2 - images
I posted my summarization in the RFC: Support provisioning with bootable images.

We could use an image builder to generate the images. Sadly, one big downside is that it’s host OS-dependent. For example, you can’t create CentOS images on RHEL.

what’s making option 1 so hard to debug (I’ve had a few problems but they where mostly down to my lack of understanding)

foreman has multiple components that all do a single job, dhcp, dns, compute resources, they hook into OS native and industry standard tooling and methods, kickstart, ubiquity, pre-seed, as well as image based deployments on virutalisation platforms that support it. The logging is fine for most situations, you can see when a dhcp host doesn’t give a record, or the dns update doesn’t match the IP, the on screen messages are normally pretty good from foreman, there are some tricky areas such as when TFTP isn’t working properly, but for the most part, it’s not too complext

Yes, I think so. Although we can switch to ActiveJobs, if foreman_tasks will be too heavy for this. Let’s have this discussion closer to implementation, since it feels like a premature optimization at this point.

This process is not transactional anyway, so the best we can do is compensation. As a rule of thumb I would borrow ideas from the Saga pattern. The idea is we will send compensation messages to all participants that have already done their part.

I assume each actor actually adds more information, which means that if the conditions are met for it to run, there is no difference in the order of the run. We need an orchestration layer anyway, to understand which actors can run and record the progress. At least theoretically, if more than one actor is ready to run, we can execute those actors in parallel.

As with any other extension points: we will have a registry that any actor from any plugin can register itself into and this is the list that will be iterated for each provisioning.

It’s solvable, if we properly wrap it in smart proxy interface.

For Option 2, is it the intent that the image building process is agnostic. If foreman is to integrate with a particular “image creator”, then the images types, os, etc. are limited by the image creation software and not within the foreman project control. Granted mkosi has a huge OS list.

The other aspect is image sprawl. The notion of reproducible image builds is important. Without it, and the appropriate foreman user/administration discipline, I can see users wanting to save images for long periods of time… great for storage providers, bad for everyone else. We have seen this before (and will see it again <cough> container registry </cough>.)

Just a couple of thoughts from the back row of the theatre for a Friday morning. :slight_smile:

As I have mentioned earlier, if we properly wrap the image creation in smart proxy interface, we should be pretty agnostic to the builder in use. This should give us the option to even use multiple “builders” with a single Foreman instance.

Thanks for bringing this up! My original thought was to have a semi-permanent images for hostgroups (to be used as a base for creating individual host images) and the host images to be more transient: for example remove the image after a successful deployment (since there is no more need for that specific image).

The predictability of the process: it would be possible to see which actor has been invoked and what it contributed (as long as we keep good track of individual actors). I think the main issue would be with understanding why a particular actor was not invoked. It is also solvable, but I think it complicates the debugging process.

I think that option 1 will require a very good understanding of the process from the user. The dynamic character of the process makes it less intuitive to the end user, and will require a careful way of communicating it.

<digression>
That sentiment has been expressed by customers generally with respect to The Foreman/Satellite!
Not necessarily individual components, but how to best take advantage of The Foreman as a whole.
</digression>

I was thinking that was your intention. There is the notion of what the customer will do, and that was my point. Where allowed to store images long term they will out of fear of reproduce-ability. Leading inevitably to sprawl.

Questions:
In what deployment configurations do images not work?
In what deployment configurations does assembly(?) not work?

These are other ways to look at the problem space.

I am thinking about bare metal provisioning. It is not properly solved for images. I know that anaconda should be OK-ish with such deployments, but I am not sure about other operating systems. Since I want to make this process as a substitute to the “traditional” provisioning, I want to have a single way that will be OS (the one on the image) agnostic. Or to put it in other words, I want to have a single mechanism that will deploy any image generated by Foreman to any machine. Ideally with as little variance as possible to reduce the provisioning code base.

I suppose you are referring to Option 1: since it’s very similar to what we have today, it will work with everything that we already support. The only thing that will not be supported unless we put some effort, is image based installs like bootable containers.
This option poses a couple of issues: It’s complex for the user, it does not reduce support burden from us and it requires extra effort for bootable containers deployments. The latter seems to me even more problematic, since from my point of view, I see that the world begins to shift in that direction. I think that if we want to stay relevant in the near future, we have to offer support for this emerging world of bootable images/containers.

This new image, how similar will it be to https://netboot.xyz/?

How would this model handle somthing like:

"I need to build 170 system that:

  • have mirrored boot and root drives
  • all drives are luks encrypted
  • all drives are bound to tang + tpm
  • the file system layout needs to meet CIS2 / DISA STIG
    "
    I don’t think an image build can do this.

P.S. this is an actual customer situation.
P.P.S. we built this with Satellite and the customer is speaking with us at Summit :slight_smile:
P.P.P.S we deployed the datacenter in less than 8 hours using satellite ansible modules :slight_smile:
P.P.P.P.S. we built and fully configured the satellite server in about 4h with same.
P.P.P.P.P.S. we actually did it twice, first in PA, then in AZ. PA tooks us a lot longer (5 days).

2 Likes

We will need to do this again.
These were bare metal systems.

3 Likes