Breaking provisioning into pieces

Dyrkon · March 28, 2024, 2:04pm

Hi everyone, I would like your opinions on breaking provisioning into smaller pieces (adding more states).

The current state

There are currently two build-state groups:

The build phase is done
1.1. BUILT - Provisioning was successful and the host is ready to use.
1.2. BUILD_FAILED - Provisioning failed because some parts of its provisioning didn’t go as planned.
1.3. TOKEN_EXPIRED - The provisioning took too long, the token used for provisioning expired and the build is considered as failed
The host is being built
2.1. PENDING - Provisioning templates and so on are being executed.

The problems

As you can see, there are three states which indicate that the build is done and how it went. That is useful after the machine is built, but we are kind of oblivious as to what is currently happening in terms of provisioning progress itself.

The upcoming changes

I am working on moving the post section to after the host’s first restart after being built in this PR. There, I have run into the problem with the limited expressivity of the states that are currently available.

The solutions

After some internal communication, it would be nice to have multiple build states, that would indicate, that the host is in a state:

PENDING - Same as above.
BUILT - The host has finished building from the provisioning template and is ready for its first restart.
UP - Host has successfully restarted and is performing post first boot stuff.
PROVISIONED - The host is done with all of its post-first boot stuff and is fully prepared for further use.

The future prospects

This build-state rework would allow for increased granularity of provisioning tracking, with the possibility to split the process into more phases in terms of reporting and possible defined actions for given phases.

The part where you come in

Sadly the finances for the creative naming department have run out. So need you to come up with better names for the states. You are also more than welcome to share your thoughts on this change and how to make it better!

Implementation

Here is a link to a PR with a proposal implementation of the state machine.

ekohl · March 28, 2024, 3:18pm

It would be good to have a state diagram for the various states. I think the current diagram is (using https://mermaid.live):

stateDiagram-v2
    [*] --> PENDING
    [*] --> BUILT
    PENDING --> BUILD_FAILED
    PENDING --> BUILT
    PENDING --> TOKEN_EXPIRED
    BUILD_FAILED --> PENDING
    BUILT --> PENDING

In the past @TimoGoebel has suggested to provide fine grained states. For example, when kickstart is running you have %pre, %post etc. If you can store those (perhaps even a free form substate field) and audit them then you get more visibility into how far it got, without having to look at the console.

Thinking out loud: I think your proposal to add UP would essentially be a substate of PENDING, but on the other hand, I do think you need to change some templates like the TFTP/HTTPBoot config to make sure it boots from local disk.

Can you further explain which state transitions you envision?

Dyrkon · March 28, 2024, 5:44pm

This change is mainly linked to the first boot service PR, which will make it possible to run some user defined commands after the host restarts for the first time. This means, that the machine is built and UP (feel free to suggest better name) as for running and some setup that is defined in the first boot service is running. When that ends the host should be PROVISIONED.

The problem with simply marking it pending is that the post first boot setup can run longer then the initial phase (pending to built), where some package installs, updates and so on might be happening.

Lennonka · April 4, 2024, 6:28pm

A couple of ideas for the naming.
For the UP status: REBOOTED, POST_BUILD

Marek_Hulan · April 12, 2024, 7:00am

The provide a bit more context for this - the linked PR suffer from one issue. If we move the call home to the first boot service (after the restart), the TFTP configuration won’t change and the machine will again boot the installer during that restart, ending up in the endless loop (well untill the token expires). Therefore we need more granularity. The current “built” state needs to essentially mean, this is not final yet, the provisioning token is still valid for after reboot configuration, but TFTP should already be set to boot from local HDD.

We also discussed adding more states, like %post starts as yuo mentioned. That will be easy follow up after we introduce higher granularity here.

Similarly, we should be able to always explicitly switch the state, instead of calculating it in runtime, like we do today for token expiration. We now have ability to schedule some task to future. So once the build process starts, we can schedule “switch state to expire in 60 minutes if the status is still pending”. Of course the time will be based on settings and we’ll have to detect the it’s still the same provisioning act based on the provisioning token.

For the state names, Built evokes that the whole process is done, since the state is called Build status. How about

Pending → OS Installed → Rebooted → Provisioned
Pending - user started provisioning
OS Installed - the today call home point (last command in %post)
Rebooted - the first command in the first boot service
Provisioned - the last command in the first boot service

We may want to add more states, e.g. when Kickstart is fetched (e.g. Installing OS) and %post first command (Configuring OS), but that’s IMHO a separate task.