Hi,
Right now we are having an issue with provisioning RHEL 9 machines, where provisioned hosts go into emergency mode after initial reboot.
The problem is in running package upgrade in the %post section in the Kickstart template. Luckily we have temporary workaround for it - disable package upgrade and do it after host is rebooted and provisioned.
This leads us to this RFC, where I would like to propose to add a new one shot Systemd service foreman_first_boot that would run on first machine start and execute a template defined by user.
Service template could contain package upgrades and some other stuff that we do right now in the %post section.
My first impression is that weāre building more hacks. Instead Iād prefer a completely different approach.
I think we should avoid using %post as much as possible. It was always a hack and we can blame RHEL subscriptions for most of it.
The backstory is that when you provision RHEL you need a subscription. However, there was no way to get that subscription enabled in the regular install. So thereās a hack in %post. Then you can install additional packages instead of using %packages. This was never needed for Fedora and CentOS, but for compatibility we used a single way.
The good news is that the support has landed in Anaconda:
So things will improve a lot if we start using those macros when itās available.
I think this is the route to take. Once you have that, you donāt need upgrade at all: it simply fetches the latest version on installation.
I took a stab at writing the very initial example:
Itās far from useful now and nowhere near complete, but I think it shows a much more elegant way. This has been on my radar for the past 3 years, but I never got around to it.
@ekohl are there some other things we do in %post that perhaps donāt have a native anaconda support? What I like about the proposal here is, that the call home curl happens only after the host reboot. So in the Foreman, host is considered built only when itās really successfully provisioned.
rhsm command looks promising, that would actually solved the issue. My question is, that we should somehow check if installed Anaconda actually supports the rhsm command, is there a way to do it? Or are we good with expectation that users should always have latest version of Anaconda.
I think this is the route to take. Once you have that, you donāt need upgrade at all: it simply fetches the latest version on installation.
I agree, for RHEL 9 seems to be perfect fit solution, but users provisioning Debian / Ubuntu systems still might find useful this proposed service that would do the stuff after the reboot.
Also, should this be in the RFC section?
Well it should be but RFC is under the Development where it wonāt (maybe) reach same number of readers as here.
Plus in past few RFC Iāve been asked to move posts under here to Community section so this time I just posted it here right away.
Iāve reached out via email to the Anaconda developer I had contact with back in 2019 about this. I recall that it was RHEL 8.2 or 8.3 that introduced it. Within Foreman we do have the OS major and minor versions and we can do version checks. Thatās how we also dealt with RHEL 5 vs 6 vs 7 etc.
What you describe sounds a lot like the Ansible Tower provisioning callback:
To avoid going off topic Iāve opened a new post:
After discussing this further with @Marek_Hulan we agreed that %post is too large. However, itās isnāt obvious where exactly things should go.
I noted that the yum/dnf update part should go into the proper sections (when possible), like more native integration. There are probably more.
However, marking as built does make sense at first boot. That ensures the network config is actually correct. Or at least, it can route to Foreman so it could be fixed with REX/configuration management. But just moving it isnāt everything.
@TimoGoebel has in the past suggested to add more steps in between. Today the host status is building or built, but it would be nice if there are steps in between. If you could somehow find out that it made it past %post (or the Debian equivalent) and should have been rebooted. I donāt recall if this was made into an RFC or otherwise, but it would give users more insight into how far provisioning is.
This will also solve some issues I had to troubleshoot in the past:
A VM with small memory would not survive dnf update due to low memory, specifically installations without swap.
User-defined post template code which assumes that they are executed on a fully booted system while thatās not the case for %post (e.g. firewall-cmd vs firewall-offline-cmd or what is the command I do not remember).
Slow provisioning, specifically when there is a lot of updates (this can be definitely done after first boot).
This does not feel like a hack, firstboot script or action is a well known term in the industry, we have seen it in the past implemented in many OSes. There is also the systemd-firstboot software which performs additional settings. Which probably means the service should start after systemd-firstboot just in case something would be uninitialized.
There already is support for kicking off an ansible playbook callback after provisioning. That is what I do for some final configuration and setup and that is where we do subscription manager registration and make sure the system is updated.
I donāt see cloud-init being mentioned as an option. Foreman already has support for userdata template, and cloud-init can be leveraged to run that on boot, which could be just a bash script.
I suppose there is a bit of overlap with what foreman does during provisioning and the features of cloud-init. Bit off-topic, I wish there was a solid image based provisioning flow that covers bare-metal and VMs. That would be awesome.
I am actually working on a small prototype of exactly this, will show it off on DevConf 2023 in Brno. I haven an idea of a separate small project dedicated only to (image-based) provisioning and in some future, maybe a Foreman plugin for it. It is going to be based heavily around Anaconda (installer) and some of its capabilities (MAC HTTP headers, image-based provisioning from tarballs, UEFI HTTP Boot, SecureBoot, EFI HTTPS x509 enrollment) but any contributions for other installers (perhaps some sort of liveimage with a shellscript/python) will be welcome.
Thanks for the recommendation. I have looked into cloud-init, but I am not very familiar with it, so please tell me if I got something wrong.
In the current state, the built callback is happening before the machine restarts, this is solved by my PR, which creates a systemd service. I have tested cloud-init by replacing the post section with this piece of code from an existing cloud-init template.
The result is the same as with my solution, but there is one thing I am a bit worried about.
Cloud-init requires systemd. This is a problem as we need to still support RHEL6, which does not have a systemd. In the case of a systemd service, you can solve that by taking the script used for the callback and making it remove itself/move to a different location or rename itself. It can be then added into a crontab which will run after reboot. I am not sure how to solve this issue with cloud-init.
I thought there was a version of cloud-init for EL6, but that lived in EPEL which is now retired.
Technically Red Hat cares about RHEL 6. In the upstream community there is less need since most EL6 is now end of life anyway.
One possible solution is to not support this feature on EL6. Depending on how far you want to take it, itās certainly an option to use version conditionals.
After actually trying to implement cloud-init and talking to @lstejska, I am not sure cloud-init is the way anymore. It requires another dependency in the form of the cloud-init package. Also mixing cloud-init in the kickstart template seems like a really bad idea.
If you are familiar with cloud-init, could you draft the kickstart template changes that would allow for cloud-init to be used?
I have some experience with cloud-init, VMWare uses cloud-init in their vRealize Automation tool for spinning up clones. I am very much not impressed with it at all. Its overly complex, and not very reliable.