RFC - Systemd first boot service for host provisioning

lstejska · August 11, 2022, 11:31am

Hi,
Right now we are having an issue with provisioning RHEL 9 machines, where provisioned hosts go into emergency mode after initial reboot.

The problem is in running package upgrade in the %post section in the Kickstart template. Luckily we have temporary workaround for it - disable package upgrade and do it after host is rebooted and provisioned.

This leads us to this RFC, where I would like to propose to add a new one shot Systemd service foreman_first_boot that would run on first machine start and execute a template defined by user.

Service template could contain package upgrades and some other stuff that we do right now in the %post section.

SystemD adoption (source)

OS	Version	Year
CentOS	7.0	2014
RHEL	7.0	2014
Debian	8.0	2015
Ubuntu	15.04	2015

Service example

[Unit]
Description=foreman_first_boot

[Service]
Type=oneshot
ExecStart=/etc/foreman/foreman_first_boot.sh
RemainAfterExit=false
Wants=basic.target
After=basic.target network-online.target nss-lookup.target
ConditionPathExists=!/etc/foreman/foreman_first_boot_done

ekohl · August 11, 2022, 11:56am

My first impression is that we’re building more hacks. Instead I’d prefer a completely different approach.

I think we should avoid using %post as much as possible. It was always a hack and we can blame RHEL subscriptions for most of it.

The backstory is that when you provision RHEL you need a subscription. However, there was no way to get that subscription enabled in the regular install. So there’s a hack in %post. Then you can install additional packages instead of using %packages. This was never needed for Fedora and CentOS, but for compatibility we used a single way.

The good news is that the support has landed in Anaconda:

github.com/rhinstaller/anaconda

Add Satellite registration support

rhinstaller:rhel-9 ← M4rtinK:rhel-9-satellite_support

opened 08:51PM - 01 Jul 21 UTC

M4rtinK

+2046 -756

Add support to Anaconda to register no only to Red Hat run subscription infrastr…ucture (usually called Hosted Candlepin) but also to custom Red Hat Satellite instances run by customers. This is implemented by a couple new tasks and and a new library module called `satellite`. To register to a Satellite instance instead of to Hosted Candlepin the use sets the Satellite URL in the GUI or the `--server-hostname` option of the `rhsm` kickstart command. Then at registration time Anaconda will try to fetch and execute a Satellite provisioning script from this URL. If this is successful, the registration process will continue but the machine will end up talking to the Satellite instance instead of to Hosted Candlepin. Also at installation time, execute the provisioning script on the target system after installation, so that the installed system is also properly provisioned for the given Satellite instance as well. **NOTE:** Most Satellite instances use self signed certificates and only after the provisioning script has been run the chain of trust containing these self signed certificates is established. Due to this it is recommended only ever register to Satellite via a trusted network where an attacker can't replace the provisioning script in transit as it by necessity has to be transferred a secure SSL connection being established (as before the script is run no chain of trust to the Satellite instance exists). **TODO** - [x] check registration to Satellite from kickstart works as well - [x] check the system can still work with Satellite and its repositories after installation - [x] check Satellite provided repos can be used as installation source (needs content added to Satellite test instance) - [x] check why install time token transfer fails (likely caused by missing entitlements on Satellite test instance) - [x] rename "Red Hat CDN" to "Satellite" when system is registered to Satellite, roll back to "Red Hat CDN" if unregistered at installation time - [x] ~find why source spoke status is not updated after un-registration from Satellite~ actually a race condition in payload thread, see details in a comment below - [x] adjust and fixup unit tests to cover Satellite functionality - [x] resolve and remove all `#FIXME` items in the code

So things will improve a lot if we start using those macros when it’s available.

I think this is the route to take. Once you have that, you don’t need upgrade at all: it simply fetches the latest version on installation.

See Performing an advanced RHEL 8 installation Red Hat Enterprise Linux 8 | Red Hat Customer Portal as well.

Also, should this be in the RFC section?

ekohl · August 11, 2022, 4:32pm

I took a stab at writing the very initial example:

github.com/theforeman/foreman

Demonstrate native RHSM registration support

theforeman:develop ← ekohl:demonstrate-rhsm

opened 04:21PM - 11 Aug 22 UTC

ekohl

+53 -0

The goal of this PR now is to show an alternative solution to https://community.…theforeman.org/t/rfc-systemd-first-boot-service-for-host-provisioning/29892. This uses the native support in Anaconda to configure RHSM. Having RHSM configured allows using all repositories in %packages, which avoids the need to do everything in %post. That makes the progress bar accurate and also avoids the need to run dnf update in %post. That also speeds up provisioning since you're not installing certain packages twice. This brings RHEL provisioning much closer to kickstarting Fedora or CentOS. There are still many TODOs, like setting the correct server for Candlepin and Pulp in case of Katello. It also hasn't been tested. [1]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/performing_an_advanced_rhel_8_installation/index#register-and-install-from-cdn-using-kickstart_register-and-install-from-cdn

It’s far from useful now and nowhere near complete, but I think it shows a much more elegant way. This has been on my radar for the past 3 years, but I never got around to it.

Marek_Hulan · August 15, 2022, 8:03am

@ekohl are there some other things we do in %post that perhaps don’t have a native anaconda support? What I like about the proposal here is, that the call home curl happens only after the host reboot. So in the Foreman, host is considered built only when it’s really successfully provisioned.

lstejska · August 15, 2022, 9:04am

rhsm command looks promising, that would actually solved the issue. My question is, that we should somehow check if installed Anaconda actually supports the rhsm command, is there a way to do it? Or are we good with expectation that users should always have latest version of Anaconda.

I think this is the route to take. Once you have that, you don’t need upgrade at all: it simply fetches the latest version on installation.

I agree, for RHEL 9 seems to be perfect fit solution, but users provisioning Debian / Ubuntu systems still might find useful this proposed service that would do the stuff after the reboot.

Also, should this be in the RFC section?

Well it should be but RFC is under the Development where it won’t (maybe) reach same number of readers as here.
Plus in past few RFC I’ve been asked to move posts under here to Community section so this time I just posted it here right away.

ekohl · August 15, 2022, 9:34am

I’ve reached out via email to the Anaconda developer I had contact with back in 2019 about this. I recall that it was RHEL 8.2 or 8.3 that introduced it. Within Foreman we do have the OS major and minor versions and we can do version checks. That’s how we also dealt with RHEL 5 vs 6 vs 7 etc.

What you describe sounds a lot like the Ansible Tower provisioning callback:

github.com

theforeman/foreman/blob/develop/app/views/unattended/provisioning_templates/snippet/ansible_provisioning_callback.erb

<%#
kind: snippet
name: ansible_provisioning_callback
model: ProvisioningTemplate
snippet: true
description: |
  Setups the one time run of the Ansible Tower / AWX callback script on a host.
  Supports only RPM based distributions. It is only used if the host param "ansible_tower_provisioning"
  is set to true. The actual callback script content is stored in the "ansible_tower_callback_script"
  snippet.

  See https://docs.theforeman.org/nightly/Configuring_Ansible/index-foreman-el.html#provisioning-a-callback-for-a-host_ansible for more details
-%>
<% if host_param_true?('ansible_tower_provisioning') -%>
<%
  rhel_compatible = @host.operatingsystem.family == 'Redhat' && @host.operatingsystem.name != 'Fedora'
  os_major = @host.operatingsystem.major.to_i
  has_systemd = (@host.operatingsystem.name == 'Fedora' && os_major >= 20) || (rhel_compatible && os_major >= 7) || (@host.operatingsystem.name == 'Ubuntu' && os_major >= 15) || (@host.operatingsystem.name == 'Debian' && os_major >= 8)
-%>
<% if has_systemd -%>

This file has been truncated. show original

To avoid going off topic I’ve opened a new post:

ekohl · August 16, 2022, 10:47am

After discussing this further with @Marek_Hulan we agreed that %post is too large. However, it’s isn’t obvious where exactly things should go.

I noted that the yum/dnf update part should go into the proper sections (when possible), like more native integration. There are probably more.

However, marking as built does make sense at first boot. That ensures the network config is actually correct. Or at least, it can route to Foreman so it could be fixed with REX/configuration management. But just moving it isn’t everything.

@TimoGoebel has in the past suggested to add more steps in between. Today the host status is building or built, but it would be nice if there are steps in between. If you could somehow find out that it made it past %post (or the Debian equivalent) and should have been rebooted. I don’t recall if this was made into an RFC or otherwise, but it would give users more insight into how far provisioning is.

So in summary:

%post is too large
Some things should move to steps before %post
Some things should move beyond %post
More fine grained statuses would be nice

lzap · August 16, 2022, 10:54am

This will also solve some issues I had to troubleshoot in the past:

A VM with small memory would not survive dnf update due to low memory, specifically installations without swap.
User-defined post template code which assumes that they are executed on a fully booted system while that’s not the case for %post (e.g. firewall-cmd vs firewall-offline-cmd or what is the command I do not remember).
Slow provisioning, specifically when there is a lot of updates (this can be definitely done after first boot).

This does not feel like a hack, firstboot script or action is a well known term in the industry, we have seen it in the past implemented in many OSes. There is also the systemd-firstboot software which performs additional settings. Which probably means the service should start after systemd-firstboot just in case something would be uninitialized.

lstejska · August 17, 2022, 12:04pm

Thanks everybody for the comments and insights, I created a redmine tracker for the changes, in summary:

Fix RHEL 9 issue with rhsm command
Implement first boot SystemD service and cleanup %post section (and Debian equivalent)
Implementing new host statuses that would reflect better host’s provisioning progress

Dyrkon · April 6, 2023, 9:21am

I have created PR implementing some of the suggestions mentioned in this thread. Feel free to check it out and give me feedback.

github.com/theforeman/foreman

Fixes #35378 - Add systemd first boot service for host provisioning

theforeman:develop ← Dyrkon:mm/first_boot_service

opened 10:34AM - 04 Apr 23 UTC

Dyrkon

+79 -4

This PR implements some improvements suggested in [RFC - Systemd first boot serv…ice for host provisioning](https://community.theforeman.org/t/rfc-systemd-first-boot-service-for-host-provisioning/29892). The main one is reducing `%post` section and moving some of it to service which is going to run after the first reboot of the machine. This should also ensure that the callback to foreman indicating that the build is done is going to be made only from a machine that successfully rebooted and is ready to use. This solution was successfully tested on Centos 7, Stream 8 and Stream 9 with libvirt.

nixfu · April 20, 2023, 5:47pm

There already is support for kicking off an ansible playbook callback after provisioning. That is what I do for some final configuration and setup and that is where we do subscription manager registration and make sure the system is updated.

lstejska · April 24, 2023, 9:49am

How about users without Ansible? Not everybody has it.

rgp · April 24, 2023, 7:51pm

I don’t see cloud-init being mentioned as an option. Foreman already has support for userdata template, and cloud-init can be leveraged to run that on boot, which could be just a bash script.

I suppose there is a bit of overlap with what foreman does during provisioning and the features of cloud-init. Bit off-topic, I wish there was a solid image based provisioning flow that covers bare-metal and VMs. That would be awesome.

lzap · April 26, 2023, 1:51pm

I am actually working on a small prototype of exactly this, will show it off on DevConf 2023 in Brno. I haven an idea of a separate small project dedicated only to (image-based) provisioning and in some future, maybe a Foreman plugin for it. It is going to be based heavily around Anaconda (installer) and some of its capabilities (MAC HTTP headers, image-based provisioning from tarballs, UEFI HTTP Boot, SecureBoot, EFI HTTPS x509 enrollment) but any contributions for other installers (perhaps some sort of liveimage with a shellscript/python) will be welcome.

Dyrkon · July 10, 2023, 12:55pm

Thanks for the recommendation. I have looked into cloud-init, but I am not very familiar with it, so please tell me if I got something wrong.

In the current state, the built callback is happening before the machine restarts, this is solved by my PR, which creates a systemd service. I have tested cloud-init by replacing the post section with this piece of code from an existing cloud-init template.

phone_home:
  url: <%= foreman_url('built') %>
  post: []
  tries: 10

The result is the same as with my solution, but there is one thing I am a bit worried about.

Cloud-init requires systemd. This is a problem as we need to still support RHEL6, which does not have a systemd. In the case of a systemd service, you can solve that by taking the script used for the callback and making it remove itself/move to a different location or rename itself. It can be then added into a crontab which will run after reboot. I am not sure how to solve this issue with cloud-init.

ekohl · July 10, 2023, 6:56pm

I thought there was a version of cloud-init for EL6, but that lived in EPEL which is now retired.

Technically Red Hat cares about RHEL 6. In the upstream community there is less need since most EL6 is now end of life anyway.

One possible solution is to not support this feature on EL6. Depending on how far you want to take it, it’s certainly an option to use version conditionals.

Dyrkon · July 13, 2023, 8:18am

After actually trying to implement cloud-init and talking to @lstejska, I am not sure cloud-init is the way anymore. It requires another dependency in the form of the cloud-init package. Also mixing cloud-init in the kickstart template seems like a really bad idea.

If you are familiar with cloud-init, could you draft the kickstart template changes that would allow for cloud-init to be used?

nixfu · July 24, 2023, 1:10pm

I have some experience with cloud-init, VMWare uses cloud-init in their vRealize Automation tool for spinning up clones. I am very much not impressed with it at all. Its overly complex, and not very reliable.