Improving discovery workflow

lzap · September 6, 2019, 10:29am

Hey,

when I was researching our Grubby default script template, I noticed that ability to render and download shell scripts from Foreman is really powerful feature which could help to improve discovery. I would like to present an idea to change discovery to fix few major painpoints:

no ability to configure NICs during provisioning (creating bonds, bridges, VLANs)
complicated way of running custom scripts after boot
unreliable kexec

The proposal:

Discovery would require Foreman TLS server certificate fingerprint to be included on the kernel command line both for PXE and for PXE-less (remaster script would not be optional anymore).
After host is successfully discovered, node would request script from Foreman either of Script kind or newly created custom kind. The shell script would be executed. Users could leverage this to do post-boot initializations. The transport would be strictly HTTPS with fingerprint validation. Default script would be probably no operation - ready for users to be customized.
When a host is provisioned a different script would be trigggered right after a host is converted to managed host in a similar way. The default script would only include reboot command. We would ship another template which would do kexec command as well.
Kexec feature and kexec template would be removed from Foreman completely (as kexec template could still be used).
Now, rendering ERB after boot and during provisioning would enable users to actually modify discovered and managed host using Ruby if we provided macros and allowed some methods in safemode. It should be possible to create bonds, bridges, VLANS or rename identifiers based on facts, hostgroup selected etc. This is one of the top-voted issues in Foreman core and across Foreman plugins: Feature #13847: Auto-provisioning custom scripts for NIC configurations - Discovery - Foreman
Since Foreman must do unused_ip call during provisioning (e.g. when Subnet is changed - this is WIP and also highly requested feature Bug #16143: Discovered host IP address is not changed to fall within the subnet range - Discovery - Foreman). This must be integrated into helper responsible for saving NICs/host (it must be called after all subnet changes).

upadhyeammit · September 6, 2019, 12:53pm

Discovery would require Foreman TLS server certificate fingerprint to be included on the kernel command line both for PXE and for PXE-less (remaster script would not be optional anymore).

–>Can we by default have foreman-discovery image built with remaster script ? something like running a postscript at rpm install ? as we have ssl certs at default location, I feel this can be possible ?

If I understand correctly then in PXE-less mode system will reboot and grubby would add kernel, vmlinuz and ks file for the client system; this should indeed trigger the installation ?

lzap · September 6, 2019, 7:27pm

The remaster script does not touch the image itself, it only changes the wrapper which is the bootloader and its configuration. Therefore all it can actually change is kernel command line arguments. And that is a very precious space with the maximum of 800 characters - any X.509 certificate will not fit there.

lzap · September 6, 2019, 7:29pm

It would wipe out initial sectors of the drive, create a small partition, copy kernel and image there, install grub2 bootloader and configure it to boot it. It is destructive indeed.

aruzicka · September 12, 2019, 2:03pm

Just a wild idea. Would it be possible to have anaconda (or any other installer) present on the discovery image and be able to invoke it on demand? So instead of going

Discovery -> reboot -> installer (anaconda, preseed, autoyast...) -> installation -> reboot -> provisioned OS

we’d go straight to

Discovery -> installer -> installation -> reboot -> provisioned OS

lzap · September 12, 2019, 2:09pm

Well that’s what Full Host Bootdisk does and it does great job. It creates an image with SYSLINUX loader and Anaconda (or any supported OS installer) and that’s it.

There is also a different approach - creating new kind of bootdisk image based on iPXE which would ask for network credentials, register the host in foreman and wait orders via some HTTP fetch loop. Then it would load the Anaconda via HTTP directly. It’s cleaner workflow of what PXE-less discovery does however there is one snag - discovery is Linux based (RHEL/CentOS = all certified hardware is guaranteed to boot or Red Hat fixes it) while iPXE is relatively small project of few people. My only concern is support scope - when iPXE works it does fantastic job.

aruzicka · September 12, 2019, 2:24pm

I assume Full Host Bootdisk creates a, well, boot disk which can be used to provision a single os. Or to put it another way, you can’t have a single Full Host Bootdisk from which you could provision RHELs and Debians. If this is wrong, please disregard the rest of the post.

The difference would be you could choose what os you want to provision on the host after it booted into this special image, without the need for having multiple flavor of the image.

The iPXE way sounds good. I never had any problems with hardware support in iPXE, but I’m not your regular enterprise user

lzap · September 12, 2019, 2:30pm

Bootdisk creates three kind of images, they are very different. I suggest to read https://github.com/theforeman/foreman_bootdisk

Full host bootdisk is for the workflow you have described.

The difference would be you could choose what os you want to provision on the host after it booted into this special image, without the need for having multiple flavor of the image.

Yes, that bootdisk does not do today. It is technically possible to achieve that with iPXE scripting language.

Marek_Hulan · September 12, 2019, 4:19pm

Could we put bootdisks functionality to discovery? If we have FDI running, we may not need to generate the bootdisk in advance. Also it would be great to get data on what bootdisk flows are mostly used today. Downside is clear, big FDI image vs small bootdisks. Upside though is, we reduce the number of different provisioning options while we’d still keep use cases. User do the setup once and it would be always be the same, they can pick the appropriate method when they only when they need to decide.

lzap · September 13, 2019, 5:57am

Do you mean provisioning functionality via discovery without - well - discovery action. This would be confusing.

But that’s one of the great workflows our users can use - generate in advance, stick it to the server, provision without PXE or DHCP “hands free”. There is similar workflow with discovery but one must remaster discovery image manually per individual host.

We won’t reduce anything for the community, bootdisk will be here as long as someone supports it.

I started this thread with generic ideas on how to improve discovery itself and I don’t mind bringing bootdisk to the table but I feel like if we stick to the goal to improve discovery that would be good start. There will always be overlap and it’s up to users and Foreman distributors to decide which way to prefer the most.

Marek_Hulan · September 13, 2019, 7:48am

good catch I agree and wouldn’t mind renaming discovery to foreman_provisioning or merging it to core

yes, and it would still be possible with customizing FDI, but but it would be the same stack. Plus the full host bootdisk use case would be the same as with traditional discovery.

I think if the new discovery would be successful and better, users would move to it, of course it may not happen.

And I’m really sorry to side-track it. I agree let’s improve the discovery. I was just thinking how high we can aim I’d prefer to focus more on discovery and make it awesome, in price of let’s spend less time on bootdisk. If there’s more people who will further improve bootdisk, great.

lzap · September 13, 2019, 8:34am

After more thinking about this, here is my updated proposal which is very much based around SSH and Remote Execution plugin effectively making Discovery a hard dependent on ReX. Major painpoints remain almost the same:

no ability to configure NICs during provisioning (creating bonds, bridges, VLANs)
complicated way of running custom scripts after boot or before provisioning
unreliable kexec
poor security (discovery communicates HTTPS - certs ignored - and also anyone on the network can freely perform kexec call which is arbitrary remote execution granted that PXE is also the same)

The proposal:

Discovery requires SSH to be enabled by default and initial (bootstrap) root password to be set via kernel command line.
Initial discovery process remains the same - new discovered host is created from facts using HTTPS (cert ignored) call to Foreman.
The moment a host is discovered, new ReX job is scheduled using the existing bootstrap SSH password to initialize the discovered node. The template would basically contain steps to deploy ReX SSH key, disable password login via SSH and deploy Foreman CA HTTPS certificate.
The discovered host is now fully secure - SSH only via keys, communication towards Foreman verified via HTTPS CA certificate.
When user performs Refresh Facts, new ReX job is scheduled to call facter and upload facts via curl or wget. This functionality can be dropped from Smart Proxy.
When user or Foreman initiates provisioning, new ReX job is scheduled. During rendering, user can use new macros to preconfigure host (customize NIC configurations) and then call macro to perform discovery provisioning (convert host to managed and save). There are several pre-defined templates available:
- Image-based provisioning. Write bootloader and image to the disk and simply reboot.
- Destroy bootloader and reboot. Little bit more tricky one to provide a good alternative to kexec users who struggle with hardware driver issues - an attempt to create bootloader and partition with OS installer and initramdisk and rebooting into it. Granted this requires specific boot order in BIOS/EFI and it might not be for all.
- Legacy kexec. For those who like the current approach (e.g. in virtual environments where kexec works the best), a template that performs what discovery does today.
- Just reboot. For normal PXE provisioning where no special action needs to be done.
Since the PR to do unused_ip call during provisioning is almost merged, this enables additional provisioning options since there is all power of ERB during rendering of ReX templates. For starters, we would only ship very basic templates and it’s up to the users what they will come with - we will accept patches or new macros.
Since there will be no communication towards discovered nodes, Smart Proxy can be dropped from the image completely. Foreman discovery image smart proxy plugin can be also deprecated. Smart proxy discovery http proxy plugin can be simplified as only Discovered Node -> Foreman HTTPS communication will be performed (only one way). Firewall requirements can be simplified, there will be new communication however from Foreman (Proxy) -> Discovered nodes (SSH).

I think this is great improvement, ability to run arbitrary scripts on discovered nodes is something that people asked about and we have provided the hacky solution (base64 encoded script on kernel command line). It also greatly simplifies communication and increases security which is probably less relevant for PXE environments but more relevant for PXE-less environments. The cost is having discovery a dependency on Remote Execution which probably can delay our plans to merge Discovery into core.

Marek_Hulan · September 13, 2019, 11:15am

I like this a lot. Few notes and comments:

REX can now even survive host reboot and continue execution after it, so e.g. it should be possible to use it for further customization after reboot if needed
it shouldn’t be much of work to enable SSH executions against Host::Discovered, today I think it only supports Host::Managed. Though if we agree we want to improve discovery or even merge it to core, prehaps it can be done first and then we’d finally have only one type of host and REX works out of the box.
the SSH bootstrap seems a bit complicated, why it does not download public key from Foreman (through SSL) as part of the boot? the same way we deploy the key during provisioning
given the REX always connects through proxy to the target, this would still work fine even if Foreman can’t talk to host directly. I’m now only confused, why you think we don’t need the same thing from the other side. Do I understand it correcrtly, that you want FDI to talk to Foreman API directly? Traditionally the puppet enabled proxy is needed here so that puppet agent reports facts thourhg it. We can talk to Foreman API directly, but I think the communication should still be able to flow through proxy, like we do it with proxy templates feature.

lzap · September 16, 2019, 8:28am

Not sure, if a host is rebooted then SSH keys are lost and it needs to be bootsrapped again. Granted rediscovery should work just fine as today.

Yes, I considered that as well. It probably needs new template kind because we can’t use finish script (it “calls home”) and we need also to customize it deploying the CA cert. It looks more secure tho, MitM attack is still possible but it’s better than opening up SSH for a short period of time.

Not directly, via the proxy-proxy plugin. We can only simplify it (only one way, one kind of message).

lzap · September 17, 2019, 11:47am

I agree with Marek that the bootstrap phase was a bit complicated, so I am changing my proposal. Major pain-points remain the same:

unreliable kexec needs to be replaced with a new workflow
complicated way of running custom scripts after boot or before provisioning
no ability to configure NICs during provisioning (creating bonds, bridges, VLANs)
poor security (discovery communicates HTTPS - certs ignored - and also anyone on the network can freely perform kexec call which is arbitrary remote execution granted that PXE is also the same)

The proposal:

Discovery boots up with SSH turned off.
Before host is discovered, X509 CA certificate chain is downloaded from Foreman as well as Remote Execution SSH keys.
SSH service is started with key-only authentication.
Initial discovery process remains the same - new discovered host is created from facts using HTTPS call to Foreman.
The moment a host is discovered, new ReX job is scheduled to run custom script on discovered node. By default the script will be empty and fully customizable by users.
When user performs Refresh Facts, new ReX job is scheduled to call facter and upload facts via curl or wget. This functionality replaces the Smart Proxy plugin which can be deprecated.
When user or Foreman initiates provisioning, new ReX job is scheduled. During rendering, user can use new macros to preconfigure host (customize NIC configurations) and then call macro to perform discovery provisioning (convert host to managed and save). There are several pre-defined templates available:
- Image-based provisioning. Write bootloader and image to the disk and simply reboot. Host is created without entering Build mode. This will be the main workflow recommended for existing kexec users.
- Destroy bootloader and reboot. Little bit more tricky one to provide another alternative to kexec users who struggle with hardware driver issues - an attempt to create bootloader and partition with OS installer and initramdisk and rebooting into it. Granted this requires specific boot order in BIOS/EFI and it might not be for all.
- Legacy kexec. For those who like the current approach (e.g. in virtual environments where kexec works the best), a template that performs what discovery does today.
- Just reboot. For normal PXE provisioning where no special action needs to be done.
Since there will be no communication towards discovered nodes, Smart Proxy can be dropped from the image completely. Smart proxy discovery http proxy plugin can be simplified as only Discovered Node -> Foreman HTTPS communication will be performed (only one way). Firewall requirements can be simplified, there will be new communication however from Foreman (Proxy) -> Discovered nodes (SSH) tho.

As part of this effort, I would look into other (optional) tasks:

upgrade FDI to facter 3.x
writing a custom fact to report NIC relationships (bond, bridge - master + slaves)
upgrading to CentOS 8 (new building tool “lorax”, hopefully smaller image)

Tell me what you think, I’d be very interested in @TimoGoebel 's opinion.

ekohl · September 17, 2019, 1:46pm

Overall I think it’s a nice design and it makes sense to me. I do have some details I’m wondering about:

Can we pass a fingerprint via the command line so the CA cert is verifiable?

An alternative could DANE but that relies on DNS and without DNSSEC there’s no integrity there. It also moves the problem elsewhere since that still needs a root to be signed somewhere. In short, that’s probably not a realistic alternative.

The SSH keys are in the facts. Can REX to ensure SSH logins work securely? Overall that may be a nice REX feature.

One thing to consider is that currently we have permissions that only allow some hosts to create a new host by uploading facts. This is to prevent abuse. With discovery I don’t see an obvious way to limit this. Perhaps this is already an issue in the current discovery design and a potential attack vector for a DoS by creating millions of hosts.

If the script is empty, is it needed to run? I can imagine it’s a good verification of functionality. In that case I’d have a default implementation that’s echo REX functionality verified or similar to make it clear there is a use.

lzap · September 20, 2019, 8:29am

I was thinking either this orr SHA of the whole X509 chain because booted discovery node is completely empty (we need both server cert and CA). I am not expert, is fingerprint of the CA (or server cert) enough to ensure integrity of the whole chain?

There are no keys on the system, it’s a clean install. But we could generate those, however I aim to improve discovery rather than ReX.

Yes, discovery on PXE is a real security issue. PXE is by design remote execution, then you need to allow untrusted hosts to do updates if you ever want to allow this workflow. I have no other ideas than “don’t use it on untrusted networks”.

We will be definitely able to put something in it, it’s just so many users asked “can I run XYZ after boot” and so far it is very painful to run own code after boot (ZIP extension, remaster with embedded script, rebuild FDI). There are three terribly complicated ways of doing the same, this will enable us to drop them all! Yay.

Thanks for taking a look. Appreciated.

ekohl · September 20, 2019, 9:28am

Not an expert either, but I think that passing the fingerprint of the server cert should be sufficient.

Currently REX implements a Trust On First Use. That means after a host reinstall you will need to clear out the keys. Every reboot of the discovery image will have new keys as well. This a problem that needs to be solved somehow.

We can store SSH keys in Foreman and via facts we already do. This could be extended into a proper storage with the appropriate fact parser. When that’s exposed via the API, REX could easily query this avoiding the need for TOFU when the key is available.

The discovery image will generate SSH keys and if you make the reporting script start after sshd has started, it should have them available. That means you have a secure channel from the discovery image all the way to the proxy running REX.

You’re right this is complicated. Initially I thought you might be able to check with the DHCP server but in DHCP-less environments this breaks. The only thing I can think of is to always register via a smart proxy so users firewall the smart proxy with the discovery feature to the local subnet. Then the smart proxy can still identify itself via the certificate and on the Foreman side we can check that is indeed a smart proxy with the right feature enabled. We do the same thing for Puppet where we check if the proxy is allowed to send facts.

TimoGoebel · September 20, 2019, 9:36am

If there is a secure way to pass the fingerprint (which should be a secure hash algorithm) of the CA cert, that should be enough. The discovery image can then download the CA certificate from Foreman (e.g. via http endpoint) and verify it maches the hash and install it in the image’s truststore. On EL, that would be

place contents in /etc/pki/ca-trust/source/anchors/my-cert.pem
run /bin/update-ca-trust

After that step is done, we can basically trust all data from Foreman via HTTPs. We don’t need the whole chain for that. The root certificate is enough.

I would not download the keys to the image but rather use sshd’s AuthorizedKeysCommand to run a script that fetched (temporary) keys from Foreman and added to audit logging.
So the image basically does a callback to Foreman once somebody tries to login and Foreman can either return (temporary) REX ssh keys or the authorized keys for all Foreman admin users (note we already have a ssh key model) for debugging purposes.

AuthorizedKeysCommand /usr/local/bin/fetch_authorized_keys_for_user.sh
AuthorizedKeysCommandUser nobody

#!/bin/bash
#  retrieves SSH public keys

SSH_USER=$1

logger -t sshd -p info "Fetching SSH public key for user ${SSH_USER} from Foreman

# we can pass more data about the system here
PUBKEY=$(curl --capath /etc/pki/tls/certs/ --silent https://foreman.example.com/discovery/ssh_keys/${SSH_USER})

echo "${PUBKEY}"

This way login authorization is securely delegated to Foreman.

All in all, the proposal looks good. Image based provisioning looks interesting but we need a way to actually generate the image. Can we write a bootloader to disk that starts a normal network based (anaconda) install and just reboot the host?

lzap · September 20, 2019, 2:46pm

I am not sure I am following here. Why we would need to store any SSH keys? In my proposal the only connection is Foreman -> Discovered node and for this we only need to get Foreman’s public key to Discovered node. SSH keys on the discovered node are irrelevant.

However what is an issue is fingerprint which will always be different when a discovered node reboots. Facter should report that so ReX can have that information ready when making SSH calls to verify the identity of the remote side.

This is nice, every login attempt would be checked against Foreman over HTTPS securely. This is probably something we should implement in ReX so users can benefit from it. Not sure how useful this workflow is, but it is interesting.

Yes, it’s in my proposal, bullet named “Destroy bootloader and reboot” is exactly that.