PXE-less Discovery Issues

metalcated · January 26, 2018, 1:43am

Hi all,

I’m having an issue with Foreman using PXE-less Discovery and have been successful with RHEL 6 and 7 x86_64 but for some reason RHEL 5.11 is not working. And one of the issues is I am unable to see any type of debug messages from either the Foreman server or the VM that I am booting off of the ISO. Once /power/kexec/ kicks off and the sudo command is sent to the VM, navigation or SSH sessions freeze up (on all RHEL versions) and I am unable to troubleshoot. So there are 2 issues here, 1. RHEL 5.11 not building out. 2. Post /power/kexec progress.

What I was expecting to happen is RHEL 5.11 should perform the same as RHEL 6 but is not. Progress should be shown, but is not.

Here’s the debugging I’ve done so far, I have tried to change the provisioning template, disk layout, etc. Nothing seems to make a difference.

Thanks!
metalcated

lzap · January 29, 2018, 8:39am

Hello, unfortunately you are on your own here. We don’t test PXE-less for RHEL5 provisioning. As you said “it should work” but so far I’ve seen the very same misbehavior and I simply don’t have time to investigate.

I recommend you to boot RHEL 7 and install “kexec” and then download RHEL 5.11 kernel and on the same hardware/VM try to kexec it and investigate it this way. If you found a bug, then file RHEL 7 bug for Red Hat RHEL team to investigate. But I think kexec in RHEL has very limited support.

metalcated · January 29, 2018, 2:37pm

Ok understood, thank you.

What about the lack of progress once kexec kicks off? If there was any way currently on 1.15 (current update) to see what is going on in the background, that would better help me troubleshoot.

lzap · January 31, 2018, 2:48pm

If the console is “frozen” for the whole provisioning process, make sure to use the latest FDI image. There were couple of fixes around this, specifically blacklisting KMS drivers and changing to tty2 before kexecing (newt TUI blocks the terminal). Then you should see Anaconda just fine.

The latest stable FDI 3.4.4 will work with 1.15 Foreman.

metalcated · January 31, 2018, 7:24pm

That is exactly what I am using and I don’t see anything happen on the target VM side. Sits at the Discovery Status screen until it reboots.

lzap · February 1, 2018, 8:17am

Until it reboots to provisioned system? Image version? Hypervisor version? And kernel command line (kexec template)? Compare kexec template to what we have in discovery git.

metalcated · February 1, 2018, 4:24pm

Until the VM reboots to a fully provisioned system, correct.

Image version: 3.4.4
Hypervisor version: 6.0 and 6.5 (both have the same result)
kexec template: https://github.com/theforeman/foreman_discovery/blob/develop/app/views/foreman_discovery/redhat_kexec.erb = this is exactly what I have on the foreman server I am testing with.
I also confirmed this is identical: https://github.com/theforeman/community-templates/blob/develop/provisioning_templates/snippet/pxelinux_discovery.erb

Thanks for your help on this so far, I do appreciate it.

lzap · February 2, 2018, 9:38am

Damn, I was fixing this bug the other day, I could reproduce it no more. We can work together in identifying the issue. We’ve seen this mis-behavior on libvirt/oVirt and switching to different graphics driver helped.

First and foremost - do you see the “frozen console” mis-behavior only for EL5 or for all EL versions?
Can you double-check you have “nomodeset” on the kernel command line in kexec?
Can you double-check you have “nomodeset” on the kernel command line for discovery? That is different command line, it’s the one you have rendered in pxelinux.cfg/default (Global PXELinux Template in Foreman)
When discovery shows up, can you verify the text mode is 80x25?
Which graphics card driver do you use for this VM?
Can you switch to different one if your VMWare offers?
Can you boot with fdi.rootpw=password fdi.ssh=1 and once the host is discovered?
- Do ssh connect there.
- Provide me output of “lsmod” please.
- Executing “discovery-debug” won’t hurt.
- And then shutdown the menu “systemctl stop discovery-menu” prior kexecing.
- Try kexec once again
- If all fails, then the last test is to finding your kernel/initramdisk and then performing kexec command via ssh with --debug option the same way discovery calls that. You will find options if you preview kexec template for the host. If you do this via ssh, you will be left with lots of debugging info before ssh connection drops/stalls. Pastebin that too, although this is more for CentOS/RHEL engineering than me d=)

metalcated · February 3, 2018, 1:27am

First and foremost - do you see the “frozen console” mis-behavior only for EL5 or for all EL versions?
- This happens on all version of RHEL/CentOS, so EL5,6 and 7.
Can you double-check you have “nomodeset” on the kernel command line in kexec?
- The kernel command does exist
Can you double-check you have “nomodeset” on the kernel command line for discovery? That is different
command line, it’s the one you have rendered in pxelinux.cfg/default (Global PXELinux Template in Foreman)
- The kernel command does exist
When discovery shows up, can you verify the text mode is 80x25?
The size of the current window when I check on tty3 is 75 x 100 - I am using VMware Remote Console. I am not able to get the Console button to work if that is what you are referring to?
Which graphics card driver do you use for this VM?
Which ever is the default (VMware SVGA II)?
Can you switch to different one if your VMWare offers?
I don’t see any options to change it to something else
Can you boot with fdi.rootpw=password fdi.ssh=1 and once the host is discovered?
- Do ssh connect there. - OK
- Provide me output of “lsmod” please. - OK - https://pastebin.com/JhBMMFiG
- Executing “discovery-debug” won’t hurt. - link: https://cloud.nttcscd.com/s/Q0FEtZjkLiXXwda
- And then shutdown the menu “systemctl stop discovery-menu” prior kexecing. - OK
- Try kexec once again OK - Same result
- If all fails, then the last test is to finding your kernel/initramdisk and then performing kexec command via ssh with --debug option the same way discovery calls that. You will find options if you preview kexec template for the host. If you do this via ssh, you will be left with lots of debugging info before ssh connection drops/stalls. Pastebin that too, although this is more for CentOS/RHEL engineering than me d=) - I’ll try this over the weekend and report back what happens.

lzap · February 5, 2018, 12:04pm

Ok I see an offender here:

drm_kms_helper        159169  1 vmwgfx

The goal is to get rid of this driver and that should do it! Now the big question for me is how FDI could load this driver, we obviously remove all KMS drivers from the image. See here:

github.com

theforeman/foreman-discovery-image/blob/master/25-minimize.ks#L24-L26


# See https://bugzilla.redhat.com/show_bug.cgi?id=1335830
echo " * remove KMS DRM video drivers to prevent kexec isues"
rm -rf /lib/modules/*/kernel/drivers/gpu/drm /lib/firmware/{amdgpu,radeon}

This is definitely in 3.4.4 version we’ve added this somewhere around 3.4.0 I think. Can you doublecheck where did you get this driver from? I don’t see it on 3.4.4 at all:

find / -name vmwg\*

Zero results, is it possible that VMWare somehow magically “inject” the driver into kernel? As far as I remember VMWare has these Guest Tools which you need to explicitly install via ISO, but last time I tried this was 10 years ago so Happy libvirt/RHEV user here.

Maybe you were not booting 3.4.4 - it happened to the best of us! I can’t tell because you pasted foreman-debug but I asked for discovery-debug (you run this on the discovered node itself). Anyway, you can tell from the version in the left-bottom corner.

Now, what you can do is adding this to the kernel command line (PXELinux for PXE, manually in Grub/SYSLINUX if you are in PXE-less mode):

vmwgfx.blacklist=yes

That should prevent it from loading. Let’s see. The key answer is - finding why the heck you have vmwgfx available (it should not be present).

metalcated · February 6, 2018, 2:17pm

Very interesting. Booting from the unmodified iso worked when I blacklisted the vmwgfx module but when I did the same on a remastered iso, blacklisting didn’t work AND the module exists on the iso. I used the 3.4.4 to remaster, not sure how it was injected. Possibly by the discovery-remaster?

https://raw.githubusercontent.com/theforeman/foreman-discovery-image/master/aux/remaster/discovery-remaster

I remastered a new iso after adding “vmwgfx.blacklist=yes” to the kernel params inside of the discovery-remaster script. I am uploading it now and will test it shortly. I will report back once I have tested.

Thanks

metalcated · February 6, 2018, 3:08pm

Booting from RHEL5 Kickstart I finally see what the issue is:

MP-BIOS bug: 8254 timer not connected to IO-APIC

I tried a modified boot param using “noapic” but that so far hasn’t worked.

However, blacklisting the vmwgfx module seems to have done the trick.

metalcated · February 6, 2018, 3:31pm

For anyone else having issues with booting EL5 (if for some reason you need to do so like me), make sure you choose an older version of VMware hardware (like version 4 or 5) when creating the VM.

Thanks

lzap · February 7, 2018, 9:07am

Remaster process is actually just unpacking ISO, modifying bootloader config and wrapping it up. It had to be some human mistake, the vmwgfx file is not present on 3.4.4 image at all, it’s deleted during build process along with other KMS drivers. So this is my take.

Anyway, excellent troubleshooting and thanks for the info about older VMWare version, that’s very helpful! I don’t use VMWare myself. I will put this into docs or somewhere else.

metalcated · February 8, 2018, 7:52pm

I found that you can go up to version 7 for VMware hardware version for EL5. Anything newer you get the MP-BIOS error.

And more than happy to provide information that I found. As for the remastering, its possibly I used the 3.4.3 image but It does say 3.4.4 at the bottom… idk Either way the issue has been identified and resolved.

Thanks again!

lzap · February 9, 2018, 12:45pm

Thanks: https://github.com/theforeman/theforeman.org/pull/1013