Discovered hosts have no provisioning token on initial build

stiege · November 23, 2020, 10:01am

I reinstalled and lost all other patches, just applying the one mentioned. Unfortunately that meant I lost logging but it seems we still failed to deploy the token to the smart proxies.

This indicates freshness is good:

fmspx1-ob-159 /var/lib/tftpboot # ls -l `find . -name '*70*10*6f*dd*a0'`
-rw-r--r-- 1 foreman-proxy foreman-proxy 1002 Nov 23 04:47 ./grub2/grub.cfg-01-70-10-6f-a9-dd-a0
-rw-r--r-- 1 foreman-proxy foreman-proxy 1002 Nov 23 04:47 ./grub2/grub.cfg-70:10:6f:a9:dd:a0
-rw-r--r-- 1 foreman-proxy foreman-proxy  564 Nov 23 04:47 ./pxelinux.cfg/01-70-10-6f-a9-dd-a0

# This file was deployed via 'MSE Kickstart PXELinux' template


# token value is -->  <--  (entry: this shows the result of  <%= @host.token %>)

DEFAULT menu
MENU TITLE Booting into OS installer (ESC to stop)
TIMEOUT 100
ONTIMEOUT installer

LABEL installer
  MENU LABEL MSE Kickstart PXELinux
  KERNEL http://bbrepo.bdns.bloomberg.com/pub/repos/rhel/rhel-server-7.6-x86_64//images/pxeboot/vmlinuz
  APPEND initrd=http://bbrepo.bdns.bloomberg.com/pub/repos/rhel/rhel-server-7.6-x86_64//images/pxeboot/initrd.img ks=http://fmspx1.bdns.bloomberg.com:8440/unattended/provision    BOOTIF=70:10:6f:a9:dd:a0
  IPAPPEND 2

I’m not exactly sure if different logging is required to help with this new strategy.

lzap · November 24, 2020, 2:36pm

Which Foreman version are you now on? I am trying to reproduce this on 2.3 RC2 which I have just installed. Any special instructions?

lzap · November 24, 2020, 3:17pm

I just tried discovered PXE workflow in 2.3 RC2 and it went smoothly:

[root@rc ~]# cat /var/lib/tftpboot/pxelinux.cfg/01-52-54-00-f0-6c-08
# This file was deployed via 'Kickstart default PXELinux' template
DEFAULT menu
MENU TITLE Booting into OS installer (ESC to stop)
TIMEOUT 100
ONTIMEOUT installer

LABEL installer
  MENU LABEL Kickstart default PXELinux
  KERNEL boot/centos-mirror-ZFYGt3CWaXkr-vmlinuz
  APPEND initrd=boot/centos-mirror-ZFYGt3CWaXkr-initrd.img ks=http://rc.nat.lan/unattended/provision?token=ffff5f4e-f1ba-4c31-bce2-c3f6b56400e5  network ksdevice=bootif ks.device=bootif BOOTIF=00-52-54-00-f0-6c-08 kssendmac ks.sendmac inst.ks.sendmac ip=dhcp nameserver=192.168.199.4
  IPAPPEND 2

You must be configuring something differently.

stiege · November 25, 2020, 9:47am

foreman.noarch                          1.24.3-1.el7

We have our servers operating behind a HA proxy frontend, and our smart-proxies behind a load-distributor via DNS.

We use memcached and an external postgresql database. I imagine there are significant differences in our configuration to what you might have available.

Our general setup is that we run foreman-installer to do initial configuration against a foreman-answers placed onto a new machine via chef. However we don’t use puppet and instead overwrite a few of the configuration files using chef - such as /etc/foreman/settings.yml and some SSL certs to provide our own PKI.

stiege · November 26, 2020, 10:42am

We’ll start trying to reproduce from a docker container. Seems like a bit of a challenge for me to figure out and would help you help us.

stiege · December 10, 2020, 9:53am

I have been able to repro in a container environment. Unfortunately this still uses internal docker containers and repositories so couldn’t be reproduced externally yet. We’ll use this to check solutions prior to attempting them in our production environment, but it just involves:

creating the containers with SSL ports exposed
deploying the foreman-answers files onto the server and smartproxy containers
generating and deploying SSL certificates
running foreman-installer
add the smartproxy to the server config

There are some manual steps to create a organisation/hostgroup/subnet/all associations and then I have a script to post some fake facts to the smartproxy to create a discovered host. On “building” this fake host I see a token appear in the server side “review” of the template, but in the actual template on the smartproxy there is no token.

If it would be worthwhile I’m fairly sure I now do a port that could be used externally, otherwise I thought you might just be able to ask for some specifics as this docker system is significantly simpler that our production system.

---
- name: provision docker containers
  become: false
  gather_facts: false
  hosts: localhost
  connection: local
  vars_files:
    - vars.yml
  tasks:
    - docker_network:
        name: foreman
    - include_tasks: create_container.yml
      loop: "{{ containers }}"

  tags:
    - cert

- name: Add foreman
  hosts:
    - server.container.com
    - smartproxy.container.com
  connection: docker
  remote_user: root
  gather_facts: false
  tasks:
    - user:
        name: puppet
    - group:
        name: puppet
    - copy:
        src: files/foreman_mse.repo
        dest: /etc/yum.repos.d/foreman_mse.repo
    - package:
        name:
          - foreman-installer
          - vim
          - hostname
          - net-tools
          - less
          - tcpdump
- include: create_pki.yml
- name: Configure server
  hosts:
    - server.container.com
  connection: docker
  remote_user: root
  gather_facts: false
  tasks:
    - copy:
        src: files/server_foreman-answers.yaml
        dest: /etc/foreman-installer/scenarios.d/foreman-answers.yaml
    - package:
        name:
          - tfm-rubygem-foreman_discovery
- name: Configure smartproxy
  hosts:
    - smartproxy.container.com
  connection: docker
  remote_user: root
  gather_facts: false
  tasks:
    - copy:
        src: files/smartproxy_foreman-answers.yaml
        dest: /etc/foreman-installer/scenarios.d/foreman-answers.yaml
    - package:
        name:
          - rubygem-smart_proxy_discovery
- name: Run foreman-installer
  hosts:
    - smartproxy.container.com
    - server.container.com
  connection: docker
  remote_user: root
  gather_facts: false
  tasks:
    - command: foreman-installer

lzap · December 10, 2020, 10:18am

I am not sure I will be able to setup docker, however why don’t you describe me exactly how you setup your subnet and hostgroup (every single field to the last detail). I can test the same setup on my system. There must be something you don’t setup that for me is obvious.

Also to rule out UI issues, could you trigger the provisioning via our CLI?

stiege · December 10, 2020, 2:05pm

Create operating system “redhat” -> 7.6, red hat family, select all arch, kickstart default partition, debian mirror,
provisioning templates -> Edit Preseed default PXEGrub2-> add association for new OS “redhat 7.6”
Add template to redhat 7.6 OS
Add my smartproxy -> smartproxy.container.com -> https://smartproxy.container.com:8443
Create domain bloomberg.com -> to match posted discovered facts
create a subnet for my fake discovered host, ip range 10.246.195.0/24 for host 10.246.195.5. Configure all proxies for this subnet to point at the smartproxy, add domain association to subnet that was created
create “myhostgroup” with association to bloomberg.com domain. configure arch to be i686, OS: “redhat 7.6”, media: Debian Mirror, partition table “Kickstart default”, set linux password: “password”, ipv4 subnet to the one that was created
settings -> discovered -> reboot -> “No” (as our discovered host doesn’t actually exist in this test), this doesn’t seem to stop the test from working but causes the UI to hang.
provision discovered host -> select myhostgroup when prompted, otherwise defaults

On reviewing the created host:
https://localhost:8443/unattended/PXEGrub2?hostname=mac000c29cf0045.bloomberg.com

menuentry 'Preseed default PXEGrub2' {
  linux  boot/debian-mirror-stNH1XDWq2I1-vmlinuz interface=auto url=http://server.container.com/unattended/provision?token=e0f07bf9-2b15-4742-95de-020488675b4a ramdisk_size=10800 root=/dev/rd/0 rw auto hostname=mac000c29cf0045.bloomberg.com console-setup/ask_detect=false console-setup/layout=USA console-setup/variant=USA keyboard-configuration/layoutcode=us localechooser/translation/warn-light=true localechooser/translation/warn-severe=true locale=en_US BOOTIF=01-$net_default_mac
  initrd boot/debian-mirror-stNH1XDWq2I1-initrd.img
}

However on smart proxy:

[root@smartproxy tftpboot]# cat grub2/grub.cfg-01-00-0c-29-cf-00-45

#
# This file was deployed via 'Preseed default PXEGrub2' template
#
# Supported host/hostgroup parameters:
#
# blacklist = module1, module2
#   Blacklisted kernel modules
#
# lang = en_US
#   System locale
#
set default=0
set timeout=10

menuentry 'Preseed default PXEGrub2' {
  linux  boot/debian-mirror-stNH1XDWq2I1-vmlinuz interface=auto url=http://server.container.com/unattended/provision ramdisk_size=10800 root=/dev/rd/0 rw auto hostname=mac000c29cf0045.bloomberg.com console-setup/ask_detect=false console-setup/layout=USA console-setup/variant=USA keyboard-configuration/layoutcode=us localechooser/translation/warn-light=true localechooser/translation/warn-severe=true locale=en_US BOOTIF=01-$net_default_mac
  initrd boot/debian-mirror-stNH1XDWq2I1-initrd.img
}

For your CLI request I repeated using hammer, which I’m less familiar with:

[root@server /]# hammer --verify-ssl 0 -p password discovery list
---|-----------------|-------------------|------|--------|------------|------------|-----------------------------------|--------------------
ID | NAME            | MAC               | CPUS | MEMORY | DISK COUNT | DISKS SIZE | SUBNET                            | LAST REPORT        
---|-----------------|-------------------|------|--------|------------|------------|-----------------------------------|--------------------
2  | mac000c29cf0045 | 00:0c:29:cf:00:45 | 0    | 0      | 0          | 0          | 10.246.195.0/24 (10.246.195.0/24) | 2020/12/10 13:45:43
---|-----------------|-------------------|------|--------|------------|------------|-----------------------------------|--------------------

[root@server /]# hammer --verify-ssl 0 -p password discovery provision --name mac000c29cf0045 --hostgroup myhostgroup
Host created

There seems to be maybe a misconfiguration, I don’t understand why the above hammer commands would result in the following, where it appears the selected template for the OS is ignored:

[root@smartproxy /]# cat /var/lib/tftpboot/grub2/grub.cfg-01-00-0c-29-cf-00-45 



set default=local
set timeout=20
echo Default PXE local template entry is set to 'local'


insmod part_gpt
insmod fat
insmod chain

menuentry 'Chainload Grub2 EFI from ESP' --id local_chain_hd0 {
  echo Chainloading Grub2 EFI from ESP, enabled devices for booting:
  ls
  echo "Trying /EFI/fedora/shim.efi "
  unset chroot
  search --file --no-floppy --set=chroot /EFI/fedora/shim.efi
  if [ -f ($chroot)/EFI/fedora/shim.efi ]; then
    chainloader ($chroot)/EFI/fedora/shim.efi
    echo "Found /EFI/fedora/shim.efi at $chroot, attempting to chainboot it..."
    sleep 2
    boot
  fi
  echo "Trying /EFI/fedora/grubx64.efi "
  unset chroot
  search --file --no-floppy --set=chroot /EFI/fedora/grubx64.efi
  if [ -f ($chroot)/EFI/fedora/grubx64.efi ]; then
    chainloader ($chroot)/EFI/fedora/grubx64.efi
    echo "Found /EFI/fedora/grubx64.efi at $chroot, attempting to chainboot it..."
    sleep 2
    boot
  fi
  echo "Trying /EFI/redhat/shim.efi "
  unset chroot
  search --file --no-floppy --set=chroot /EFI/redhat/shim.efi
  if [ -f ($chroot)/EFI/redhat/shim.efi ]; then
    chainloader ($chroot)/EFI/redhat/shim.efi
    echo "Found /EFI/redhat/shim.efi at $chroot, attempting to chainboot it..."
    sleep 2
    boot
  fi
  echo "Trying /EFI/redhat/grubx64.efi "
  unset chroot
  search --file --no-floppy --set=chroot /EFI/redhat/grubx64.efi
  if [ -f ($chroot)/EFI/redhat/grubx64.efi ]; then
    chainloader ($chroot)/EFI/redhat/grubx64.efi
    echo "Found /EFI/redhat/grubx64.efi at $chroot, attempting to chainboot it..."
    sleep 2
    boot
  fi
  echo "Trying /EFI/centos/shim.efi "
  unset chroot
  search --file --no-floppy --set=chroot /EFI/centos/shim.efi
  if [ -f ($chroot)/EFI/centos/shim.efi ]; then
    chainloader ($chroot)/EFI/centos/shim.efi
    echo "Found /EFI/centos/shim.efi at $chroot, attempting to chainboot it..."
    sleep 2
    boot
  fi
  echo "Trying /EFI/centos/grubx64.efi "
  unset chroot
  search --file --no-floppy --set=chroot /EFI/centos/grubx64.efi
  if [ -f ($chroot)/EFI/centos/grubx64.efi ]; then
    chainloader ($chroot)/EFI/centos/grubx64.efi
    echo "Found /EFI/centos/grubx64.efi at $chroot, attempting to chainboot it..."
    sleep 2
    boot
  fi
  echo "Trying /EFI/debian/grubx64.efi "
  unset chroot
  search --file --no-floppy --set=chroot /EFI/debian/grubx64.efi
  if [ -f ($chroot)/EFI/debian/grubx64.efi ]; then
    chainloader ($chroot)/EFI/debian/grubx64.efi
    echo "Found /EFI/debian/grubx64.efi at $chroot, attempting to chainboot it..."
    sleep 2
    boot
  fi
  echo "Trying /EFI/ubuntu/grubx64.efi "
  unset chroot
  search --file --no-floppy --set=chroot /EFI/ubuntu/grubx64.efi
  if [ -f ($chroot)/EFI/ubuntu/grubx64.efi ]; then
    chainloader ($chroot)/EFI/ubuntu/grubx64.efi
    echo "Found /EFI/ubuntu/grubx64.efi at $chroot, attempting to chainboot it..."
    sleep 2
    boot
  fi
  echo "Trying /EFI/sles/grubx64.efi "
  unset chroot
  search --file --no-floppy --set=chroot /EFI/sles/grubx64.efi
  if [ -f ($chroot)/EFI/sles/grubx64.efi ]; then
    chainloader ($chroot)/EFI/sles/grubx64.efi
    echo "Found /EFI/sles/grubx64.efi at $chroot, attempting to chainboot it..."
    sleep 2
    boot
  fi
  echo "Trying /EFI/opensuse/grubx64.efi "
  unset chroot
  search --file --no-floppy --set=chroot /EFI/opensuse/grubx64.efi
  if [ -f ($chroot)/EFI/opensuse/grubx64.efi ]; then
    chainloader ($chroot)/EFI/opensuse/grubx64.efi
    echo "Found /EFI/opensuse/grubx64.efi at $chroot, attempting to chainboot it..."
    sleep 2
    boot
  fi
  echo "Trying /EFI/Microsoft/boot/bootmgfw.efi "
  unset chroot
  search --file --no-floppy --set=chroot /EFI/Microsoft/boot/bootmgfw.efi
  if [ -f ($chroot)/EFI/Microsoft/boot/bootmgfw.efi ]; then
    chainloader ($chroot)/EFI/Microsoft/boot/bootmgfw.efi
    echo "Found /EFI/Microsoft/boot/bootmgfw.efi at $chroot, attempting to chainboot it..."
    sleep 2
    boot
  fi
  echo Partition with known EFI file not found, you may want to drop to grub shell
  echo and investigate available files updating 'pxegrub2_chainload' template and
  echo the list of known filepaths for probing. Contents of \EFI directory:
  ls ($chroot)/EFI
  echo The system will halt in 2 minutes or press ESC to halt immediately.
  sleep -i 120
  halt --no-apm
}

menuentry 'Chainload into BIOS bootloader on first disk' --id local_chain_legacy_hd0 {
  set root=(hd0,0)
  chainloader +1
  boot
}

menuentry 'Chainload into BIOS bootloader on second disk' --id local_chain_legacy_hd1 {
  set root=(hd1,0)
  chainloader +1
  boot
}


menuentry 'Foreman Discovery Image httpboot efi' --id discoveryefihttpboot {
  linuxefi /httpboot/boot/fdi-image/vmlinuz0 rootflags=loop root=live:/fdi.iso rootfstype=auto ro rd.live.image acpi=force rd.luks=0 rd.md=0 rd.dm=0 rd.lvm=0 rd.bootif=0 rd.neednet=0 nokaslr nomodeset proxy.url=https://server.container.com proxy.type=foreman BOOTIF=01-$net_default_mac
  initrdefi /httpboot/boot/fdi-image/initrd0.img
}

menuentry 'Foreman Discovery Image  efi' --id discoveryefi {
  linuxefi boot/fdi-image/vmlinuz0 rootflags=loop root=live:/fdi.iso rootfstype=auto ro rd.live.image acpi=force rd.luks=0 rd.md=0 rd.dm=0 rd.lvm=0 rd.bootif=0 rd.neednet=0 nokaslr nomodeset proxy.url=https://server.container.com proxy.type=foreman BOOTIF=01-$net_default_mac
  initrdefi boot/fdi-image/initrd0.img
}

menuentry 'Foreman Discovery Image httpboot ' --id discoveryhttpboot {
  linux /httpboot/boot/fdi-image/vmlinuz0 rootflags=loop root=live:/fdi.iso rootfstype=auto ro rd.live.image acpi=force rd.luks=0 rd.md=0 rd.dm=0 rd.lvm=0 rd.bootif=0 rd.neednet=0 nokaslr nomodeset proxy.url=https://server.container.com proxy.type=foreman BOOTIF=01-$net_default_mac
  initrd /httpboot/boot/fdi-image/initrd0.img
}

menuentry 'Foreman Discovery Image  ' --id discovery {
  linux boot/fdi-image/vmlinuz0 rootflags=loop root=live:/fdi.iso rootfstype=auto ro rd.live.image acpi=force rd.luks=0 rd.md=0 rd.dm=0 rd.lvm=0 rd.bootif=0 rd.neednet=0 nokaslr nomodeset proxy.url=https://server.container.com proxy.type=foreman BOOTIF=01-$net_default_mac
  initrd boot/fdi-image/initrd0.img
}

lzap · December 10, 2020, 3:22pm

Can you provide me hammer hostgroup show output of the hostgroup? Feel free to anonymize the output, but to the degree I can still read important stuff.

stiege · December 10, 2020, 3:27pm

This is all “fake” data just created on the containers:

[root@server /]# hammer --verify-ssl 0 -p password hostgroup list
---|-------------|-------------|------------------|--------------------|------
ID | NAME        | TITLE       | OPERATING SYSTEM | PUPPET ENVIRONMENT | MODEL
---|-------------|-------------|------------------|--------------------|------
1  | myhostgroup | myhostgroup | redhat 7.6       |                    |      
---|-------------|-------------|------------------|--------------------|------
[root@server /]# hammer --verify-ssl 0 -p password hostgroup info --id 1
Id:               1
Name:             myhostgroup
Title:            myhostgroup
Description:      
  
Network:          
    Subnet ipv4: 10.246.195.0/24
    Domain:      bloomberg.com
Operating system: 
    Architecture:     i386
    Operating System: redhat 7.6
    Medium:           Debian mirror
    Partition Table:  Kickstart default
    PXE Loader:       Grub2 UEFI
Puppetclasses:    

Parameters:       

Locations:        
    Default Location
Organizations:    
    Default Organization

lzap · December 14, 2020, 10:38am

This is probably the only difference, all my systems are x86_64.

Full disclosure, I am testing with Foreman 2.3 and CentOS, but Debian or Red Hat it does not matter, PXE template both use the same macro foreman_url.

One thing I do see are plugins, for example foreman_host_extra_validator is the one I suspect, but granted that one is fairly small. It is worth trying out without these plugins, ideally all of them.

But what’s the most interesting is the last find - you see local PXE grub template, meaning that build mode was not engaged when the template was rendered. That’s a difference from PXELinux when the host was in build mode (but token was missing). Something is going on and I am unable to understand this.

stiege · December 14, 2020, 11:27am

Regarding plugins, this was part of the reason I reproduced in docker without all the complexity of our production deployment.

I’m able to reproduce in docker and here we’re just using the foreman_discovery plugin.

I will repro again with the x86 and then I think the only thing left to try is moving to the same version to see if that resolves it.

stiege · December 14, 2020, 10:14pm

I think I’ve managed to conclude that this is a problem with the server-side foreman-discovery gem that we’re using - 16.0.1.

I think I know that as I kept the smart proxy completely the same, but upgraded the foreman server to 2.2 (via a migration). I believe the foreman-discovery plugin was kept at the same version, I managed to repro the issue. I definitely confirmed that I was at 2.2, and the smartproxy was unchanged.

Because that was quite messy, I repeated with a clean install on the server, but accidentally went to 2.3.1 (latest) rather than stable. I kept the smartproxy build exactly the same (no upgrade at all). When I did this and repeated my configuration steps it all works fine (not able to reproduce the issue).

So IMO either the problem existed up to 2.2 and was fixed by 2.3.1 (unlikely); or indeed the actual problem was tfm-rubygem-foreman_discovery 16.0.1.

Either way, I think we’re happy and can make some good decisions from this info.

lzap · December 15, 2020, 3:06pm

Thanks for getting back to us. I quickly checked git log and I don’t see any commit in relation to tokens, it must have been some incompatibility which slipped through our testing process. Weird.

We only maintain two releases back, so 2.2/2.3 so I strongly suggest to plan your upgrades accordingly. For longer lifecycle you can get Red Hat Satellite or ATIX.

stiege · December 15, 2020, 3:35pm

Thanks,

Agreed, I had a bit of a search through the changes and couldn’t spot anything. I have now managed to repeat with a 2.2.1 clean install (works) and repeat with holding the plugin and migrating to 2.2.1 from 1.24.3 - again reproducing the error.

I missed a third possibility that it’s the migration that’s carrying over some bad config, unfortunately I don’t think I’m easily able to figure out how to do a clean install and downgrade. That seems like a pain for no benefit anyway. As you say, whatever it was, it’s long in the past.

lzap · December 15, 2020, 4:15pm

Okay, well honestly this thread was in my stomach for few weeks now, I am happy you sorted it out. I saw similar error caused by a hook, but you showed you don’t have any. It could have been a bug, I expect 2.3 to be better than 2.0 or 2.2 honestly.