Cloud-init Issue with VMware

Problem:
I used HashiCorp Packer and ks.cfg to build a gold image from CentOS 8/9 DVD iso. It has cloud-init baked and I followed the insturction from the doc (11.7. Using VMware cloud-init and userdata templates for provisioning) as well.
Once the vm has been converted to the vm template in vsphere. I go to Compute resource → Image → Link the VM template. Everything was fine and working great during the image-based provisioning.
However, what I noticed was my VM will reboot after 30 seconds when the VM is booted and cloud-init and userdata templates were being applied. I think the reboot broke the configuration process and the VM was not fully configured with all the snippets config applied.
I have no idea why it happened.

I checked the /var/log/message and I could not find any hints. my cloud-init template general summary:

  1. set hostname
  2. update time
  3. add authorized keys
  4. subscribe to foreman
  5. install some packages from repo
  6. configure sssd and realm to join the domain/ad authentication
  7. install puppet-agent, update config file and then run puppet agent -t
    But it always reboot in the last step where the VM has puppet-agent install but sometimes cannot update the puppet.conf file. it casued the VM failure to get the puppet classes.

I checked my snippet and there is no trigger for the reboot.

The error before the reboot is ‘failed unmounting /var’ but I think it is triggerred by the reboot as we have a partition on /var.
I have no idea why it reboots.

Expected outcome:
I expected that there is no reboot after the VM is provisoined via foreman with image-based option.

Foreman and Proxy versions:
Foreman 3.10 and Katello 4.12

Info:
Here is the cloudinit template supposed to be applied to the VM:

#cloud-config
hostname: devc9reboot2.corp.abc.ca
fqdn: devc9reboot2.corp.abc.ca
manage_etc_hosts: true
users: {}
runcmd:

  • |
    echo “devc9reboot2” > /etc/hostname

    hostname devc9reboot2

    cat > /etc/hosts << EOF
    127.0.0.1 devc9reboot2.corp.abc.ca devc9reboot2 localhost localhost.localdomain
    ::1 ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    EOF

  • |

  • |

    echo “Updating system time”
    systemctl enable --now chronyd
    /usr/bin/chronyc -a makestep
    /usr/sbin/hwclock --systohc

  • |

  • |

    echo “################# SUBSCRIPTION MANAGER #######################”
    echo
    echo “Starting the subscription-manager registration process”

    Disable online yum repos

    echo “Disabling all online yum repositories”
    yum-config-manager -y --disable * > /dev/null

    rpm -Uvh http://foremanvan01.corp.abc.ca/pub/katello-ca-consumer-latest.noarch.rpm

    On rare ocassion, the system already attached a subscription which cause

    subsequence step to fail.

    subscription-manager clean

    subscription-manager register --name=“devc9reboot2.corp.abc.ca” --org=‘Systems’ --activationkey=‘CentOS Stream 9 Server Production’ --force || subscription-manager register --name=“devc9reboot2.corp.abc.ca” --org=‘Systems’ --activationkey=‘CentOS Stream 9 Server Production’ --force

  • |

    user_exists=false
    getent passwd root >/dev/null 2>&1 && user_exists=true

    if $user_exists; then

    mkdir -p ~root/.ssh

    cat << EOF >> ~root/.ssh/authorized_keys
    ssh-rsa fake_one_AAAADAQABAAABgQDbZQepP3IFryQ5GHDCYYytoOEvUanHOkmkMBvlC6cnLOqGeXXeLI34S1+HVRwaLiKJrqiJmFdTwpnoBn4eEfqjWH26NY/SwEhFsMMewfewfwefTudAxESvXjkzXHmsKaYLrLnIWn2voK6zrdghX5kousCnAyIQEeJDAD9PiuasdasdasdasdFS1ZUR0DMG4OJCq6JN9HLA1+4Krkq0YofWB5MTGTZw/mBAL8tetQnLMoWiLTN0zmAUBspZKUBPfAA7Gn8ybRd6OoD2rjoRm/AzyXNxRui+LYnqrTsxwJOJLvwKjUTQJqaIDfk= foreman-proxy@smrtprxvan01.corp.abc.ca
    EOF

    chmod 0700 ~root/.ssh
    chmod 0600 ~root/.ssh/authorized_keys
    chown -R root: ~root/.ssh
    chown -R root: ~root

    Restore SELinux context with restorecon, if it’s available:

    command -v restorecon && restorecon -RvF ~root/.ssh || true

    else
    echo ‘The remote_execution_ssh_user does not exist and remote_execution_create_user is not set to true. remote_execution_ssh_keys snippet will not install keys’
    fi

  • |

    Install required packages

    yum install -y sssd realmd oddjob oddjob-mkhomedir adcli samba-common samba-common-tools krb5-workstation openldap-clients fping

    cat > /root/join_realm.sh << EOF
    #!/bin/sh

    DONE=0
    COUNT=0

    if test -f /etc/krb5.keytab; then
    DONE=1
    fi

    while [ $DONE -eq 0 ] && [ $COUNT -le 30 ]; do
    fping -q corp.abc.ca || { sleep 2; COUNT=$(( $COUNT + 1 )); continue; }

    Attempt to join the domain

    echo wan@abc| kinit jointhedomain@CORP.abc.ca
    realm leave corp.abc.ca
    sleep 10
    realm join corp.abc.ca

    if [ $? ]; then
    if ! grep -q ‘case_sensitive = false’ /etc/sssd/sssd.conf 2> /dev/null; then
    sed -i ‘s/[domain/corp.abc.ca]/[domain/corp.abc.ca]\ncase_sensitive = false/’ /etc/sssd/sssd.conf 2> /dev/null
    fi

    systemctl enable --now sssd
    systemctl restart sssd
    DONE=1
    

    fi

    COUNT=$(( COUNT + 1 ))
    done

    rm -f /root/join_realm.sh

    EOF

    chown -v root:root /root/join_realm.sh
    chmod -v 0700 /root/join_realm.sh

    cat > /lib/systemd/system/realm-join.service << EOF
    [Unit]
    Description=Realm join

    [Install]
    WantedBy=multi-user.target

    [Service]
    ExecStart=/bin/bash /root/join_realm.sh
    Type=simple
    User=root
    Group=root
    WorkingDirectory=/root
    Restart=on-failure
    EOF

    systemctl daemon-reload
    systemctl enable --now realm-join.service

  • |
    if [ -f /usr/bin/dnf ]; then
    dnf -y install puppet-agent
    else
    yum -t -y install puppet-agent
    fi

    cat > /etc/puppetlabs/puppet/puppet.conf << EOF
    [main]
    vardir = /var/lib/puppet
    logdir = /var/log/puppet
    rundir = /var/run/puppet
    ssldir = $vardir/ssl

    [agent]
    pluginsync = true
    report = true
    ca_server = pptmstrvan01.corp.abc.ca
    certname = devc9reboot2.corp.abc.ca
    server = pptmstrvan01.corp.abc.ca
    environment = production

    EOF

    puppet_unit=puppet
    /usr/bin/systemctl list-unit-files | grep -q puppetagent && puppet_unit=puppetagent
    /usr/bin/systemctl enable ${puppet_unit}

    export a custom fact called ‘is_installer’ to allow detection of the installer environment in Puppet modules

    export FACTER_is_installer=true

    passing a non-existent tag like “no_such_tag” to the puppet agent only initializes the node

    You can select specific tag(s) with the “run-puppet-in-installer-tags” parameter

    or set a full puppet run by setting “run-puppet-in-installer” = true

    echo “Performing initial puppet run for --tags no_such_tag”
    /opt/puppetlabs/bin/puppet agent --config /etc/puppetlabs/puppet/puppet.conf --onetime --tags no_such_tag --server pptmstrvan01.corp.abc.ca --no-daemonize
    /opt/puppetlabs/bin/puppet resource service puppet ensure=running

phone_home:
url: http://smrtprxvan01.corp.abc.ca:8000/unattended/built
post:
tries: 10

i’m not in front of my instance but maybe i can help anyway

the UserData Template is using vmware tools and vmware to customize Hostname/IP/ ect
it’s rebooting after this step
After the vm boots again, it’s contacting foreman via cloud-init datasource and applies cloudInit Template

can you figure out, when exactly the vm reboots?
it should only reboot after UserData Tempalte

BTW:
I’m not using cloud-init like you
i’m using ansible for the configuration steps

My cloudInit Template is very small and only contains generating SSH Keys and expand HDD

i had many troubles also when setting this up but it’s working very realible since then, except when cloud-init update broke cloud-init :wink:

yep, he UserData Template uses vmware tools and the VM does get the IP, DNS and hostname set up correctly.
The VM contacts the foreman, gets the cloud-init template, and applies it itself.
The reboot is triggered when you see the cloud-init steps output on the screen ( close to the end when it installed the puppet agent successfully and then all of a sudden it reboots)
I assumed it would write the puppet.conf as well but it does not happen all the time. It looks like if it reboots earlier, the cloud-init does not have sufficient time to apply the config. And I have no idea why it reboots itself. is it related to the cloud-init? or is it related to Foreman? of is it related to the provisioning template?

maybe i’ve got the same problem as you in the past and decided to do Configuration via Ansible can’t remember

as far as i know
the VM reboots because of the UserData vmware-tools Template

But the doc did not mention the the userdata template will trigger the reboot

  • The user provisions one or more virtual machines using the Satellite web UI, API, or hammer
  • Satellite calls the VMware vCenter to clone the virtual machine template
  • Satellite userdata provisioning template adds customized identity information
  • When provisioning completes, the Cloud-init provisioning template instructs the virtual machine to call back to Capsule when cloud-initruns
  • VMware vCenter clones the template to the virtual machine
  • VMware vCenter applies customization for the virtual machine’s identity, including the host name, IP, and DNS
  • The virtual machine builds, cloud-init is invoked and calls back Satellite on port 80, which then redirects to 443

i just tested it again and build a Host

The CloudInit Template is applied before the reboot, so theres the culprit
seems like i ran into the same issue as you
i had many issues setting this up also

i even created a new cloudInit Template with only the following

#cloud-config
hostname: <%= @host.name %>
fqdn: <%= @host %>
manage_etc_hosts: true
users: {}
# use runcmd to grow the VG, LV and root filesystem, as cloud-init
# doesn't handle LVM resizing natively
runcmd:
  - cloud-init-per once grow_part growpart /dev/sda 3
  - cloud-init-per once grow_lv lvextend -r -l +100%FREE /dev/mapper/vg_system-lv_root
  - ssh-keygen -A



phone_home:
  url: <%= foreman_url('built') %>
  post: []
tries: 10

after that it’s contacting katello and applies all Ansible Roles which are attached to the Host Group ( after a delay )
Ansible is doing the registration and configuration

take a look at this issue

seems like theres also a fix for it

Did you also encounter the ‘out of nowhere’ reboot issue?
I am not sure if the Foreman developers notice this issue and have the solution to fix it.
if it is a common issue, is it possible to extend the time delay before it actually reboot…
it is so frustrating as it is the last part of my config in cloud-init template.
As every snippet can be reused and all configuration management can be done via puppet.

yes it seems so

and also this might help

https://knowledge.broadcom.com/external/article/311864/how-does-vsphere-guest-os-customization.html

need to set vmware-toolbox-cmd config set deployPkg wait-cloudinit-timeout 300 for example

seems you need to do it in the template and deploy it again via Packer

Hi dodo,

Yep. I checked out all the link you provided and this is the workaround solution for my situation and it worked well.
So basically I just add vmware-toolbox-cmd config set deployPkg wait-cloudinit-timeout 90 in my packer ks.cfg and now everything works perfectly.

Thanks!

1 Like