Cloud-init Issue with VMware

jmauto · August 30, 2024, 3:53am

Problem:
I used HashiCorp Packer and ks.cfg to build a gold image from CentOS 8/9 DVD iso. It has cloud-init baked and I followed the insturction from the doc (11.7. Using VMware cloud-init and userdata templates for provisioning) as well.
Once the vm has been converted to the vm template in vsphere. I go to Compute resource → Image → Link the VM template. Everything was fine and working great during the image-based provisioning.
However, what I noticed was my VM will reboot after 30 seconds when the VM is booted and cloud-init and userdata templates were being applied. I think the reboot broke the configuration process and the VM was not fully configured with all the snippets config applied.
I have no idea why it happened.

I checked the /var/log/message and I could not find any hints. my cloud-init template general summary:

set hostname
update time
add authorized keys
subscribe to foreman
install some packages from repo
configure sssd and realm to join the domain/ad authentication
install puppet-agent, update config file and then run puppet agent -t
But it always reboot in the last step where the VM has puppet-agent install but sometimes cannot update the puppet.conf file. it casued the VM failure to get the puppet classes.

I checked my snippet and there is no trigger for the reboot.

The error before the reboot is ‘failed unmounting /var’ but I think it is triggerred by the reboot as we have a partition on /var.
I have no idea why it reboots.

Expected outcome:
I expected that there is no reboot after the VM is provisoined via foreman with image-based option.

Foreman and Proxy versions:
Foreman 3.10 and Katello 4.12

Info:
Here is the cloudinit template supposed to be applied to the VM:

#cloud-config
hostname: devc9reboot2.corp.abc.ca
fqdn: devc9reboot2.corp.abc.ca
manage_etc_hosts: true
users: {}
runcmd:

|
echo “devc9reboot2” > /etc/hostname

hostname devc9reboot2

cat > /etc/hosts << EOF
127.0.0.1 devc9reboot2.corp.abc.ca devc9reboot2 localhost localhost.localdomain
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
EOF
|
|

echo “Updating system time”
systemctl enable --now chronyd
/usr/bin/chronyc -a makestep
/usr/sbin/hwclock --systohc
|
|

echo “################# SUBSCRIPTION MANAGER #######################”
echo
echo “Starting the subscription-manager registration process”

Disable online yum repos

echo “Disabling all online yum repositories”
yum-config-manager -y --disable * > /dev/null

rpm -Uvh http://foremanvan01.corp.abc.ca/pub/katello-ca-consumer-latest.noarch.rpm

On rare ocassion, the system already attached a subscription which cause

subsequence step to fail.

subscription-manager clean

subscription-manager register --name=“devc9reboot2.corp.abc.ca” --org=‘Systems’ --activationkey=‘CentOS Stream 9 Server Production’ --force || subscription-manager register --name=“devc9reboot2.corp.abc.ca” --org=‘Systems’ --activationkey=‘CentOS Stream 9 Server Production’ --force
|

user_exists=false
getent passwd root >/dev/null 2>&1 && user_exists=true

if $user_exists; then

mkdir -p ~root/.ssh

cat << EOF >> ~root/.ssh/authorized_keys
ssh-rsa fake_one_AAAADAQABAAABgQDbZQepP3IFryQ5GHDCYYytoOEvUanHOkmkMBvlC6cnLOqGeXXeLI34S1+HVRwaLiKJrqiJmFdTwpnoBn4eEfqjWH26NY/SwEhFsMMewfewfwefTudAxESvXjkzXHmsKaYLrLnIWn2voK6zrdghX5kousCnAyIQEeJDAD9PiuasdasdasdasdFS1ZUR0DMG4OJCq6JN9HLA1+4Krkq0YofWB5MTGTZw/mBAL8tetQnLMoWiLTN0zmAUBspZKUBPfAA7Gn8ybRd6OoD2rjoRm/AzyXNxRui+LYnqrTsxwJOJLvwKjUTQJqaIDfk= foreman-proxy@smrtprxvan01.corp.abc.ca
EOF

chmod 0700 ~root/.ssh
chmod 0600 ~root/.ssh/authorized_keys
chown -R root: ~root/.ssh
chown -R root: ~root

Restore SELinux context with restorecon, if it’s available:

command -v restorecon && restorecon -RvF ~root/.ssh || true

else
echo ‘The remote_execution_ssh_user does not exist and remote_execution_create_user is not set to true. remote_execution_ssh_keys snippet will not install keys’
fi
|

Install required packages

yum install -y sssd realmd oddjob oddjob-mkhomedir adcli samba-common samba-common-tools krb5-workstation openldap-clients fping

cat > /root/join_realm.sh << EOF
#!/bin/sh

DONE=0
COUNT=0

if test -f /etc/krb5.keytab; then
DONE=1
fi

while [ $DONE -eq 0 ] && [ $COUNT -le 30 ]; do
fping -q corp.abc.ca || { sleep 2; COUNT=$(( $COUNT + 1 )); continue; }

Attempt to join the domain

echo wan@abc| kinit jointhedomain@CORP.abc.ca
realm leave corp.abc.ca
sleep 10
realm join corp.abc.ca

if [ $? ]; then
if ! grep -q ‘case_sensitive = false’ /etc/sssd/sssd.conf 2> /dev/null; then
sed -i ‘s/[domain/corp.abc.ca]/[domain/corp.abc.ca]\ncase_sensitive = false/’ /etc/sssd/sssd.conf 2> /dev/null
fi
```
systemctl enable --now sssd
systemctl restart sssd
DONE=1
```
fi

COUNT=$(( COUNT + 1 ))
done

rm -f /root/join_realm.sh

EOF

chown -v root:root /root/join_realm.sh
chmod -v 0700 /root/join_realm.sh

cat > /lib/systemd/system/realm-join.service << EOF
[Unit]
Description=Realm join

[Install]
WantedBy=multi-user.target

[Service]
ExecStart=/bin/bash /root/join_realm.sh
Type=simple
User=root
Group=root
WorkingDirectory=/root
Restart=on-failure
EOF

systemctl daemon-reload
systemctl enable --now realm-join.service
|
if [ -f /usr/bin/dnf ]; then
dnf -y install puppet-agent
else
yum -t -y install puppet-agent
fi

cat > /etc/puppetlabs/puppet/puppet.conf << EOF
[main]
vardir = /var/lib/puppet
logdir = /var/log/puppet
rundir = /var/run/puppet
ssldir = $vardir/ssl

[agent]
pluginsync = true
report = true
ca_server = pptmstrvan01.corp.abc.ca
certname = devc9reboot2.corp.abc.ca
server = pptmstrvan01.corp.abc.ca
environment = production

EOF

puppet_unit=puppet
/usr/bin/systemctl list-unit-files | grep -q puppetagent && puppet_unit=puppetagent
/usr/bin/systemctl enable ${puppet_unit}

export a custom fact called ‘is_installer’ to allow detection of the installer environment in Puppet modules

export FACTER_is_installer=true

passing a non-existent tag like “no_such_tag” to the puppet agent only initializes the node

You can select specific tag(s) with the “run-puppet-in-installer-tags” parameter

or set a full puppet run by setting “run-puppet-in-installer” = true

echo “Performing initial puppet run for --tags no_such_tag”
/opt/puppetlabs/bin/puppet agent --config /etc/puppetlabs/puppet/puppet.conf --onetime --tags no_such_tag --server pptmstrvan01.corp.abc.ca --no-daemonize
/opt/puppetlabs/bin/puppet resource service puppet ensure=running

phone_home:
url: http://smrtprxvan01.corp.abc.ca:8000/unattended/built
post:
tries: 10

dodo · August 30, 2024, 4:54am

i’m not in front of my instance but maybe i can help anyway

the UserData Template is using vmware tools and vmware to customize Hostname/IP/ ect
it’s rebooting after this step
After the vm boots again, it’s contacting foreman via cloud-init datasource and applies cloudInit Template

can you figure out, when exactly the vm reboots?
it should only reboot after UserData Tempalte

dodo · August 30, 2024, 5:05am

BTW:
I’m not using cloud-init like you
i’m using ansible for the configuration steps

My cloudInit Template is very small and only contains generating SSH Keys and expand HDD

i had many troubles also when setting this up but it’s working very realible since then, except when cloud-init update broke cloud-init

jmauto · August 30, 2024, 5:09am

yep, he UserData Template uses vmware tools and the VM does get the IP, DNS and hostname set up correctly.
The VM contacts the foreman, gets the cloud-init template, and applies it itself.
The reboot is triggered when you see the cloud-init steps output on the screen ( close to the end when it installed the puppet agent successfully and then all of a sudden it reboots)
I assumed it would write the puppet.conf as well but it does not happen all the time. It looks like if it reboots earlier, the cloud-init does not have sufficient time to apply the config. And I have no idea why it reboots itself. is it related to the cloud-init? or is it related to Foreman? of is it related to the provisioning template?

dodo · August 30, 2024, 5:16am

maybe i’ve got the same problem as you in the past and decided to do Configuration via Ansible can’t remember

as far as i know
the VM reboots because of the UserData vmware-tools Template

jmauto · August 30, 2024, 5:26am

But the doc did not mention the the userdata template will trigger the reboot

The user provisions one or more virtual machines using the Satellite web UI, API, or hammer
Satellite calls the VMware vCenter to clone the virtual machine template
Satellite userdata provisioning template adds customized identity information
When provisioning completes, the Cloud-init provisioning template instructs the virtual machine to call back to Capsule when cloud-initruns
VMware vCenter clones the template to the virtual machine
VMware vCenter applies customization for the virtual machine’s identity, including the host name, IP, and DNS
The virtual machine builds, cloud-init is invoked and calls back Satellite on port 80, which then redirects to 443

dodo · August 30, 2024, 5:41am

i just tested it again and build a Host

The CloudInit Template is applied before the reboot, so theres the culprit
seems like i ran into the same issue as you
i had many issues setting this up also

i even created a new cloudInit Template with only the following

#cloud-config
hostname: <%= @host.name %>
fqdn: <%= @host %>
manage_etc_hosts: true
users: {}
# use runcmd to grow the VG, LV and root filesystem, as cloud-init
# doesn't handle LVM resizing natively
runcmd:
  - cloud-init-per once grow_part growpart /dev/sda 3
  - cloud-init-per once grow_lv lvextend -r -l +100%FREE /dev/mapper/vg_system-lv_root
  - ssh-keygen -A



phone_home:
  url: <%= foreman_url('built') %>
  post: []
tries: 10

after that it’s contacting katello and applies all Ansible Roles which are attached to the Host Group ( after a delay )
Ansible is doing the registration and configuration

dodo · August 30, 2024, 5:47am

take a look at this issue

github.com/vmware/open-vm-tools

wait for cloud-init execution to finish breaks previous behavior

opened 05:32AM - 28 Aug 23 UTC

vitality411

bug

### Describe the bug Hello, thanks to @lethargosapatheia and issue https://git…hub.com/canonical/cloud-init/issues/4188 in cloud-init repository I was able to find the root cause of the following issue. Since Ubuntu 22.04 [20230602](https://cloud-images.ubuntu.com/releases/jammy/release-20230602/ubuntu-22.04-server-cloudimg-amd64.ova) cloud image version the behavior of cloud-init has unexpectedly changed. Until this version the virtual machine would start, run cloud-init init-local stage, reboot and run the remaining cloud-init stages correctly. In this version cloud-init starts additional stages besides 'init-local' during first boot (see attached cloud-init analyze show output). During these stages it is terminated prematurely by deployPkg. I found out that this is due to https://github.com/vmware/open-vm-tools/commit/cd995a58b07a91d7804d9fdec5545a5fe11e9db9 which changed deployPkg plugin behavior to wait for cloud-init execution to finish. If the cloud-init execution is not finished during default timeout 30s it is killed. This behavior disturbs automatic provisioning, which rely on correct application of cloud-init settings. In my case, [kubermatic/machine-controller](https://github.com/kubermatic/machine-controller) starts doing the provisioning through userdata and it's interrupted by the reboot, making it impossible to automatically provision new Kubernetes nodes. I can confirm the provisioning works properly with version [20230518](https://cloud-images.ubuntu.com/releases/jammy/release-20230518/ubuntu-22.04-server-cloudimg-amd64.ova) of Ubuntu 22.04, where cloud-init is executed correctly, without being terminated prematurely. Environment details Cloud-init versions are identical on both cloudimage versions: ``` ii cloud-guest-utils 0.32-22-g45fe84a5-0ubuntu1 all cloud guest utilities ii cloud-init 23.1.2-0ubuntu0~22.04.1 all initialization and customization tool for cloud instances ii cloud-initramfs-copymods 0.47ubuntu1 all copy initramfs modules into root filesystem for later use ii cloud-initramfs-dyn-netconf 0.47ubuntu1 all write a network interface file in /run for BOOTIF ``` open-vm-tools ``` Working: VMware Tools Version: 12.1.0.37487 (build-20219665) Broken: VMware Tools Version: 12.1.5.39265 (build-20735119) ``` Operating System Distribution: Ubuntu 22.04.2 LTS Cloud provider, platform or installer type: VMware Cloud Director/OVA Logs I'm uploading the relevant logs for both images. [broken.tar.gz](https://github.com/vmware/open-vm-tools/files/12450543/broken.tar.gz) [working.tar.gz](https://github.com/vmware/open-vm-tools/files/12450544/working.tar.gz) Best regards! ### Reproduction steps ### Working 1. Deploy VM using Ubuntu 22.04 [20230518](https://cloud-images.ubuntu.com/releases/jammy/release-20230518/ubuntu-22.04-server-cloudimg-amd64.ova) cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file. 2. Verify cloud-init was able to execute only the init-local stage on first boot: ``` cloud-init analyze show -- Boot Record 01 -- The total time elapsed since completing an event is printed after the "@" character. The time the event takes is printed after the "+" character. Starting stage: init-local |`->no cache found @00.00300s +00.00000s |`->found local data from DataSourceOVF @00.01200s +00.00700s Finished stage: (init-local) 00.25800 seconds Total Time: 0.25800 seconds -- Boot Record 02 -- The total time elapsed since completing an event is printed after the "@" character. The time the event takes is printed after the "+" character. Starting stage: init-local |`->cache invalid in datasource: DataSourceOVF [seed=com.vmware.guestInfo] @00.00400s +00.00900s |`->found local data from DataSourceOVF @00.01400s +00.00700s Finished stage: (init-local) 00.06100 seconds Starting stage: init-network ... ``` 3. Verify /var/log/vmware-imc/toolsDeployPkg.log does **not** contain the following lines: ``` [2023-08-25T08:18:48.556Z] [ info] Do not trigger reboot if cloud-init is executing. [2023-08-25T08:18:48.857Z] [ info] Cloud-init status is 'running'. ... [2023-08-25T08:19:16.029Z] [ info] Cloud-init status is 'running'. [2023-08-25T08:19:21.029Z] [ info] Timed out waiting for cloud-init execution done. ``` ### Broken 1. Deploy VM using Ubuntu 22.04 [20230602](https://cloud-images.ubuntu.com/releases/jammy/release-20230602/ubuntu-22.04-server-cloudimg-amd64.ova) cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file. 2. Verify cloud-init tried to execute multiple stages besides init-local stage on first boot: ``` -- Boot Record 01 -- The total time elapsed since completing an event is printed after the "@" character. The time the event takes is printed after the "+" character. Starting stage: init-local |`->no cache found @00.00500s +00.00000s |`->found local data from DataSourceOVF @00.12200s +00.00800s Finished stage: (init-local) 00.42600 seconds Starting stage: init-network ... Finished stage: (init-network) 01.55400 seconds Starting stage: modules-config ... Total Time: 14.33200 seconds -- Boot Record 02 -- The total time elapsed since completing an event is printed after the "@" character. The time the event takes is printed after the "+" character. Starting stage: init-local |`->cache invalid in datasource: DataSourceOVF [seed=com.vmware.guestInfo] @00.00400s +00.01000s |`->found local data from DataSourceOVF @00.01500s +00.00700s Finished stage: (init-local) 00.06700 seconds Starting stage: init-network ... ``` 3. Check /var/log/vmware-imc/toolsDeployPkg.log for the following lines: ``` [2023-08-25T08:18:48.556Z] [ info] Do not trigger reboot if cloud-init is executing. [2023-08-25T08:18:48.857Z] [ info] Cloud-init status is 'running'. ... [2023-08-25T08:19:16.029Z] [ info] Cloud-init status is 'running'. [2023-08-25T08:19:21.029Z] [ info] Timed out waiting for cloud-init execution done. [2023-08-25T08:19:21.130Z] [ info] Trigger reboot. [2023-08-25T08:19:22.049Z] [ info] Rebooting. [2023-08-25T08:19:23.150Z] [ info] Reboot has been triggered. ``` 4. Check journalctl for prematurely termination of cloud-init services ``` Aug 25 08:19:21 broken systemd[1]: cloud-final.service: Main process exited, code=exited, status=1/FAILURE Aug 25 08:19:21 broken systemd[1]: cloud-final.service: Failed with result 'exit-code'. Aug 25 08:19:21 broken systemd[1]: Stopped Execute cloud user/final scripts. -- Boot 1673d9c3f266400881bfd1b73933b494 -- Aug 25 08:19:54 broken systemd[1]: Starting Execute cloud user/final scripts... ``` ### Expected behavior No breaking change I understand that this change was required to resolve issues where users want to set a vm's networking and apply cloud-init userdata together before the vm is booted. But still I find it bad practice to change the previous default and break previous working configuration for others. Previously I was able to use the cloud image without modification. Now I have to build my own image with `wait-cloudinit-timeout=0` just to restore the previous behavior. ### Additional context _No response_

seems like theres also a fix for it

jmauto · August 30, 2024, 5:50am

Did you also encounter the ‘out of nowhere’ reboot issue?
I am not sure if the Foreman developers notice this issue and have the solution to fix it.
if it is a common issue, is it possible to extend the time delay before it actually reboot…
it is so frustrating as it is the last part of my config in cloud-init template.
As every snippet can be reused and all configuration management can be done via puppet.

dodo · August 30, 2024, 5:55am

yes it seems so

and also this might help

https://knowledge.broadcom.com/external/article/311864/how-does-vsphere-guest-os-customization.html

need to set vmware-toolbox-cmd config set deployPkg wait-cloudinit-timeout 300 for example

seems you need to do it in the template and deploy it again via Packer

jmauto · August 30, 2024, 6:44pm

Hi dodo,

Yep. I checked out all the link you provided and this is the workaround solution for my situation and it worked well.
So basically I just add vmware-toolbox-cmd config set deployPkg wait-cloudinit-timeout 90 in my packer ks.cfg and now everything works perfectly.

Thanks!

Cloud-init Issue with VMware

Disable online yum repos

On rare ocassion, the system already attached a subscription which cause

subsequence step to fail.

Restore SELinux context with restorecon, if it’s available:

Install required packages

Attempt to join the domain

rm -f /root/join_realm.sh

export a custom fact called ‘is_installer’ to allow detection of the installer environment in Puppet modules

passing a non-existent tag like “no_such_tag” to the puppet agent only initializes the node

You can select specific tag(s) with the “run-puppet-in-installer-tags” parameter

or set a full puppet run by setting “run-puppet-in-installer” = true