Cloud-init not running properly

lravelo · July 11, 2023, 3:55pm

Problem: I used to be able to provision VMware virtual machines up until about a couple of weeks ago (only changes here were patching the Foreman server but not updating Foreman itself. I also did install the foreman_webhooks plugin but don’t see how this could affect). I would use cloud-init on Rocky Linux 8 images and it would work just fine.

Lately, however, something stopped working. I would provision a host. The userdata open-vm-tools provisioning template works fine. But when cloud-init runs on the host, it fails. The weird part is that if I do a rm -rf /var/lib/cloud/instances and reboot the machine, then it runs properly but not sure why if the config is the same.

I upgraded to 3.6.1 to see if this helps resolve the issue but no dice.

Expected outcome: Expecting cloud-init to run properly and then phone back home to Foreman

Foreman and Proxy versions: was tried on 3.5.3 as well as 3.6.1

Foreman and Proxy plugin versions:

Distribution and version: Foreman server runs on Rocky Linux 8.8 as well as the hosts I’m provisioning.

Other relevant data:
Here’s the relevant portion of the cloud-init.log file

2023-07-11 15:28:41,627 - subp.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0] (shell=False, capture=False)
2023-07-11 15:28:58,306 - util.py[DEBUG]: Cloud-init 22.1-8.el8.0.2 received SIGTERM, exiting...
  Filename: /usr/lib64/python3.6/subprocess.py
  Function: _try_wait
  Line number: 1424
    Filename: /usr/lib64/python3.6/subprocess.py
    Function: wait
    Line number: 1477
      Filename: /usr/lib64/python3.6/subprocess.py
      Function: communicate
      Line number: 855
2023-07-11 15:28:58,306 - handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance
2023-07-11 15:28:58,306 - util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
2023-07-11 15:28:58,306 - util.py[DEBUG]: Read 12 bytes from /proc/uptime
2023-07-11 15:28:58,306 - util.py[DEBUG]: cloud-init mode 'modules' took 16.734 seconds (16.74)
2023-07-11 15:28:58,306 - handlers.py[DEBUG]: finish: modules-final: FAIL: running modules for final

lravelo · July 13, 2023, 4:28pm

So after further troubleshooting, the cause seems to be related to the fact that, for a reason I’ve yet to determine, the particular host group I’m trying to provision the new VM into is not using the UserData open-vm-tools template but rather a Kickstart finishing template.

lravelo · July 13, 2023, 6:18pm

So it seems to be something to do with my transition to using Composite Content Views. I tried to create a new host group and when selecting the synced content repo that I have configured in other groups I will get the following message:

The selected kickstart repository is not part of the assigned content view, lifecycle environment, content source, operating system, and architecture

I tried creating a new hierarchy of nested host groups nearly as identical as my production ones and I’m noticing weird inheritance issues. Here’s what I’m trying:

Create a test parent group called “test”. The only things configured here is an x86_64 architecture and the assignment of all my locations to it.
Create nested group called “loc01”. Here I set Library as the lifecycle environment as well as the content view (using a composite content view here), content source (a smart proxy at the location in question), deploy on my vmware cluster in said location, as well as a baseline. For now I set a domain, OS of Rocky 8.8, All Media for Media Selection, select my Rocky Linux media. The first weird thing I notice here is that in Locations, one of them is greyed out saying that it is used by a host despite there being no host assigned to this). I save this and move on.
Inside “loc01” I nest another group called “dev”. Only changes I make here are to the IPv4 subnet, set the activation key and I try to select “all media” under OS. When I go to save, it errors out saying the error message above while reverting back to “synced content” and selecting the Rocky Linux Base OS repo that is nested inside a content view that exists inside the composite content view that I’m using. Only way I can get it to stick is to reselect “all media” and then hard code the Media value to the same value as the one inherited.

Despite all of this, trying to create a host still chooses the Kickstart finishing template instead of the userdata template.

I even do the same test as before but using my previous content view and the same thing happens. Not sure where else I could look to try to resolve this.

lravelo · July 13, 2023, 8:19pm

OK this is definitely related to the bug mentioned in this post Foreman 3.6.1 / Katello 3.8.0 UI Hostgroup CV "Inherit Parent" Bug

If I don’t use any inheritance within the host group itself and then try to build a new host, the userdata template resolves correctly.

lravelo · July 13, 2023, 8:46pm

Also seems to exist in 3.7.0 which I just upgraded to. If I hard code all content and OS related values within the host group and go build a host against it, the UserData template is the one that resolves.

Franz · September 14, 2023, 11:51am

Same problem with Redhat Satellite 6.13 !
I only use 4 commands using runcmd.
When running the third command

subscription-manager attach --auto

the problem occurs.
It’s a python script
For me, it’s a low level Python bug, not using a proper ASYNC I/O backend but “giddy” tries.
It would be very easy to use Perl and using a proper ASYNC I/O Loop (eventfds) along with promises A+ and wait for all childs to come back.
Solutions ?

Franz · September 14, 2023, 3:19pm

My solution is quite “quirky”.
i really don’t get, why there’s an ordered YAMl file which will be completely ignored, because all scripts in the final stage will be started concurrently.
There must be a major bug in the Python subprocess library, because you CAN control the workflow of async execve calls by using PROMISES A+ !
My solution for the subscription manager is split in two sections:

Provisioning (vmware-cloud-init)

#cloud-config
hostname: <%= @host.shortname %>
fqdn: <%= @host %>
manage_etc_hosts: true
users: {}
runcmd:
- [ sh, -c, echo "========= GENERATE RHSM UUID FACTS =========" ]
- [ sh, -c, "~/gen-rhsm-uuid.sh" ]
- [ sh, -c, echo "========= TOUCH CLOUD INIT (RUN ONCE) =========" ]
- [ sh, -c, "touch ~/cloud-init" ]
- [ sh, -c, echo "========= SUBSCRIPTION MANAGER REGISTER =========" ]
- [ sh, -cx, "nohup /usr/bin/subscription-manager register --name <%= @host.name %> --org ORGANIZATION --activationkey <%= host_param('kt_activation_keys') %> >/dev/null 2>&1" ]

phone_home:
  url: <%= foreman_url('built') %>
  post: []
  tries: 10

The auto attach script which will be called by the final module

#!/bin/bash

if [ ! -f ~/.cloud-final ]; then
  RET=$(/sbin/subscription-manager identity >/dev/null 2>&1)
  if [[ $? == 1 ]]; then
    exit 0
  else
    /sbin/subscription-manager attach --auto
    /bin/insights-client --register --silent
    touch ~/.cloud-final
  fi
fi

At the first run the subscription will be configured.
After the reboot, the attachment will succeed and so the insight client registration.

I’m not entirely sure, why all modules must be written in Python, when there’s evidence of better fitting programming languages for this low level complexity…

Best
Franz

nixfu · September 14, 2023, 5:50pm

VMWare VRA + cloud-init, has caused no end to problems in our experience.

The more I have dug into cloud-init, the more I think cloud-init is just a poorly implemented technology.

Franz · September 15, 2023, 6:22am

The VMware Guest customization is the “old” Perl release.
Beginning with version 7.x → there’s a basic customization wizard (run script after boot) which would be sufficient for me.
I don’t know if the ReST API is capable of supplying script snippets.
Will check that later.
For now, i’ll live with my workaround…

lravelo · September 15, 2023, 4:30pm

Hi @Franz this looks interesting. Would you be able to share your cloud.cfg and other relevant cloud-init files that help make this work? I’m interested in giving this a shot as I still have this issue which constantly giving my headaches since I have to go back and manually create users, add keys, etc.

Franz · September 16, 2023, 7:47am

Hi,
to get me started, i used this blog entry:
Vmware Provisioning Template Redhat Satellite
I changed the cloud.cfg to create ssh host keys.
There are various templating guides out there, i used the redhat satellite guide (beginning at page 96)
Adding_VMware_Images_to_Server_provisioning
My cloud.cfg adaption:

cloud_init_modules:
 - ssh
 - bootcmd

cloud_config_modules:
 - runcmd

cloud_final_modules:
 - scripts-per-once
 - scripts-per-boot
 - scripts-per-instance
 - scripts-user
 - phone-home

system_info:
  distro: rhel
  paths:
    cloud_dir: /var/lib/cloud
    templates_dir: /etc/cloud/templates
  ssh_svcname: sshd

ssh_genkeytypes: ['rsa', 'dsa', 'ecdsa', 'ed25519']

per-boot-script: final version tested)

#!/bin/bash

if [ ! -f ~/.cloud-final ]; then
  RET=$(/sbin/subscription-manager identity >/dev/null 2>&1)
  if [[ $? == 1 ]]; then
    exit 0
  else
    RET=$(/sbin/subscription-manager attach --auto >/dev/null 2>&1)
    if [[ $? == 1 ]]; then
      exit 0
    fi
    RET=$(/bin/insights-client --register --silent >/dev/null 2>&1)
    if [[ $? == 1 ]]; then
      exit 0
    fi
    /bin/systemctl disable --now cloud-config.service cloud-final.service cloud-init-local.service cloud-init.service >/dev/null 2>&1
    /bin/systemctl reset-failed >/dev/null 2>&1
    touch ~/.cloud-final
    rm -f /var/log/audit/audit.log.*
    echo > /var/log/audit/audit.log
    /sbin/reboot
  fi
fi

Cloud init ERB :

#cloud-config
hostname: <%= @host.shortname %>
fqdn: <%= @host %>
manage_etc_hosts: true
users: {}
runcmd:
- [ sh, -c, echo "========= GENERATE RHSM UUID FACTS =========" ]
- [ sh, -c, "~/gen-rhsm-uuid.sh" ]
- [ sh, -c, echo "========= TOUCH CLOUD INIT (RUN ONCE) =========" ]
- [ sh, -c, "touch ~/cloud-init" ]
- [ sh, -c, echo "========= SUBSCRIPTION MANAGER REGISTER =========" ]
- [ sh, -cx, "nohup /usr/bin/subscription-manager register --name <%= @host.name %> --org ORG --activationkey <%= host_param('kt_activation_keys') %> >/dev/null 2>&1" ]

phone_home:
  url: <%= foreman_url('built') %>
  post: []
  tries: 10

Of course it’s 100% foreman
The problem, subscription-manager itself is a python script which’s using pipes for subprocesses.
When reaching subscription-manager attach, cloud-init subprocess exits with os.err and the vm reboots.
Being online again, the final stage will be retried, the subscription-manager register call did succeeded in the first run, and the script will be executed.
Users etc. can be defined using the cloud.cfg (see cloud init reference for details), but i ended up using my shell-script wrapper

Best Franz

pjbarbero · September 30, 2024, 1:16pm

Hi there guys,

I guess I’ve finally found what is at least for me the real problem of this behavior. Its properly described at below link. I hope it helps.

Regards.