Random Build Loop Issue

tlauwk · November 28, 2018, 12:23am

Hi,

I’m experiencing a random issue where the esxi host (uefi) does not trigger the before_provision hook script that sets the PXE Loader to None. This causes the host to reboot back into mboot.efi and installing esxi again.

The kickstart script has the following …

%post --interpreter=busybox
wget -O /dev/null <%= foreman_url(‘built’) %>
echo “Done with Foreman call”

… which if I understand correctly, tells Foreman the host has been built, triggering the before_provision hook script to change the PXE Loader to None …

if [[ $system_operatingsystem_name == ESXi* ]] && [[ $system_pxe_loader == “mboot UEFI” ]]; then
echo “ESXi in UEFI mode detected. Changing PXE Loader to None”
hammer -u admin -p password host update --id $system_id --pxe-loader ‘None’
fi

… and this process works - most of the time. But I get the odd occasion where it doesn’t and the host enters a “build loop”.

Plus its random and maybe occurs 5-10% of the time. Any ideas on how to troubleshoot this would be much appreciated.

Thanks.

lzap · November 29, 2018, 8:36am

Using hooks to do DB updates is a terrible idea, you have a race condition. Don’t to that.

tlauwk · November 29, 2018, 2:29pm

Forgive the noobie question but I’m not sure what you mean by …

To get the host to boot into the esxi installation on local disk, the before_provision hook script is required as indicated by the doco (3.7):

https://theforeman.org/2018/08/deploying-esxi-through-foreman.html

Unless I’m not understanding correctly on how it all ties together.

Thanks.

lzap · December 3, 2018, 9:03am

Well, this one is still a hack.

You need to dig out why the call failed, go into logs and check out what HTTP status value it returned, maybe enable debug level to see more. This can be also a network issue (timeout), try increasing timeout for curl/wget and also report error if wget returns non-zero (maybe try it 10 times).

One thing - we default to single passenger instance by default, if your server has only one passenger processing the request (thus this hook) and you call hammer, the app server will start launching another worker which takes time (20-80 seconds). Your call may expire, thus I recommend to set minimum amount of passengers to some reasonable value to always have a spare worker which can process your request. Configure this via --foreman-passenger-min-instances option, I recommend at least two but for busy server this might be more. Keep in mind that each worker can eat up to 1 GB of memory.