Virtual machines randomly changing the status between Out of sync/Error and Active in foreman

Problem:
We have onboarded more than 2000 VMs in foreman and the runtime interval is set to default - 30 mins, we’re facing a strange issue of VMs getting out-of-sync/Error and this is happening randomly. A VM marked as Error will be back to “Active” Status during next puppet run. Even if we run a manual puppet agent -t run, that will bring VM which was in Error/Out-of-Sync status earlier, back to Active status. Consistency seems to be very low with Active VM percentage between 60-65.

For those VMs which are showing as Error in foreman console, I’m finding below error message in the foreman reports - though the same VM will be back to Active state during next run
Error: Connection to https://foreman.xxx:8140/puppet/v3 failed, trying next route: Request to https://foreman…xxx:8140/puppet/v3 failed after 54.465 seconds: SSL_connect SYSCALL returned=5 errno=0 state=SSLv3/TLS write client hello

Is there any cap in the total number of VMs that can be onboarded to Puppet. I’m pretty sure that the compliance was more than 90% when the VM count was in 1500’s.

foreman version - 2.5
Puppet version - 7.2

Also I noticed that puppet is not pulling the code from the repository, resulting below error - This was working fine previously.
Importing Environments/Classes into Foreman. This may take a few minutes.
ERF12-2749 [ProxyAPI::ProxyException]: Unable to get environments from Puppet ([RestClient::NotAcceptable]: 406 Not Acceptable) for proxy https://foreman.xxx::8443/puppet

Any help will be highly appreciated.

Looks like either a network issue sometimes preventing the clients from reporting to foreman, or a load issue preventing foreman from responding in time.
Is the server under heavy load? do you see some requests in the foreman production.log that are taking a long time to respond?

Definitely seems load related. What are your foreman server specs, and puppet master specs? Any load issues? Does the puppet api load when hosta are returning those errors?

The server is a 32 CPU 128 GB machine. We have noticed delays when running puppetserver ca commands as well. The behaviour of clients are random, in a single refresh 2/3rd of the machines will report fine, but 1/3rd will fail. On the subsequent run, it will be another set of VM’s which will go through.

There are only two environments for classes. and this issue surfaced when we crossed the mark of 2000 machines.

By the way, I work with Brijesh.

Please note that the Puppet Server and Foreman is running on the same server.

One way to scale is to move puppetserver to a separate server(s) each with a smart proxy. However, Foreman itself should be fine handling 2k machines, we know of setups with 10x that number or more. Production.log should provide some further information regarding which calls are taking too long and may give us a hint at what can be done.
One quick thing to check is the size of the fact_value and fact_names tables - if you have some uncommon random facts (e.g. virtual interfaces or storage) they can grow quite large and slow down the whole server. There are settings to ignore certain patterns that cover the common cases, but it’s possible you have some such that we missed. With only puppet facts and 2k hosts, fact_names should be in the order of several hundred rows and fact_values in the order of 1-2M rows at most, if the numbers are much higher than that we could try investigating why.
Another thing to check is the report expiry policy and the number of saved reports. By default we have a cron job that cleans up reports older than a week, perhaps tuning that for shorter retention could also speed things up.

Also, are you running puppet on all machines at the same time? that could also impact performance, the puppet runs should be randomly scattered across the 30 minutes between subsequent runs. You can also consider increasing the time between runs from the default 30 minutes to 1 or 2 hours which will get you an immediate 2x or 4x reduction in load.

Thanks @tbrisker for your valuable inputs, Had a look at the fact_value and fact_names and could find the figures as 1.1 million rows and 194,321 rows respectively. Do you think fact_names count is on the higher side - Does it look like the root cause of this problem ?

There is a strange issue on this machine (that Brijesh is working on also) where I see the following problems:

  1. If I run ‘puppetserver ca list --certname …’ - it will take about a minute or more to lookup the cert, or it will occasionally fail with the following output:
    Fatal error when running action ‘list’
    Error: Failed connecting to https://xxxxxx:8140/puppet-ca/v1/certificate_statuses/any_key
    Root cause: Net::OpenTimeout

  2. On a client machine, if I run ‘puppet agent -t --debug’, each time it is connecting back to the puppet server, it will delay for well over a minute(s) while establishing the connection, and will intermittently fail with the SSL error Brijesh reported initially.
    (i.e. Creating new connection for https://xxxxxxx:8140 - this step hangs for minute(s))

fact_value size is within reason, but fact_names is certainly way too large. I’m not sure about the root cause yet, but cleaning up that table could certainly help the performance.
As a first step, you can try running foreman-rake facts:clean to try and trim any unused fact names. If that doesn’t lead to a significant reduction, the next step would be to investigate what semi-random fact names are filling up the table and try to add them to the excluded_facts setting.

Does the puppet api load for other queries during this time?

Still seems like a “thundering herd” scenario. I couldn’t fathom one puppet master serving 2000 hosts.

I have 18,000 but use 27 puppet masters behind a load balancer to keep things same and functioning

Specifically I’d track num_free_jrubies and other similar metrics. My guess is queuing of some kind is leading to the http server tipping over, which is why you then get timeouts even on the puppet CA endpoint loading. Thundering herd “might” explain it, but so could a lot of other things. Are your runs for your hosts distributed evenly?

1 Like