Performance Tuning Foreman / Puppet Master Home Lab

ikonia · August 4, 2023, 11:27am

Problem:
Performance Tuning Foreman in a home lab environment options, different approach from production approach to performance tuning

Expected outcome:
optimised throughput and resource utilisation for small lab deployments

Foreman and Proxy versions:
Foreman 3.7
Foreman-Proxy 3.7

Foreman and Proxy plugin versions:
Ansible 3.5.5
DHCP 3.7.0
DNS 3.7.0
Dynflow 0.9.0
Openscap 0.9.2 (not in use)
Script 0.10.1
TFTP 3.7.0
libvirt 3.7.0
puppet-6.0.0

Distribution and version:
CentOS 8-Stream x86_64
Puppet 7 (7.12)

Other relevant data:
I’m running Foreman 3.7 with Puppet 7 in a home lab environment to test some automation tasks. I’m using libvirt plugin to provision virtual machines on a kvm/libvirt target I’m running between 15-40 guests at one time (never more) on the KVM host, with a puppet run on the guests every 5 minutes (for quick changes).

The foreman server is a Intel NUC with an intel i5 5200 with 4 cores, 16GB ram and an NVME disk so low spec, but has always been a solid home lab device for doing foreman development work / learning.

The Foreman tuning guides Tuning performance of Foreman and RHEL Satellite tuning guides are fantastic for production / enterprise deployments and provider some great improvements and optimisations however, I think there are differing options to tune a home lab type setup rather than just changing these numbers from large scale.

I’ve also noticed when running 30-40 guests at once with a 5 minute check in time, occasionally the guests go out of sync with the puppet master, I’ve seen in the logs

Aug 4 09:46:10 ezri puppet-agent[48056]: Connection to https://lab.no-dns.co.uk:8140/puppet/v3 failed, trying next route: Request to https://lab.no-dns.co.uk:8140/puppet/v3 failed after 0.004 seconds: Failed to open TCP connection to lab.no-dns.co.uk:8140 (Connection refused - connect(2) for “jarvis.no-dns.co.uk” port 8140)
Aug 4 09:46:10 ezri puppet-agent[48056]: Wrapped exception:
Aug 4 09:46:10 ezri puppet-agent[48056]: Failed to open TCP connection to lab.no-dns.co.uk:8140 (Connection refused - connect(2) for “lab.no-dns.co.uk” port 8140)

the checkin’s are fine as 90% of the time all hosts are in sync, and if I manually run puppet on the guest it checks in and completes fine, I believe the foreman server can’t process the volume of simultaneous requests for 30-40 hosts overlapping every 5 minutes

I’m trying to look at the best ways to optimise my home lab (beyond putting it onto bigger tin which I don’t think is really needed to fix this)

I’d like to look at optimisation opinions for

a.) web gui snappy response, it’s pretty good as is, and there is a little lag caused by the libvirt plugin that has to check in with the libvirt host to pull back the data, but optimising it for human interaction would be good.
b.) web services capacity / performance for the proxies, especially around the smart proxy interface to the puppet master, I believe this is what’s causing some puppet runs to fail
c.) the actual puppet master - foreman installed and configured a puppet 7 puppet master as part of the install, it uses the default formulars in the installers to set it up, I think I can get more performance out of this to optimise the puppet runs

Because the lab is so small, I don’t think from my monitoring the database is causing any issues, it’s on the same host, using close to no resources and the database is tiny and query profiling shows great response times

I could probably benefit to some tweaks to puma but because it’s so small a host with such a small environment I’m not sure how to best size it.

While the machine is small and the environment is small, resource utilisation is interesting with approx 4.5GB of ram in use (excluding disk caching) and 8-15% cpu spiking for the puppet master java process.
disk IO is close to non-existent

the puppet master startup options are pretty default and small resource wise

/usr/bin/java -Xms2G -Xmx2G -Dcom.redhat.fips=false -Djruby.logger.class=com.puppetlabs.jruby_utils.jruby.Slf4jLogger -XX:ReservedCodeCacheSize=512m -XX:OnOutOfMemoryError=kill -9 %p -XX:ErrorFile=/var/log/puppetlabs/puppetserver/puppetserver_err_pid%p.log -cp /opt/puppetlabs/server/apps/puppetserver/puppet-server-release.jar:/opt/puppetlabs/puppet/lib/ruby/vendor_ruby/facter.jar:/opt/puppetlabs/server/data/puppetserver/jars/* clojure.main -m puppetlabs.trapperkeeper.main --config /etc/puppetlabs/puppetserver/conf.d --bootstrap-config /etc/puppetlabs/puppetserver/services.d/,/opt/puppetlabs/server/apps/puppetserver/config/services.d/ --restart-file /opt/puppetlabs/server/data/puppetserver/restartcounter

really keen to see if I can push this hardware better with a more optimised config , my big nodes are great, but I’m struggling to get a good balance on a small home lab

Dirk · August 4, 2023, 6:01pm

Did you enable tuned and throughput-performance? I think should also provide a better performance than balanced even in a home lab.

Could also be more a case of tuning the Puppetserver as Java tends to require a good amount of memory initial before being good to scale: Tuning guide

ikonia · August 4, 2023, 7:21pm

I did set the tuned profile as suggest in the doc.

I’ve been doing a little more research since I posted this (home lab has just sat there default for years, first time I’ve taken a look at it seriously)

I think my two areas to optimise are

a.) ram - I’ve got a load of ram that’s just not being used, all the processes foreman uses I’m sure I can do more with this
b.) CPU on the puppet run, the CPU appears to be holding processes open and causing wait on resources.

When I’ve been look at what’s going on with the puppet server, it looks like the puppet run of a client is holding open a chunk of CPU resource while the run is happening, which stops other puppet runs / queues them, and it creates a backlog/bottleneck.

I’ve been monitoring the pid for the puppet master, and all is fine, but eventually it builds up to a point where the cpu is high usage, but low load, and the process is in a wait state.


strace: Process 18294 attached
futex(0x7fae1a0b89d0, FUTEX_WAIT, 18297, NULL

I believe this is in a wait state for a client executing a puppet run, as the clients are checking in every 5 minutes and I’m running at the moment 34 clients, it looks to me like the puppet master can only sustain 4 puppet runs at once (allocating 1 per core).

if I restart the puppet server to get a clean environment to monitor, I seethe puppet logs show batches of 4 puppet runs, with large waits between them

2023-08-04T19:58:14.423+01:00 INFO  [qtp496451819-53] [puppetserver] Puppet Compiled catalog for anton.no-dns.co.uk in environment production in 7.70 seconds
2023-08-04T19:58:50.971+01:00 INFO  [qtp496451819-47] [puppetserver] Puppet Compiled catalog for wopr.no-dns.co.uk in environment production in 3.31 seconds
2023-08-04T19:58:57.923+01:00 INFO  [qtp496451819-47] [puppetserver] Puppet Compiled catalog for trip.no-dns.co.uk in environment production in 5.95 seconds
2023-08-04T19:59:05.112+01:00 INFO  [qtp496451819-53] [puppetserver] Puppet Compiled catalog for mother.no-dns.co.uk in environment production in 2.30 seconds

and

2023-08-04T20:00:17.343+01:00 INFO  [qtp496451819-51] [puppetserver] Puppet Compiled catalog for tpol.no-dns.co.uk in environment production in 11.34 seconds
2023-08-04T20:00:25.349+01:00 INFO  [qtp496451819-50] [puppetserver] Puppet Compiled catalog for router.no-dns.co.uk in environment production in 4.11 seconds
2023-08-04T20:00:32.527+01:00 INFO  [qtp496451819-47] [puppetserver] Puppet Compiled catalog for dukat.no-dns.co.uk in environment production in 5.57 seconds

I think I’ve approached this in a pretty sloppy way, I’ve not benchmarked my lab before hand, I’ve upgrade to 3.7 and not seen what impact that made, and I’ve just applied the tuned profile (because there didn’t seem to be a problem and it seemed a good thing to do).

Probably need to reset any parameters back to default or do a rebuild on this foreman server to get a baseline, I could fix forward/optimise forward but I’ve never noticed that CPU bottleneck before and I don’t know if it was there before 3.7 or if the tuned profile has impacted it.