Metrics for better tuning presets

evgeni · November 19, 2021, 8:36am

Ohai,

as you may have seen, there have been a few discussions around the memory usage of Foreman (and related parts of the stack):

While we think we have fixed the source for the particular increased memory usage in 2.5/3.0, we also found that our current tuning defaults for deployments aren’t as well defined as they could. This is mostly due to the fact that we have ways to do calculations of the form of “if you have 4 CPUs and 32GB of memory, you shouldn’t run more than 6 Puma workers” these calculations aren’t based on recent numbers and platform changes and might be suboptimal.

Let me quickly recap the current architecture:

Foreman

Apache HTTPd: Doing, well, HTTP. So SSL termination, forwarding requests to backends using mod_proxy, no mod_wsgi or mod_passenger anymore
Puma: The Ruby application server, running the main Ruby on Rails app.
PostgreSQL: The database.
Dynflow: The “Foreman jobs daemon”, doing various background tasks that the Rails app shouldn’t be blocked on.
Redis: Mostly used as a cache for Dynflow and Rails

Foreman Proxy

Foreman Proxy: Small Sinatra web app, running on WEBRick, responsible for integrating with external services like DNS, DHCP, Puppet etc.

Puppet Server

Puppet Server: If you’re using Puppet, you’ll also have a Puppet Server running on your machine

Katello

If you’re using Katello, you’ll also have the following additional services running on your system:

Pulpcore API: Django app (running in gunicorn) allows Foreman/Katello to manage Pulp repositories
Pulpcore Content: Django app (running in gunicorn) serves content (packages etc) to clients
Pulpcore Workers: Django app (running standalone) that does the “work” of downloading packages, generating repository metadata etc
Tomcat: Java application server, running the Candlepin app, responsible for all things subscriptions for Katello and Clients

Now, as you can see, depending on the deployed featureset, the number of services (and thus required resources) greatly differ. Additionally, y’all will have different use cases and use patterns, resulting in different resource usage, even with the same features enabled.

What we’re looking for are numbers to further improve our defaults. Mostly memory and CPU usage of the individual processes, plus a bit of details about your deployment.

Examples:

Foreman without Katello and Puppet on 4 CPU, 16GB RAM VM. 6 Puma workers at 1GB each, PostgreSQL 800MB, Redis 200MB, running flawlessly
Foreman with Katello and Puppet on 8 CPU, 32GB RAM. 12 Puma workers at 2GB each, PostgreSQL at 3GB, Redis at 1GB, Tomcat at 2GB and Puppet Server at 4GB, constantly OOMing

mcorr · November 30, 2021, 3:00pm

So just a reminder of what we are asking for:

To help us create better tuning presets:

Can you (all) share with us memory and CPU usage and some details about your deployment?

See examples above!

Dirk · November 30, 2021, 4:17pm

I have no access to a customer to give you all the information at the moment, but I have asked at least two which I was in contact in the last weeks.

For me a tuning preset which will give you the absolute minimum would be great, so I can create a local demo without blocking all resources for it. I just like to have my demos with me to be independent of internet and datacenter. Name this preset like demo or “minimal - no productive use” to make it clear, but it would be nice to have it (even after I order a new system with 32GB RAM ).

Thulium-Drake · November 30, 2021, 4:17pm

The added value of this might be a bit low, more in the domain of ‘good to know’:

Foreman + Katello, no Puppet:

Lab setting: 2 CPUs, 12 GB RAM. It runs 3 Puma’s and when I run free -m I see:

[ansible@foreman ~]$ free -m
              total        used        free      shared  buff/cache   available
Mem:          11800        5373        4863         247        1563        5901
Swap:          6063           0        6063

But it doesn’t serve more then 5 clients (my laptop can’t handle that )

We’re still waiting for our client to provision VMs, networks and firewall access, but the ‘real’ deployment of this lab is scaled at 4 CPUs and 24GB RAM. This should eventually server ~60 nodes, direct or through 4 smart proxies (2CPUs / 8 GB RAM)

areyus · December 1, 2021, 10:11am

We have the following:

Foreman + Katello + Puppet, running on 16 CPU Cores (currently @ ~35% CPU load on average during regular operation) and 64 GB RAM VM. 8 PUMA workers at ~800 MB each, PostgreSQL around 4GB of RAM, Redis at neglectable 150MB, Tomcat using ~4GB of RAM.
We have 3 smart-proxies solelly serving puppet and another 3 for content/packages. This setup currently serves just short of 3000 Hosts currently with a Puppet run interval of 1 hour.

We just updated this production environment to Foreman 2.5 this weekend from 2.0, so our sizing currently is still mainly based off the mod_passenger setup. Our sizing is also intended to withstand peaking requests, both for puppet runs and package updates, but we have not had any of those peaks since the update.
In our current day-to-day usage, this setup by now runs flawlessly, but we have observed random occurances of “slowness”, probably related to the way puma request queues work different from mod_passenger request queues. Before, when a user found Foreman to be unresponsive, this indicated requests piling up in general. Now they seem to just be randomly assigned to a worker that is currently handling more work than others while the system in general is completely fine.

lzap · December 2, 2021, 2:36pm

One observation, I test Satellite on my ex-workstation with (I know I know I know) 12GB of RAM. It was quite decent workstation back then. Anyway:

This is Xeon W3550 with 4 cores and it worked fine. When we started shipping puppet server (recent versions) it sometimes had to swap, since I do not use Puppet I always turned it off. And it worked just fine.

Until recently when I realized things got much slower when I am syncing with Pulp3. I noticed that pulpcore-workers goes up to 2.7g of virtual memory. Installer set 4 of them for my system and luckily only two of them were working when I was syncing my 4 repositories, but that is significant spike from the previous state.

My suggestion would be to verify how installer sets things up for multicore systems, e.g. systems with many cores could suffer from increased memory usage when syncing.

fresh-pie · December 28, 2021, 4:59pm

I know this thread is a little older now, but I really wanted to share a summary of my setup and the common resource usage, in the case it may still be useful to this community.

The below is my Foreman and Katello version. I know at this point, these are already “old”. My challenge currently is that I’m in the middle of a large deployment of CentOS to 5000+ hosts in a retail environment. It is a very slow deployment (such is the nature of retail), so I’m trying to not do any upgrades until completed.

Foreman 2.2.3
Katello 3.17.3

Foreman host data at this very moment:

Hosts: 896
Content Hosts: 896

System CPU summary:

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             4
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Stepping:              0
CPU MHz:               2693.671
BogoMIPS:              5387.34
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              33792K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities

System memory summary:

# free
              total        used        free      shared  buff/cache   available
Mem:       65808684    20394692    34458596     1808904    10955396    43073212
Swap:       2097148      341692     1755456

Top 20 CPU intensive processes (Note the biggest offender xagt is not always this CPU hungry and will drop shortly):

# ps aux --sort -pcpu | head -n 20
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      4164 23.6  0.4 1680316 312752 ?      SLl  Dec02 8886:18 /opt/fireeye/bin/xagt --mode Eventor --iofd 3 --cmname 12 --log INFO --logfd 4
puppet   30469  3.8  2.3 6039376 1557304 ?     Sl   Dec27  62:40 /usr/bin/java -Xms2G -Xmx2G -Djruby.logger.class=com.puppetlabs.jruby_utils.jruby.Slf4jLogger -XX:OnOutOfMemoryError=kill -9 %p -XX:ErrorFile=/var/log/puppetlabs/puppetserver/puppetserver_err_pid%p.log -cp /opt/puppetlabs/server/apps/puppetserver/puppet-server-release.jar:/opt/puppetlabs/puppet/lib/ruby/vendor_ruby/facter.jar:/opt/puppetlabs/server/data/puppetserver/jars/* clojure.main -m puppetlabs.trapperkeeper.main --config /etc/puppetlabs/puppetserver/conf.d --bootstrap-config /etc/puppetlabs/puppetserver/services.d/,/opt/puppetlabs/server/apps/puppetserver/config/services.d/ --restart-file /opt/puppetlabs/server/data/puppetserver/restartcounter
tomcat   30285  1.2  1.6 8128324 1083008 ?     Ssl  Dec27  20:40 /usr/lib/jvm/jre/bin/java -Xms1024m -Xmx4096m -Djava.security.auth.login.config=/usr/share/tomcat/conf/login.config -classpath /usr/share/tomcat/bin/bootstrap.jar:/usr/share/tomcat/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/usr/share/tomcat -Dcatalina.home=/usr/share/tomcat -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat/temp -Djava.util.logging.config.file=/usr/share/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start
foreman  31025  1.2  1.1 1134780 763928 ?      Sl   Dec27  19:55 puma: cluster worker 0: 30291 [foreman]
qdroute+ 29844  1.1  2.5 2119100 1694816 ?     Ssl  Dec27  18:24 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf
foreman  31028  1.0  1.1 1136520 786280 ?      Sl   Dec27  17:37 puma: cluster worker 1: 30291 [foreman]
root      4727  0.4  6.5 6617028 4312048 ?     Sl   Dec02 163:52 /var/IBM/ITM/lx8266/lz/bin/klzagent
mongodb  29783  0.4  6.0 5047564 3987272 ?     Sl   Dec27   7:59 /opt/rh/rh-mongodb34/root/usr/bin/mongod -f /etc/opt/rh/rh-mongodb34/mongod.conf run
root       878  0.2  0.0 349376  3588 ?        Ssl  Dec02  85:14 /usr/bin/vmtoolsd
apache   29921  0.2  0.1 708460 80392 ?        Ssl  Dec27   3:37 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid
apache   29936  0.2  0.1 708484 78372 ?        Ssl  Dec27   3:26 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --pidfile=/var/run/pulp/reserved_resource_worker-0.pid
apache   29939  0.2  0.1 708436 78364 ?        Ssl  Dec27   3:28 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --pidfile=/var/run/pulp/reserved_resource_worker-1.pid
apache   29942  0.2  0.1 708452 78376 ?        Ssl  Dec27   3:26 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --pidfile=/var/run/pulp/reserved_resource_worker-2.pid
apache   29950  0.2  0.1 708436 80416 ?        Ssl  Dec27   3:26 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --pidfile=/var/run/pulp/reserved_resource_worker-3.pid
foreman  30975  0.2  0.6 733484 436164 ?       Ssl  Dec27   3:29 sidekiq 5.2.7  [0 of 5 busy]
root      3834  0.1  0.1 320772 121960 ?       Sl   Dec02  62:18 splunkd -p 8089 start
postgres 14930  0.1  0.1 855632 87272 ?        Ss   11:45   0:00 postgres: foreman foreman [local] idle
postgres 15713  0.1  0.0 863412 44628 ?        Ss   11:48   0:00 postgres: foreman foreman [local] idle
redis    29811  0.1  0.0 156616  8028 ?        Ssl  Dec27   2:29 /opt/rh/rh-redis5/root/usr/bin/redis-server 127.0.0.1:6379

Top 20 memory intensive processes:

# ps aux --sort -pmem | head -n 20
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      4727  0.4  6.5 6618148 4313168 ?     Sl   Dec02 163:53 /var/IBM/ITM/lx8266/lz/bin/klzagent
mongodb  29783  0.4  6.0 5047564 3987268 ?     Sl   Dec27   7:59 /opt/rh/rh-mongodb34/root/usr/bin/mongod -f /etc/opt/rh/rh-mongodb34/mongod.conf run
qdroute+ 29844  1.1  2.5 2119100 1694948 ?     Ssl  Dec27  18:25 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf
puppet   30469  3.8  2.3 6039376 1557304 ?     Sl   Dec27  62:41 /usr/bin/java -Xms2G -Xmx2G -Djruby.logger.class=com.puppetlabs.jruby_utils.jruby.Slf4jLogger -XX:OnOutOfMemoryError=kill -9 %p -XX:ErrorFile=/var/log/puppetlabs/puppetserver/puppetserver_err_pid%p.log -cp /opt/puppetlabs/server/apps/puppetserver/puppet-server-release.jar:/opt/puppetlabs/puppet/lib/ruby/vendor_ruby/facter.jar:/opt/puppetlabs/server/data/puppetserver/jars/* clojure.main -m puppetlabs.trapperkeeper.main --config /etc/puppetlabs/puppetserver/conf.d --bootstrap-config /etc/puppetlabs/puppetserver/services.d/,/opt/puppetlabs/server/apps/puppetserver/config/services.d/ --restart-file /opt/puppetlabs/server/data/puppetserver/restartcounter
tomcat   30285  1.2  1.6 8128324 1083008 ?     Ssl  Dec27  20:41 /usr/lib/jvm/jre/bin/java -Xms1024m -Xmx4096m -Djava.security.auth.login.config=/usr/share/tomcat/conf/login.config -classpath /usr/share/tomcat/bin/bootstrap.jar:/usr/share/tomcat/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/usr/share/tomcat -Dcatalina.home=/usr/share/tomcat -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat/temp -Djava.util.logging.config.file=/usr/share/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start
foreman  31028  1.0  1.1 1149320 786516 ?      Sl   Dec27  17:38 puma: cluster worker 1: 30291 [foreman]
foreman  31025  1.2  1.1 1142972 764056 ?      Sl   Dec27  19:56 puma: cluster worker 0: 30291 [foreman]
qpidd    29847  0.1  0.8 936964 566816 ?       Ssl  Dec27   2:31 /usr/sbin/qpidd --config /etc/qpid/qpidd.conf
postgres 29827  0.0  0.6 835064 450892 ?       Ss   Dec27   0:18 postgres: checkpointer
foreman  30975  0.2  0.6 733484 436164 ?       Ssl  Dec27   3:29 sidekiq 5.2.7  [0 of 5 busy]
foreman  30288  0.1  0.6 718404 418076 ?       Ssl  Dec27   2:36 sidekiq 5.2.7  [0 of 1 busy]
foreman  30979  0.1  0.6 731924 413748 ?       Ssl  Dec27   1:55 sidekiq 5.2.7  [0 of 5 busy]
foreman  30291  0.0  0.5 694196 355864 ?       Ssl  Dec27   0:26 puma 4.3.3 (tcp://127.0.0.1:3000) [foreman]
root      4164 23.6  0.4 1680316 312272 ?      SLl  Dec02 8886:28 /opt/fireeye/bin/xagt --mode Eventor --iofd 3 --cmname 12 --log INFO --logfd 4
foreman+ 30985  0.0  0.2 2424340 144568 ?      Ssl  Dec27   0:14 ruby /usr/share/foreman-proxy/bin/smart-proxy --no-daemonize
root      3834  0.1  0.1 320772 122348 ?       Sl   Dec02  62:18 splunkd -p 8089 start
pulp     29865  0.0  0.1 1984764 117848 ?      Sl   Dec27   0:48 /usr/bin/python3 /usr/bin/gunicorn pulpcore.content:server --bind 127.0.0.1:24816 --worker-class aiohttp.GunicornWebWorker -w 2 --access-logfile -
pulp     29869  0.0  0.1 1984896 117832 ?      Sl   Dec27   0:48 /usr/bin/python3 /usr/bin/gunicorn pulpcore.content:server --bind 127.0.0.1:24816 --worker-class aiohttp.GunicornWebWorker -w 2 --access-logfile -
root       535  0.0  0.1 187500 115276 ?       Ss   Dec02   7:29 /usr/lib/systemd/systemd-journald

I hope this data helps in some way. If there is any additional information that I can provide that would be beneficial, please just let me know!