Problem:
Hello everybody,
I’m trying to increase the throughput of the remote execution and tried different things / parameters (puma worker / threads, dynflow worker / concurrency, …), but currently I’ve no idea, where the bottleneck might be.
Our current setup / configuration:
- 4x Foreman 3.6.2 on RHEL 8.8 (installed and configured with your puppet modul).
- every Node has 4 CPUs and 14GB Ram
- foreman_service_puma_workers: 4, foreman_service_puma_threads_min: 8, foreman_service_puma_threads_max: 8
- dynflow_worker_instances: 2, dynflow_worker_concurrency: 5
- on every Node there is also a foreman-proxy with dynflow and the remote execution plugin enabled
- ~ 2200 hosts are connected
CPU and Memory seems to be fine, under normal load we have a CPU utilization of ~5%
Current situation:
When we start a remote execution job with a simple command like “id” it took quiet a long time until all nodes are finished. Due to the setting “Allow proxy batch tasks = yes” and “Proxy tasks batch size = 100” Foreman starts the execution of the first 100 nodes (in a block), then the second 100 nodes (in a block) and so on. Mostly every node of a block switches in foreman (webfrontend) direcly to running after the block is started, but it takes a long time until such a block is finished.
On the Foreman nodes itself i cant really see a high stress situation (cpu utilization increase to ~ 25%) , mostly there are 2-5 ssh processes in parallel at the start of a block, but mostly of the time they do nothing.
As I mentioned above, I tried different parameter (puma worker / threads, dynflow worker / concurrency, …) but none of them really helped. Also I must say, that I don’t really know at what moment what components are involved in which order(starting a job → foreman tasks → dynflow?? → foreman smart proxy / remote execution proxy???)
So do you have any hints for me? Or what are the “classical” parameter to increase the throughput? Is there a way to monitor foreman-task, dynflow,…?(or doesn’t makes this really sense)
We’re sending the telemetry data through statsd to an InfluxDB / Grafana, but I didn’t find a metric, which might help.
If you need further data, let me know.
Thank you very much