Help to find the bottleneck in foreman-task / remote execution

aruzicka · November 10, 2023, 12:36pm

tl;dr: You’re right that preparation of the job on Foreman’s side is strictly sequential. There are batches of up to 100 hosts, the batches are processed sequentially and even the hosts inside the batches are processed sequentially. From what you wrote, this is probably the bottleneck you’re looking for, but probably also something you cannot remove, unless you can somehow boost single core/single process/single thread performance.

Why four separate foreman instances?

Well, that is somewhat expected. There is an overhead associated with running jobs through foreman. For every single host in the job, it needs to verify all the necessary permissions, render the template, roll a number of per-host executions into a batch and dispatch the batch to the proxy. On my machine, id itself finishes in 0.02 seconds, but if you consider all the other things that need to happen, you get more into the seconds range per host.

Correct. To reduce the number of requests that need to be exchanged between Foreman and proxy, the nodes as you call them get prepared in batches and then each batch gets sent to the proxy in a single request. That however doesn’t mean that the batch needs to completely finish on the remote end before the next one can start.

Yes, they flip to running once they start being prepared and sit in that state until they are completely done. A larger part of sitting in running state will most likely be just waiting for the batch to be finalized so that it can be sent over to the proxy and actually executed.

Overall this feels like you’re seeing this because it finishes on the remote end so fast. If you run something that takes a long time to actually execute, you should start seeing these piling up.

If what I wrote in the tl;dr above holds, then no, not really. There are things that can be done if the other parts become the bottleneck, but not for this.

In theory you could watch https://$foreman_fqdn/foreman_tasks/sidekiq . If you see queues not being processed in time, then that is something that could be addressed.

In theory Dynflow should be able to send its own metrics to a statsd sink, but I don’t recall it being properly documented anywhere.

Just out of curiosity, could you go to the details of the job, click “Job Task” in the upper right corner, in there “Dynflow console”, in there click both the boxes under the run tab a post the data from it here?

And a similar thing from first and 99th host in the job? For both of those, click the hostname in the table below, there should be a “Task” (or something along the lines of that) button, then dynflow console and screenshot the details of that?

You can redact it as much as you want, the only thing that would be interesting to see would be information about counts and times.