Ansible default batch size for remote execution job is 100 hosts, need to increase the limit for remote execution in Foreman UI

Problem: I am trying to execute the Ansible role on more than 100 hosts from Foreman UI but it is executing only on 100 hosts and rest of the hosts are not moving forward the status is “host gone missing”.
I am able to execute manually on all hosts but looks like there is limit of ansible batch size as 100 hosts to be executed, but it should start in next batch the hosts and it is not starting instead the status is “hosts gone missing”

  • I have tried to go in Foreman setting and increase limit of “proxy batch size” more than 100 still I see it is giving the same problem, there are no errors just need to manage the execution of batch size.

Expected outcome: Ansible job should be able to execute on more than 100 hosts in one batch or in second batch.

Foreman and Proxy versions: foreman katello version 3.17.1

Hello,

I’m not sure what is the reason for “host gone missing” error and I haven’t seen that myself. It should work with in batches, 100 hosts at a time. It would help if you see any backtrace (e.g. in the task linked to the job or if you look at dynflow console).

I know @ezr-ondrej works on improved batching and a separate setting for REX/Ansible jobs. That should probably land in the next release.

You don’t specify what version of foreman_ansible and foreman you run, but based on the katello version I assume it would benefit from an upgrade.

@Marek_Hulan Thank you for the reply, below are the version details you have asked.
Foreman: 2.2.1
ansible: 2.9.15
Foreman_ansible: tfm-rubygem-foreman_ansible_core-3.0.4-1.fm2_2.el7.noarch

In Dynflow I couldn’t see any errors regarding “hosts gone missing”
Please find the screenshots of Dynflow, hosts gone missing error



Aha, now I get what Hosts gone missing message you talk about. This happens when you enter the job invocation page and the one or more host that it was executed agaisnt can’t be found anymore. This can happen if the host is deleted meanwhile or you ran the invocation in a different organization/location context. Typically if people run the job in “Any Organization” context, e.g. from API, and then they look in the UI in a specific org, some hosts may not belong to that org and are considered missing. Try to switch to Any Organization/Any Location and see if you till see some missing hosts.

That shouldn’t be the cause for failure though. In the tasks UI you posted as the last screen shot, click on sub-tasks button. This task represents the overall run but it has a sub-task for each individual host. Sub-tasks button takes you to the list of such tasks so find some that seems to errored out. Click on it’s name and see if there’s something in the Error tab. And potentially what’s in the dynflow console for such sub-task.

1 Like

Hello @Marek_Hulan
The screenshot which I have posted hosts gone missing in that there are a total of 168 hosts from the same location and organization and ansible role is executed on starting 100 hosts only can be considered as the first batch and the second batch doesn’t start, if we try to execute manually Ansible role on the “hosts which gone missing”, it is success no errors.

It is not only about this screenshot we do have hosts in different locations and the problem is the same, starting 100 hosts are being executed rest of them hosts gone missing. “I believe the second batch is not starting”

However, as you mentioned to switch Any organization/Any location I have tried and hosts still exist.

The subtasks screenshot is for the failed hosts which I shared and for the hosts which are in “hosts gone missing” list their status is ?N/A and hostname are grayed out as it was never executed there is no subtask or error about those hosts.

Thanks for more details. I’ve tried on my nightly setup with batch set to lower number but it seems to work fine. Just to confirm the assumption, it’s caused by the batch size, could you please set the Administer -> Settings -> ForemanTasks -> Proxy tasks batch size to some smaller number and retrigger the job? Also can you confirm that at least one host out of those 68 is in the location and organization you started the job in? I want to make sure I have the right reproducer.

Also your Foreman seems a bit older, I don’t recall which version we changed the ansible to use ansible-runner exactly, but I know there were some tweaks around this recently. Is there a chance you could upgrade to the most recent version?

Also do you have any customizations installed? The hosts table looks different than in default setup, could you upload the list of all plugins please?

==========

Hello Marek,
Thank you for the suggestions, below is the recap of the suggestions which you gave:

Just to confirm the assumption, it’s caused by the batch size, could you please set the Administer -> Settings -> ForemanTasks -> Proxy tasks batch size to some smaller number and retrigger the job?

  • I have reduced the batch size to 98 in
    Administer -> Settings -> ForemanTasks -> Proxy tasks batch size
    But, looks like it is not doing anything, no matter which number I keep in batch size it was picking up the default value of 100 hosts in a batch rest of all are skipped.

  • I have verified
    Allow proxy batch tasks it is set to Yes

Also, can you confirm that at least one host out of those 68 is in the location and organization you started the job in?

  • All the hosts are in the same location and organization and I have triggered ansible job on all 68 hosts are executed without any errors just the issue is with the next batch is not starting and batch size is not controllable in my case I think from previous troubleshooting.

Also, your Foreman seems a bit older, I don’t recall which version we changed the ansible to use ansible-runner exactly, but I know there were some tweaks around this recently. Is there a chance you could upgrade to the most recent version?

  • To be honest, it will be great if the issue can be fixed in this current version, in the worst case if there is no chance to fix this problem I can request a customer for an upgrade.

Also, do you have any customizations installed? The host’s table looks different than in the default setup, could you upload the list of all plugins, please?

  • There are no customizations installed

plugins installed:
foreman-plugin-remote-execution
foreman-proxy-plugin-remote-execution-ssh
foreman-cli-remote-execution
foreman-plugin-ansible
foreman-proxy-plugin-ansible
foreman_cli_virt_who_configure
foreman_plugin_virt_who_configure

That’s weird, if the batch size did not do any change, it’s probably not the planning batch. But I’m not really aware of any other. Do you happen to have more foreman proxies with remote execution / ansible feature enabled? Foreman tries to do the load balancing in such case, not sure why the second would fail though.

Yes we have more smart proxies but even if I execute an Ansible role on one SP it is still picking up only starting 100 hosts and the second batch is not starting

Do you see the same with pure SSH rex job or does it happen just for Ansible?

It is a problem with remote execution on RHEL hosts and Windows hosts as well.

(SSH and winrm connections)

We have bulk execution for Linux and Windows hosts, both of them are not starting the second batch and in foreman setting-> proxy batch size is not controllable it is doing the default job by picking up starting 100 hosts in first batch and rest of all are in “hosts gone missing” and there status is being “?NA”

From CLI ansible role execution is working fine with no issues with Ansible, I think it is somewhere related to Foreman remote execution in multiple batches is not working as expected.

image

Indeed the planning batch is not controllable, only the size for batch send to the proxy is controllable.
Planning batch can be adjusted in code only. @aruzicka would you know if we added some change that could cause the host in second planning batch to be not found (lost context, user etc)?

This batch size can be only tweaked by adding method batch_size in RemoteExecution::RunHostsJob

def batch_size
  50
end

There is no user interface for this.

would you know if we added some change that could cause the host in second planning batch to be not found (lost context, user etc)?

Not recently, no. All the current user/org/loc middlewares have been around for ages and haven’t really changed at all.

==

Could you please provide steps on how to add method ‘batch_size’ in cli, hope that will solve the issue.
Can I keep the batch_size limit “150” or higher so that batch will be sent to the proxy and jobs on all hosts can be performed.

===========
This batch size can be only tweaked by adding method batch_size in RemoteExecution::RunHostsJob

def batch_size
  50
end

Request you to advise if my understanding is correct adding batch_size method will solve the issue of remote execution on more than 100 hosts from foreman UI on Ansible jobs.