Dealing with Ansible exit code 4

kkeane · March 17, 2023, 7:05am

Problem:

Ansible has a quirk in that a successful run sometimes returns exit code 4 instead of 0. Foreman reports this as an error and sets “Last Execution Failed”. How can I fix this?

Of course I’ll also try to address this on the Ansible side, but I’m not sure if that is possible.

Below is an example for Ansible code that literally just calls “ping” but returns exit code 4 (for reasons I don’t understand. That’s outside Foreman’s scope of course).

Expected outcome:

Exit code 4 from an Ansible playbook should be treated like exit code 0.

Foreman and Proxy versions:

3.5.1

Foreman and Proxy plugin versions:

Distribution and version:

AlmaLinux 8.7

Other relevant data:

   1:[DEPRECATION WARNING]: ANSIBLE_CALLBACK_WHITELIST option, normalizing names to 
   2:new standard, use ANSIBLE_CALLBACKS_ENABLED instead. This feature will be 
   3:removed from ansible-core in version 2.15. Deprecation warnings can be disabled
   4: by setting deprecation_warnings=False in ansible.cfg.
   5:[WARNING]: Callback disabled by environment. Disabling the Foreman callback
   6:plugin.
   7:  ___________________________
   8:| -W 60 -f default PLAY [all] |
   9:  ===========================
  10:                           \
  11:                            \
  12:                              ^__^
  13:                              (oo)\_______
  14:                              (__)\       )\/\
  15:                                  ||----w |
  16:                                  ||     ||
  17:  ___________________________________
  18:| -W 60 -f default TASK [Apply roles] |
  19:  ===================================
  20:                                   \
  21:                                    \
  22:                                      ^__^
  23:                                      (oo)\_______
  24:                                      (__)\       )\/\
  25:                                          ||----w |
  26:                                          ||     ||
  27:  __________________________________________
  28:| -W 60 -f default TASK [usd-ping : Do ping] |
  29:  ==========================================
  30:                                          \
  31:                                           \
  32:                                             ^__^
  33:                                             (oo)\_______
  34:                                             (__)\       )\/\
  35:                                                 ||----w |
  36:                                                 ||     ||
  37 ok: [<hostname>]
  38:___________________________
  39: <hostname>   : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
  40: Exit status: 4
  41:StandardError: Job execution failed

Dirk · March 17, 2023, 4:36pm

I found this explaining the returncodes: Ansible Exit Codes - jwkenney.github.io
But it still does not look like an explanation.

As in the linked issue another user has the same problem it is definitely worth investigating.

kkeane · March 17, 2023, 4:48pm

Funny you mention that; I found the same site and also thought that it didn’t provide an explanation. Turns out that it actually does, and the mention in the linked issue provided the answer.

Ansible returns exit code 4 when any host in a batch is unreachable.

In my case, I had a concurrency level of 200 hosts, and selected more than 100 hosts (by host collection). A few of these hosts are turned off, which means that every single of the hosts receives error code 4, even if there is no problem with it. That is not immediately apparent because Foreman seems to split the Ansible output per host. At first glance, it looks like the error code 4 belongs to this particular host.

A workaround is to specify concurrency level 1 - in that case, all my Ansible runs succeed (except of course those for the hosts that are unreachable).

This is only a workaround, not a fix, because using a concurrency level of 1 is painfully slow.

So the underlying problem is that the Ansible error code doesn’t indicate success or failure on the per-host level. It seems to me that the only way to get an accurate result is by parsing out the summary line. If it is missing or has a “failed=” count > 0, it should report failure.

Dirk · March 17, 2023, 8:47pm

Created an issue for this: Bug #36206: One or more unreachable hosts result in "failed" job for every host - Ansible - Foreman

nofaralfasi · March 21, 2023, 1:58pm

Hey,
Thanks for reporting that. This is an issue we are aware of and it is fixed in the latest version. See Bug #36130: job invocation shows wrong info after remote execution job I’m going to talk about it more in this week’s community demo if you want to get more details.