Output of the job execution is inconsistent

ThomasKube · July 16, 2024, 12:32pm

Problem:
The display of the output of Ansible jobs is inconsistent if the job is executed on several hosts, sometimes the output of some or several tasks is missing. More information in “Other relevant data”

Expected outcome:
One consistent log per host

Foreman and Proxy versions:
Both 3.10.0

Foreman and Proxy plugin versions:

Installed Packages

ansible-collection-theforeman-foreman-4.0.0-2.el8.noarch
candlepin-4.3.12-1.el8.noarch
candlepin-selinux-4.3.12-1.el8.noarch
foreman-3.10.0-1.el8.noarch
foreman-cli-3.10.0-1.el8.noarch
foreman-debug-3.10.0-1.el8.noarch
foreman-dynflow-sidekiq-3.10.0-1.el8.noarch
foreman-installer-3.10.0-1.el8.noarch
foreman-installer-katello-3.10.0-1.el8.noarch
foreman-obsolete-packages-1.6-1.el8.noarch
foreman-postgresql-3.10.0-1.el8.noarch
foreman-proxy-3.10.0-1.el8.noarch
foreman-redis-3.10.0-1.el8.noarch
foreman-release-3.10.0-1.el8.noarch
foreman-selinux-3.10.0-1.el8.noarch
foreman-service-3.10.0-1.el8.noarch
katello-4.12.0-1.el8.noarch
katello-certs-tools-2.9.0-2.el8.noarch
katello-client-bootstrap-1.7.9-2.el8.noarch
katello-common-4.12.0-1.el8.noarch
katello-debug-4.12.0-1.el8.noarch
katello-repos-4.12.0-1.el8.noarch
katello-selinux-5.0.2-1.el8.noarch
pulpcore-obsolete-packages-1.0-9.el8.noarch
pulpcore-selinux-2.0.1-1.el8.x86_64
python3.11-pulp-ansible-0.20.2-3.el8.noarch
python3.11-pulp-certguard-1.7.1-2.el8.noarch
python3.11-pulp-cli-0.21.2-5.el8.noarch
python3.11-pulp-container-2.16.4-1.el8.noarch
python3.11-pulp-deb-3.0.1-1.el8.noarch
python3.11-pulp-file-1.15.1-2.el8.noarch
python3.11-pulp-glue-0.21.2-3.el8.noarch
python3.11-pulp-python-3.10.0-3.el8.noarch
python3.11-pulp-rpm-3.23.3-1.el8.noarch
python3.11-pulpcore-3.39.11-1.el8.noarch
rubygem-foreman-tasks-9.1.1-1.fm3_11.el8.noarch
rubygem-foreman_ansible-14.0.0-1.fm3_11.el8.noarch
rubygem-foreman_maintain-1.5.1-1.el8.noarch
rubygem-foreman_remote_execution-13.0.0-1.fm3_11.el8.noarch
rubygem-hammer_cli-3.10.0-1.el8.noarch
rubygem-hammer_cli_foreman-3.10.0-1.el8.noarch
rubygem-hammer_cli_foreman_remote_execution-0.3.0-1.fm3_10.el8.noarch
rubygem-hammer_cli_foreman_tasks-0.0.20-1.fm3_10.el8.noarch
rubygem-hammer_cli_katello-1.12.0-0.1.pre.master.20240122184453git9a6ae05.el8.noarch
rubygem-katello-4.12.0-1.el8.noarch
rubygem-pulp_ansible_client-0.20.3-1.el8.noarch
rubygem-pulp_certguard_client-1.6.5-1.el8.noarch
rubygem-pulp_container_client-2.16.4-1.el8.noarch
rubygem-pulp_deb_client-3.0.1-1.el8.noarch
rubygem-pulp_file_client-1.15.1-1.el8.noarch
rubygem-pulp_ostree_client-2.1.3-1.el8.noarch
rubygem-pulp_python_client-3.10.0-1.el8.noarch
rubygem-pulp_rpm_client-3.23.2-1.el8.noarch
rubygem-pulpcore_client-3.39.9-1.el8.noarch
rubygem-smart_proxy_pulp-3.3.0-1.fm3_10.el8.noarch

Distribution and version:
Rocky Linux 8.9 (Green Obsidian)

Other relevant data:
A job calls up a role with several tasks (Update operation system):

main.yml:

tasks file for osupdate

include_tasks: “update-{{ ansible_pkg_mgr }}.yml”
name: Check on an update task (wait up to 20 minutes to complete)
ansible.builtin.async_status:
jid: “{{ update_sleeper.ansible_job_id }}”
register: job_result
until: job_result.finished
retries: ‘{{ update_timeout }}’
delay: 60
name: “Cleanup async status”
ansible.builtin.async_status:
jid: “{{ update_sleeper.ansible_job_id }}”
mode: cleanup
name: Reboot after patching
ansible.builtin.shell: “sleep 2 && reboot”
async: 1
poll: 0
when: reboot|bool == true
name: Wait for the reboot to complete
ansible.builtin.wait_for_connection:
connect_timeout: 5
sleep: 5
delay: 30
timeout: 600
when: reboot|bool == true

…

Task for zypper update:

ansible.builtin.debug:
msg: “Checking for running update process”
ansible.builtin.stat:
path: /var/run/zypp.pid
register: pidfile
ansible.builtin.lineinfile:
path: /var/run/zypp.pid
regexp: ‘\d+’
state: absent
check_mode: yes
register: zypperpid
until: zypperpid.found == 0
retries: 12
delay: 10
failed_when: false
changed_when: false
when: pidfile.stat.exists
name: “Get the pid of blocking application”
ansible.builtin.shell: cat /var/run/zypp.pid
register: blockpid
changed_when: false
when: pidfile.stat.exists and zypperpid.found == 1
name: “Fail if a running zypper task locking the system”
ansible.builtin.fail:
msg: “System management is locked by the application with pid {{ blockpid.stdout }} - Abort”
when: pidfile.stat.exists and zypperpid.found == 1
name: Update server (zypper)
community.general.zypper:
name: ‘’
state: latest
extra_args: ‘–auto-agree-with-licenses’
environment:
ACCEPT_EULA: ‘y’
async: '{{ (update_timeout60)|int|abs }}’
poll: 0
register: update_sleeper

…

Task for dnf Update:

ansible.builtin.debug:
msg: “Checking for running update process”
ansible.builtin.shell: ps -ef|grep ‘/usr/bin/dnf’|grep -E ‘update|upgrade’|grep -v grep
retries: 12
delay: 10
register: dnfpid
failed_when: false
until: dnfpid.rc == 1
name: “Fail if a running dnf task is blocking the system”
ansible.builtin.fail:
msg: “System management is locked by another dnf update/upgrade process - Abort”
when: dnfpid.rc == 0
name: Update server (dnf)
ansible.builtin.dnf:
name: ‘’
state: latest
async: '{{ (update_timeout60)|int|abs }}’
poll: 0
register: update_sleeper

…

The output of the two hosts against which the job was executed is not consistent:

Output for the Rocky Linux Server:
1:
[WARNING]: Callback disabled by environment. Disabling the Foreman callback
2:
plugin.
3:

4:
PLAY [all] *********************************************************************
5:

6:
TASK [Gathering Facts] *********************************************************
7:
ok: [testrocky02.playground.sandbox.ogit.cloud]
8:

9:
TASK [Update Server] ***********************************************************
10:

11:
TASK [ottogroup.runsap.osupdate : include_tasks] *******************************
12:
included: /usr/share/ansible/collections/ansible_collections/ottogroup/runsap/roles/osupdate/tasks/update-dnf.yml for testrocky02.playground.sandbox.ogit.cloud
13:
included: /usr/share/ansible/collections/ansible_collections/ottogroup/runsap/roles/osupdate/tasks/update-zypper.yml for testsap01.playground.sandbox.ogit.cloud
14:

15:
TASK [ottogroup.runsap.osupdate : ansible.builtin.debug] ***********************
16:
ok: [testrocky02.playground.sandbox.ogit.cloud] => {
17:
“msg”: “Checking for running update process”
18:
}
19:

20:
TASK [ottogroup.runsap.osupdate : ansible.builtin.shell] ***********************
21:
changed: [testrocky02.playground.sandbox.ogit.cloud]
22:

23:
TASK [ottogroup.runsap.osupdate : Update server (dnf)] *************************
24:
changed: [testrocky02.playground.sandbox.ogit.cloud]
25:
FAILED - RETRYING: [testrocky02.playground.sandbox.ogit.cloud]: Check on an update task (wait up to 20 minutes to complete) (15 retries left).
26:
FAILED - RETRYING: [testrocky02.playground.sandbox.ogit.cloud]: Check on an update task (wait up to 20 minutes to complete) (14 retries left).
27:
FAILED - RETRYING: [testrocky02.playground.sandbox.ogit.cloud]: Check on an update task (wait up to 20 minutes to complete) (13 retries left).
28:
FAILED - RETRYING: [testrocky02.playground.sandbox.ogit.cloud]: Check on an update task (wait up to 20 minutes to complete) (12 retries left).
29:
FAILED - RETRYING: [testrocky02.playground.sandbox.ogit.cloud]: Check on an update task (wait up to 20 minutes to complete) (11 retries left).
30:
FAILED - RETRYING: [testrocky02.playground.sandbox.ogit.cloud]: Check on an update task (wait up to 20 minutes to complete) (10 retries left).
31:
FAILED - RETRYING: [testrocky02.playground.sandbox.ogit.cloud]: Check on an update task (wait up to 20 minutes to complete) (9 retries left).
32:
changed: [testrocky02.playground.sandbox.ogit.cloud]
33:

34:
TASK [ottogroup.runsap.osupdate : Cleanup async status] ************************
35:
ok: [testrocky02.playground.sandbox.ogit.cloud]
36:
changed: [testrocky02.playground.sandbox.ogit.cloud]
37:

38:
TASK [ottogroup.runsap.osupdate : Wait for the reboot to complete] *************
39:
ok: [testrocky02.playground.sandbox.ogit.cloud]
40:
PLAY RECAP *********************************************************************
41:
testrocky02.playground.sandbox.ogit.cloud : ok=9 changed=4 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
42:
Exit status: 0

Output for SuSE Server:

1:
[WARNING]: Callback disabled by environment. Disabling the Foreman callback
2:
plugin.
3:

4:
PLAY [all] *********************************************************************
5:
ok: [testsap01.playground.sandbox.ogit.cloud]
6:
included: /usr/share/ansible/collections/ansible_collections/ottogroup/runsap/roles/osupdate/tasks/update-dnf.yml for testrocky02.playground.sandbox.ogit.cloud
7:
included: /usr/share/ansible/collections/ansible_collections/ottogroup/runsap/roles/osupdate/tasks/update-zypper.yml for testsap01.playground.sandbox.ogit.cloud
8:

9:
TASK [ottogroup.runsap.osupdate : ansible.builtin.debug] ***********************
10:
ok: [testsap01.playground.sandbox.ogit.cloud] => {
11:
“msg”: “Checking for running update process”
12:
}
13:

14:
TASK [ottogroup.runsap.osupdate : ansible.builtin.stat] ************************
15:
ok: [testsap01.playground.sandbox.ogit.cloud]
16:

17:
TASK [ottogroup.runsap.osupdate : ansible.builtin.lineinfile] ******************
18:
ok: [testsap01.playground.sandbox.ogit.cloud]
19:

20:
TASK [ottogroup.runsap.osupdate : Update server (zypper)] **********************
21:
changed: [testsap01.playground.sandbox.ogit.cloud]
22:
FAILED - RETRYING: [testsap01.playground.sandbox.ogit.cloud]: Check on an update task (wait up to 20 minutes to complete) (15 retries left).
23:
FAILED - RETRYING: [testsap01.playground.sandbox.ogit.cloud]: Check on an update task (wait up to 20 minutes to complete) (14 retries left).
24:

25:
TASK [ottogroup.runsap.osupdate : Check on an update task (wait up to 20 minutes to complete)] ***
26:
changed: [testsap01.playground.sandbox.ogit.cloud]
27:
ok: [testsap01.playground.sandbox.ogit.cloud]
28:

29:
TASK [ottogroup.runsap.osupdate : Reboot after patching] ***********************
30:
changed: [testsap01.playground.sandbox.ogit.cloud]
31:
ok: [testsap01.playground.sandbox.ogit.cloud]
32:
PLAY RECAP *********************************************************************
33:
testsap01.playground.sandbox.ogit.cloud : ok=10 changed=3 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
34:
Exit status: 0

As you can see, the output is pretty wildly distributed between the hosts. Incidentally, the problem affects all jobs that call roles.

aruzicka · July 16, 2024, 12:53pm

Could you please put the outputs into preformatted blocks?

aruzicka · July 16, 2024, 1:00pm

And maybe also show us what you would like to see?

ThomasKube · July 17, 2024, 10:31am

Here is the formatted output. I ran the job once in parallel on two hosts, then ran it as a single job on each host. If the job is executed in parallel, some outputs are not or only partially displayed. When executed individually, the full output can be seen. I am of the opinion that this behavior was not the case a few releases ago.

Running Jobs in parallel on two hosts

Output Server A

PLAY [all]

TASK [Gathering Facts]
ok: [Server A]

TASK [Update Server]

TASK [ottogroup.runsap.osupdate : include_tasks]
included: /../roles/osupdate/tasks/update-dnf.yml for Server A
included: /../roles/osupdate/tasks/update-zypper.yml for Server B

TASK [ottogroup.runsap.osupdate : ansible.builtin.debug]
ok: [Server A] => {
"msg": "Checking for running update process"
}

TASK [ottogroup.runsap.osupdate : ansible.builtin.shell]
changed: [Server A]

TASK [ottogroup.runsap.osupdate : Update server (dnf)]
changed: [Server A]
FAILED - RETRYING: [Server A]: Check on an update task (wait up to 20 minutes to complete) (15 retries left).
FAILED - RETRYING: [Server A]: Check on an update task (wait up to 20 minutes to complete) (14 retries left).
changed: [Server A]

TASK [ottogroup.runsap.osupdate : Cleanup async status]
ok: [Server A]
changed: [Server A]

TASK [ottogroup.runsap.osupdate : Wait for the reboot to complete]
ok: [Server A]
PLAY RECAP
Server A : ok=9 changed=4 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
Exit status: 0

Output Server B

PLAY [all]
ok: [Server B]
included: /../roles/osupdate/tasks/update-dnf.yml for Server A
included: /../roles/osupdate/tasks/update-zypper.yml for Server B

TASK [ottogroup.runsap.osupdate : ansible.builtin.debug]
ok: [Server B] => {
"msg": "Checking for running update process"
}

TASK [ottogroup.runsap.osupdate : ansible.builtin.stat]
ok: [Server B]

TASK [ottogroup.runsap.osupdate : ansible.builtin.lineinfile]
ok: [Server B]

TASK [ottogroup.runsap.osupdate : Update server (zypper)]
changed: [Server B]
FAILED - RETRYING: [Server B]: Check on an update task (wait up to 20 minutes to complete) (15 retries left).
FAILED - RETRYING: [Server B]: Check on an update task (wait up to 20 minutes to complete) (14 retries left).

TASK [ottogroup.runsap.osupdate : Check on an update task (wait up to 20 minutes to complete)]
changed: [Server B]
ok: [Server B]

TASK [ottogroup.runsap.osupdate : Reboot after patching]
changed: [Server B]
ok: [Server B]
PLAY RECAP
Server B : ok=10 changed=3 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
Exit status: 0

Running Jobs one after another in Foreman (Expected output)

Output Server A

PLAY [all]
TASK [Gathering Facts]

TASK [Update Server] TASK [ottogroup.runsap.osupdate : include_tasks]
included: /../roles/osupdate/tasks/update-dnf.yml for Server A

TASK [ottogroup.runsap.osupdate : ansible.builtin.debug]
ok: [Server A] => {
"msg": "Checking for running update process"
}

TASK [ottogroup.runsap.osupdate : ansible.builtin.shell]
changed: [Server A]

TASK [ottogroup.runsap.osupdate : Update server (dnf)]
changed: [Server A]

FAILED - RETRYING: [Server A]: Check on an update task (wait up to 20 minutes to complete) (15 retries left).
FAILED - RETRYING: [Server A]: Check on an update task (wait up to 20 minutes to complete) (14 retries left).
TASK [ottogroup.runsap.osupdate : Check on an update task (wait up to 20 minutes to complete)]
changed: [Server A]

TASK [ottogroup.runsap.osupdate : Cleanup async status]
ok: [Server A]

TASK [ottogroup.runsap.osupdate : Reboot after patching]
changed: [Server A]

TASK [ottogroup.runsap.osupdate : Wait for the reboot to complete]
ok: [Server A]
PLAY RECAP *********************************************************************
Server A : ok=9 changed=4 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
Exit status: 0

Output Server B:

PLAY [all]

TASK [Gathering Facts]
ok: [Server B]

TASK [Update Server]

TASK [ottogroup.runsap.osupdate : include_tasks]
included: /../roles/osupdate/tasks/update-zypper.yml for Server B

TASK [ottogroup.runsap.osupdate : ansible.builtin.debug]
ok: [Server B] => {
"msg": "Checking for running update process"
}

TASK [ottogroup.runsap.osupdate : ansible.builtin.stat]
ok: [Server B]

TASK [ottogroup.runsap.osupdate : ansible.builtin.lineinfile]
ok: [Server B]

TASK [ottogroup.runsap.osupdate : Update server (zypper)]
changed: [Server B]

FAILED - RETRYING: [Server B]: Check on an update task (wait up to 20 minutes to complete) (15 retries left).
TASK [ottogroup.runsap.osupdate : Check on an update task (wait up to 20 minutes to complete)]
ok: [Server B]

TASK [ottogroup.runsap.osupdate : Cleanup async status]
ok: [Server B]

TASK [ottogroup.runsap.osupdate : Reboot after patching]
changed: [Server B]

TASK [ottogroup.runsap.osupdate : Wait for the reboot to complete]
ok: [Server B]
PLAY RECAP
Server B : ok=10 changed=2 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
Exit status: 0

aruzicka · July 17, 2024, 10:44am

Would you be able to tell how many releases ago? Many releases ago, foreman spawned a single ansible-playbook process for every single host in the job, which gave nice outputs but didn’t scale at all. Later we changed to spawning a single ansible-playbook process for a batch of hosts (100 by default), which scales better, but gives us output for the entire batch. We then take this output and try to take it apart to replicate the “single host output view” that we had before.

If you see things like TASK [Update Server] with nothing beneath it, it just means that this task was in the play, but didn’t apply to this particular host. Sadly from the information the output from ansible contains, we cannot really filter that out, because we just don’t know until it is too late.

ThomasKube · July 17, 2024, 12:22pm

Yes, the problem has existed for many releases. The change you mentioned could really be the trigger. The output TASK [Update Server] comes from the playbook:

- hosts: all
become: true
gather_facts: yes

tasks:

- name: "Update Server"
ansible.builtin.include_role:
name: ottogroup.runsap.osupdate
vars:
reboot: <%= input('Reboot') %>

…

ThomasKube · July 17, 2024, 5:56pm

I did some research. We introduced Foreman in mid-2020, so it was a 2.x.x release.
Would it be an idea, if the assignment to the hosts is a problem, to also provide a full output in addition to the individual logs (i.e. as if I were calling ansible-playbook via commandline)? At least temporarily, such a solution would be an improvement, although I can’t say whether this would be relatively easy to implement

aruzicka · July 18, 2024, 2:01pm

I poked around a bit a came up with two things. One is a fix[1], which may or may not resolve the issue you’re having. Sadly I can’t really reproduce it reliably so it is hard to tell. The other implements your idea of showing the full output[2].

Could you please take a look? Ideally, if you have some sort of test environment and you’d be willing to test these changes yourself then that would be super awesome, but I’d be happy for any feedback, even if only based on the attached screenshots.

Also, if you have any hints on how to reliably trigger the bug, that would be helpful as well.

[1] - Fixes #37660 - Reduce mangling of output by adamruzicka · Pull Request #91 · theforeman/smart_proxy_ansible · GitHub
[2] - Fixes #TBFL - Allow viewing of the entire output at once by adamruzicka · Pull Request #92 · theforeman/smart_proxy_ansible · GitHub

ThomasKube · July 18, 2024, 7:13pm

I can do that, but I need a little time. I would like to deploy a new, small instance of Foreman, take a snapshot of it and then try out your fixes one after the other (I think that’s what you want, or both together?). I hope to be able to set up the server tomorrow and start testing on Monday. As I am only a simple sysadmin I will get help from a developer at our company to install the diffs.
I will be on vacation from Thursday, but I will be available for this, the topic is very important to me.

aruzicka · July 22, 2024, 8:45am

One after another would be helpful.

You don’t actually need to patch the system this way. The repository has packit (a package building service) enabled, so there are patched rpms available that you should be able to take and “just” install.

ThomasKube · August 4, 2024, 10:44pm

I am very sorry for the delay in my reply, but I have been seriously ill. At the moment I am on the road to recovery and hope to be fit again next week or the week after to be able to test your patches. Please accept my apologies.

aruzicka · August 12, 2024, 11:37am

No worries, hope you’re doing better now