Foreman proxies - smart_proxy_dynflow_core.service keeps crashing - 400+ ansible tasks pending

Problem:

We have 7 proxies, including two windows (DHCP) and the katello server itself. Out of these 7 , 2 proxies keep failing all the time. Running foreman-maintain services status shows an error:

/ All services displayed [FAIL]
Some services are not running (smart_proxy_dynflow_core)

Scenario [Status Services] failed.

The following steps ended up in failing state:

[service-status]

Resolve the failed steps and rerun
the command. In case the failures are false positives,
use --whitelist=“service-status”

I recently started exchanging some puppet modules with ansible roles and maybe this is related. I scheduled a job to run every 20 minutes that executes the ansible roles on 300+ hosts. Some of these hosts are not online and so the tasks keep piling up. It seems that they do not time out? Maybe this is related? Most of these hosts are handeld by the the two proxies. This morning I have a total of 480 tasks, over 200 older than 24 hours.

Expected outcome:

Proxy doesn’t crash
tasks time out quicker?

Foreman and Proxy versions:

Katello 3.18.2
Foreman 2.3.3
Proxy 2.3.3

Foreman and Proxy plugin versions:

tfm-rubygem-kafo_wizards-0.0.1-4.el7.noarch
tfm-rubygem-foreman_ansible_core-4.0.0-1.fm2_3.el7.noarch
tfm-rubygem-rack-2.2.3-1.el7.noarch
tfm-rubygem-algebrick-0.7.3-7.el7.noarch
tfm-rubygem-rake-compiler-1.0.7-3.el7.noarch
tfm-rubygem-rkerberos-0.1.5-19.el7.x86_64
tfm-rubygem-netrc-0.11.0-5.el7.noarch
tfm-rubygem-jwt-2.2.1-2.el7.noarch
tfm-rubygem-sinatra-2.0.3-4.el7.noarch
tfm-rubygem-rest-client-2.0.2-3.el7.noarch
tfm-rubygem-multi_json-1.14.1-2.el7.noarch
tfm-rubygem-foreman-tasks-core-0.3.4-1.fm2_1.el7.noarch
tfm-runtime-6.1-4.el7.x86_64
tfm-rubygem-smart_proxy_dynflow-0.3.0-2.fm2_3.el7.noarch
tfm-rubygem-rack-protection-2.0.3-4.el7.noarch
tfm-rubygem-powerbar-2.0.1-2.el7.noarch
tfm-rubygem-tilt-2.0.8-4.el7.noarch
tfm-rubygem-mime-types-3.2.2-4.el7.noarch
tfm-rubygem-statsd-instrument-2.1.4-3.el7.noarch
tfm-rubygem-redfish_client-0.5.2-1.el7.noarch
tfm-rubygem-bundler_ext-0.4.1-5.el7.noarch
tfm-rubygem-hashie-3.6.0-2.el7.noarch
tfm-rubygem-concurrent-ruby-1.1.6-2.el7.noarch
tfm-rubygem-mustermann-1.0.2-4.el7.noarch
tfm-rubygem-unf-0.1.3-8.el7.noarch
tfm-rubygem-mime-types-data-3.2018.0812-4.el7.noarch
tfm-rubygem-rsec-0.4.3-4.el7.noarch
tfm-rubygem-smart_proxy_pulp-2.1.0-3.fm2_2.el7.noarch
tfm-rubygem-sequel-5.7.1-3.el7.noarch
tfm-rubygem-apipie-params-0.0.5-4.el7.noarch
tfm-rubygem-foreman_remote_execution_core-1.4.0-1.el7.noarch
tfm-rubygem-dynflow-1.4.7-1.fm2_3.el7.noarch
tfm-rubygem-smart_proxy_remote_execution_ssh-0.3.1-1.fm2_3.el7.noarch
tfm-rubygem-ansi-1.5.0-2.el7.noarch
tfm-rubygem-rubyipmi-0.10.0-6.el7.noarch
tfm-rubygem-unf_ext-0.0.7.2-3.el7.x86_64
tfm-rubygem-ruby-libvirt-0.7.1-1.el7.x86_64
tfm-rubygem-smart_proxy_discovery-1.0.5-6.fm2_2.el7.noarch
tfm-rubygem-smart_proxy_ansible-3.0.1-6.fm2_2.el7.noarch
tfm-rubygem-kafo-6.1.2-1.el7.noarch
tfm-rubygem-smart_proxy_dynflow_core-0.3.2-1.fm2_3.el7.noarch
tfm-rubygem-rb-inotify-0.9.7-5.el7.noarch
tfm-rubygem-little-plugger-1.1.4-2.el7.noarch
tfm-rubygem-kafo_parsers-1.1.0-3.el7.noarch
tfm-rubygem-http-cookie-1.0.2-4.el7.noarch
tfm-rubygem-concurrent-ruby-edge-0.6.0-2.fm2_1.el7.noarch
tfm-rubygem-net-ssh-4.2.0-2.el7.noarch
tfm-rubygem-sd_notify-0.1.0-1.el7.noarch
tfm-rubygem-server_sent_events-0.1.2-1.el7.noarch
tfm-rubygem-highline-1.7.8-5.el7.noarch
tfm-rubygem-gssapi-1.2.0-7.el7.noarch
tfm-rubygem-xmlrpc-0.3.0-2.el7.noarch
tfm-rubygem-clamp-1.1.2-6.el7.noarch
tfm-rubygem-domain_name-0.5.20160310-4.el7.noarch
tfm-rubygem-sqlite3-1.3.13-6.el7.x86_64
tfm-rubygem-logging-2.3.0-1.el7.noarch
tfm-rubygem-excon-0.76.0-1.el7.noarch
tfm-rubygem-ffi-1.12.2-1.el7.x86_64

Distribution and version:

CentOS Linux release 7.9.2009 (Core)

Other relevant data:

grep ERROR /var/log/messages on the proxy:

May 11 01:28:16 gedapvl05 smart_proxy_dynflow_core: E, [2021-05-11T01:28:16.894921 #876] ERROR – /client-dispatcher: Could not find an executor for Dynflow::Dispatcher::Envelope[request_id: b25aa1a1-2351-4885-83bb-424f0a3a4b63-737504, sender_id: b25aa1a1-2351-4885-83bb-424f0a3a4b63, receiver_id: Dynflow::Dispatcher::UnknownWorld, message: Dynflow::Dispatcher::Event[execution_plan_id: 5b2f240e-6be6-45aa-a0ca-5d4ab98e09b7, step_id: 2, event: #ForemanTasksCore::Runner::Update:0x00007f8536f834d8, time: ]] (Dynflow::Error)

Hi,
could you post more logs? It is rather hard to suggest a solution based on a single line.

On a side note, if you run the job to apply the roles, does it finish within 20 minutes?

Hi,

sorry for not posting more logs. The log files are enormous. I can’t really tell at what time the proxies crashed. Which log files should I collect? proxy.log and /var/log/messages from katello and proxy?

On a side note, if you run the job to apply the roles, does it finish within 20 minutes?

I don’t think so. There are over 200 jobs that are listed older than 24 hours. I will disable the job for now. Kill all pending jobs and select just a handful machines, where some are powered off to see if the jobs for the powered off machines keep pending.

It sounds like the issue is in smart_proxy_dynflow_core, so /var/log/foreman-proxy/smart_proxy_dynflow_core.log could be useful. The others not so much.

It sounds you may be pushing it too hard. Or at least until the jobs getting stuck issue is resolved.

Good idea

1 Like

Ok, I’,m still fighting to cancel the old pending tasks. But here is the log file, or at least parts of it.

https://pastebin.com/raw/qWPJ4JNg

It is almost 5000 lines. I scrolled through it and it shows more or less the same all over again and again.

Good morning,

some good news. After disabling the ansible run cron job for over 300 machines, the proxies are still running fine this morning. Yesterday I rebooted the katello server after disabling the cron job and it showed over 500 pending ansible run jobs. This morning all are gone and the tasks are back at 0. So the problems I’m facing must be related to the scheduled ansible jobs. I will now schedule a cron ansible run for only 10 machines. Some of them are powered off. I want to see if this will break the proxies again.

BTW. Is there a way to search for only powered on machines / vms? Because I don’t think it makes much sense to even try running ansible on powered off machines every 20 minutes. We have plenty that are powered off and I have the feeling that this harms the proxies as the jobs for the powered off machines get stuck and pile up to a huge amount.

Here is my new report.

I still have disabled the 300+ ansible job and all proxies are still running fine. But I now have 3 “Run hosts job: Run Ansible roles” that are running since 2 days. I don’t think this is correct or?

Best Regards,
Oliver