GET /dynflow/tasks/<taskid>/status with 404 & Proxy task gone missing from the smart proxy

Problem:
Occasionally, we’ll have jobs from foreman return a 404 on their dynflow task, during a remote execution run. I can queue up a few jobs against the same proxy, and some will work, while others will go missing. During this time, I can see the 404’d job id in the dynflow console, usually in a waiting or running state.

Expected outcome:
The job executes and completes with any return.

Foreman and Proxy versions:
Foreman

[root@10-222-206-158 jsparrow]# rpm -qa | grep foreman
rubygem-hammer_cli_foreman_remote_execution-0.3.0-1.el9.noarch
rubygem-hammer_cli_foreman_ssh-0.0.3-1.el9.noarch
rubygem-foreman_vault-2.0.0-1.fm3_11.el9.noarch
rubygem-foreman_kubevirt-0.2.0-1.fm3_11.el9.noarch
rubygem-foreman_statistics-2.1.0-3.fm3_11.el9.noarch
rubygem-hammer_cli_foreman_kubevirt-0.2.0-1.fm3_11.el9.noarch
rubygem-hammer_cli_foreman_puppet-0.1.0-1.fm3_11.el9.noarch
rubygem-hammer_cli_foreman_tasks-0.0.21-1.fm3_11.el9.noarch
foreman-obsolete-packages-1.10-1.el9.noarch
foreman-release-3.13.0-1.el9.noarch
rubygem-hammer_cli_foreman-3.13.0-1.el9.noarch
foreman-selinux-3.13.0-1.el9.noarch
foreman-3.13.0-1.el9.noarch
rubygem-foreman-tasks-10.0.1-1.fm3_13.el9.noarch
rubygem-foreman_remote_execution-14.0.2-1.fm3_13.el9.noarch
rubygem-foreman_salt-17.0.0-1.fm3_13.el9.noarch
foreman-dynflow-sidekiq-3.13.0-1.el9.noarch
foreman-libvirt-3.13.0-1.el9.noarch
foreman-ovirt-3.13.0-1.el9.noarch
foreman-postgresql-3.13.0-1.el9.noarch
foreman-redis-3.13.0-1.el9.noarch
foreman-service-3.13.0-1.el9.noarch
foreman-vmware-3.13.0-1.el9.noarch
rubygem-foreman_puppet-8.0.0-1.fm3_13.el9.noarch
rubygem-foreman_templates-10.0.1-1.fm3_13.el9.noarch
foreman-cli-3.13.0-1.el9.noarch
foreman-installer-3.13.0-1.el9.noarch
foreman-proxy-3.13.0-1.el9.noarch
rubygem-foreman_maintain-1.8.1-2.el9.noarch

Proxy:

foreman-installer-3.11.2-1.el8.noarch
foreman-release-3.11.5-1.el8.noarch
rubygem-foreman_remote_execution-13.1.0-1.fm3_11.el8.noarch
rubygem-foreman_salt-16.0.3-2.fm3_11.el8.noarch
foreman-proxy-3.11.2-1.el8.noarch
foreman-selinux-3.11.2-1.el8.noarch
rubygem-foreman-tasks-9.1.1-1.fm3_11.el8.noarch
foreman-3.11.2-1.el8.noarch

Foreman and Proxy plugin versions:
See above.

Distribution and version:
Foreman:

AlmaLinux release 9.5 (Teal Serval)
NAME="AlmaLinux"
VERSION="9.5 (Teal Serval)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.5"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.5 (Teal Serval)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9"
ALMALINUX_MANTISBT_PROJECT_VERSION="9.5"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.5"
SUPPORT_END=2032-06-01
AlmaLinux release 9.5 (Teal Serval)
AlmaLinux release 9.5 (Teal Serval)

Proxy:

$ cat /etc/*release
AlmaLinux release 8.10 (Cerulean Leopard)
AlmaLinux release 8.10 (Cerulean Leopard)
NAME="AlmaLinux"
VERSION="8.10 (Cerulean Leopard)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="AlmaLinux 8.10 (Cerulean Leopard)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:8::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-8"
ALMALINUX_MANTISBT_PROJECT_VERSION="8.10"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"
SUPPORT_END=2029-06-01
AlmaLinux release 8.10 (Cerulean Leopard)
AlmaLinux release 8.10 (Cerulean Leopard)

Other relevant data:
We only use jobs/tasks for salt calls to our masters.
Here are the logs of a job, with multiple tasks where some are successful and one in particular fails:

2025-01-24T13:41:04 776caca3 [I] Started GET /dynflow/tasks/9dfe1c7b-de67-4f9f-8498-c7b0b033bf5c/status
2025-01-24T13:41:04 776caca3 [I] Finished GET /dynflow/tasks/9dfe1c7b-de67-4f9f-8498-c7b0b033bf5c/status with 404 (2.52 ms)
2025-01-24T13:41:06 532c5c6c [I] Started GET /dynflow/tasks/9dfe1c7b-de67-4f9f-8498-c7b0b033bf5c/status
2025-01-24T13:41:06 532c5c6c [I] Finished GET /dynflow/tasks/9dfe1c7b-de67-4f9f-8498-c7b0b033bf5c/status with 404 (1.13 ms)
2025-01-24T13:41:07 a4b43d84 [I] Started GET /dynflow/tasks/9dfe1c7b-de67-4f9f-8498-c7b0b033bf5c/status

Dynflow:

And then after sometime:

134:
Error loading data from proxy: RestClient::NotFound - 404 Not Found
 135:
Error loading data from proxy: RestClient::NotFound - 404 Not Found
 136:
Actions::ProxyAction::ProxyActionMissing: Proxy task gone missing from the smart proxy

Here are some more relevant logs from foreman and the proxy:
Foreman:

[root@10-222-206-158 jsparrow]# grep b1b11dff /var/log/foreman/production.log
2025-01-24T14:03:12 [I|app|b1b11dff] Started POST "/api/job_invocations" for 10.222.218.3 at 2025-01-24 14:03:12 +0000
2025-01-24T14:03:12 [I|app|b1b11dff] Processing by Api::V2::JobInvocationsController#create as JSON
2025-01-24T14:03:12 [I|app|b1b11dff]   Parameters: {"job_invocation"=>{"job_template_id"=>172, "targeting_type"=>"static_query", "randomized_ordering"=>nil, "inputs"=>{"state"=>"ssnc-foreman-saltcontrol"}, "ssh_user"=>nil, "ssh"=>{"effective_user"=>"root"}, "recurrence"=>nil, "scheduling"=>nil, "concurrency_control"=>{"concurrency_level"=>"4"}, "bookmark_id"=>nil, "search_query"=>"name ^ (10-213-48-11.ssnc-corp.cloud, 10-222-203-207.ssnc-corp.cloud, 10-222-45-187.ssnc-corp.cloud, 10-234-32-23.ssnc-corp.cloud, 10-234-61-47.ssnc-corp.cloud)", "description_format"=>"Salt: state.apply", "execution_timeout_interval"=>500, "time_to_pickup"=>""}, "apiv"=>"v2"}
2025-01-24T14:03:12 [I|aud|b1b11dff] JobInvocation (515) create event on job_category Salt
2025-01-24T14:03:12 [I|aud|b1b11dff] JobInvocation (515) create event on description Salt: state.apply
2025-01-24T14:03:12 [I|aud|b1b11dff] JobInvocation (515) create event on concurrency_level 4
2025-01-24T14:03:12 [I|aud|b1b11dff] JobInvocation (515) create event on execution_timeout_interval 500
2025-01-24T14:03:12 [I|aud|b1b11dff] JobInvocation (515) create event on password [redacted]
2025-01-24T14:03:12 [I|aud|b1b11dff] JobInvocation (515) create event on key_passphrase
2025-01-24T14:03:12 [I|aud|b1b11dff] JobInvocation (515) create event on remote_execution_feature_id
2025-01-24T14:03:12 [I|aud|b1b11dff] JobInvocation (515) create event on effective_user_password
2025-01-24T14:03:12 [I|aud|b1b11dff] JobInvocation (515) create event on ssh_user
2025-01-24T14:03:12 [I|aud|b1b11dff] JobInvocation (515) create event on time_to_pickup
2025-01-24T14:03:12 [I|bac|b1b11dff] Task {label: , execution_plan_id: a8122b1d-8609-434f-865a-61e3437ea9d4} state changed: pending
2025-01-24T14:03:12 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostsJob, id: 4743b44c-b39c-4a14-8695-c18f3283a17c, execution_plan_id: a8122b1d-8609-434f-865a-61e3437ea9d4} state changed: planning
2025-01-24T14:03:12 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostsJob, id: 4743b44c-b39c-4a14-8695-c18f3283a17c, execution_plan_id: a8122b1d-8609-434f-865a-61e3437ea9d4} state changed: planned
2025-01-24T14:03:12 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostsJob, id: 4743b44c-b39c-4a14-8695-c18f3283a17c, execution_plan_id: a8122b1d-8609-434f-865a-61e3437ea9d4} state changed: running
2025-01-24T14:03:12 [I|app|b1b11dff]   Rendered /usr/share/gems/gems/foreman_remote_execution-14.0.2/app/views/api/v2/job_invocations/create.json.rabl (Duration: 25.7ms | Allocations: 12859)
2025-01-24T14:03:12 [I|app|b1b11dff] Completed 201 Created in 266ms (Views: 27.6ms | ActiveRecord: 61.9ms | Allocations: 81674)
2025-01-24T14:03:12 [I|bac|b1b11dff] Task {label: , execution_plan_id: 1b2166d4-ea1c-4481-a77a-b62f8ad6e217} state changed: pending
2025-01-24T14:03:12 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 93ea47de-d47c-4ee8-8d55-c508156214a0, execution_plan_id: 1b2166d4-ea1c-4481-a77a-b62f8ad6e217} state changed: planning
2025-01-24T14:03:12 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 93ea47de-d47c-4ee8-8d55-c508156214a0, execution_plan_id: 1b2166d4-ea1c-4481-a77a-b62f8ad6e217} state changed: planned
2025-01-24T14:03:12 [I|bac|b1b11dff] Task {label: , execution_plan_id: d4e4d679-29e6-445c-b115-cd077f624b79} state changed: pending
2025-01-24T14:03:12 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 3a57096c-7f6a-492f-ad2d-9e1204c759de, execution_plan_id: d4e4d679-29e6-445c-b115-cd077f624b79} state changed: planning
2025-01-24T14:03:12 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 93ea47de-d47c-4ee8-8d55-c508156214a0, execution_plan_id: 1b2166d4-ea1c-4481-a77a-b62f8ad6e217} state changed: running
2025-01-24T14:03:13 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 3a57096c-7f6a-492f-ad2d-9e1204c759de, execution_plan_id: d4e4d679-29e6-445c-b115-cd077f624b79} state changed: planned
2025-01-24T14:03:13 [I|bac|b1b11dff] Task {label: , execution_plan_id: 1ba6a87a-872a-4f0d-b2a8-808a4e429e3a} state changed: pending
2025-01-24T14:03:13 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 3a57096c-7f6a-492f-ad2d-9e1204c759de, execution_plan_id: d4e4d679-29e6-445c-b115-cd077f624b79} state changed: running
2025-01-24T14:03:13 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 41523cc0-3a00-495f-bcb7-94a8c28d5dfa, execution_plan_id: 1ba6a87a-872a-4f0d-b2a8-808a4e429e3a} state changed: planning
2025-01-24T14:03:13 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 41523cc0-3a00-495f-bcb7-94a8c28d5dfa, execution_plan_id: 1ba6a87a-872a-4f0d-b2a8-808a4e429e3a} state changed: planned
2025-01-24T14:03:13 [I|bac|b1b11dff] Task {label: , execution_plan_id: ac996ea6-3236-4790-9d4a-8ca3f4ef9d01} state changed: pending
2025-01-24T14:03:13 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 41523cc0-3a00-495f-bcb7-94a8c28d5dfa, execution_plan_id: 1ba6a87a-872a-4f0d-b2a8-808a4e429e3a} state changed: running
2025-01-24T14:03:13 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: fe025b84-a951-4f90-beef-97c4e9fec79b, execution_plan_id: ac996ea6-3236-4790-9d4a-8ca3f4ef9d01} state changed: planning
2025-01-24T14:03:13 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: fe025b84-a951-4f90-beef-97c4e9fec79b, execution_plan_id: ac996ea6-3236-4790-9d4a-8ca3f4ef9d01} state changed: planned
2025-01-24T14:03:13 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: fe025b84-a951-4f90-beef-97c4e9fec79b, execution_plan_id: ac996ea6-3236-4790-9d4a-8ca3f4ef9d01} state changed: running
2025-01-24T14:03:16 [I|bac|b1b11dff] Event delivered by request 5d607c96-0f2f-471d-bf09-da8b2887b0c9
2025-01-24T14:03:16 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: fe025b84-a951-4f90-beef-97c4e9fec79b, execution_plan_id: ac996ea6-3236-4790-9d4a-8ca3f4ef9d01} state changed: stopped  result: error
2025-01-24T14:03:28 [I|bac|b1b11dff] Task {label: , execution_plan_id: 6268c0e3-d477-45be-a2c8-e9e7b0f11dd1} state changed: pending
2025-01-24T14:03:28 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 8b05b98a-f672-4581-8711-12b045790498, execution_plan_id: 6268c0e3-d477-45be-a2c8-e9e7b0f11dd1} state changed: planning
2025-01-24T14:03:28 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 8b05b98a-f672-4581-8711-12b045790498, execution_plan_id: 6268c0e3-d477-45be-a2c8-e9e7b0f11dd1} state changed: planned
2025-01-24T14:03:28 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 8b05b98a-f672-4581-8711-12b045790498, execution_plan_id: 6268c0e3-d477-45be-a2c8-e9e7b0f11dd1} state changed: running
2025-01-24T14:03:30 [E|bac|b1b11dff] Job execution failed
2025-01-24T14:03:31 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 8b05b98a-f672-4581-8711-12b045790498, execution_plan_id: 6268c0e3-d477-45be-a2c8-e9e7b0f11dd1} state changed: stopped  result: error
2025-01-24T14:03:40 [I|bac|b1b11dff] Event delivered by request dc0cf7db-77c5-4f27-b30f-bc1640579a2c
2025-01-24T14:03:40 [E|bac|b1b11dff] Job execution failed
2025-01-24T14:03:40 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 93ea47de-d47c-4ee8-8d55-c508156214a0, execution_plan_id: 1b2166d4-ea1c-4481-a77a-b62f8ad6e217} state changed: stopped  result: error
2025-01-24T14:13:12 [E|bac|b1b11dff] Proxy task gone missing from the smart proxy (Actions::ProxyAction::ProxyActionMissing)
 b1b11dff |
2025-01-24T14:13:12 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 3a57096c-7f6a-492f-ad2d-9e1204c759de, execution_plan_id: d4e4d679-29e6-445c-b115-cd077f624b79} state changed: stopped  result: error
2025-01-24T14:13:12 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostJob, id: 41523cc0-3a00-495f-bcb7-94a8c28d5dfa, execution_plan_id: 1ba6a87a-872a-4f0d-b2a8-808a4e429e3a} state changed: stopped  result: error
2025-01-24T14:13:14 [E|bac|b1b11dff] A sub task failed (Dynflow::Action::V2::WithSubPlans::SubtaskFailedException)
 b1b11dff |
2025-01-24T14:13:14 [I|bac|b1b11dff] Task {label: Actions::RemoteExecution::RunHostsJob, id: 4743b44c-b39c-4a14-8695-c18f3283a17c, execution_plan_id: a8122b1d-8609-434f-865a-61e3437ea9d4} state changed: stopped  result: warning

Proxy:

root@10-222-45-187:kc-saltmom-01:[/etc/foreman-proxy/settings.d]:
$ grep d4e4d679-29e6-445c-b115-cd077f624b79 /var/log/foreman-proxy/proxy.log
2025-01-24T14:11:11 1ba135ab [I] Started GET /dynflow/tasks/d4e4d679-29e6-445c-b115-cd077f624b79/status
2025-01-24T14:11:11 1ba135ab [I] Finished GET /dynflow/tasks/d4e4d679-29e6-445c-b115-cd077f624b79/status with 404 (2.47 ms)
2025-01-24T14:11:13 cc59fd2a [I] Started GET /dynflow/tasks/d4e4d679-29e6-445c-b115-cd077f624b79/status
2025-01-24T14:11:13 cc59fd2a [I] Finished GET /dynflow/tasks/d4e4d679-29e6-445c-b115-cd077f624b79/status with 404 (1.17 ms)
2025-01-24T14:11:14 8189b618 [I] Started GET /dynflow/tasks/d4e4d679-29e6-445c-b115-cd077f624b79/status
2025-01-24T14:11:14 8189b618 [I] Finished GET /dynflow/tasks/d4e4d679-29e6-445c-b115-cd077f624b79/status with 404 (1.34 ms)
2025-01-24T14:11:16 41cfd991 [I] Started GET /dynflow/tasks/d4e4d679-29e6-445c-b115-cd077f624b79/status
2025-01-24T14:11:16 41cfd991 [I] Finished GET /dynflow/tasks/d4e4d679-29e6-445c-b115-cd077f624b79/status with 404 (1.09 ms)
2025-01-24T14:11:18 bc8cc65a [I] Started GET /dynflow/tasks/d4e4d679-29e6-445c-b115-cd077f624b79/status
2025-01-24T14:11:18 bc8cc65a [I] Finished GET /dynflow/tasks/d4e4d679-29e6-445c-b115-cd077f624b79/status with 404 (1.17 ms)
repeated....

So your foreman is 3.13 on el9, while the proxy is 3.11 on el8? Why does the proxy machine have foreman-only packages installed? Could you also post output of rpm -qa | grep smart_proxy from the proxy?

In the proxy logs, are there any incoming requests to /tasks/launch?

Hello @aruzicka - Thanks for continuing to follow up with me on this. Apologies for the long delay in a reply, I was out of the country for the last 2 weeks.

You can skip to the bottom where it says Likely Cause, I think I just found the issue.

while the proxy is 3.11 on el8

Yes, the reason for this, is that we need to upgrade our salt masters, however the latest version of foreman-proxy deprecated support for 8. And the version of salt we are on doesnt support 9. Thus, it is going to be a large undertaking to upgrade all of our salt master version, as well as all the OS’s from 8 → 9. It’s in the works though. However, I will state that I hit the same error even when both foreman and the proxies were on 3.10 and both on el8.

Why does the proxy machine have foreman-only packages installed?

I guess Im not 100% sure why. This is what I use to install the proxies (its salt code, but you’ll get the gist):

install_puppet_repo:
  cmd.run:
    - name: dnf -y install https://yum.puppet.com/puppet7-release-el-8.noarch.rpm

install_foreman_repo:
  cmd.run:
    - name: dnf -y install https://yum.theforeman.org/releases/3.11/el8/x86_64/foreman-release.rpm

# Enable the Foreman module for EL8
enable_foreman_module:
  cmd.run:
    - name: dnf -y module enable foreman:el8
    - unless: dnf module list enabled | grep -q '^foreman.*el8'

There is some other hand waving here, but just for handling certs and user in the os. Then:

# Run the Foreman installer to enable and configure Foreman Proxy
run_foreman_proxy_installer:
  cmd.run:
    - name:
        foreman-installer --enable-foreman-proxy
        --foreman-proxy-foreman-base-url=https://foreman.ssnc-corp.cloud:443
        --foreman-proxy-oauth-consumer-key={{ foreman_oauth_key }}
        --foreman-proxy-oauth-consumer-secret={{ foreman_oauth_pwd }}
        --foreman-proxy-puppetca=false --foreman-proxy-tftp=false
        --foreman-proxy-trusted-hosts=foreman.ssnc-corp.cloud
        --foreman-proxy-trusted-hosts=foreman-ui-01.ssnc-corp.cloud
        --foreman-proxy-trusted-hosts=foreman-ui-02.ssnc-corp.cloud
        --no-enable-foreman --no-enable-foreman-cli
        --puppet-server-ca=false --no-enable-foreman-plugin-puppet
        --no-enable-foreman-cli-puppet --foreman-proxy-registered-name={{ foreman_proxy_reg_name }}

Could you also post output of rpm -qa | grep smart_proxy from the proxy?

$ rpm -qa | grep smart_proxy
rubygem-smart_proxy_salt-5.1.0-1.fm3_11.el8.noarch
rubygem-smart_proxy_dynflow-0.9.2-1.fm3_11.el8.noarch

In the proxy logs, are there any incoming requests to /tasks/launch ?

Likely Cause:
Annnnd I may have just figured out the problem as Ive been collecting data for this thread all day. If this is the case, I feel like a complete idiot lol. I was watching the proxy logs, and noticed the foreman-proxy service restart, followed by 404’s on the job. This make sense since I dont have a database setup for jobs, just in house memory. It processed about 50 of the jobs, but I also had it set to run a job on the proxy itself. Part of that job is to restart the proxy service. So after it restarted it, it of course failed on the rest. Im 90% sure this is the problem. I dont need an additional assistance, unless of course I see this occur again, without the proxy being restarted.

That being said, I probably want to switch to using a db on the proxies, instead of memory. However, I am having a hard time finding documentation that explains how to do so. Do you have any idea where to find that info?

Thanks again, and sorry to have likely wasted your time.

@aruzicka - I can confirm we are still hitting this issue. What I posted above, does happen, but we resolved that by removing the proxies themselves from the jobs.

That being said, we are still hitting 404 errors on multiple proxies, when kicking off large amounts of jobs/tasks. (50,000+).

Its quite bad. At about a 97% failure rate due to 404’s - Something has to be way off for it to be that high…

2025-02-26T15:56:16 dda7aaa3 [I] Finished GET /dynflow/tasks/b79b24fc-34ed-4fbb-8437-fa59ad539709/status with 404 (0.91 ms)
2025-02-26T15:56:16 dda7aaa3 [I] Started GET /dynflow/tasks/564281a2-8470-4647-87b9-13a6b0c7cd2c/status
2025-02-26T15:56:16 dda7aaa3 [I] Finished GET /dynflow/tasks/564281a2-8470-4647-87b9-13a6b0c7cd2c/status with 404 (0.85 ms)
2025-02-26T15:56:16 dda7aaa3 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T15:56:16 dda7aaa3 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (1.08 ms)
2025-02-26T15:56:16 dda7aaa3 [I] Started GET /dynflow/tasks/bb8133bb-bdef-45f4-8317-7c6481f8bf66/status
2025-02-26T15:56:16 dda7aaa3 [I] Finished GET /dynflow/tasks/bb8133bb-bdef-45f4-8317-7c6481f8bf66/status with 404 (0.83 ms)
2025-02-26T15:56:17 dda7aaa3 [I] Started GET /dynflow/tasks/44abd23c-6139-45bb-aaa9-a646a8ee713e/status
2025-02-26T15:56:17 dda7aaa3 [I] Finished GET /dynflow/tasks/44abd23c-6139-45bb-aaa9-a646a8ee713e/status with 404 (0.81 ms)
2025-02-26T15:56:18 dda7aaa3 [I] Started GET /dynflow/tasks/ad216e75-a17c-4173-9db2-116cefb14479/status
2025-02-26T15:56:18 dda7aaa3 [I] Finished GET /dynflow/tasks/ad216e75-a17c-4173-9db2-116cefb14479/status with 404 (0.84 ms)
2025-02-26T15:56:18 dda7aaa3 [I] Started GET /dynflow/tasks/5ca9d352-3da6-4da7-8b87-5b2cb6e570ee/status
2025-02-26T15:56:18 dda7aaa3 [I] Finished GET /dynflow/tasks/5ca9d352-3da6-4da7-8b87-5b2cb6e570ee/status with 404 (0.91 ms)
2025-02-26T15:56:18 dda7aaa3 [I] Started GET /dynflow/tasks/bc580f26-354a-4a97-87bc-31d565ec65f4/status
2025-02-26T15:56:18 dda7aaa3 [I] Finished GET /dynflow/tasks/bc580f26-354a-4a97-87bc-31d565ec65f4/status with 404 (0.82 ms)
2025-02-26T15:56:18 dda7aaa3 [I] Started GET /dynflow/tasks/947689a2-fd1e-4561-8038-d1aff57a001a/status
2025-02-26T15:56:18 dda7aaa3 [I] Finished GET /dynflow/tasks/947689a2-fd1e-4561-8038-d1aff57a001a/status with 404 (0.79 ms)
2025-02-26T15:56:18 dda7aaa3 [I] Started GET /dynflow/tasks/5000a72f-30bc-406d-9134-6c5498616df1/status
2025-02-26T15:56:18 dda7aaa3 [I] Finished GET /dynflow/tasks/5000a72f-30bc-406d-9134-6c5498616df1/status with 404 (0.88 ms)
2025-02-26T15:56:18 dda7aaa3 [I] Started GET /dynflow/tasks/be0a07bb-e545-4a34-a416-952aa22a8f7d/status
2025-02-26T15:56:18 dda7aaa3 [I] Finished GET /dynflow/tasks/be0a07bb-e545-4a34-a416-952aa22a8f7d/status with 404 (0.84 ms)
2025-02-26T15:56:19 dda7aaa3 [I] Started GET /dynflow/tasks/fd435f49-d230-4358-9cb7-9ce8d29d4b37/status
2025-02-26T15:56:19 dda7aaa3 [I] Finished GET /dynflow/tasks/fd435f49-d230-4358-9cb7-9ce8d29d4b37/status with 404 (0.76 ms)
2025-02-26T15:56:19 dda7aaa3 [I] Started GET /dynflow/tasks/a182c82c-8f4c-4163-bc20-af298cb101ca/status
2025-02-26T15:56:19 dda7aaa3 [I] Finished GET /dynflow/tasks/a182c82c-8f4c-4163-bc20-af298cb101ca/status with 404 (0.8 ms)
2025-02-26T15:56:22 dda7aaa3 [I] Started GET /dynflow/tasks/a0a39f81-9a48-49b4-9a58-1f5b3b9d8f8d/status
2025-02-26T15:56:22 dda7aaa3 [I] Finished GET /dynflow/tasks/a0a39f81-9a48-49b4-9a58-1f5b3b9d8f8d/status with 404 (0.86 ms)
2025-02-26T15:56:26 dda7aaa3 [I] Started GET /dynflow/tasks/9fd0a12d-1e5d-4510-93e4-55dfd3804e34/status
2025-02-26T15:56:26 dda7aaa3 [I] Finished GET /dynflow/tasks/9fd0a12d-1e5d-4510-93e4-55dfd3804e34/status with 404 (0.81 ms)
2025-02-26T15:57:18 dc05d2a2 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T15:57:18 dc05d2a2 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.97 ms)
2025-02-26T15:57:23 270a6b2b [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T15:57:23 270a6b2b [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.99 ms)
2025-02-26T15:57:28 b330aeb3 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T15:57:28 b330aeb3 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.8 ms)
2025-02-26T15:57:33 b1bcf741 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T15:57:33 b1bcf741 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.87 ms)
2025-02-26T15:57:38 1e232ec2 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T15:57:38 1e232ec2 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (1.02 ms)
2025-02-26T15:57:43 edcc26da [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T15:57:43 edcc26da [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.94 ms)
2025-02-26T15:57:48 0f2bbd52 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T15:57:48 0f2bbd52 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.91 ms)
root@10-222-15-36:kc-saltmaster-05:[/home/jsparrow]:
$ tail -f /var/log/foreman-proxy/proxy.log
2025-02-26T16:01:43 9b7fa6e7 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:01:43 9b7fa6e7 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (1.02 ms)
2025-02-26T16:01:48 1c6eff8d [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:01:48 1c6eff8d [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.95 ms)
2025-02-26T16:01:53 de81242a [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:01:53 de81242a [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.79 ms)
2025-02-26T16:01:56 0a513e57 [I] Started GET /dynflow/tasks/count state=running
2025-02-26T16:01:56 0a513e57 [I] Finished GET /dynflow/tasks/count with 200 (0.97 ms)
2025-02-26T16:01:58 b360f5d7 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:01:58 b360f5d7 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.79 ms)
2025-02-26T16:02:03 23a30193 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:02:03 23a30193 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.86 ms)
2025-02-26T16:02:05 0a513e57 [I] Started GET /dynflow/tasks/count state=running
2025-02-26T16:02:05 0a513e57 [I] Finished GET /dynflow/tasks/count with 200 (0.81 ms)
2025-02-26T16:02:05 bdded9f1 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:02:05 bdded9f1 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.88 ms)
2025-02-26T16:02:08 60fbd3ec [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:02:08 60fbd3ec [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.87 ms)
2025-02-26T16:02:08 73318701 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:02:08 73318701 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.98 ms)
2025-02-26T16:02:13 512b5e56 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:02:13 512b5e56 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.83 ms)
2025-02-26T16:02:13 cd9e04af [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:02:13 cd9e04af [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.74 ms)
2025-02-26T16:02:18 85b115f4 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:02:18 85b115f4 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.88 ms)
2025-02-26T16:02:23 2b2ad566 [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:02:23 2b2ad566 [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (1.1 ms)
2025-02-26T16:02:28 ab72804f [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:02:28 ab72804f [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.88 ms)
2025-02-26T16:02:33 9b815cfe [I] Started GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status
2025-02-26T16:02:33 9b815cfe [I] Finished GET /dynflow/tasks/cce5a87b-2f80-40e6-909b-9475c99e02da/status with 404 (0.83 ms)

It seems like maybe it only happens when we send out large jobs. If we run the same job against 5-10 hosts, no issues at all.

No idea if this means anything, but its the only log I could find with some sort of error related to dynflow:

`Feb 26 11:58:30 10-222-206-152.xxx dynflow-sidekiq@orchestrator[1231736]: E, [2025-02-26T11:58:30.534331 #1231736] ERROR -- /connector-database-core: Receiving envelopes failed on Value (NilClass) '' is not any of: Algebrick::Types::Boolean(TrueClass | FalseClass).

Thats it.

Right now, I see three options that are sort of plausible:

  1. It’s something specific to salt, after all it is not getting as much love as ssh or ansible do
  2. It has something to do with the fact that you have 3.13 foreman and 3.11 proxy, but I don’t recall any changes between those versions that could cause this
  3. It has something to do with the sheer scale

or of course, combination of the above or something else altogether. I’ll try thinking on it.

Right, I somehow missed this. You’re looking for the database key in /etc/foreman-proxy/settings.d/dynflow.yml, or even better the --foreman-proxy-plugin-dynflow-database-path installer option. It can be either unset to use in-memory db, or a path or in theory anything that’s described in here README.rdoc .

That should be safe to ignore.

Thanks I really really appreciate the reply.

We are doing some tests today, because this only seems to happen on large scale jobs.

Here is what we are going to do:

  • kill tasks/jobs
  • select 1000 hosts and run simple command like uptime - keep batch triggering on
  • review outcome
  • kill tasks/jobs
  • select same 1000 hosts - turn off batch triggering - disable batch triggering and use 20 concurrency instead
  • review outcome
  • continue those tests and increase host amount by 1000 each time

Will report back. This is my top priority right now as we need to be able to run against 90,000 hosts, which is growing by 1,000 or more a month. Fun times ahead!!

Testing thus far:

With “Allow proxy batch tasks” ON and no concurrency set in the job:

  • A simple `echo “hello world” against 1,257 hosts: 100% success rate
  • The same job against 2,175 hosts: 79.8% failure rate due to 404 (on multiple proxies)

With “Allow proxy batch tasks” OFF and no concurrency set in the job:

  • The same job against 2,175 hosts: 100 return rate. No 404 errors

So I’d say we have a very exact suspect here.
We just kicked off a more vigorous job against 60,000 hosts, so far its through 10,000 of them and looks like no signs of stopping.

Perhaps I do not understand what the Allow proxy batch tasks means? I am slightly concerned of the number of calls back and forth between 60 salt-master (aka foreman-proxies) and the foreman ui/jobs servers without batching on. However, I am more excited that it seems to be working. In addition, watching the sidekiq dashboard we’re at 4% utilization of workers (6 workers, 2 orchestrators in active/passive, split out on 2 HA foreman servers). The servers also seem like they are doing fine.

Based on the above information, do you think there are any pros or cons to not using batch processing (besides the current fact that the major pro is, its working lol).

Do you have any interest in debugging this for your own knowledge or curiosity? I would be more than happy to act as a testbed, since I believe this is one of the largest foreman/proxy infrastructures. Next phase we’ll be adding in 2 more UI/JOBS servers and 2 more Reports/Grains servers. Then we’ll be starting the process of reworking all this into k8s vms/containers.

Cheers, and thank you so much again for helping me through this.

I’ll try explaining how the proxy batching works and maybe one of us will get some sort of idea. Unless you’d be open to me writing a custom patch to make Foreman log additional details, this is probably the only thing we can do right now.

In the early days, we had no proxy batching. When you kicked off a job, each host in the job would get its own task and that task would make a http call to the proxy to delegate the actual execution to it. That worked well up to a point. As people scaled up, their proxies were being hit by more and more http requests and eventually, the proxy would be unable to withstand the load. The proxy seemed to be able to handle the executions, but not that many requests coming in at the same time.

To address the “too many requests” issue, we introduced proxy batching. On Foreman, the per-host tasks don’t make http calls to the proxy themselves, but only queue up what they would like to send to the other side. On Foreman side, there is a component of the job dedicated to just picking these queued payloads, bundling them up into a single request and sending them to the proxy. When the proxy received such a bundled request, it would unroll it so that each per-host task in Foreman would have its counterpart on the proxy. Again this worked at small scale, but because the unrolling process on the proxy was synchronous, the proxy sometimes wouldn’t respond in time and you’d get various timeouts on Foreman side.

To address the “big request takes too long to process” issue, we turned the proxy-side processing of this request to be asynchronous - the request ends before it is fully unrolled on the proxy side. Looks like we even have a sequence diagram for this. So at any given point in time, a per-host task in foreman may or may not have its counterpart on the proxy side. Then when we check on the proxy, there should be 4 cases of what could happen:

  1. The unrolling is still not done, so if there’s no counterpart on the proxy it is fine as it should appear eventually
  2. The unrolling is done, at this point, each per-host task should have its counterpart on the proxy. If it does not, you get proxy task gone missing
  3. The proxy knows nothing about the unrolling that should have happened, which should only happen if someone restarted the proxy or if the unrolling finished more than 24 hours ago.

Two downsides come to mind - increased network traffic and need for relying on concurrency control to prevent foreman from DOSing your proxies.

I am 100% up for whatever I can do to help improve Foreman.

Friendly bump. Let me know if you want to debug through this.

Just another friendly check in @aruzicka , to see if you wanted to work on this. Cheers!

Yes, eventually, it is still on my todo list. I need to figure out something that would get us just the right amount of data without becoming overwhelming.

Sounds good. I can help you test whenever you’d like.

You said that you have smart_proxy_dynflow-0.9.2, correct? The version after that (0.9.3) has one additional fix[1], which led me to this jira[2], which seems to match what you’re experiencing almost to the letter.

Could you please try either applying the patch from[1] or pulling smart_proxy_dynflow-0.9.3 from rubygems and running a large scale job with that?

[1] - Fixes #37284 - Do not overwrite results by adamruzicka · Pull Request #134 · theforeman/smart_proxy_dynflow · GitHub
[2] - https://issues.redhat.com/browse/SAT-24050