Remote execution failing occasionally with 403 Forbidden

Problem:
Remote execution fails on a single host/task out of the batch of tasks in a job. Smart Proxy and production.log (included below) show “RestClient::Forbidden 403 Forbidden” and “Completed 403 Forbidden” respectively. Only occasionally - once or twice a month.

We have had this issue since we started using Foreman over a year ago and it has continued through at least 1 upgrade. It recently became an actual problem because we had to start using remote execution for an unfortunately tightly orchestrated process that fails if 1 element fails.

Expected outcome:
Remote execution does not fail randomly.

Foreman and Proxy versions:
Foreman 2.3.3-1
Katello 3.18.2-1

Distribution and version:
RHEL 7.9

What I have tried/found so far
I found a number of posts that all reference “RestClient::Forbidden 403 Forbidden” but they seemed to be persistent problems, not intermittent like mine:

https://access.redhat.com/solutions/4819051
https://foreman-users.narkive.com/OyUcEqV3/new-foreman-installation-new-proxy-won-t-connect-to-one-another-why
https://github.com/theforeman/puppet-foreman_proxy/issues/127

The solution in each of these cases was to add any CNAMEs in the ecosystem to the trusted_hosts lists on each proxy. We only have one CNAME in the ecosystem for our main Foreman server. I added the CNAME to :trusted_hosts: in /etc/foreman/settings.yml on our main Foreman server and 3 Smart Proxies.

I thought this fixed the issue, but unfortunately it did not. I also found this post, which again seems like it was persistent, not intermittent:
https://community.theforeman.org/t/remote-execution-failing-with-403-forbidden/19270/4

That post mentioned the Trusted Hosts lists in the UI, which I located at Settings->Authentication->Trusted Hosts. I have added the one CNAME for our main Foreman server there. However, I just did that today and I am not very confident it will correct the issue.

Other relevant data:
Smart Proxy Log: (Snipped, complete log can be supplied)

2021-07-05T10:30:33 a06beb0d [E] <RestClient::Forbidden> 403 Forbidden
        /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/abstract_response.rb:223:in `exception_with_response'
        /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/abstract_response.rb:103:in `return!'
        /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:809:in `process_result'
        /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:725:in `block in transmit'
        /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:910:in `start'
<SNIP>
        /opt/theforeman/tfm/root/usr/share/gems/gems/concurrent-ruby-1.1.6/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:324:in `catch'
        /opt/theforeman/tfm/root/usr/share/gems/gems/concurrent-ruby-1.1.6/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:324:in `block in create_worker'
        /opt/theforeman/tfm/root/usr/share/gems/gems/logging-2.3.0/lib/logging/diagnostic_context.rb:474:in `block in create_with_logging_context'

production.log:

2021-07-05T10:30:33 [I|app|d60ba5dc] Started POST "/foreman_tasks/api/tasks/callback" for 127.0.0.1 at 2021-07-05 10:30:33 -0500
2021-07-05T10:30:33 [I|app|741b0775] Started POST "/foreman_tasks/api/tasks/callback" for 127.0.0.1 at 2021-07-05 10:30:33 -0500
2021-07-05T10:30:33 [I|app|d60ba5dc] Processing by ForemanTasks::Api::TasksController#callback as */*
2021-07-05T10:30:33 [I|app|d60ba5dc]   Parameters: {"callback"=>{"task_id"=>"885ec5d8-b3a2-4185-8407-97aebabb39a6", "step_id"=>3}, "data"=>{"result"=>[{"output_type"=>"stdout", "output"=>"rex login: ", "timestamp"=>1625499032.5666928}, {"output_type"=>"stdout", "output"=>"\r\n", "timestamp"=>1625499033.5729036}, {"output_type"=>"stdout", "output"=>"127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4\r\n::1         localhost localhost.localdomain localhost6 localhost6.localdomain6\r\n\r\n172.18.5.20\tadms0006.tm.dc02.mgicint.net adms0006\r\n", "timestamp"=>1625499033.5731907}], "runner_id"=>"1ae5af2a-fb0b-4543-9136-da7d29031114", "exit_status"=>0}, "task"=>{}}
2021-07-05T10:30:33 [I|app|741b0775] Processing by ForemanTasks::Api::TasksController#callback as */*
2021-07-05T10:30:33 [I|app|741b0775]   Parameters: {"callback"=>{"task_id"=>"429e184d-8dc2-4d29-b125-c67c29dbb06f", "step_id"=>3}, "data"=>{"result"=>[{"output_type"=>"stdout", "output"=>"rex login: ", "timestamp"=>1625499032.5674927}, {"output_type"=>"stdout", "output"=>"\r\n", "timestamp"=>1625499033.5762162}, {"output_type"=>"stdout", "output"=>"127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4\r\n::1         localhost localhost.localdomain localhost6 localhost6.localdomain6\r\n\r\n172.18.4.20\tadmt0002.ts.dc02.mgicint.net admt0002\r\n", "timestamp"=>1625499033.5764387}], "runner_id"=>"d73ff876-e6ac-4581-a4e3-c7eb974f01fe", "exit_status"=>0}, "task"=>{}}
2021-07-05T10:30:33 [I|bac|] Task {label: Actions::RemoteExecution::RunHostJob, id: 31e80bfb-73cd-42a9-a8e3-26bfcf427923, execution_plan_id: 63cc422b-96c7-4f5e-ba2f-704637198a18} state changed: planned
2021-07-05T10:30:33 [W|app|741b0775] No SSL cert with CN supplied - request from 172.18.1.64
2021-07-05T10:30:33 [I|app|d60ba5dc] Completed 200 OK in 11ms (Views: 0.2ms | ActiveRecord: 2.2ms | Allocations: 3349)
2021-07-05T10:30:33 [I|app|741b0775]   Rendering api/v2/errors/access_denied.json.rabl within api/v2/layouts/error_layout
2021-07-05T10:30:33 [I|app|741b0775]   Rendered api/v2/errors/access_denied.json.rabl within api/v2/layouts/error_layout (Duration: 1.2ms | Allocations: 332)
2021-07-05T10:30:33 [I|app|741b0775] Filter chain halted as #<Proc:0x000000000a9de6f8@/usr/share/foreman/app/controllers/concerns/foreman/controller/smart_proxy_auth.rb:14> rendered or redirected
2021-07-05T10:30:33 [I|app|741b0775] Completed 403 Forbidden in 14ms (Views: 5.0ms | ActiveRecord: 2.2ms | Allocations: 3187)

production.log from 10 minutes later when the task actually fails, I believe due to timeout:

2021-07-05T10:40:11 [E|bac|] A sub task failed (RuntimeError)
 | /opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action/with_sub_plans.rb:231:in `check_for_errors!'
 | /opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action/with_sub_plans.rb:137:in `try_to_finish'
 | /opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action/with_polling_sub_plans.rb:19:in `poll'
 | /opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action/with_polling_sub_plans.rb:11:in `run'
<SNIP>
2021-07-05T10:40:11 [I|bac|] Task {label: Actions::RemoteExecution::RunHostsJob, id: f4b391d5-e579-4fd4-89d8-d41916a1b4cf, execution_plan_id: af8789f8-3fc5-488e-bad7-734979032539} state changed: stopped  result: warning
2021-07-05T10:40:11 [I|bac|] Task {label: Actions::RemoteExecution::RunHostsJob, id: f4b391d5-e579-4fd4-89d8-d41916a1b4cf, execution_plan_id: af8789f8-3fc5-488e-bad7-734979032539} state changed: stopped  result: warning

How big is the batch? Do you use multiple proxies or just one with the REX feature? Could you find the failing “RunHostJob” task under Monitor → Tasks and copy the Error tab here? This task represents the job on a single machine (note the singular Host in its name). Is that reproducible when you run the same job several times on this single host?

Hopefully that would show us something.

That is the odd thing, it doesn’t seem to matter how big the batch is. It fails randomly on one task only occasionally no matter batches of 10 or 100+.

We have the main Foreman server plus three smart proxies in separate network segments.

Unfortunately this is completely random. I have tried running simple remote execution jobs I know wont affect the hosts like “cat /etc/hosts” against every server (400+) at the same time, multiple times (5+), and have not been able to get it to fail on demand in any of those runs. Which task it fails on is also random. My favorite type of problem to debug :frowning:

It happened to fail this morning in a batch of 10 tasks, here is the output from the Error tab:

Action:
Actions::ProxyAction
Input:
{"proxy_operation_name"=>"ssh",
 "ssh_user"=>"for-mn-lnx-rw",
 "effective_user"=>"root",
 "effective_user_method"=>"sudo",
 "cleanup_working_dirs"=>true,
 "ssh_port"=>22,
 "hostname"=>"prxp1016.pz.dc01.mgicdmz.net",
 "script"=>
  "yum update -y; systemctl stop puppet; systemctl stop 8110-1.tomcat; systemctl stop b2b; /sbin/shutdown -r +1",
 "execution_timeout_interval"=>nil,
 "use_batch_triggering"=>true,
 "use_concurrency_control"=>false,
 "connection_options"=>
  {"retry_interval"=>15, "retry_count"=>4, "proxy_batch_triggering"=>true},
 "proxy_url"=>"https://admp1053.mz.dc01.mgicdmz.net:9090",
 "proxy_action_name"=>"ForemanRemoteExecutionCore::Actions::RunScript",
 "proxy_version"=>{"major"=>2, "minor"=>3, "patch"=>3},
 "current_request_id"=>nil,
 "current_timezone"=>"UTC",
 "current_user_id"=>4,
 "current_organization_id"=>nil,
 "current_location_id"=>nil}
Output:
{"proxy_task_id"=>"d365a704-0b1a-4a5f-ae43-bc101d141ec4"}
Exception:
Foreman::Exception: ERF42-5753 [Foreman::Exception]: The smart proxy task d365a704-0b1a-4a5f-ae43-bc101d141ec4 failed.
Backtrace:
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/proxy_action.rb:90:in `check_task_status'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/proxy_action.rb:137:in `on_proxy_action_stopped'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/proxy_action.rb:56:in `block in run'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/proxy_action.rb:237:in `with_connection_error_handling'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/proxy_action.rb:41:in `run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action.rb:571:in `block (3 levels) in execute_run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:27:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware.rb:19:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware.rb:32:in `run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:23:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:27:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware.rb:19:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/rails_executor_wrap.rb:14:in `block in run'
/opt/theforeman/tfm/root/usr/share/gems/gems/activesupport-6.0.3.4/lib/active_support/execution_wrapper.rb:88:in `wrap'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/rails_executor_wrap.rb:13:in `run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:23:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:27:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware.rb:19:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action/progress.rb:31:in `with_progress_calculation'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action/progress.rb:17:in `run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:23:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:27:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware.rb:19:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_request_id.rb:15:in `block in run'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_request_id.rb:49:in `restore_current_request_id'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_request_id.rb:15:in `run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:23:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:27:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware.rb:19:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_timezone.rb:15:in `block in run'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_timezone.rb:44:in `restore_curent_timezone'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_timezone.rb:15:in `run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:23:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:27:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware.rb:19:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_user.rb:15:in `block in run'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_user.rb:44:in `restore_curent_user'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_user.rb:15:in `run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:23:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:27:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware.rb:19:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_taxonomies.rb:15:in `block in run'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_taxonomies.rb:45:in `restore_current_taxonomies'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-3.0.5/app/lib/actions/middleware/keep_current_taxonomies.rb:15:in `run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:23:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:27:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware.rb:19:in `pass'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware.rb:32:in `run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/stack.rb:23:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/middleware/world.rb:31:in `execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action.rb:570:in `block (2 levels) in execute_run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action.rb:569:in `catch'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action.rb:569:in `block in execute_run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action.rb:472:in `block in with_error_handling'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action.rb:472:in `catch'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action.rb:472:in `with_error_handling'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action.rb:564:in `execute_run'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/action.rb:285:in `execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:18:in `block (2 levels) in execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/execution_plan/steps/abstract.rb:167:in `with_meta_calculation'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:17:in `block in execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:32:in `open_action'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:16:in `execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/director.rb:93:in `execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/executors/sidekiq/worker_jobs.rb:11:in `block (2 levels) in perform'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/executors.rb:18:in `run_user_code'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/executors/sidekiq/worker_jobs.rb:9:in `block in perform'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/executors/sidekiq/worker_jobs.rb:25:in `with_telemetry'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/executors/sidekiq/worker_jobs.rb:8:in `perform'
/opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.7/lib/dynflow/executors/sidekiq/serialization.rb:27:in `perform'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:192:in `execute_job'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:165:in `block (2 levels) in process'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/middleware/chain.rb:128:in `block in invoke'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/middleware/chain.rb:133:in `invoke'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:164:in `block in process'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:137:in `block (6 levels) in dispatch'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/job_retry.rb:109:in `local'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:136:in `block (5 levels) in dispatch'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq.rb:37:in `block in <module:Sidekiq>'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:132:in `block (4 levels) in dispatch'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:250:in `stats'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:127:in `block (3 levels) in dispatch'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/job_logger.rb:8:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:126:in `block (2 levels) in dispatch'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/job_retry.rb:74:in `global'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:125:in `block in dispatch'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/logging.rb:48:in `with_context'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/logging.rb:42:in `with_job_hash_context'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:124:in `dispatch'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:163:in `process'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:83:in `process_one'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/processor.rb:71:in `run'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/util.rb:16:in `watchdog'
/opt/theforeman/tfm/root/usr/share/gems/gems/sidekiq-5.2.7/lib/sidekiq/util.rb:25:in `block in safe_thread'
/opt/theforeman/tfm/root/usr/share/gems/gems/logging-2.3.0/lib/logging/diagnostic_context.rb:474:in `block in create_with_logging_context'

I just noticed something that might be important. In my first post, in the production.log:

2021-07-05T10:30:33 [I|app|741b0775] Parameters: {"callback"=>{"task_id"=>"429e184d-8dc2-4d29-b125-c67c29dbb06f", "step_id"=>3}, "data"=>{"result"=>[{"output_type"=>"stdout", "output"=>"rex login: ", "timestamp"=>1625499032.5674927}, {"output_type"=>"stdout", "output"=>"\r\n", "timestamp"=>1625499033.5762162}, {"output_type"=>"stdout", "output"=>"127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4\r\n::1 localhost localhost.localdomain localhost6 localhost6.localdomain6\r\n\r\n172.18.4.20\tadmt0002.ts.dc02.mgicint.net admt0002\r\n", "timestamp"=>1625499033.5764387}], "runner_id"=>"d73ff876-e6ac-4581-a4e3-c7eb974f01fe", "exit_status"=>0}, "task"=>{}}

That shows after ““output”=>” the output of “cat /etc/hosts” on the host admt0002. That is the task that failed. So the command “cat /etc/hosts” did run and the output shows there.

The task that failed this morning, which I posted the output of in the post just previous to this one was to run “yum update -y; systemctl stop puppet; systemctl stop 8110-1.tomcat; systemctl stop b2b; /sbin/shutdown -r +1” on prxp1016. I just logged into that box and it did in fact update and reboot.

So it looks like the tasks actually are succeeding but being shown as failed. That happens after the task has been running for 10 minutes, I am guessing there is a 10 minute timeout set somewhere.

When you run a rex job, it gets delegated to a smart proxy. The smart proxy opens the ssh connection to the target host and actually performs the execution. Once it is done, it tries to report the results back to foreman. If foreman rejects this callback, which is what you’re occasionally seeing, it is marked as failed on the proxy side.

To deal with the possibility of the callback getting lost, foreman checks every 10 minutes on the proxy to see how it went. That’s where the failure after 10 minutes comes from, but that is merely a symptom of the callback not getting through.

I don’t recall seeing this before and nothing obvious comes to mind. Any tips on how to reproduce this would be greatly appreciated.

Thanks for your explanation, it triggered a line of reasoning that very well might be the answer.

We have a firewall that does SSL decryption for packet inspection. I am fairly confident I have never seen this error on tasks that run on hosts from the main Foreman server, only from the Smart Proxies in separate networks. Therefore the tasks from the main Foreman server never traverse the firewall.

Once I re-read the production.log for the 15th time, this error clicked No SSL cert with CN supplied - request from <Smart Proxy IP> Since you pointed out that this is a callback from the Smart Proxy to Foreman, I realized this was a 403 forbidden Smart Proxy->Foreman not the other direction as I previously thought.

My theory is this is the SSL decryption intermittently failing at… Something (Not unheard of at all). That is what I will work on getting corrected or worked around. It seems like this has been happening more often recently so hopefully a week without problems may at least hint I am on the right track.

Great, I was incorrect. SSL Decryption was not enabled internally. I misread the cert I checked via openssl s_client. Probably grasping at some straws. I am still going to validate with the firewall people but this might not in fact be the issue.

The intermittent and completely random nature of this issue has me really stumped on reproducing it. I have ran large numbers of jobs trying to force it and of course Murphy’s Law is in full effect and it never happens then.

I am considering setting Smart Proxy and Foreman server to debug logging levels and just running jobs until I can get the failure to happen.

1 Like

I finally was able to get some time to work on this. I ran 100 jobs of “cat /etc/hosts” against 1 Server Connected to each smart proxy (total of 4 servers). No failures. I then ran 100 jobs of “cat /etc/hosts” against 152 servers, our entire test environment. No failures. This job actually runs on a schedule to validate the environment on the day prior to patching and has failed in the past.

I have talked with a colleague and we are feeling at this point it may be a transient issue on our network. My next step is going to be adding the Smart Proxies to the main Foreman server’s hosts file and the main Foreman server to each Smart Proxy’s host file. We figure it can’t hurt to eliminate a possible DNS lookup failure.

Additionally, I am going to turn on debugging on the Smart Proxys and Foreman server and figure out how much space it will take to let it run. That way if the hosts file fix doesn’t work I may at least get some sort of data on what is going on. I’ll update here periodically.