katello-nightly-rpm-pipeline 421 failed

Katello nightly pipeline failed:

https://ci.theforeman.org/job/katello-nightly-rpm-pipeline/421/

foreman-katello-nightly-test (failed)
foreman-katello-upgrade-nightly-test (passed)

There are various exceptions in the log that I can’t make sense of, like:

2020-03-02T23:17:01 [E|bac|] Validation failed: Content view 'Test CV' is not in environment 'Test' (ActiveRecord::RecordInvalid)

@katello mind taking a look?

@ekohl is that the first point of failure? I see this on the first bats test:

not ok 1 Create an Empty Organization
# (in test file fb-content-katello.bats, line 21)
#   `hammer organization create --name="Empty Organization" | grep -q "Organization created"' failed
# Could not create the organisation:
#   Error: 502 Bad Gateway

Here is what I found so far, spinning up a nightly box using koji staging repos:

I was able to reproduce the bats hammer failure:

hammer -u admin -p changeme -d organization create --name="Empty Organization5"

[ INFO 2020-03-03T16:29:14 API] POST /katello/api/organizations
[DEBUG 2020-03-03T16:29:14 API] Params: {
    "organization" => {
        "name" => "Empty Organization5"
    }
}
[DEBUG 2020-03-03T16:29:14 API] Headers: {}
[DEBUG 2020-03-03T16:29:14 API] Using authenticator: HammerCLIForeman::Api::InteractiveBasicAuth
[ERROR 2020-03-03T16:30:14 API] 502 Bad Gateway
[DEBUG 2020-03-03T16:30:14 API] "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>502 Proxy Error</title>\n</head><body>\n<h1>Proxy Error</h1>\n<p>The proxy server received an invalid\r\nresponse from an upstream server.<br />\r\nThe proxy server could not handle the request <em><a href=\"/katello/api/organizations\">POST&nbsp;/katello/api/organizations</a></em>.<p>\nReason: <strong>Error reading from remote server</strong></p></p>\n</body></html>\n"
[DEBUG 2020-03-03T16:30:14 Exception] Using exception handler HammerCLIKatello::ExceptionHandler#handle_general_exception
[ERROR 2020-03-03T16:30:14 Exception] Error: 502 Bad Gateway
Could not create the organization:
  Error: 502 Bad Gateway
[ERROR 2020-03-03T16:30:14 Exception] 

RestClient::BadGateway (502 Bad Gateway):
    /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/abstract_response.rb:223:in `exception_with_response'
    /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/abstract_response.rb:103:in `return!'
    /opt/theforeman/tfm/root/usr/share/gems/gems/apipie-bindings-0.3.0/lib/apipie_bindings/api.rb:353:in `block in rest_client_call_block'
    /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:807:in `process_result'
    /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:725:in `block in transmit'
    /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:910:in `start'
    /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:715:in `transmit'
    /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
    /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
    /opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.2/lib/restclient/resource.rb:67:in `post'
    /opt/theforeman/tfm/root/usr/share/gems/gems/apipie-bindings-0.3.0/lib/apipie_bindings/api.rb:327:in `call_client'
    /opt/theforeman/tfm/root/usr/share/gems/gems/apipie-bindings-0.3.0/lib/apipie_bindings/api.rb:240:in `http_call'
    /opt/theforeman/tfm/root/usr/share/gems/gems/apipie-bindings-0.3.0/lib/apipie_bindings/api.rb:190:in `call_action'
    /opt/theforeman/tfm/root/usr/share/gems/gems/apipie-bindings-0.3.0/lib/apipie_bindings/api.rb:185:in `call'
    /opt/theforeman/tfm/root/usr/share/gems/gems/apipie-bindings-0.3.0/lib/apipie_bindings/resource.rb:21:in `call'
    /opt/theforeman/tfm/root/usr/share/gems/gems/hammer_cli-2.1.0.pre.develop/lib/hammer_cli/apipie/command.rb:53:in `send_request'
    /opt/theforeman/tfm/root/usr/share/gems/gems/hammer_cli_foreman-2.1.0.pre.develop/lib/hammer_cli_foreman/commands.rb:188:in `send_request'
    /opt/theforeman/tfm/root/usr/share/gems/gems/hammer_cli-2.1.0.pre.develop/lib/hammer_cli/apipie/command.rb:34:in `execute'
    /opt/theforeman/tfm/root/usr/share/gems/gems/hammer_cli_foreman-2.1.0.pre.develop/lib/hammer_cli_foreman/resource_supported_test.rb:9:in `execute'
    /opt/theforeman/tfm/root/usr/share/gems/gems/clamp-1.1.2/lib/clamp/command.rb:63:in `run'
    /opt/theforeman/tfm/root/usr/share/gems/gems/hammer_cli-2.1.0.pre.develop/lib/hammer_cli/abstract.rb:76:in `run'
    /opt/theforeman/tfm/root/usr/share/gems/gems/clamp-1.1.2/lib/clamp/subcommand/execution.rb:11:in `execute'
    /opt/theforeman/tfm/root/usr/share/gems/gems/clamp-1.1.2/lib/clamp/command.rb:63:in `run'
    /opt/theforeman/tfm/root/usr/share/gems/gems/hammer_cli-2.1.0.pre.develop/lib/hammer_cli/abstract.rb:76:in `run'
    /opt/theforeman/tfm/root/usr/share/gems/gems/clamp-1.1.2/lib/clamp/subcommand/execution.rb:11:in `execute'
    /opt/theforeman/tfm/root/usr/share/gems/gems/clamp-1.1.2/lib/clamp/command.rb:63:in `run'
    /opt/theforeman/tfm/root/usr/share/gems/gems/hammer_cli-2.1.0.pre.develop/lib/hammer_cli/abstract.rb:76:in `run'
    /opt/theforeman/tfm/root/usr/share/gems/gems/clamp-1.1.2/lib/clamp/command.rb:132:in `run'
    /opt/theforeman/tfm/root/usr/share/gems/gems/hammer_cli-2.1.0.pre.develop/bin/hammer:147:in `<top (required)>'
    /bin/hammer:23:in `load'
    /bin/hammer:23:in `<main>'

There is this from the httpd logs:

==> /var/log/httpd/foreman-ssl_error_ssl.log <==
[Tue Mar 03 16:26:57.796203 2020] [proxy_http:error] [pid 18565] (70007)The timeout specified has expired: [client 192.168.121.202:58754] AH01102: error reading status line from remote server 127.0.0.1:3000
[Tue Mar 03 16:26:57.796288 2020] [proxy:error] [pid 18565] [client 192.168.121.202:58754] AH00898: Error reading from remote server returned by /katello/api/organizations

But the actual organization does get created and the task succeeds.

I’ll keep investigating but feel free to comment with any ideas or further debugging steps :slight_smile:

That may be due to the switch from Passenger to Puma. @sureshkumar_thirugna might know more.

  • Are all the processes running in foreman-maintain service status?
  • Can you check if the VM is suffering from lack of memory?

Thanks for taking a look:

services are fine:

[root@my-nightly-koji ~]# foreman-maintain service status -b
Running Status Services
================================================================================
Get status of applicable services: 

Displaying the following service(s):
rh-mongodb34-mongod, postgresql, pulpcore-api, pulpcore-content, pulpcore-resource-manager, qdrouterd, qpidd, rh-redis5-redis, squid, pulp_celerybeat, pulp_resource_manager, pulp_streamer, pulp_workers, pulpcore-worker@*, tomcat, dynflow-sidekiq@orchestrator, httpd, puppetserver, rh-redis5-redis, dynflow-sidekiq@worker, dynflow-sidekiq@worker-hosts-queue, foreman-proxy
| displaying rh-mongodb34-mongod                   [OK]                         
/ displaying postgresql                            [OK]                         
/ displaying pulpcore-api                          [OK]                         
/ displaying pulpcore-content                      [OK]                         
/ displaying pulpcore-resource-manager             [OK]                         
/ displaying qdrouterd                             [OK]                         
/ displaying qpidd                                 [OK]                         
/ displaying rh-redis5-redis                       [OK]                         
/ displaying squid                                 [OK]                         
/ displaying pulp_celerybeat                       [OK]                         
- displaying pulp_resource_manager                 [OK]                         
- displaying pulp_streamer                         [OK]                         
- displaying pulp_workers                          [OK]                         
- displaying pulpcore-worker@*                     [OK]                         
- displaying tomcat                                [OK]                         
- displaying dynflow-sidekiq@orchestrator          [OK]                         
- displaying httpd                                 [OK]                         
- displaying puppetserver                          [OK]                         
\ displaying rh-redis5-redis                       [OK]                         
\ displaying dynflow-sidekiq@worker                [OK]                         
\ displaying dynflow-sidekiq@worker-hosts-queue    [OK]                         
\ displaying foreman-proxy                         [OK]                         
\ All services are running                                            [OK]     

and has 8.4GB memory

              total        used        free      shared  buff/cache   available
Mem:           8.4G        6.7G        526M         83M        1.2G        1.2G

The organization create call is a blocking call, its not async. It kinda sounds like we hit a timeout with the apache proxying.

It is apache proxy timeout. I was able to reproduce it, but not sure what’s causing it. FWIW the organization create call completes but the proxy times out.

==> /var/log/foreman/production.log <==
2020-03-03T18:44:19 [I|app|cefd7ca3] Started GET "/katello/api/organizations" for 127.0.0.1 at 2020-03-03 18:44:19 +0000
2020-03-03T18:44:19 [I|app|cefd7ca3] Processing by Katello::Api::V2::OrganizationsController#index as JSON
2020-03-03T18:44:19 [I|app|cefd7ca3]   Parameters: {"api_version"=>"v2", "organization"=>{}}
2020-03-03T18:44:19 [I|app|cefd7ca3] Authorized user admin(Admin User)
2020-03-03T18:44:19 [I|app|cefd7ca3]   Rendering /opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.16.0.pre.master/app/views/katello/api/v2/organizations/index.json.rabl within api/v2/layouts/index_layout
2020-03-03T18:44:19 [I|app|cefd7ca3]   Rendered /opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.16.0.pre.master/app/views/katello/api/v2/organizations/index.json.rabl within api/v2/layouts/index_layout (4.0ms)
2020-03-03T18:44:19 [I|app|cefd7ca3] Completed 200 OK in 161ms (Views: 4.9ms | ActiveRecord: 24.4ms)
2020-03-03T18:44:19 [I|app|3dd2f4ca] Started POST "/katello/api/organizations" for 127.0.0.1 at 2020-03-03 18:44:19 +0000
2020-03-03T18:44:19 [I|app|3dd2f4ca] Processing by Katello::Api::V2::OrganizationsController#create as JSON
2020-03-03T18:44:19 [I|app|3dd2f4ca]   Parameters: {"organization"=>{"name"=>"newname1"}, "api_version"=>"v2"}
2020-03-03T18:44:19 [I|app|3dd2f4ca] Authorized user admin(Admin User)
2020-03-03T18:44:19 [I|bac|3dd2f4ca] Task {label: Actions::Katello::Organization::Create, id: 33ed7ef5-8b0d-49f8-bcfc-49c5b19f21ba, execution_plan_id: 48c41ae4-b14a-4bbf-9b08-6b8c23d8a22b} state changed: planning 
2020-03-03T18:44:20 [I|aud|] Organization (5) create event on name newname1
2020-03-03T18:44:20 [I|aud|] Organization (5) create event on ignore_types []
2020-03-03T18:44:20 [I|aud|] Organization (5) create event on description 
2020-03-03T18:44:20 [I|aud|] Organization (5) create event on label newname1
2020-03-03T18:44:20 [I|aud|] Organization (5) create event on ancestry 
2020-03-03T18:44:20 [I|aud|] Organization (5) create event on manifest_refreshed_at 
2020-03-03T18:44:20 [I|aud|] Katello::KTEnvironment (4) create event on name Library
2020-03-03T18:44:20 [I|aud|] Katello::KTEnvironment (4) create event on description 
2020-03-03T18:44:20 [I|aud|] Katello::KTEnvironment (4) create event on library true
2020-03-03T18:44:20 [I|aud|] Katello::KTEnvironment (4) create event on organization_id 5
2020-03-03T18:44:20 [I|aud|] Katello::KTEnvironment (4) create event on label Library
2020-03-03T18:44:20 [I|aud|] Katello::KTEnvironment (4) create event on registry_name_pattern 
2020-03-03T18:44:20 [I|aud|] Katello::KTEnvironment (4) create event on registry_unauthenticated_pull false
2020-03-03T18:44:20 [I|aud|] Katello::KTEnvironment (4) create event on content_facet_ids []
2020-03-03T18:44:20 [I|aud|] Katello::CapsuleLifecycleEnvironment (4) create event on capsule_id 1
2020-03-03T18:44:20 [I|aud|] Katello::CapsuleLifecycleEnvironment (4) create event on lifecycle_environment_id 4
2020-03-03T18:44:20 [W|app|] Scoped order is ignored, it's forced to be batch order.
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on name Default Organization View
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on label 4c211a83-3ec4-4a3a-b339-3172520b0b3d
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on description 
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on organization_id 5
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on default true
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on composite false
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on next_version 1
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on force_puppet_environment false
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on auto_publish false
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on solve_dependencies false
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on repository_ids []
2020-03-03T18:44:20 [I|aud|] Katello::ContentView (4) create event on environment_ids []
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewVersion (4) create event on content_view_id 4
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewVersion (4) create event on major 1
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewVersion (4) create event on definition_archive_id 
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewVersion (4) create event on minor 0
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewVersion (4) create event on content_counts 
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewVersion (4) create event on repository_ids []
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewVersion (4) create event on environment_ids []

==> /var/log/messages <==
Mar  3 18:44:21 localhost dynflow-sidekiq@worker: 2020-03-03T18:44:21.055Z 21187 TID-2wbbjn Dynflow::Executors::Sidekiq::WorkerJobs::PerformWork JID-34e2fa25a55a16f700c55446 INFO: start
Mar  3 18:44:21 localhost dynflow-sidekiq@worker: 2020-03-03T18:44:21.175Z 21187 TID-2wbbjn Dynflow::Executors::Sidekiq::WorkerJobs::PerformWork JID-34e2fa25a55a16f700c55446 INFO: done: 0.12 sec
Mar  3 18:44:21 localhost dynflow-sidekiq@orchestrator: 2020-03-03T18:44:21.175Z 20704 TID-2dr7bk Dynflow::Executors::Sidekiq::OrchestratorJobs::WorkerDone JID-fb9fcd7eb6437531909786db INFO: start
Mar  3 18:44:21 localhost dynflow-sidekiq@orchestrator: 2020-03-03T18:44:21.177Z 20704 TID-2dr7bk Dynflow::Executors::Sidekiq::OrchestratorJobs::WorkerDone JID-fb9fcd7eb6437531909786db INFO: done: 0.001 sec
Mar  3 18:44:21 localhost dynflow-sidekiq@worker: 2020-03-03T18:44:21.178Z 21187 TID-2wbb8v Dynflow::Executors::Sidekiq::WorkerJobs::PerformWork JID-8aae84e1fd7df8d75ca99e86 INFO: start
Mar  3 18:44:21 localhost dynflow-sidekiq@orchestrator: 2020-03-03T18:44:21.247Z 20704 TID-2dr7bk Dynflow::Executors::Sidekiq::OrchestratorJobs::WorkerDone JID-6af890838fe7378d30ce7cda INFO: start
Mar  3 18:44:21 localhost dynflow-sidekiq@orchestrator: 2020-03-03T18:44:21.248Z 20704 TID-2dr7bk Dynflow::Executors::Sidekiq::OrchestratorJobs::WorkerDone JID-6af890838fe7378d30ce7cda INFO: done: 0.001 sec
Mar  3 18:44:21 localhost dynflow-sidekiq@worker: 2020-03-03T18:44:21.248Z 21187 TID-2wbb8v Dynflow::Executors::Sidekiq::WorkerJobs::PerformWork JID-8aae84e1fd7df8d75ca99e86 INFO: done: 0.07 sec
Mar  3 18:44:21 localhost dynflow-sidekiq@worker: 2020-03-03T18:44:21.250Z 21187 TID-2wbav3 Dynflow::Executors::Sidekiq::WorkerJobs::PerformWork JID-a9d5127bfb9644e492ae1897 INFO: start

==> /var/log/foreman/production.log <==
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewEnvironment (4) create event on name Library
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewEnvironment (4) create event on label Library
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewEnvironment (4) create event on cp_id 42c0404a43f3d98243e5a25fafce8715
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewEnvironment (4) create event on content_view_id 4
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewEnvironment (4) create event on environment_id 4
2020-03-03T18:44:20 [I|aud|] Katello::ContentViewEnvironment (4) create event on content_view_version_id 4
2020-03-03T18:44:20 [I|bac|] Task {label: Actions::Katello::Organization::Create, id: 33ed7ef5-8b0d-49f8-bcfc-49c5b19f21ba, execution_plan_id: 48c41ae4-b14a-4bbf-9b08-6b8c23d8a22b} state changed: planned 
2020-03-03T18:44:21 [I|bac|] Task {label: Actions::Katello::Organization::Create, id: 33ed7ef5-8b0d-49f8-bcfc-49c5b19f21ba, execution_plan_id: 48c41ae4-b14a-4bbf-9b08-6b8c23d8a22b} state changed: running 

==> /var/log/messages <==
Mar  3 18:44:22 localhost dynflow-sidekiq@orchestrator: 2020-03-03T18:44:22.374Z 20704 TID-2dr7bk Dynflow::Executors::Sidekiq::OrchestratorJobs::WorkerDone JID-a4b0a49a34377352be42142a INFO: start
Mar  3 18:44:22 localhost dynflow-sidekiq@orchestrator: 2020-03-03T18:44:22.375Z 20704 TID-2dr7bk Dynflow::Executors::Sidekiq::OrchestratorJobs::WorkerDone JID-a4b0a49a34377352be42142a INFO: done: 0.001 sec
Mar  3 18:44:22 localhost dynflow-sidekiq@worker: 2020-03-03T18:44:22.376Z 21187 TID-2wbav3 Dynflow::Executors::Sidekiq::WorkerJobs::PerformWork JID-a9d5127bfb9644e492ae1897 INFO: done: 1.126 sec
Mar  3 18:44:22 localhost dynflow-sidekiq@worker: 2020-03-03T18:44:22.378Z 21187 TID-2wba63 Dynflow::Executors::Sidekiq::WorkerJobs::PerformWork JID-eb1e72a8dffa986922060c18 INFO: start
Mar  3 18:44:22 localhost dynflow-sidekiq@orchestrator: 2020-03-03T18:44:22.420Z 20704 TID-2dr7bk Dynflow::Executors::Sidekiq::OrchestratorJobs::WorkerDone JID-a352bacd65d9b2f144548887 INFO: start
Mar  3 18:44:22 localhost dynflow-sidekiq@worker: 2020-03-03T18:44:22.420Z 21187 TID-2wba63 Dynflow::Executors::Sidekiq::WorkerJobs::PerformWork JID-eb1e72a8dffa986922060c18 INFO: done: 0.041 sec
Mar  3 18:44:22 localhost dynflow-sidekiq@orchestrator: 2020-03-03T18:44:22.420Z 20704 TID-2dr7bk Dynflow::Executors::Sidekiq::OrchestratorJobs::WorkerDone JID-a352bacd65d9b2f144548887 INFO: done: 0.001 sec

==> /var/log/foreman/production.log <==
2020-03-03T18:44:22 [I|bac|] Task {label: Actions::Katello::Organization::Create, id: 33ed7ef5-8b0d-49f8-bcfc-49c5b19f21ba, execution_plan_id: 48c41ae4-b14a-4bbf-9b08-6b8c23d8a22b} state changed: stopped  result: success
2020-03-03T18:44:22 [I|bac|] Task {label: Actions::Katello::Organization::Create, id: 33ed7ef5-8b0d-49f8-bcfc-49c5b19f21ba, execution_plan_id: 48c41ae4-b14a-4bbf-9b08-6b8c23d8a22b} state changed: stopped  result: success

==> /var/log/httpd/foreman-ssl_access_ssl.log <==
192.168.121.75 - - [03/Mar/2020:18:44:19 +0000] "POST /katello/api/organizations HTTP/1.1" 502 430 "-" "rest-client/2.0.2 (linux-gnu x86_64) ruby/2.5.5p157"

==> /var/log/httpd/foreman-ssl_error_ssl.log <==
[Tue Mar 03 18:45:19.778901 2020] [proxy_http:error] [pid 20648] (70007)The timeout specified has expired: [client 192.168.121.75:50198] AH01102: error reading status line from remote server 127.0.0.1:3000
[Tue Mar 03 18:45:19.779333 2020] [proxy:error] [pid 20648] [client 192.168.121.75:50198] AH00898: Error reading from remote server returned by /katello/api/organizations

I found that the problem is associated with using puma workers. I don’t know what’s the fix though. For temporary workaround:

  1. Go to /usr/share/foreman/config/puma/production.rb -> Comment the line starting with workers.
  2. systemctl restart foreman

A few of us reproduced this this afternoon. As far as we can tell, this isn’t specific to katello, but likely some sort of interaction between dynflow/foreman-tasks and puma. We’re not fully sure what is going on. What we do see is that any call to ForemanTasks.sync_task() seems to execute the task properly within a few seconds, but not actually return from the method call for a full 2 minutes.

I’ve seen other weirdness, such as clicking on the ‘check status’ under the executor status page in the dynflow console also causing a similar ‘hang’. Any thoughts/ideas @aruzicka @ezr-ondrej ?

I’ll try to reproduce without katello just to see.

I was able to reproduce by:

installing foreman nightly (centos7)
installing remote-execution
going to the dynflow console, clicking status -> check status

I presume if i added a sync_task() in a controller method i’d see the same.

Also, one thing to note is that if I run a sync_task() from the rails console, the problem does not occur. It only seems to occur within puma (with workers enabled it seems).

Good finds!

For now, I think we should set the workers to zero for installs until someone can dig into this issue:

I’m looking into it. So far I managed to reproduce (albeit not always) when running with puma -w 8 --preload. No luck without preload so far

Quick upodate, when I start rails, it creates roughly the following process/thread structure.

puma
├── worker1
│   ├── thread1
│   ├── thread2
│   ├── thread3
│   ├── thread4
│   └── thread5
└── worker2
    ├── thread1
    ├── thread2
    ├── thread3
    ├── thread4
    └── thread5

If I start everything using rails server, Dynflow gets initialized in the first parent process and then all the other processes and threads reuse the same world. This is terrible and should never happen. Essentially it means that messages sent between worlds can be delivered to a wrong process. That mostly explains the sync tasks hanging.

If I start the server by running just puma, Dynflow gets initialized in the worker processes (leading to several separate instances) and everything seems fine.

Do we have to start the web server using rails start in https://github.com/theforeman/foreman/blob/develop/extras/systemd/foreman.service#L11? I’d like to really get to the bottom of this, but this seems like an easy way out.

Thanks for digging into this so quickly!

I have a naive view of Dynflow, especially with the changes to using Sidekiq. What is the world and how does it affect how tasks get put into the database for processing by the Sidekiq workers?

Development environments have been using Puma with multiple workers for what feels like years at this point. Do you know what the difference is between production and development in this respect that we weren’t able to catch this locally by developers?

You can think about Dynflow world as a “runtime unit” of Dynflow. Usually a single process is a single Dynflow world (although, not strictly necessary, for example in tests we run multiple worlds in a single process).

The world has an ID which is used for adressing other worlds when sending them messages. The problem here arises when multiple processes share the same world ID. It could lead to situation where a process is waiting for a response, but that response is consumed by a different process, which has the same ID.

We differentiate two kinds of worlds - client and executor ones. Client ones can send and receive messages, but they cannot execute tasks. Clients are usually web server processes. Executor worlds on the other hand can do everything clients can and execute tasks.

When an execution plan is created (by planning an action, any kind of world can do this), the world which created it sends out an execution request. Any executor world can then accept the request and perform the execution.

With sidekiq in play, it gets a bit more confusing. Only the orchestrator process is an executor world, the sidekiq worker are considered clients (they cannot accept execution requests on their own).

In development environments we initialize dynflow on-demand. Because of that, Dynflow would always get initialized in the workers separately.

TL; DR:

Together with @ekohl we put together a fix https://github.com/theforeman/foreman/pull/7487 which makes Dynflow world initialization happen in the workers, not in the master.

3 Likes

Thank you @aruzicka and @ekohl for quickly getting us a fix.

The latest run had a new, related failure:

e[0;31m    [ INFO 2020-03-04T17:46:29 verbose]  /Stage[main]/Katello::Application/File[/etc/foreman/plugins/katello.yaml]: Scheduling refresh of Class[Foreman::Service]e[0m
e[0;31m    [ INFO 2020-03-04T17:46:29 verbose]  Foreman::Rake[db:seed]: Scheduling refresh of Exec[foreman-rake-db:seed]e[0m
e[0;31m    [ WARN 2020-03-04T17:47:17 verbose]  /Stage[main]/Foreman::Database/Foreman::Rake[db:seed]/Exec[foreman-rake-db:seed]/returns: Apipie cache enabled but not present yet. Run apipie:cache rake task to speed up API calls.e[0m
e[0;31m    [ WARN 2020-03-04T17:47:17 verbose]  /Stage[main]/Foreman::Database/Foreman::Rake[db:seed]/Exec[foreman-rake-db:seed]/returns: rake aborted!e[0m
e[0;31m    [ WARN 2020-03-04T17:47:17 verbose]  /Stage[main]/Foreman::Database/Foreman::Rake[db:seed]/Exec[foreman-rake-db:seed]/returns: The Dynflow world was not initialized yet. If your plugin uses it, make sure to call Rails.application.dynflow.require! in some initializere[0m
e[0;31m    [ WARN 2020-03-04T17:47:17 verbose]  /Stage[main]/Foreman::Database/Foreman::Rake[db:seed]/Exec[foreman-rake-db:seed]/returns: /opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-1.4.2/lib/dynflow/rails.rb:75:in `world'e[0m
e[0;31m    [ WARN 2020-03-04T17:47:17 verbose]  /Stage[main]/Foreman::Database/Foreman::Rake[db:seed]/Exec[foreman-rake-db:seed]/returns: /opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-1.0.1/lib/foreman_tasks.rb:23:in `trigger'e[0m
e[0;31m    [ WARN 2020-03-04T17:47:17 verbose]  /Stage[main]/Foreman::Database/Foreman::Rake[db:seed]/Exec[foreman-rake-db:seed]/returns: /opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-1.0.1/lib/foreman_tasks.rb:29:in `block in trigger_task'e[0m

That is because the Katello seeds run a sync task during seeding:

Long term I’d love to see Katello move to a model where the candlepin org is created only when it’s actually needed but I understand this isn’t easy to do.

Should we force lazy dynflow init when we’re in a rake task?