Different response format on identical requests when executed in parralel

Problem:
Whenever I run multiple POST requests to create discovery rules in parralel, I ocassionally get a 201 response with a different payload format.
I’m unable to reproduce this if I execute the requests sequentially. More parralel requests I execute, bigger change of hitting this issue:

{"discovery_rule":{"id":742,"name":"57531206","search":"foo","hostgroup_id":125,"hostname":"foo","max_count":0,"priority":29227,"enabled":true,"created_at":"2020-09-16T10:33:11.220Z","updated_at":"2020-09-16T10:33:11.220Z"}}
{"id":745,"name":"304796992","search":"foo","hostgroup_id":125,"hostname":"foo","max_count":0,"priority":17120,"enabled":true,"created_at":"2020-09-16T10:33:12.128Z","updated_at":"2020-09-16T10:33:12.128Z"}

Expected outcome:
Consistent payload format

Foreman and Proxy versions:
foreman-2.1.2.11-2.el7sat.noarch
foreman-proxy-2.1.2-2.el7sat.noarch

Foreman and Proxy plugin versions:
# rpm -qa foreman-*
foreman-postgresql-2.1.2.11-2.el7sat.noarch
foreman-selinux-2.1.2.1-1.el7sat.noarch
foreman-ec2-2.1.2.11-2.el7sat.noarch
foreman-service-2.1.2.11-2.el7sat.noarch
foreman-debug-2.1.2.11-2.el7sat.noarch
foreman-vmware-2.1.2.11-2.el7sat.noarch
foreman-dynflow-sidekiq-2.1.2.11-2.el7sat.noarch
foreman-bootloaders-redhat-tftpboot-202005201200-1.el7sat.noarch
foreman-proxy-2.1.2-2.el7sat.noarch
foreman-discovery-image-3.6.5-1.el7sat.noarch
foreman-installer-2.1.2.6-1.el7sat.noarch
foreman-installer-katello-2.1.2.6-1.el7sat.noarch
foreman-openstack-2.1.2.11-2.el7sat.noarch
foreman-bootloaders-redhat-202005201200-1.el7sat.noarch
foreman-gce-2.1.2.11-2.el7sat.noarch
foreman-libvirt-2.1.2.11-2.el7sat.noarch
foreman-ovirt-2.1.2.11-2.el7sat.noarch
foreman-cli-2.1.2.11-2.el7sat.noarch

`# rpm -qa katello-*
katello-default-ca-1.0-1.noarch
katello-client-bootstrap-1.7.5-1.el7sat.noarch
katello-debug-3.16.0-1.el7sat.noarch
katello-server-ca-1.0-1.noarch
katello-certs-tools-2.7.1-1.el7sat.noarch
katello-selinux-3.4.0-1.el7sat.noarch
katello-ca-consumer-dhcp-8-29-228.lab.eng.rdu2.redhat.com-1.0-5.noarch
katello-common-3.16.0-1.el7sat.noarch`

Distribution and version:
Red Hat Enterprise Linux Server release 7.8 (Maipo)

Other relevant data:

# grep 6b772cdc /var/log/foreman/production.log
2020-09-16T05:09:45 [I|app|6b772cdc] Started POST "/api/v2/discovery_rules" for 127.0.0.1 at 2020-09-16 05:09:45 -0400
2020-09-16T05:09:45 [I|app|6b772cdc] Processing by Api::V2::DiscoveryRulesController#create as JSON
2020-09-16T05:09:45 [I|app|6b772cdc]   Parameters: {"discovery_rule"=>{"name"=>"<address>mclMyFZCkHBOTmESILmFzbGKdHXRPTMepDJ</address>", "hostname"=>"myhost-<%= rand(99999) %>", "priority"=>546, "hostgroup_id"=>125, "location_ids"=>[259], "organization_ids"=>[255], "search"=>"memory > 500"}, "apiv"=>"v2"}
2020-09-16T05:09:45 [D|app|6b772cdc] Authenticated user admin against INTERNAL authentication source
2020-09-16T05:09:45 [D|app|6b772cdc] Post-login processing for admin
2020-09-16T05:09:45 [I|app|6b772cdc] Authorized user admin(Admin User)
2020-09-16T05:09:45 [D|app|6b772cdc] Post-login processing for admin
2020-09-16T05:09:45 [D|tax|6b772cdc] Current location set to none
2020-09-16T05:09:45 [D|tax|6b772cdc] Current organization set to none
2020-09-16T05:09:45 [D|tax|6b772cdc] Current location set to none
2020-09-16T05:09:45 [D|tax|6b772cdc] Current organization set to none
2020-09-16T05:09:45 [I|aud|6b772cdc] DiscoveryRule (442) create event on name <address>mclMyFZCkHBOTmESILmFzbGKdHXRPTMepDJ</address>
2020-09-16T05:09:45 [I|aud|6b772cdc] DiscoveryRule (442) create event on search memory > 500
2020-09-16T05:09:45 [I|aud|6b772cdc] DiscoveryRule (442) create event on hostgroup_id 125
2020-09-16T05:09:45 [I|aud|6b772cdc] DiscoveryRule (442) create event on hostname myhost-<%= rand(99999) %>
2020-09-16T05:09:45 [I|aud|6b772cdc] DiscoveryRule (442) create event on max_count 0
2020-09-16T05:09:45 [I|aud|6b772cdc] DiscoveryRule (442) create event on priority 546
2020-09-16T05:09:45 [I|aud|6b772cdc] DiscoveryRule (442) create event on enabled true
2020-09-16T05:09:45 [I|aud|6b772cdc] DiscoveryRule (442) create event on organization_ids [255]
2020-09-16T05:09:45 [I|aud|6b772cdc] DiscoveryRule (442) create event on location_ids [259]
2020-09-16T05:09:45 [D|app|6b772cdc] Body: {"discovery_rule":{"id":442,"name":"\u003caddress\u003emclMyFZCkHBOTmESILmFzbGKdHXRPTMepDJ\u003c/address\u003e","search":"memory \u003e 500","hostgroup_id":125,"hostname":"myhost-\u003c%= rand(99999) %\u003e","max_count":0,"priority":546,"enabled":true,"created_at":"2020-09-16T09:09:45.291Z","updated_at":"2020-09-16T09:09:45.291Z"}}
2020-09-16T05:09:45 [I|app|6b772cdc] Completed 201 Created in 244ms (Views: 0.5ms | ActiveRecord: 95.8ms | Allocations: 50630)
 
===========
 
# grep 2605d72c /var/log/foreman/production.log
2020-09-16T05:10:42 [I|app|2605d72c] Started POST "/api/v2/discovery_rules" for 127.0.0.1 at 2020-09-16 05:10:42 -0400
2020-09-16T05:10:42 [I|app|2605d72c] Processing by Api::V2::DiscoveryRulesController#create as JSON
2020-09-16T05:10:42 [I|app|2605d72c]   Parameters: {"discovery_rule"=>{"name"=>"<frameset>qyPzhDnC</frameset>", "hostname"=>"myhost-<%= rand(99999) %>", "priority"=>463, "hostgroup_id"=>122, "location_ids"=>[256], "organization_ids"=>[252], "search"=>"memory > 500"}, "apiv"=>"v2"}
2020-09-16T05:10:42 [D|app|2605d72c] Authenticated user admin against INTERNAL authentication source
2020-09-16T05:10:42 [D|app|2605d72c] Post-login processing for admin
2020-09-16T05:10:42 [I|app|2605d72c] Authorized user admin(Admin User)
2020-09-16T05:10:42 [D|app|2605d72c] Post-login processing for admin
2020-09-16T05:10:42 [D|tax|2605d72c] Current location set to none
2020-09-16T05:10:42 [D|tax|2605d72c] Current organization set to none
2020-09-16T05:10:42 [D|tax|2605d72c] Current location set to none
2020-09-16T05:10:42 [D|tax|2605d72c] Current organization set to none
2020-09-16T05:10:42 [I|aud|2605d72c] DiscoveryRule (462) create event on name <frameset>qyPzhDnC</frameset>
2020-09-16T05:10:42 [I|aud|2605d72c] DiscoveryRule (462) create event on search memory > 500
2020-09-16T05:10:42 [I|aud|2605d72c] DiscoveryRule (462) create event on hostgroup_id 122
2020-09-16T05:10:42 [I|aud|2605d72c] DiscoveryRule (462) create event on hostname myhost-<%= rand(99999) %>
2020-09-16T05:10:42 [I|aud|2605d72c] DiscoveryRule (462) create event on max_count 0
2020-09-16T05:10:42 [I|aud|2605d72c] DiscoveryRule (462) create event on priority 463
2020-09-16T05:10:42 [I|aud|2605d72c] DiscoveryRule (462) create event on enabled true
2020-09-16T05:10:42 [I|aud|2605d72c] DiscoveryRule (462) create event on organization_ids [252]
2020-09-16T05:10:42 [I|aud|2605d72c] DiscoveryRule (462) create event on location_ids [256]
2020-09-16T05:10:42 [D|app|2605d72c] Body: {"id":462,"name":"\u003cframeset\u003eqyPzhDnC\u003c/frameset\u003e","search":"memory \u003e 500","hostgroup_id":122,"hostname":"myhost-\u003c%= rand(99999) %\u003e","max_count":0,"priority":463,"enabled":true,"created_at":"2020-09-16T09:10:42.714Z","updated_at":"2020-09-16T09:10:42.714Z"}
2020-09-16T05:10:42 [I|app|2605d72c] Completed 201 Created in 146ms (Views: 0.4ms | ActiveRecord: 41.1ms | Allocations: 28787)

here’s my reproducer (mind that you might need to create and update other entity ids accordingly to your setup:

for i in {1..10}; do curl -sku admin:changeme -H "content-type: application/json" -d "{\"name\": \"$RANDOM$RANDOM\", \"hostname\": \"myhost-<%= rand(99999) %>\", \"priority\": $RANDOM, \"hostgroup_id\": 125, \"location_ids\": [259], \"organization_ids\": [255], \"search\": \"memory > 500\"}" https://<sat_server>/api/v2/discovery_rules; echo ""; done

I identified some other API resources with the same problem. Looks like this might be a general API issue.
I’m looking further into this and am updating a list of affected resources here:

Also, I filed a BZ: 1879587 – simultaneous POST requests to some endpoints result in Different response payload format
(will change the component and the title accordingly as soon as I’ll get more data)

Roman just confirmed this happens on puma based deployment only, with passenger it works. An immediate (but very poor) workaround is to configure puma to use single thread. To do so, use the following installer argument --foreman-foreman-service-puma-threads-max 1. The other option is to switch back to passenger, for that, use --foreman-passenger true

That’s quite odd, the response content should be generated by the foreman application, regardless of which webserver it is running under.
However, I did find one thing in common for all three endpoints identified: They don’t have matching view templates.
I wounder if copying e.g. <$discovery_plugin>/app/views/api/v2/discovery_rules/show.json.rabl to <$discovery_plugin>/app/views/api/v2/discovery_rules/create.json.rabl would lead to any difference?

This article does show that it’s very easy in controllers to make threading mistakes, as yet another example that thread safe code is hard.
https://bearmetal.eu/theden/how-do-i-know-whether-my-rails-app-is-thread-safe-or-not/

Memoization is used a lot in our codebase. Probably not always problematic, but there’s a good chance we’re doing something wrong since we’ve always ran with Passenger and didn’t use threads. That means it’s likely we’re only just uncovering how thread-unsafe Foreman really is.

spot on,
this seems to did the trick and i no longer spot the inconsistent behavior.

How memorization can be a problem in a controller? Each request has its own controller instance, therefore memorization within a controller cannot be a problem. I believe the linked article is wrong about this.

Anyway, some time ago we made a shoking switch in our codebase by enabling threadsafe! in the codebase. This was done without a good understanding of consequences, I was asking for this and IIRC the response was something “it works in dev just fine”. And here we are.

I suggest to configure Puma with only single thread and properly plan some review and testing.

After reading the memorization section I can say it’s very misleading, I should say even completely wrong in the controller context where every request has its own instance. The only valid point for Rails controllers is sharing data via class variables which is a bad in any multi-threaded application (it’s a global state). Also point about context switch in ||= is not applicable in a controller, however that makes sense in a shared instance.

Anyway, catching these kind of bugs can be a challenge. It does not need to be actually something in our code, we carry ton of libraries and I believe they are the biggest problem when it comes to thread safety - it’s almost impossible to tell which are safe.

In this particular case, I’d probably start digging in the RABL codebase because there is some caching going on. A cache = global state. Therefore this code must be thead safe. I think it’s more likely these are bugs in libraries than in our Ruby on Rails code.

Okay that was fast:

I am not saying this hasn’t been fixed but after 10 minutes of searching I could not find much improvements, there was a one patch fixing one particular problem which was not Rails related.

What is the expected behavior? I don’t understand Rails too much, is this some convention over configuration? What exactly is this supposed to pick? Shouldn’t we see some debug messages with more info about what template Rails are picking to render the response?

Ok I tried to create a pure Rails app, when create.html.erb is missing, an error is thrown:

Started GET "/welcome/create" for 127.0.0.1 at 2020-09-18 15:21:15 +0200
Processing by WelcomeController#create as */*
No template found for WelcomeController#create, rendering head :no_content
Completed 204 No Content in 1ms (ActiveRecord: 0.0ms | Allocations: 211)

It looks like there is some fallback mechanism in RABL when template is not found it picks index.json.rabl.

Anyway, I spotted a global variable (hash) in RABL and created a mutex for it. Unfortunately it does not fix the problem. I asked RABL devs for more insights: https://github.com/nesquena/rabl/pull/738

Thanks for digging into this with Rabl @lzap . How are you feeling about it being a successful vector? Do we need to investigate what a switch to Rails supported JBuilder or it’s supposedly faster related project jb would take for the project?

No hard feelings but the library is definitely not thread safe and there are no plans in doing so. I gave it a day trying to spot the problem but this needs someone more familiar with Rails stack, specifically how views and rendering work. The misbehavior really suggests this has something to do with caching, because a different output is returned.

Replacing such stack will be painful and can possibly introduce issues, so I’d perhaps ask first if we really need such thing? Isn’t simply as_json methods approach good enough? It will be probably the fastest option as well. This needs some bigger research.

I did some investigation of this issue. The TLDR of the story is: this is not a problem with Rabl, in fact, this particular API does not use Rabl rendering at all.

The Rabl gem (and Foreman) hook into the rendering process by supplying a template handler to ActionView. The template handling engine looks through available handlers performing a lookup based on the name of the handler and the available file extensions. Since there is not a create.json.rabl present, the rendering engine skips the Rabl handler all together. For example, when hitting api/v2/discovery_rules/393 in the logs you can see the rendering path:

2020-09-30T19:01:34 [I|app|1f6417f1]   Rendering /opt/theforeman/tfm/root/usr/share/gems/gems/foreman_discovery-16.2.0/app/views/api/v2/discovery_rules/show.json.rabl

OK, so what is happening here?

When a POST is made to api/v2/discovery_rules in order to create a discovery rule, the rendering engine finds no corresponding template handler (i.e. Rabl template) to process and falls back to standard Rails processing methods. That is, the object is converted to_json and then sent back in the response.

Why do we get two different response formats then?

The fallback in Foreman’s API layer is to follow the API V2 standard of not including the root object type in the response by disabling this configuration. This disable is done per response, not application wide.

Conclusion

The move to Puma in a multi-thread deployment exposed a place in the Foreman code (that is over 6 years old) where per response we are attempting to manipulate a class level attribute which is not thread safe. You can see from this output where the processing starts to get out of order setting and re-setting this:

Sep 30 18:48:42 centos7-foreman-nightly.war.example.com foreman[1524]: DISABLE JSON ROOT
Sep 30 18:48:42 centos7-foreman-nightly.war.example.com foreman[1524]: PROCESS RESPONSE
Sep 30 18:48:42 centos7-foreman-nightly.war.example.com foreman[1524]: RESPONSE: 223177500
Sep 30 18:48:42 centos7-foreman-nightly.war.example.com foreman[1524]: RE-ENABLE JSON ROOT
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: DISABLE JSON ROOT
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: DISABLE JSON ROOT
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: PROCESS RESPONSE
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: RESPONSE: 292601022
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: RE-ENABLE JSON ROOT
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: PROCESS RESPONSE
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: RESPONSE: 184514629
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: RE-ENABLE JSON ROOT
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: DISABLE JSON ROOT
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: PROCESS RESPONSE
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: RESPONSE: 1101731735
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: RE-ENABLE JSON ROOT
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: DISABLE JSON ROOT
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: PROCESS RESPONSE
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: RESPONSE: 2178710316
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: RE-ENABLE JSON ROOT
Sep 30 18:48:43 centos7-foreman-nightly.war.example.com foreman[1524]: DISABLE JSON ROOT

I have proposed the following fixes:

  • Add create.json.rabl to foreman_discovery for discovery rules PR
  • Elevate disabling json root to the application boot PR
4 Likes

Great job! Honestly, I hate convention over configuration when this is not documented. I still don’t understand how Rails should behave when there is no template. On a new RoR6 app an error is thrown. Why in our case Rails attempts to render via to_json?

Anyway, I agree with the fix for both core and discovery.

Just doing some house keeping.
This has been resolved long time ago, during the Covid Era (remember that?)