Crippling load after update to foreman 1.17

Problem:

Clustered install with 4 foreman nodes behind an F5 load balancer and ~8000 clients running Foreman 1.15 and external posgres database and puppetmasters i.e. foreman nodes only do ENC and UI.

Upgraded to 1.17 via 1.16 and everything seemed ok with a small amount of clients but once we started to load up the cluster with client requests again CPU on all nodes went to 100% crippling the cluster.

We added multiple new foreman nodes up to 7 in total but it made little difference so eventually we rolled back to 1.16 which is still running at high load but is useable.

foreman servers are Vmware with 6 cores and 8G of ram

Expected outcome:

Similar foreman server load after the upgrade

Foreman and Proxy versions:

foreman 1.17.0
foreman-proxy 1.17.0

Foreman and Proxy plugin versions:

foreman_default_hostgroup 4.0.1
foreman_memcache 0.1.0

Other relevant data:

[e.g. logs from Foreman and/or the Proxy, modified templates, commands issued, etc]

Unsure if this is related but we cannot create the apipie cache

foreman-rake apipie:cache
2018-05-09 20:29:17 +1200 | Started
2018-05-09 20:29:18 +1200 | Documents loaded...
2018-05-09 20:29:18 +1200 | Processing docs for
rake aborted!
NoMethodError: undefined method `rebuild_methods' for #<Class:0x0000000b91b5a8>
/usr/share/foreman/vendor/ruby/2.3.0/gems/activerecord-5.1.4/lib/active_record/dynamic_matchers.rb:22:in `method_missing'
/usr/share/foreman/app/models/host/managed.rb:468:in `valid_rebuild_only_values'
/usr/share/foreman/config/initializers/apipie.rb:25:in `block (2 levels) in <top (required)>'
/usr/share/foreman/config/initializers/apipie.rb:33:in `block (3 levels) in <top (required)>'
/usr/share/foreman/config/initializers/apipie.rb:33:in `each'
/usr/share/foreman/config/initializers/apipie.rb:33:in `map'
/usr/share/foreman/config/initializers/apipie.rb:33:in `block (2 levels) in <top (required)>'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/application.rb:385:in `translate'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/method_description.rb:155:in `block in method_apis_to_json'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/method_description.rb:151:in `each'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/method_description.rb:151:in `each'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/method_description.rb:151:in `collect'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/method_description.rb:151:in `method_apis_to_json'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/method_description.rb:173:in `to_json'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/resource_description.rb:101:in `block in to_json'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/resource_description.rb:101:in `each'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/resource_description.rb:101:in `collect'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/resource_description.rb:101:in `to_json'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/application.rb:284:in `block in to_json'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/application.rb:283:in `each'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/application.rb:283:in `inject'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/application.rb:283:in `to_json'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/apipie/apipie_module.rb:13:in `to_json'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/tasks/apipie.rake:142:in `block (4 levels) in <top (required)>'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/tasks/apipie.rake:131:in `each'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/tasks/apipie.rake:131:in `block (3 levels) in <top (required)>'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/tasks/apipie.rake:286:in `with_loaded_documentation'
/usr/share/foreman/vendor/ruby/2.3.0/gems/apipie-rails-0.5.8/lib/tasks/apipie.rake:129:in `block (2 levels) in <top (required)>'
Tasks: TOP => apipie:cache
(See full trace by running task with --trace)

Additional info

stracing the foreman-ruby passenger workers only gave the following output

strace -p 118136
strace: Process 118136 attached
select(19, [0 18], NULL, NULL, NULL
14:11

i could see normal activity with passenger-status though

and if i debug with gdb i just get

[New Thread 0x7f1ac598f700 (LWP 53749)]
[Thread 0x7f1ac598f700 (LWP 53749) exited]
[New Thread 0x7f1ac598f700 (LWP 53750)]
[Thread 0x7f1ac598f700 (LWP 53750) exited]
[New Thread 0x7f1ac598f700 (LWP 53751)]
[Thread 0x7f1ac598f700 (LWP 53751) exited]

I looked at this last week. Here’s more info and an update.

I fixed:

error by setting unattended to true instead of false and was able to build the api caches.

It made no difference to the performance problem.

The problem occurs when this endpoint is hit

/node/name.of.host?format=yml

What used to take seconds or less is now taking 20 seconds.

It occurs on 1.17 and on 1.16.1.

The problem is not in the database. All of the sql queries are < 0.1s or better - you see the code making a lot of repeated queries that are all being satisfied from cache.

These are the top repeated queries for one request to/node/nodename?format=yml

   737 [sql] [D] CACHE (0.0ms) SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('CommonParameter') ORDER BY parameters.name 738 [sql] [D] CACHE (0.0ms) SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('DomainParameter') AND "parameters"."reference_id" = $1 ORDER BY parameters.name [["reference_id", 2]] 740 [sql] [D] CACHE (0.0ms) SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('GroupParameter') AND "parameters"."reference_id" = $1 ORDER BY parameters.name [["reference_id", 68]] 737 [sql] [D] CACHE (0.0ms) SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('GroupParameter') AND "parameters"."reference_id" = $1 ORDER BY parameters.name [["reference_id", 7]] 736 [sql] [D] CACHE (0.0ms) SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('GroupParameter') AND "parameters"."reference_id" = $1 ORDER BY parameters.name [["reference_id", 9]] 736 [sql] [D] CACHE (0.0ms) SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('HostParameter') AND "parameters"."reference_id" = $1 ORDER BY parameters.name [["reference_id", 1047]] 737 [sql] [D] CACHE (0.0ms) SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('OsParameter') AND "parameters"."reference_id" = $1 ORDER BY parameters.name [["reference_id", 1]]

The first number on each line is the number of times the query was repeated.

The full debug and sql debug enabled log for one request is around 19,000 lines.

And from the app debug log - these are the most repeated lines - for just one node request

148 f8b526d0 [app] [D] Cache read: host_group_matchers_inheritance 284 f8b526d0 [app] [D] Cache read: safemode_render 378 f8b526d0 [app] [D] Cache read: interpolate_erb_in_parameters 10523 f8b526d0 [app] [D] Cache read: Default_variables_Lookup_Path

So there’s something pathological going on in the code here.

We have a large setup - around 1,000 classes, 4,000 smart class parameters, 40 host groups, only 2 smart variables.

This appears to be related to this forum post: Massive externalNode performance degradation after upgrade to 1.16

the query for parameters is done per host where it tries to find the right match based on hierarchy (common/domain/hostgroup(and parent host groups)/ os and host parameters)

what I’m missing from your entry if its within the same request (e.g one ENC request for one puppet managed host or all of them).

as of one 1.13 (AFAIR) you can configure `config/logging.yaml’ under the pattern section to include a request ID (so its easier to grep the logs for just one request).

I’ve tried to go over the SQL queries myself, but was unable to find anything that would impact performance dramatically based on my data set, we would need more logs or a database dump?

I’m in mixed feelings about this, as at the end of the day its a successful cache read (e.g. it didn’t go the db to read the value) and if you are using memcache it didnt read from the FS rather from memcache, however, can we avoid in some cases trying to re-read the same cache within the same request - perhaps… I would probably start with the lookup_path variable and see if it makes any difference.

do note - if you are using memcache, this does mean it created ~11K requests to memcache during the enc lookup…

The API cache issues is: Bug #23536: apipie:cache:index NoMethodError when unattended=false - Foreman
A workaround is to set unattended to true in the settings before running the cache creation task.