Massive externalNode performance degradation after upgrade to 1.16

also worth mentioning the very large amount of cached sql queries, maybe we
need to store it in some local variable instead of going to active record?

Don’t want to blame anybody, but in 1.16 we introduced redesign of host info, providers were introduced to allow extending of ENC data. This was part of:

http://projects.theforeman.org/issues/15711

This possibly could be root cause, unfortunately this is rather complex code change, I am unable to tell and this needs further analysis. Creating a simple benchmark and running it before/after would tell us perhaps.

Thank you very much for your analysis. You can reproduce this with the ENC script (as provided here: Foreman :: Manual, e.g. $ time ./node.rb $(hostname -f) > /dev/null) or just by clicking on “YAML” in the Foreman web UI.

With the depth of this topic and with these Foreman and Ruby internals, I am at my wit’s end.

Do you need any further debug information or do you have any tips on how to trace this problem at Ruby level?

This is far from an analysis, I was just looking into git and trying to figure out files which changed in 1.16. I might be very wrong.

Can you describe in more detail the host you are experiencing slow performance with? I mean, can you pastebin the ENC output actually? Make sure all sensitive data is redacted, but it would be useful to see how much data are we talking about. Also, the benchmark would use the same or similar input.

You can find an example here: https://pastebin.com/Kc8shXgw
Due to the anonymization the yaml might have syntax errors - sorry about that.

Meanwhile we did the following test with a host and the smart class parameters:

  • Host with 17 included puppet classes with a few activated smart class parameters (configure -> Smart Class Parameters -> “Override” selected): ~ 800ms
  • Host with 17 included puppet classes with some more activated smart class parameters (configure -> Smart Class Parameters -> “Override”): ~ 3000ms
  • Host with 17 include puppet classes with some more activated smart class parameters (configure -> Smart Class Parameters -> “Override”) and overwritten on the host: ~ 6000ms
1 Like

Thanks for the snippet, that is fine. Just please confirm that the amount of data is roughly correct. This does not look like a huge ENC output.

Class and smart class params are a bit complicated code I haven’t looked into yet, so I will let some more experienced devs to comment on that - everything above one second is IMHO bad performance.

Yes, the amount of data is roughly correct. Of course the public ssh keys are a bit longer :P, but the number of parameters is correct.

We have the same problem.

We had a separate post before the community vm problem - but really our data belongs here.

@ohadlevy was replying to me on that other thread - but I think this data belongs on this thread now since it looks like the a related problem.

We upgraded from 1.15 to 1.17 and suffered huge performance degradation which rendered foreman unusable.

We downgraded to 1.16 and had the same problem. It was fixed when we went back to 1.15.

In 1.15 a single node query – /node/node.name?format=yml took 0.8 seconds.

In 1.16 it takes around 19 seconds which renders foreman unusable because it’s burning 100% of one CPU during this time. So we end up burning all CPUs on our 7 foreman servers (which only run foreman) and our puppet runs start failing because of time outs.

I believe the same problem was true for 1.17 but I did not see the logs personally to confirm it. All I know is the foreman servers were burning all of the CPUs they had handling the /node/node.name?format=yml endpoint.

Overall we have around 1,000 classes, 4,000 smart class variables, 8,000 hosts, 18 puppet environments, 1 smart variable.

The ENC for the host from the log entries below has 67 classes, 148 class parameters and 13 parameters.

The following is an extract from a single host request to /node/single.node.name?format=yml .

I used the request ID to grep every line from the log for the single /node call, cut off the time stamps and did a sort | uniq -c .

The number on the front of each line is the number of times that line appeared in the log for a single /node/ request for one node.

    737 [sql] [D]   CACHE (0.0ms)  SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('CommonParameter')  ORDER BY parameters.name
    738 [sql] [D]   CACHE (0.0ms)  SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('DomainParameter') AND "parameters"."reference_id" = $1  ORDER BY parameters.name  [["reference_id", 2]]
    740 [sql] [D]   CACHE (0.0ms)  SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('GroupParameter') AND "parameters"."reference_id" = $1  ORDER BY parameters.name  [["reference_id", 68]]
    737 [sql] [D]   CACHE (0.0ms)  SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('GroupParameter') AND "parameters"."reference_id" = $1  ORDER BY parameters.name  [["reference_id", 7]]
    736 [sql] [D]   CACHE (0.0ms)  SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('GroupParameter') AND "parameters"."reference_id" = $1  ORDER BY parameters.name  [["reference_id", 9]]
    736 [sql] [D]   CACHE (0.0ms)  SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('HostParameter') AND "parameters"."reference_id" = $1  ORDER BY parameters.name  [["reference_id", 1047]]
    737 [sql] [D]   CACHE (0.0ms)  SELECT "parameters".* FROM "parameters" WHERE "parameters"."type" IN ('OsParameter') AND "parameters"."reference_id" = $1  ORDER BY parameters.name  [["reference_id", 1]]

I can do the same logging on 1.15 by comparison if that would help and I can provide the full log from 1.16. We can run up a test 1.17 or 1.16 and put our data into it for testing as well.

We can not upgrade to 1.16 or 1.17 at the moment because foreman is unusable due to this performance problem.

I believe I’m also suffering from this issue. ~13,000 hosts, ~800 classes per environment ~20 environments. Prolly 20 to 120 smartclass params per host.

Unlike others we did not fall back though. We have 18 puppet masters with 16 cores and 55gb RAM each, and allow something like 24 max passenger processes per host. Previously this gave us way overkill redundancy with the ability to lose 66 percent of our puppet masters, and still function fine… Now it’s a lot less, but it can “keep up” overall. Hope these numbers help anyone who wants to upgrade, and has virtual resources to spare to scale out for the time being…

Definitely notice the yaml button is very slow in the UI compared to 1.15.x though being the easiest way to see something is amiss. Would love to see if this can be fixed… Once we figure out what it is :slight_smile:

Can anyone of you share their db? It would make developers life a lot easier to reproduce the issue.

We have what looks like a similar problem.

Ignore the api build problems in that orinal post.

Ignore the above - it’s from the restore of the data. I can’t delete the post.

I’m trying to find an anonymiser for the database - can you suggest one that can be used?
I’ve tried https://github.com/DivanteLtd/anonymizer - didn’t work.

I’m currently trying https://github.com/realrolfje/anonimatron

if @bryder can find an anonymizer that works - i can also take a stab. My database is many many GB though - so i wouldn’t even know where to put it should I be able to anonymize it

I spent a lot of time trying to make anonimatron work which in the end required me to run it in a debugger to find out what was wrong with my configuration.

Once I got past that I discovered that it only supports a tiny subset of the postgres types and sadly I don’t have the time to implement the extra types we need. It doesn’t help that I’m not the fastest java programmer!

Unfortunately we can not share our db, but you can easily reproduce this behavior:

  • Install a foreman (reproducible with Foreman 1.16 and 1.17)
  • Be sure the setting “Enable smart variables in ENC” is true
  • Add some more puppet classes with the possibility to set smart class parameters
  • Add these classes to one host and overwrite the puppet class parameters.

-> The more parameters are overwritten, the higher the response time will be.

So, perhaps we have the solution:
https://github.com/theforeman/foreman/pull/5583

The “new” Code fixes the time problem on our testsystems, but still needs to be tested and merged. It also works in foreman 1.16 :slight_smile:

2 Likes

Good to hear there’s probably a fix for the problem now!

I figured out how to use anonimatron in the end but there’s still a lot of work to configure it properly for the foreman database (working through which columns should and should not be anonymised basically).

I’ll stop doing that now since it looks like it’s not needed.

I finished off what I can of this without looking into anonimatron bugs.

The final status is:

It can not current anonimise settings, subnets, and notification_blueprints due to bugs in the engine. nics and operating_systems have field sizes too small for the default hash, and notification_blueprints has no primary key so anonimatron will not try to anonymise it.

I put what I have done up here https://github.com/bryder/anonimatron_anonymizers in case I ever need to come back to it or someone wants the unfinished work.

1 Like

The PR has been merged and we are working on backporting. Nice catch!

1 Like