Performance Degradation after 1.15 -> 1.16 Upgrade

HeWhoWas · April 29, 2018, 11:49pm

Problem:

Slow performance using GUI, clients timing out, etc. We have over 300 hosts communicating with this server, since the upgrade approx 75% of them time out when retrieving the manifest.

Expected outcome:

Performance would hopefully not change.

Foreman and Proxy versions:

OS: redhat
RELEASE: CentOS Linux release 7.3.1611 (Core)
FOREMAN: 1.16.1
RUBY: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
PUPPET: 3.8.7

Foreman and Proxy plugin versions:
foreman-tasks (0.10.9)
foreman-tasks-core (0.2.4)
foreman_remote_execution (1.3.7)
foreman_remote_execution_core (1.1.0)
foreman_setup (5.0.0)
foreman_xen (0.5.2)

Other relevant data:
I can provide the debug data archive at request - I’m not aware of what information is in there, so I haven’t included it by default.

ohadlevy · April 30, 2018, 8:15am

Its very hard to understand from your post the issue. it would be really useful if you could provide some specific UI / API actions that are slow, and adding the relevant logs to when clients timeout etc.

@lzap is there any guide in how to troubleshooting performance using the new telemetry framework (or any other suggestions)?

lzap · May 2, 2018, 11:59am

Hey, telemetry was introduced in 1.17 but the basic investigation is to find out slow endpoints. You can do this with grep from production.log easily. Sure, use foreman-debug -u to upload logs that would help. Or you can simply tell us which pages are slow in particular? I don’t believe that everything is slow.

Also do rough analysis of the host - how much memory consumed, swapping, I/O load or CPU load?

Isn’t 1.16 the version when we introduced the power icon in host list? This can generate quite some I/O wait, it’s configurable tho. Just guessing.

jarro2783 · May 18, 2018, 6:49am

We have just upgraded to 1.17 and have the same problem. This is an idle host:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND        
12447 foreman   20   0 2275068 1.574g   1260 S  99.3 21.6   2:08.61 foreman-ruby   
12664 foreman   20   0  964976 409660    276 S  99.3  5.3   0:31.42 foreman-ruby   
12587 foreman   20   0 1685776 1.035g    668 S  98.7 14.2   1:00.90 foreman-ruby   
12722 foreman   20   0 1357580 765016   1160 S  98.3 10.0   0:13.02 foreman-ruby

jarro2783 · May 20, 2018, 11:34pm

One of my team reported that it’s doing a full table scan:

mysql> explain DELETE FROM `reports` WHERE `reports`.`type` IN ('ConfigReport') AND (reports.created_at < '2018-05-17 10:25:58');
+----+-------------+---------+------------+------+---------------------------------------------------------+------+---------+------+--------+----------+-------------+
| id | select_type | table   | partitions | type | possible_keys                                           | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+---------+------------+------+---------------------------------------------------------+------+---------+------+--------+----------+-------------+
|  1 | DELETE      | reports | NULL       | ALL  | index_reports_on_type,index_reports_on_type_and_host_id | NULL | NULL    | NULL | 573019 |   100.00 | Using where |
+----+-------------+---------+------------+------+---------------------------------------------------------+------+---------+------+--------+----------+-------------+

They also pointed to this bad query:

SELECT `fact_values`.* FROM `fact_values`

Deleting the whole database seemed to fix the problem.

We currently have months of reports stored, so if it is querying everything, then clearly deleting it all would fix that.

Gwmngilfen · May 21, 2018, 10:00am

Full table scans are indeed bad - could you open a bug?

jarro2783 · May 22, 2018, 2:16am

Actually that was only when running foreman-rake reports:expire days=1. But one of my team reported that this was running repeatedly:

 Id: 76923
   User: foreman
   Host: ip-10-138-87-133.ec2.internal:51140
     db: foreman
Command: Query
   Time: 45
  State: Sending to client
   Info: SELECT `reports`.* FROM `reports` WHERE `reports`.`type` IN ('ConfigReport')

I’ll report the expire bug though. An index on the right column might help some other things anyway.

lzap · May 22, 2018, 12:10pm

The way we are expiring reports is terrible and very transaction-unfriendly. I started an attempt to fix this in

but I don’t have time to finish this. The idea is to break the deletion into small transactions (batches).

BUT I don’t think that this is the root cause of Ruby processes burning CPU. The reports expiration rather creates transaction errors than loops with some hard work.