Rake reports cleanup task slowness fix

planktonette · January 10, 2018, 4:05pm

Hello,

I’ll preface this by saying I’m not a ruby developer

I’ve seen quite a number of bug reports surrounding how slow the rake task that cleans up the various tables related to reports can get when run. We have a large deployment and hit this issue every time the rake task runs.

The first task takes approximately 15 minutes to run

/usr/sbin/foreman-rake reports:expire days=1 status=0

And this task takes anywhere from 3-8 hours depending on DB load, and around 12 hours for the database slaves to eventually catch up.

/usr/sbin/foreman-rake reports:expire days=7

I’ve started to dig into the queries that this task invokes, the first of which is:

DELETE FROM `logs` WHERE `logs`.`report_id` IN (SELECT `reports`.`id` FROM `reports`  WHERE (reports.created_at < '2018-01-10 10:35:40' and reports.status = 0));

So I set to trying to optimize this delete. This gist has my result set:

gist.github.com

https://gist.github.com/Planktonette/dde1a18ec0b5d6854022b61d531757bd

foremanrake

NEW way of running query
:foreman> select sql_no_cache * FROM `logs` inner join (SELECT `reports`.`id` FROM `reports`  WHERE (reports.created_at < '2018-01-10 10:35:40' and reports.status = 0)) rep on `logs`.`report_id` = rep.id;
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
| id         | source_id | message_id | report_id | level_id | created_at          | updated_at          | id       |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
<snip>
| 4882753922 |   1753442 |   41878002 |  63231532 |        4 | 2018-01-09 16:35:54 | 2018-01-09 16:35:54 | 63231532 |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
506 rows in set (1.85 sec)

This file has been truncated. show original

In this instance, I was running the query against a new slave we set up, which is sitting on vmware architecture backed by a blade chassis and a 3par san. So performance isn’t an issue.

As you can see, the initial run of the query (after flushing all the caches) took 2 min 43.62 sec to return 506 rows. It required a full table scan, and it hit the disk for ~6gb of read ops.

After recrafting the query to use an inner join (https://dev.mysql.com/doc/refman/5.7/en/delete.html) to build the delete set, we see the query drop to 1.85 seconds with 506 rows and 165mb of disk read ops.

I was planning on submitting a PR, but, it appears that ruby doesn’t actually support delete joins in active records, so as far as I can tell, it would require direct sql execution to put this fix in place.

Is this something that would be acceptable, and equally, can I get some thoughts and opinions on this fix, as I intend to patch our own installation either way

Thanks!
Amy

tbrisker · January 10, 2018, 10:05pm

Thank you for the input and profiling!
Since we support both PostgreSQL and MySQL, we need to figure out a solution that works for both.
Could you also please run EXPLAIN on the two queries and provide us with the output? How many records do you have in each of these tables?

tbrisker · January 10, 2018, 10:07pm

Also what happens if you try adding one of the following indexes:

CREATE INDEX temp ON reports(created_at, status)

or:

CREATE INDEX temp ON reports(status, created_at)

lzap · January 11, 2018, 9:38am

This is indeed problematic part of Foreman, it’s slow and it also locks
those tables exclusively for quite some time which leads to transaction
rollbacks. We definitely need to improve this, I’d consider deleting in
batches (10k records).

Tomer, I am not so sure if creating another index on reports table (quite
big one) is the good way to go. Indexes today already takes much more space
than actual data, we should go the opposite way I think. In this case, I
vote for ditching ActiveRecord and doing plain SQL for PostgreSQL.

When using 9.5+ postgres, you can consider setting reports and related
tables as UNLOGGED which will boost dramatically all fact uploads and
deletes will also be faster. This is just an idea which came to my mind,
nothing that has been tested in production.

planktonette · January 11, 2018, 11:25am

Sure thing

:foreman> select count(*) from logs;
+-----------+
| count(*)  |
+-----------+
| 114756673 |
+-----------+
1 row in set (49.21 sec) 

:foreman> select count(*) from reports;
+----------+
| count(*) |
+----------+
|   189340 |
+----------+
1 row in set (0.12 sec)

:foreman> select count(*) from messages;
+----------+
| count(*) |
+----------+
|   143858 |
+----------+
1 row in set (0.23 sec)

:foreman> select count(*) from sources;
+----------+
| count(*) |
+----------+
|    15986 |
+----------+
1 row in set (0.03 sec)

Query Explains:

:foreman> explain extended select * FROM `logs` WHERE `logs`.`report_id` IN (SELECT `reports`.`id` FROM `reports`  WHERE (reports.created_at < '2018-01-10 10:35:40' and reports.status = 0));
+----+-------------+---------+--------+-------------------------------------------------------------+---------+---------+------------------------+-----------+----------+-------------+
| id | select_type | table   | type   | possible_keys                                               | key     | key_len | ref                    | rows      | filtered | Extra       |
+----+-------------+---------+--------+-------------------------------------------------------------+---------+---------+------------------------+-----------+----------+-------------+
|  1 | SIMPLE      | logs    | ALL    | index_logs_on_report_id                                     | NULL    | NULL    | NULL                   | 123694898 |   100.00 | Using where |
|  1 | SIMPLE      | reports | eq_ref | PRIMARY,index_reports_on_status,index_reports_on_created_at | PRIMARY | 4       | foreman.logs.report_id |         1 |   100.00 | Using where |
+----+-------------+---------+--------+-------------------------------------------------------------+---------+---------+------------------------+-----------+----------+-------------+

:foreman> explain select * FROM `logs` inner join (SELECT `reports`.`id` FROM `reports`  WHERE (reports.created_at < '2018-01-10 10:35:40' and reports.status = 0)) rep on `logs`.`report_id` = rep.id;
+----+-------------+------------+------+-----------------------------------------------------+-------------------------+---------+--------+-------+-------------+
| id | select_type | table      | type | possible_keys                                       | key                     | key_len | ref    | rows  | Extra       |
+----+-------------+------------+------+-----------------------------------------------------+-------------------------+---------+--------+-------+-------------+
|  1 | PRIMARY     | <derived2> | ALL  | NULL                                                | NULL                    | NULL    | NULL   | 74382 | NULL        |
|  1 | PRIMARY     | logs       | ref  | index_logs_on_report_id                             | index_logs_on_report_id | 5       | rep.id |  3606 | NULL        |
|  2 | DERIVED     | reports    | ref  | index_reports_on_status,index_reports_on_created_at | index_reports_on_status | 9       | const  | 74382 | Using where |
+----+-------------+------------+------+-----------------------------------------------------+-------------------------+---------+--------+-------+-------------+

I don’t think it will make a huge difference as the delay isn’t on that side, but happy to try.

I considered that filtering more on the logs table might help, but the volume of info in the logs table is quite high, and since the queries are very slow even when run back to back, I don’t think adding a filter on the logs.created_at gains much, and in fact will slow things down. This seems borne out in testing:

:foreman> select * FROM `logs` WHERE `logs`.`report_id` IN (SELECT `reports`.`id` FROM `reports`  WHERE (reports.created_at < '2018-01-10 10:35:40' and reports.status = 0)) AND `logs`.`created_at` < '2018-01-10 10:35:40'
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
| id         | source_id | message_id | report_id | level_id | created_at          | updated_at          | id       |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
<snip>
| 4882753922 |   1753442 |   41878002 |  63231532 |        4 | 2018-01-09 16:35:54 | 2018-01-09 16:35:54 |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+
506 rows in set (2 min 56.06 sec)

Here’s the same query with the index added on the reports table:

:foreman> select * FROM `logs` WHERE `logs`.`report_id` IN (SELECT `reports`.`id` FROM `reports`  WHERE (reports.created_at < '2018-01-10 10:35:40' and reports.status = 0)) AND `logs`.`created_at` < '2018-01-10 10:35:40'
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
| id         | source_id | message_id | report_id | level_id | created_at          | updated_at          | id       |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
<snip>
| 4882753922 |   1753442 |   41878002 |  63231532 |        4 | 2018-01-09 16:35:54 | 2018-01-09 16:35:54 |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+
506 rows in set (2 min 42.83 sec)

One other thing I thought to try is to re-analyze the logs table - obviously this isn’t a workable solution in the long term, but perhaps there is some issue with innodb in how it handles key distribution over time when purging huge volumes of data:

:foreman> analyze table logs;
+--------------+---------+----------+----------+
| Table        | Op      | Msg_type | Msg_text |
+--------------+---------+----------+----------+
| foreman.logs | analyze | status   | OK       |
+--------------+---------+----------+----------+
1 row in set (0.04 sec)

Stopped mysql, purged all the caches.

:foreman> select * FROM `logs` WHERE `logs`.`report_id` IN (SELECT `reports`.`id` FROM `reports`  WHERE (reports.created_at < '2018-01-10 10:35:40' and reports.status = 0)) AND `logs`.`created_at` < '2018-01-10 10:35:40'
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
| id         | source_id | message_id | report_id | level_id | created_at          | updated_at          | id       |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
<snip>
| 4882753922 |   1753442 |   41878002 |  63231532 |        4 | 2018-01-09 16:35:54 | 2018-01-09 16:35:54 |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+
506 rows in set (1.19 sec)

There appears to be less of an impact, but still a huge improvement on our slower foreman db nodes though:

This node would take 10 minutes to execute this query before:

:foreman> select * FROM `logs` WHERE `logs`.`report_id` IN (SELECT `reports`.`id` FROM `reports`  WHERE (reports.created_at < '2018-01-10 10:35:40' and reports.status = 0)) AND `logs`.`created_at` < '2018-01-10 10:35:40'
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
| id         | source_id | message_id | report_id | level_id | created_at          | updated_at          | id       |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
<snip>
| 4882753922 |   1753442 |   41878002 |  63231532 |        4 | 2018-01-09 16:35:54 | 2018-01-09 16:35:54 |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+
506 rows in set (1 min 41.31 sec)

And this node which would take around 4 minutes:

:foreman> select * FROM `logs` WHERE `logs`.`report_id` IN (SELECT `reports`.`id` FROM `reports`  WHERE (reports.created_at < '2018-01-10 10:35:40' and reports.status = 0)) AND `logs`.`created_at` < '2018-01-10 10:35:40'
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
| id         | source_id | message_id | report_id | level_id | created_at          | updated_at          | id       |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+----------+
<snip>
| 4882753922 |   1753442 |   41878002 |  63231532 |        4 | 2018-01-09 16:35:54 | 2018-01-09 16:35:54 |
+------------+-----------+------------+-----------+----------+---------------------+---------------------+
506 rows in set (1 min 47.94 sec)

So I think part of the solution could comprise forcing an analyze on the tables after the rake task is done if we can’t come up with a better fix, as I’m fairly certain that on postgres this isn’t an automated task (at least on the pgsql nodes we have here), and MySQL from these results doesn’t seem to do the right thing.

With that said, comparing this to the inner join delete just shows the expense of the select in() method:

Fast node: 506 rows in set (0.11 sec)
Slow node: 506 rows in set (0.27 sec)
Really slow node: 506 rows in set (12.89 sec)

-Amy

planktonette · January 16, 2018, 9:35pm

Has anyone had any further thoughts on how to fix this in the long term?

lzap · January 17, 2018, 8:17am

You have 120 millions records there but only 190k reports, handling up to
million records is piece of cake for Postgres, but once you get to hundreds
of millions, things are more complicated and it requires tuning and
expertise. By default we do not tune any parameters, this is on you.

There is one rather radical idea I have - to store logs data outside of SQL
database, we rarely join this table and it could easily live in separate
files or system journal where we could pull these information out pretty
easily (journald API supports filtering and accessing data).

Second less radical idea is to give up granularity and de-normalize data a
bit, in our design each log line is a separate record with extra join table
(the huge one called logs) which is very costly, if we go back one level
and store whole log in a (compressed) text column, performance will go up
dramatically. Sure, we won’t be able to associate all records with
particular log or filter out quickly all log lines which are error level,
but do we need to do this across all hosts? I doubt anybody ever does that.
We simply want to know number of errors for particular hosts (we can store
that in extra columns) and to be able to filter out irrelevant lines on
report detail page, which we can do in the user interface rather than on
the database level.

I think combination of both could work, we are currently working on
improving logging and we are considering sending whole rendered templates
into logging sink, we can do the same with reports so users are able to
delete records earlier from the database putting less stress but still
having ability to search for older reports in system journal.

lzap · December 11, 2019, 1:05pm

This patch should help, give me your feedback.

github.com/theforeman/foreman

Fixes #27971 - Optimize reports/logs deletion task

theforeman:develop ← lzap:optimize-log-deletion-27971

opened 11:39AM - 11 Dec 19 UTC

lzap

+57 -33

Instead of fetching all IDs individually which is very slow and creates a lot of… array allocations, we build on a fact that database sequence IDs are monotonic increasing integers. Therefore we can simply delete all reports and logs which has id/report_id LESS THAN expected one. However to keep the batch deletion behevior we still delete them in batches, instead 1k the default batch is now 100k because the query is much more efficient. Yet deleting 50 millions of records in logs table might take some time, thus the batching approach (exclusive lock problem). This replaces https://github.com/theforeman/foreman/pull/7072

This could be a workaround until we refactor reports storage completely. This is planned for 1.25.