Hello,
I have a preliminary version of patch which speeds up report import by 24x (not 24 % but 24 times) while the data consumed in SQL is cut at half. Let me first described how I measured it. I’ve used a VM (6GB RAM, 2 vCPUs) with Foreman 1.23 running with default installation parameters (PostgreSQL) on CentOS 7.7. In the Rails console I created a small script that tries to create reports with 100 lines. Half of it were same lines and half of it were unique lines. I’ve used benchmark-ips gem to measure throughtput for 30 minutes. The results are:
Without the patch:
- 191 reports inserted in 30 minutes
- rate of import was 0.14 report per second
- 27 MB of postgres data (data + indices)
With the patch:
- 5.477k reports inserted in 30 minutes
- rate of import was 3.468 report per second
- 13 MB of postgres data (data + indices)
That gives approx. 24x better performance and 50% less stored data. How is that possible you ask?
It’s easy, we currently store report lines in separate tables: messages, sources and a join table called logs which joins all of the with table called reports. The data is deduplicated using sha1 sum and index on it, however the problem is that on busy instances the logs table easily grow and index don’t fit into memory. Everything is slow then, not only report import but everything else to the point then host starts swapping (the worst case).
My scenario was easy as my VM had plenty of memory and all indices were loaded, however we’ve seen many deployments where logs table index is much bigger than memory and in that case all operations are much slower as SQL server needs to seek on drive. In these scenarios this can easily be 500x better performance.
The proposal I would like to discuss is obvious - I drop all those tables and report model gets new field called “body” which is a JSON (Rails) type. For databases which do support JSON datatype Rails uses that (PostgreSQL, MySQL), for others it’s just TEXT. Rails automatically serializes JSON when reading and writing to the field. In other words: instead storing report lines into three join tables, I store it into a long VARCHAR instead. No rocket science, text gets compressed on PostgreSQL so deduplication is not necessary.
Postgres also supports JSONB format which allows to create indices on JSON fields, however the big disadvantage is that the format is not compressed by default so it is probably not worth it. Also I don’t see ability to search across report lines as useful thing and tablescan can always be implemented.
Everything looks great, except few cons I want to discuss here before I start finishing off the patch and doing some changes in some plugins:
- The change is disturbing for plugins - many of them uses report models directly (there is no API defined). It will require changes, at least in OpenSCAP and Ansible (there are probably more). It can’t be done in a compatible way, however changes should be pretty small (except OpenSCAP that one is reporting monster).
- Foreman provided search capability in report contents (messages, resources). However I did a research this year on Discourse and there were no voices against dropping this. Many actually confirmed that they’ve never used this search. Note searching by hosts, report date/time and other basic fields will still be possible and we can add other fields when requested (possibly OpenSCAP).
It’s also a chance to define an API for plugins, since reports are JSON now, plugins can either manipulate the JSON directly adding their own fields or use provided API for that. Also we could define columns per plugin, so base report would only have one column (line) while puppet would have two (line, resource) and openscap probably four as it has today. This would lead to better user experience, currently all reports show two columns as it was a puppet report.
Raw results from my test scripts (will come with the patch). Without the patch:
Calculating -------------------------------------
import report 50/50 0.140 (± 0.0%) i/s - 191.000 in 1812.264915s
table_name | total | index | toast | table
-----------------------------------------+------------+------------+------------+------------
logs | 18 MB | 16 MB | 8192 bytes | 1984 kB
messages | 4688 kB | 2152 kB | 8192 bytes | 2528 kB
sources | 4688 kB | 2152 kB | 8192 bytes | 2528 kB
reports | 400 kB | 304 kB | 8192 bytes | 88 kB
With the patch:
Calculating -------------------------------------
import report 50/50 3.468 (±57.7%) i/s - 5.477k in 1800.238053s
table_name | total | index | toast | table
-----------------------------------------+------------+------------+------------+---------
reports | 13 MB | 1432 kB | 8192 bytes | 12 MB
Edit: All tests were done in a single process (single thread), it generated all the JSON and also sent the requests. Obviously, real-world scenario would saturate the database much more (we default to 6 worker Ruby processes by default and big deployments have dozens of them) giving even more dramatic increase of performance.