RFC: Optimized reports storage

lzap · September 26, 2019, 12:23pm

Hello,

I have a preliminary version of patch which speeds up report import by 24x (not 24 % but 24 times) while the data consumed in SQL is cut at half. Let me first described how I measured it. I’ve used a VM (6GB RAM, 2 vCPUs) with Foreman 1.23 running with default installation parameters (PostgreSQL) on CentOS 7.7. In the Rails console I created a small script that tries to create reports with 100 lines. Half of it were same lines and half of it were unique lines. I’ve used benchmark-ips gem to measure throughtput for 30 minutes. The results are:

Without the patch:

191 reports inserted in 30 minutes
rate of import was 0.14 report per second
27 MB of postgres data (data + indices)

With the patch:

5.477k reports inserted in 30 minutes
rate of import was 3.468 report per second
13 MB of postgres data (data + indices)

That gives approx. 24x better performance and 50% less stored data. How is that possible you ask?

It’s easy, we currently store report lines in separate tables: messages, sources and a join table called logs which joins all of the with table called reports. The data is deduplicated using sha1 sum and index on it, however the problem is that on busy instances the logs table easily grow and index don’t fit into memory. Everything is slow then, not only report import but everything else to the point then host starts swapping (the worst case).

My scenario was easy as my VM had plenty of memory and all indices were loaded, however we’ve seen many deployments where logs table index is much bigger than memory and in that case all operations are much slower as SQL server needs to seek on drive. In these scenarios this can easily be 500x better performance.

The proposal I would like to discuss is obvious - I drop all those tables and report model gets new field called “body” which is a JSON (Rails) type. For databases which do support JSON datatype Rails uses that (PostgreSQL, MySQL), for others it’s just TEXT. Rails automatically serializes JSON when reading and writing to the field. In other words: instead storing report lines into three join tables, I store it into a long VARCHAR instead. No rocket science, text gets compressed on PostgreSQL so deduplication is not necessary.

Postgres also supports JSONB format which allows to create indices on JSON fields, however the big disadvantage is that the format is not compressed by default so it is probably not worth it. Also I don’t see ability to search across report lines as useful thing and tablescan can always be implemented.

Everything looks great, except few cons I want to discuss here before I start finishing off the patch and doing some changes in some plugins:

The change is disturbing for plugins - many of them uses report models directly (there is no API defined). It will require changes, at least in OpenSCAP and Ansible (there are probably more). It can’t be done in a compatible way, however changes should be pretty small (except OpenSCAP that one is reporting monster).
Foreman provided search capability in report contents (messages, resources). However I did a research this year on Discourse and there were no voices against dropping this. Many actually confirmed that they’ve never used this search. Note searching by hosts, report date/time and other basic fields will still be possible and we can add other fields when requested (possibly OpenSCAP).

It’s also a chance to define an API for plugins, since reports are JSON now, plugins can either manipulate the JSON directly adding their own fields or use provided API for that. Also we could define columns per plugin, so base report would only have one column (line) while puppet would have two (line, resource) and openscap probably four as it has today. This would lead to better user experience, currently all reports show two columns as it was a puppet report.

Raw results from my test scripts (will come with the patch). Without the patch:

Calculating -------------------------------------
 import report 50/50      0.140  (± 0.0%) i/s -    191.000  in 1812.264915s

               table_name                |   total    |   index    |   toast    |   table    
-----------------------------------------+------------+------------+------------+------------
 logs                                    | 18 MB      | 16 MB      | 8192 bytes | 1984 kB
 messages                                | 4688 kB    | 2152 kB    | 8192 bytes | 2528 kB
 sources                                 | 4688 kB    | 2152 kB    | 8192 bytes | 2528 kB
 reports                                 | 400 kB     | 304 kB     | 8192 bytes | 88 kB

With the patch:

Calculating -------------------------------------
 import report 50/50      3.468  (±57.7%) i/s -      5.477k in 1800.238053s

               table_name                |   total    |   index    |   toast    |  table  
-----------------------------------------+------------+------------+------------+---------
 reports                                 | 13 MB      | 1432 kB    | 8192 bytes | 12 MB

Edit: All tests were done in a single process (single thread), it generated all the JSON and also sent the requests. Obviously, real-world scenario would saturate the database much more (we default to 6 worker Ruby processes by default and big deployments have dozens of them) giving even more dramatic increase of performance.

ekohl · September 26, 2019, 12:33pm

Could you share the model? I assume it’s now just the reports table with some columns.

Have you thought about the API for plugins? I’d think you have something like a pipeline:

Report comes in (probably API)
Foreman does pre-processing
Plugins do pre-processing
Foreman does post-processing
Plugins do post-processing
Foreman stores it in the database

Then I wonder if we should have generic helpers or just treat it as an opaque blob that you need a specific implementation (Puppet, Ansible, OpenSCAP) for the query. That would also have UI implications. Because of that I’d recommend some base class that at least has some basic abstraction to read a report. Even if it’s just to remain compatible.

lzap · September 26, 2019, 1:57pm

Sure, yes indeed it’s simply new column and three tables dropped. The commit does not include table drops yet, I just wanted to give it a quick try before I proceed with bigger refactoring:

Very good design indeed I like that not only for reports but also for facts (sometime in the future). Ideally the pipeline container would simply be Ruby hash, then plugins could manipulate the data at will and there would be some Foreman API metadata defining what are available columns to render on the puppet report page. Or better metadata inside each report, since it’s compressed and it’s not big it could work nicely.

I kinda fail to understand what’s in the middle of pre/post. I’d probably simplify this to:

JSON comes in
new “type” field detects which parser to use (or improved scanner registry finds it)
the parser transforms the JSON into some standard format
Foreman validates and stores the JSON into DB

Since OpenSCAP hacks additional fields quite a bit, I am thinking about making the format very generic per-report. Something like:

{
  "report": {
    "format": "puppet_v1",
    "columns": [
      {
        "name": "Message",
        "index": "0"
      },
      {
        "name": "Source",
        "index": "1"
      },
      {
        "name": "Level",
        "index": "2"
      }
    ],
    "data": [
      [
        "A message",
        "//some/source",
        "info"
      ],
      [
        "A message",
        "//some/source",
        "info"
      ],
      [
        "A message",
        "//some/source",
        "info"
      ]
    ]
  },
  "reported_at": "2018-01-15 17:31:36 521275",
  "metrics": {
    "time": {
      "total": 37
    }
  },
  "host": "io.local",
  "status": {
    "applied": 0,
    "failed": 0,
    "skipped": 0
  }
}

TimoGoebel · September 26, 2019, 1:59pm

This sounds nice. If we break plugins with this patch, I believe it’s worth it.

lzap · September 26, 2019, 2:24pm

Question for @Ondrej_Prazak - would that “metadata JSON” be sufficient for OpenSCAP? From what I see, it adds several columns and bunch of them are searchable.

We could introduce one or two VARCHAR columns and allow plugins to define what content should be indexed there if search is absolutely necessary. Or simply design the abstract report table in a way that it contains some generic fields which could be easily reused.

lzap · September 27, 2019, 8:13am

Another observation, we have a bitmasks for status operations and it all looks great on paper (bit masks are efficient, are they) except one “little” detail. They do not work in SQL environment because btree index does not work with them. It’s always a table scan no matter how you store them. Does not matter if you use integer type or native psql bitmask type - there is literally no index available for that.

I suggest to simplify this as well, the easiest way is only to store highest status as an integer. This gives ability to search for reports that has at least an error, et least a warning etc. It’s not very useful to be able to search reports that have debug messages, it’s probably all of them. This vastly simplifies the complex bitarray dance and it can actually leverage btree index of a SQL databases.

Alternatively, I suggest a clean and simple (pragmatic) approach. Let’s create three columns i1, i2 and i3 as indexed integers so implementations can decide what to store them and how to name them in scoped_search. Possibly amount of failed reports for puppet, amount of passed or failed checks for OpenSCAP and similarly for others.

lzap · September 27, 2019, 9:28am

The metadata (columns) can be either stored with report itself (which is nice and compact, compressed anyway) or separately in the Ruby codebase. Then upgrade to newer version (new/changed column) would probably need an extra transformation but those don’t change very often so it’s also a good option to have that in Ruby codebase.

Ondrej_Prazak · September 27, 2019, 9:54am

I like the idea of plugins adding their fields via API and that we count on plugins adding their own stuff. Being able to search in reports efficiently is vital for OpenSCAP, and there is always a possiblity that we might introduce additional search definitions in the future, but I think we can add another generic field as a wors case scenario.

Marek_Hulan · September 27, 2019, 11:24am

From OpenSCAP perspective, it’s important to be able to search by status on specific rule. E.g. “give me all hosts failing on rule xyz”. Rule today is a log, equal to one resource in puppet report.

lzap · September 27, 2019, 11:36am

That’s the price we need to pay. Here is my suggestion as a generic solution: to introduce one more field text t1 which would include all resource names which failed as simple text, one per line. Then the query would be as simple as SELECT * FROM reports WHERE type = "OpenSCAP" AND i1 > 0 AND t1 LIKE "%xccdf_org.some_rule%";. This will use indices from type and status (i1) columns and then it’s a table scan on the rest.

Other plugins could take advantage of that indexed text column named t1. Alternatively this column would be only defined by OpenSCAP plugin but I think it’s better to allow other plugins to leverage that rather than having multiple columns added with same information later on.

Secondary solution is to implement a join table and list of resources similarly to what I am getting rid of in the OpenSCAP plugin, but given amount of problems we had with this I suggest to rely on a text field because PostgreSQL can compress it and still provide (slower) search.

It’s a trade-off indeed, but when deciding do realize how many times Foreman need to import a report (hudreds, thousands per hour) with how many times a user searches for such report (few times per hour). For me there is a clear winner - SQL LIKE statement.

lzap · September 27, 2019, 12:08pm

Anyone against dropping origin column from reports? I don’t understand why we keep origin and report type (which is stored in type column). That sounds like the information is stored twice, also there is no index on origin anyway, why would anyone use this column. @bastilian speak up now or never

Edit: One idea I have is that Report is the base class that does very little, ConfigReport contains configuration stuff but it really should be named PuppetReport. Ansible reuses ConfigReport so for this reason another field origin was introduced. My take:

Rename ConfigReport to PuppetReport and only keep puppet-related stuff.
Make Report to be really the abstract class for all other report implementations.
Refactor Ansible to use Report instead ConfigReport.
Other plugins as well

Dirk · September 27, 2019, 7:44pm

Not really against it, but it is quite nice to have a visual differentiator if you have multiple solutions in place like puppet for baselining and ansible for application deployment. Perhaps not the most common scenario and reports from other plugins are already separated in a different way, so we do not lose much. Also the icon can perhaps be tied to something else to keep this small feature.

lzap · September 30, 2019, 6:09am

At any point, I was not suggesting to remove that icon. The reason why the column was introduced was that CofigReport model was purely a Puppet report. Then we extracted this to Report class but plugins kept using ConfigReport and I think that origin was introduced as a workaround to differentiate config reports.

What I am aiming for is complete refactoring of reporting, since I am braking plugins by this anyway I suggest that all plugins now inherit from Report model where type determines the report kind. Thus we could get rid of origin.

lzap · December 6, 2019, 1:07pm

I am staring my work with these main ideas:

This is now out of date, see below the updated plan.

The model that will be kept is Report.
All STI and subclasses including ConfigReport is removed.
Report contents stored in new field body as plain text (JSON).
Report type is kept in origin field (Puppet, Ansible, OpenSCAP).
Generic non-indexed* boolean columns b1, b2 and b3 available for use by plugins.
Generic non-indexed** text columns t1 and t2 available for use by plugins.
New processing pipeline as described by @ekohl extensible by plugins.
Plugins define metadata via new sub-class of ReportMetadata which captures mapping of fields, processing hooks and transformation, searching fields and other details.
Puppet metadata: b1 = failed resources, b2 = failed restart resources, b3 = changed resources, t1 = the first error message, t2 = resource of the first error message
Puppet metadata: b1 = unreachable host, b2 = failed tasks, b3 = changed tasks, t1 = the first error message, t2 = task name of the first error message
OpenSCAP metadata: b1 = failed rules, b2 = othered rules, b3 = passed rules, t1 = rules which failed, t2 = rules which othered
No data upgrade in migration (when needed reports must be exported and imported back via special rake task before and after upgrade process).
All searching should be by default scoped to a reasonable time frame (last week) so SQL server can quickly use index on “reported_at” column and do a quick tablescan for the rest
PostgreSQL partitioning will be investigated in order to make report expiration even faster.

Legend:

* index on boolean column is very likely not to be used and its presence will slow down updates
** index on text field is useless for LIKE queries (these will always be like queries for OpenSCAP)

Ondrej_Prazak · December 6, 2019, 1:47pm

2 things I can see at the moment:

OpenSCAP will need t3 field - rules which passed
reimporting the reports will cause that a native xml report stored on proxy cannot be accessed because ids change, so we need to warn users about it

ekohl · December 6, 2019, 1:48pm

If you’re storing a report as a blob, then you need some way to describe the format. https://puppet.com/docs/puppet/latest/format_report.html describes the Puppet report and as you can see it has a report_format parameter. This will allow application code to know if it can even render it. For example, it might refuse to render an old report format. You can also merge the source and the format by defining the origin as Puppet-10 but I don’t know if that makes it easier.

lzap · December 6, 2019, 1:50pm

Before I add it, is it really a real-world use-case to search for report with “particular rule which passed”?

Ondrej_Prazak · December 6, 2019, 1:53pm

We have that search but I do not have data on how much it is used.

lzap · December 6, 2019, 2:33pm

Ok I’ve spent whole Friday thinking, drawing, talking, walking in the house trying to find the best solution. With help from Ewoud, Marek, Ondra and others I think I have hopefully the masterplan:

The model that will be kept is Report.
All STI and subclasses including ConfigReport is removed.
Report contents stored in new field body as plain text (JSON).
Plugins can store arbitrary contents in body however backward compatibility must be taken into account (new version of reports must work with old versions).
Report type is kept in origin field (Puppet-9, Ansible, OpenSCAP). Puppet-9 stands for report format V9 which is compatible with V10. Older versions would not be supported. There would be Unknown report type which would be simply store report as-is without further processing.
Status column converted from integer to 64bit big int.
StatusCalculator is kept and extended to use all 64bits.
Generic non-indexed* boolean columns b1, b2 and b3 available for use by plugins.
Plugin decides themselves how to store data in the body field in such a way that it’s usable and searchable.
New processing pipeline as described by @ekohl extensible by plugins.
Plugins define metadata via new sub-class of ReportMetadata which captures mapping of fields, processing hooks and transformation, searching fields and other details.
Plugins define its own ERB template and path to it in metadata to use when displaying detail page as well as index page.
Puppet metadata: b1 = failed resources, b2 = failed restart resources, b3 = changed resources
Puppet metadata: b1 = unreachable host, b2 = failed tasks, b3 = changed tasks
OpenSCAP metadata: b1 = failed rules, b2 = othered rules, b3 = passed rules, rules in searchable form in the body (e.g. F:rule.name.abc for “failed” rules)
No data upgrade in migration (when needed reports must be exported and imported back via special rake task before and after upgrade process).
Searching is supported for reports which has failures, changes, passed rules etc.
Index page (search result) shows also number of failures, changes etc (using StatusCalculator).
All searching should be by default scoped to a reasonable time frame (last week) so SQL server can quickly use index on “reported_at” column and do a quick table scan for the rest
PostgreSQL partitioning will be investigated in order to make report expiration even faster.

I believe this represents the best user experience with the most effective storage and performance unmatched which what we have today.

Legend:

* index on boolean column is very likely not to be used and its presence will slow down updates

lzap · September 11, 2020, 4:30pm

For the record, I was working today on couple of optimizations of the current design with promising results (10x improvement without much changing the schema).

github.com/theforeman/foreman

Fixes #30820 - drop digest and use hash index instead

theforeman:develop ← lzap:drop-report-indices-30820

opened 09:06AM - 11 Sep 20 UTC

lzap

+69 -19

Foreman reports storage is designed so that reports are broken into individual l…ines and store them as 1:N references. The biggest offender are messages and sources tables which contain those lines together with SHA1 hash. This was designed around Puppet because it returns "source" for each log line (on terminal it is printed as "[source] log_content"). So what we essentially do is break reports into lines, break lines into two pieces, calculate SHA1 sum from both parts and store that into the database. Extremely inefficient. The reason for all of that is that users can search for a source or for a (whole) report line using scoped search. This is only useful for Puppet users (source), I am unable to find a use case for searching for whole line. Full text search would make sense, or doing SQL LIKE query as well, but non of that Foreman supports. The way this is implemented is to overcome SQL server limit of index because "messages"."value" and "sources"."value" fields are TEXT type. Most SQL servers, including PostgreSQL, can create B-Tree index only up to some limit (8k in this case). This is extremely slow, it's wasting of resources, there is no resolution of SHA1 conflicts. This patch gets rid of our own SHA1 "digest" column and let SQL server do the index directly on the text column. To break the index limit, Postgres has a hash index option which is not only faster and way more storage-efficient but it's ideal for comparison only, which is exactly what we need (we do not sort logs or sources tables - it makes no sense - we only do equal operation to find duplicities). The patch * drops digest completely from the code base and let SQL server to do hashing via index * drops digest field and index from both tables (2 indices, 2 fields) * creates a hash index on value column (migration can be slow - testing needed) The result needs to be tested on real data, but the patch essentially removes one field, removes SHA1 slow calculation in Ruby and replaces two b-tree indices with two hash indices which takes approximately 20-30% of the original space and are faster. The only issue could be sqlite3 which we still require for packaging, from my rough testing I am able to create index on text column just fine. Initially I thought that creating hash index in RoR migration would require SQL, but it looks like RoR creates the index with the correct type just fine with "using" attribute. For the record, hash index is supported in postgres from version 8.2 and from version 10 it is transaction safe. Meaning that a crash on pre-10 version would require index rebuild which is a trivial operation. CentOS installations already use Postgres v12 and Debian (Buster) has v11 and Ubuntu has v10. We are all good here. Also, complete refactoring of reports is planned where searching in reports will be only possible via tablescan anyway: https://community.theforeman.org/t/rfc-optimized-reports-storage/15573 Finally, the same change must be done in OpenSCAP where digest is being used.