RFC: Optimized reports storage

Hello,

I have a preliminary version of patch which speeds up report import by 24x (not 24 % but 24 times) while the data consumed in SQL is cut at half. Let me first described how I measured it. I’ve used a VM (6GB RAM, 2 vCPUs) with Foreman 1.23 running with default installation parameters (PostgreSQL) on CentOS 7.7. In the Rails console I created a small script that tries to create reports with 100 lines. Half of it were same lines and half of it were unique lines. I’ve used benchmark-ips gem to measure throughtput for 30 minutes. The results are:

Without the patch:

  • 191 reports inserted in 30 minutes
  • rate of import was 0.14 report per second
  • 27 MB of postgres data (data + indices)

With the patch:

  • 5.477k reports inserted in 30 minutes
  • rate of import was 3.468 report per second
  • 13 MB of postgres data (data + indices)

That gives approx. 24x better performance and 50% less stored data. How is that possible you ask? :wink:

It’s easy, we currently store report lines in separate tables: messages, sources and a join table called logs which joins all of the with table called reports. The data is deduplicated using sha1 sum and index on it, however the problem is that on busy instances the logs table easily grow and index don’t fit into memory. Everything is slow then, not only report import but everything else to the point then host starts swapping (the worst case).

My scenario was easy as my VM had plenty of memory and all indices were loaded, however we’ve seen many deployments where logs table index is much bigger than memory and in that case all operations are much slower as SQL server needs to seek on drive. In these scenarios this can easily be 500x better performance.

The proposal I would like to discuss is obvious - I drop all those tables and report model gets new field called “body” which is a JSON (Rails) type. For databases which do support JSON datatype Rails uses that (PostgreSQL, MySQL), for others it’s just TEXT. Rails automatically serializes JSON when reading and writing to the field. In other words: instead storing report lines into three join tables, I store it into a long VARCHAR instead. No rocket science, text gets compressed on PostgreSQL so deduplication is not necessary.

Postgres also supports JSONB format which allows to create indices on JSON fields, however the big disadvantage is that the format is not compressed by default so it is probably not worth it. Also I don’t see ability to search across report lines as useful thing and tablescan can always be implemented.

Everything looks great, except few cons I want to discuss here before I start finishing off the patch and doing some changes in some plugins:

  • The change is disturbing for plugins - many of them uses report models directly (there is no API defined). It will require changes, at least in OpenSCAP and Ansible (there are probably more). It can’t be done in a compatible way, however changes should be pretty small (except OpenSCAP that one is reporting monster).
  • Foreman provided search capability in report contents (messages, resources). However I did a research this year on Discourse and there were no voices against dropping this. Many actually confirmed that they’ve never used this search. Note searching by hosts, report date/time and other basic fields will still be possible and we can add other fields when requested (possibly OpenSCAP).

It’s also a chance to define an API for plugins, since reports are JSON now, plugins can either manipulate the JSON directly adding their own fields or use provided API for that. Also we could define columns per plugin, so base report would only have one column (line) while puppet would have two (line, resource) and openscap probably four as it has today. This would lead to better user experience, currently all reports show two columns as it was a puppet report.

Raw results from my test scripts (will come with the patch). Without the patch:

Calculating -------------------------------------
 import report 50/50      0.140  (± 0.0%) i/s -    191.000  in 1812.264915s

               table_name                |   total    |   index    |   toast    |   table    
-----------------------------------------+------------+------------+------------+------------
 logs                                    | 18 MB      | 16 MB      | 8192 bytes | 1984 kB
 messages                                | 4688 kB    | 2152 kB    | 8192 bytes | 2528 kB
 sources                                 | 4688 kB    | 2152 kB    | 8192 bytes | 2528 kB
 reports                                 | 400 kB     | 304 kB     | 8192 bytes | 88 kB

With the patch:

Calculating -------------------------------------
 import report 50/50      3.468  (±57.7%) i/s -      5.477k in 1800.238053s

               table_name                |   total    |   index    |   toast    |  table  
-----------------------------------------+------------+------------+------------+---------
 reports                                 | 13 MB      | 1432 kB    | 8192 bytes | 12 MB

Edit: All tests were done in a single process (single thread), it generated all the JSON and also sent the requests. Obviously, real-world scenario would saturate the database much more (we default to 6 worker Ruby processes by default and big deployments have dozens of them) giving even more dramatic increase of performance.

3 Likes

Could you share the model? I assume it’s now just the reports table with some columns.

Have you thought about the API for plugins? I’d think you have something like a pipeline:

  • Report comes in (probably API)
  • Foreman does pre-processing
  • Plugins do pre-processing
  • Foreman does post-processing
  • Plugins do post-processing
  • Foreman stores it in the database

Then I wonder if we should have generic helpers or just treat it as an opaque blob that you need a specific implementation (Puppet, Ansible, OpenSCAP) for the query. That would also have UI implications. Because of that I’d recommend some base class that at least has some basic abstraction to read a report. Even if it’s just to remain compatible.

Sure, yes indeed it’s simply new column and three tables dropped. The commit does not include table drops yet, I just wanted to give it a quick try before I proceed with bigger refactoring:

Very good design indeed I like that not only for reports but also for facts (sometime in the future). Ideally the pipeline container would simply be Ruby hash, then plugins could manipulate the data at will and there would be some Foreman API metadata defining what are available columns to render on the puppet report page. Or better metadata inside each report, since it’s compressed and it’s not big it could work nicely.

I kinda fail to understand what’s in the middle of pre/post. I’d probably simplify this to:

  • JSON comes in
  • new “type” field detects which parser to use (or improved scanner registry finds it)
  • the parser transforms the JSON into some standard format
  • Foreman validates and stores the JSON into DB

Since OpenSCAP hacks additional fields quite a bit, I am thinking about making the format very generic per-report. Something like:

{
  "report": {
    "format": "puppet_v1",
    "columns": [
      {
        "name": "Message",
        "index": "0"
      },
      {
        "name": "Source",
        "index": "1"
      },
      {
        "name": "Level",
        "index": "2"
      }
    ],
    "data": [
      [
        "A message",
        "//some/source",
        "info"
      ],
      [
        "A message",
        "//some/source",
        "info"
      ],
      [
        "A message",
        "//some/source",
        "info"
      ]
    ]
  },
  "reported_at": "2018-01-15 17:31:36 521275",
  "metrics": {
    "time": {
      "total": 37
    }
  },
  "host": "io.local",
  "status": {
    "applied": 0,
    "failed": 0,
    "skipped": 0
  }
}

This sounds nice. If we break plugins with this patch, I believe it’s worth it.

1 Like

Question for @Ondrej_Prazak - would that “metadata JSON” be sufficient for OpenSCAP? From what I see, it adds several columns and bunch of them are searchable.

We could introduce one or two VARCHAR columns and allow plugins to define what content should be indexed there if search is absolutely necessary. Or simply design the abstract report table in a way that it contains some generic fields which could be easily reused.

Another observation, we have a bitmasks for status operations and it all looks great on paper (bit masks are efficient, are they) except one “little” detail. They do not work in SQL environment because btree index does not work with them. It’s always a table scan no matter how you store them. Does not matter if you use integer type or native psql bitmask type - there is literally no index available for that.

I suggest to simplify this as well, the easiest way is only to store highest status as an integer. This gives ability to search for reports that has at least an error, et least a warning etc. It’s not very useful to be able to search reports that have debug messages, it’s probably all of them. This vastly simplifies the complex bitarray dance and it can actually leverage btree index of a SQL databases.

Alternatively, I suggest a clean and simple (pragmatic) approach. Let’s create three columns i1, i2 and i3 as indexed integers so implementations can decide what to store them and how to name them in scoped_search. Possibly amount of failed reports for puppet, amount of passed or failed checks for OpenSCAP and similarly for others.

The metadata (columns) can be either stored with report itself (which is nice and compact, compressed anyway) or separately in the Ruby codebase. Then upgrade to newer version (new/changed column) would probably need an extra transformation but those don’t change very often so it’s also a good option to have that in Ruby codebase.

I like the idea of plugins adding their fields via API and that we count on plugins adding their own stuff. Being able to search in reports efficiently is vital for OpenSCAP, and there is always a possiblity that we might introduce additional search definitions in the future, but I think we can add another generic field as a wors case scenario.

From OpenSCAP perspective, it’s important to be able to search by status on specific rule. E.g. “give me all hosts failing on rule xyz”. Rule today is a log, equal to one resource in puppet report.

That’s the price we need to pay. Here is my suggestion as a generic solution: to introduce one more field text t1 which would include all resource names which failed as simple text, one per line. Then the query would be as simple as SELECT * FROM reports WHERE type = "OpenSCAP" AND i1 > 0 AND t1 LIKE "%xccdf_org.some_rule%";. This will use indices from type and status (i1) columns and then it’s a table scan on the rest.

Other plugins could take advantage of that indexed text column named t1. Alternatively this column would be only defined by OpenSCAP plugin but I think it’s better to allow other plugins to leverage that rather than having multiple columns added with same information later on.

Secondary solution is to implement a join table and list of resources similarly to what I am getting rid of in the OpenSCAP plugin, but given amount of problems we had with this I suggest to rely on a text field because PostgreSQL can compress it and still provide (slower) search.

It’s a trade-off indeed, but when deciding do realize how many times Foreman need to import a report (hudreds, thousands per hour) with how many times a user searches for such report (few times per hour). For me there is a clear winner - SQL LIKE statement.

Anyone against dropping origin column from reports? I don’t understand why we keep origin and report type (which is stored in type column). That sounds like the information is stored twice, also there is no index on origin anyway, why would anyone use this column. @bastilian speak up now or never :wink:

Edit: One idea I have is that Report is the base class that does very little, ConfigReport contains configuration stuff but it really should be named PuppetReport. Ansible reuses ConfigReport so for this reason another field origin was introduced. My take:

  • Rename ConfigReport to PuppetReport and only keep puppet-related stuff.
  • Make Report to be really the abstract class for all other report implementations.
  • Refactor Ansible to use Report instead ConfigReport.
  • Other plugins as well

Not really against it, but it is quite nice to have a visual differentiator if you have multiple solutions in place like puppet for baselining and ansible for application deployment. Perhaps not the most common scenario and reports from other plugins are already separated in a different way, so we do not lose much. Also the icon can perhaps be tied to something else to keep this small feature.

2 Likes

At any point, I was not suggesting to remove that icon. The reason why the column was introduced was that CofigReport model was purely a Puppet report. Then we extracted this to Report class but plugins kept using ConfigReport and I think that origin was introduced as a workaround to differentiate config reports.

What I am aiming for is complete refactoring of reporting, since I am braking plugins by this anyway I suggest that all plugins now inherit from Report model where type determines the report kind. Thus we could get rid of origin.

1 Like

I am staring my work with these main ideas:

This is now out of date, see below the updated plan.

  • The model that will be kept is Report.
  • All STI and subclasses including ConfigReport is removed.
  • Report contents stored in new field body as plain text (JSON).
  • Report type is kept in origin field (Puppet, Ansible, OpenSCAP).
  • Generic non-indexed* boolean columns b1, b2 and b3 available for use by plugins.
  • Generic non-indexed** text columns t1 and t2 available for use by plugins.
  • New processing pipeline as described by @ekohl extensible by plugins.
  • Plugins define metadata via new sub-class of ReportMetadata which captures mapping of fields, processing hooks and transformation, searching fields and other details.
  • Puppet metadata: b1 = failed resources, b2 = failed restart resources, b3 = changed resources, t1 = the first error message, t2 = resource of the first error message
  • Puppet metadata: b1 = unreachable host, b2 = failed tasks, b3 = changed tasks, t1 = the first error message, t2 = task name of the first error message
  • OpenSCAP metadata: b1 = failed rules, b2 = othered rules, b3 = passed rules, t1 = rules which failed, t2 = rules which othered
  • No data upgrade in migration (when needed reports must be exported and imported back via special rake task before and after upgrade process).
  • All searching should be by default scoped to a reasonable time frame (last week) so SQL server can quickly use index on “reported_at” column and do a quick tablescan for the rest
  • PostgreSQL partitioning will be investigated in order to make report expiration even faster.

Legend:

  • * index on boolean column is very likely not to be used and its presence will slow down updates
  • ** index on text field is useless for LIKE queries (these will always be like queries for OpenSCAP)

2 things I can see at the moment:

  • OpenSCAP will need t3 field - rules which passed
  • reimporting the reports will cause that a native xml report stored on proxy cannot be accessed because ids change, so we need to warn users about it

If you’re storing a report as a blob, then you need some way to describe the format. https://puppet.com/docs/puppet/latest/format_report.html describes the Puppet report and as you can see it has a report_format parameter. This will allow application code to know if it can even render it. For example, it might refuse to render an old report format. You can also merge the source and the format by defining the origin as Puppet-10 but I don’t know if that makes it easier.

1 Like

Before I add it, is it really a real-world use-case to search for report with “particular rule which passed”?

We have that search but I do not have data on how much it is used.

Ok I’ve spent whole Friday thinking, drawing, talking, walking in the house trying to find the best solution. With help from Ewoud, Marek, Ondra and others I think I have hopefully the masterplan:

  • The model that will be kept is Report.
  • All STI and subclasses including ConfigReport is removed.
  • Report contents stored in new field body as plain text (JSON).
  • Plugins can store arbitrary contents in body however backward compatibility must be taken into account (new version of reports must work with old versions).
  • Report type is kept in origin field (Puppet-9, Ansible, OpenSCAP). Puppet-9 stands for report format V9 which is compatible with V10. Older versions would not be supported. There would be Unknown report type which would be simply store report as-is without further processing.
  • Status column converted from integer to 64bit big int.
  • StatusCalculator is kept and extended to use all 64bits.
  • Generic non-indexed* boolean columns b1, b2 and b3 available for use by plugins.
  • Plugin decides themselves how to store data in the body field in such a way that it’s usable and searchable.
  • New processing pipeline as described by @ekohl extensible by plugins.
  • Plugins define metadata via new sub-class of ReportMetadata which captures mapping of fields, processing hooks and transformation, searching fields and other details.
  • Plugins define its own ERB template and path to it in metadata to use when displaying detail page as well as index page.
  • Puppet metadata: b1 = failed resources, b2 = failed restart resources, b3 = changed resources
  • Puppet metadata: b1 = unreachable host, b2 = failed tasks, b3 = changed tasks
  • OpenSCAP metadata: b1 = failed rules, b2 = othered rules, b3 = passed rules, rules in searchable form in the body (e.g. F:rule.name.abc for “failed” rules)
  • No data upgrade in migration (when needed reports must be exported and imported back via special rake task before and after upgrade process).
  • Searching is supported for reports which has failures, changes, passed rules etc.
  • Index page (search result) shows also number of failures, changes etc (using StatusCalculator).
  • All searching should be by default scoped to a reasonable time frame (last week) so SQL server can quickly use index on “reported_at” column and do a quick table scan for the rest
  • PostgreSQL partitioning will be investigated in order to make report expiration even faster.

I believe this represents the best user experience with the most effective storage and performance unmatched which what we have today.

Legend:

  • * index on boolean column is very likely not to be used and its presence will slow down updates
1 Like

For the record, I was working today on couple of optimizations of the current design with promising results (10x improvement without much changing the schema).