RFC: Optimized reports storage

Hello,

I have a preliminary version of patch which speeds up report import by 24x (not 24 % but 24 times) while the data consumed in SQL is cut at half. Let me first described how I measured it. I’ve used a VM (6GB RAM, 2 vCPUs) with Foreman 1.23 running with default installation parameters (PostgreSQL) on CentOS 7.7. In the Rails console I created a small script that tries to create reports with 100 lines. Half of it were same lines and half of it were unique lines. I’ve used benchmark-ips gem to measure throughtput for 30 minutes. The results are:

Without the patch:

  • 191 reports inserted in 30 minutes
  • rate of import was 0.14 report per second
  • 27 MB of postgres data (data + indices)

With the patch:

  • 5.477k reports inserted in 30 minutes
  • rate of import was 3.468 report per second
  • 13 MB of postgres data (data + indices)

That gives approx. 24x better performance and 50% less stored data. How is that possible you ask? :wink:

It’s easy, we currently store report lines in separate tables: messages, sources and a join table called logs which joins all of the with table called reports. The data is deduplicated using sha1 sum and index on it, however the problem is that on busy instances the logs table easily grow and index don’t fit into memory. Everything is slow then, not only report import but everything else to the point then host starts swapping (the worst case).

My scenario was easy as my VM had plenty of memory and all indices were loaded, however we’ve seen many deployments where logs table index is much bigger than memory and in that case all operations are much slower as SQL server needs to seek on drive. In these scenarios this can easily be 500x better performance.

The proposal I would like to discuss is obvious - I drop all those tables and report model gets new field called “body” which is a JSON (Rails) type. For databases which do support JSON datatype Rails uses that (PostgreSQL, MySQL), for others it’s just TEXT. Rails automatically serializes JSON when reading and writing to the field. In other words: instead storing report lines into three join tables, I store it into a long VARCHAR instead. No rocket science, text gets compressed on PostgreSQL so deduplication is not necessary.

Postgres also supports JSONB format which allows to create indices on JSON fields, however the big disadvantage is that the format is not compressed by default so it is probably not worth it. Also I don’t see ability to search across report lines as useful thing and tablescan can always be implemented.

Everything looks great, except few cons I want to discuss here before I start finishing off the patch and doing some changes in some plugins:

  • The change is disturbing for plugins - many of them uses report models directly (there is no API defined). It will require changes, at least in OpenSCAP and Ansible (there are probably more). It can’t be done in a compatible way, however changes should be pretty small (except OpenSCAP that one is reporting monster).
  • Foreman provided search capability in report contents (messages, resources). However I did a research this year on Discourse and there were no voices against dropping this. Many actually confirmed that they’ve never used this search. Note searching by hosts, report date/time and other basic fields will still be possible and we can add other fields when requested (possibly OpenSCAP).

It’s also a chance to define an API for plugins, since reports are JSON now, plugins can either manipulate the JSON directly adding their own fields or use provided API for that. Also we could define columns per plugin, so base report would only have one column (line) while puppet would have two (line, resource) and openscap probably four as it has today. This would lead to better user experience, currently all reports show two columns as it was a puppet report.

Raw results from my test scripts (will come with the patch). Without the patch:

Calculating -------------------------------------
 import report 50/50      0.140  (± 0.0%) i/s -    191.000  in 1812.264915s

               table_name                |   total    |   index    |   toast    |   table    
-----------------------------------------+------------+------------+------------+------------
 logs                                    | 18 MB      | 16 MB      | 8192 bytes | 1984 kB
 messages                                | 4688 kB    | 2152 kB    | 8192 bytes | 2528 kB
 sources                                 | 4688 kB    | 2152 kB    | 8192 bytes | 2528 kB
 reports                                 | 400 kB     | 304 kB     | 8192 bytes | 88 kB

With the patch:

Calculating -------------------------------------
 import report 50/50      3.468  (±57.7%) i/s -      5.477k in 1800.238053s

               table_name                |   total    |   index    |   toast    |  table  
-----------------------------------------+------------+------------+------------+---------
 reports                                 | 13 MB      | 1432 kB    | 8192 bytes | 12 MB

Edit: All tests were done in a single process (single thread), it generated all the JSON and also sent the requests. Obviously, real-world scenario would saturate the database much more (we default to 6 worker Ruby processes by default and big deployments have dozens of them) giving even more dramatic increase of performance.

3 Likes

Could you share the model? I assume it’s now just the reports table with some columns.

Have you thought about the API for plugins? I’d think you have something like a pipeline:

  • Report comes in (probably API)
  • Foreman does pre-processing
  • Plugins do pre-processing
  • Foreman does post-processing
  • Plugins do post-processing
  • Foreman stores it in the database

Then I wonder if we should have generic helpers or just treat it as an opaque blob that you need a specific implementation (Puppet, Ansible, OpenSCAP) for the query. That would also have UI implications. Because of that I’d recommend some base class that at least has some basic abstraction to read a report. Even if it’s just to remain compatible.

Sure, yes indeed it’s simply new column and three tables dropped. The commit does not include table drops yet, I just wanted to give it a quick try before I proceed with bigger refactoring:

Very good design indeed I like that not only for reports but also for facts (sometime in the future). Ideally the pipeline container would simply be Ruby hash, then plugins could manipulate the data at will and there would be some Foreman API metadata defining what are available columns to render on the puppet report page. Or better metadata inside each report, since it’s compressed and it’s not big it could work nicely.

I kinda fail to understand what’s in the middle of pre/post. I’d probably simplify this to:

  • JSON comes in
  • new “type” field detects which parser to use (or improved scanner registry finds it)
  • the parser transforms the JSON into some standard format
  • Foreman validates and stores the JSON into DB

Since OpenSCAP hacks additional fields quite a bit, I am thinking about making the format very generic per-report. Something like:

{
  "report": {
    "format": "puppet_v1",
    "columns": [
      {
        "name": "Message",
        "index": "0"
      },
      {
        "name": "Source",
        "index": "1"
      },
      {
        "name": "Level",
        "index": "2"
      }
    ],
    "data": [
      [
        "A message",
        "//some/source",
        "info"
      ],
      [
        "A message",
        "//some/source",
        "info"
      ],
      [
        "A message",
        "//some/source",
        "info"
      ]
    ]
  },
  "reported_at": "2018-01-15 17:31:36 521275",
  "metrics": {
    "time": {
      "total": 37
    }
  },
  "host": "io.local",
  "status": {
    "applied": 0,
    "failed": 0,
    "skipped": 0
  }
}

This sounds nice. If we break plugins with this patch, I believe it’s worth it.

1 Like

Question for @Ondrej_Prazak - would that “metadata JSON” be sufficient for OpenSCAP? From what I see, it adds several columns and bunch of them are searchable.

We could introduce one or two VARCHAR columns and allow plugins to define what content should be indexed there if search is absolutely necessary. Or simply design the abstract report table in a way that it contains some generic fields which could be easily reused.

Another observation, we have a bitmasks for status operations and it all looks great on paper (bit masks are efficient, are they) except one “little” detail. They do not work in SQL environment because btree index does not work with them. It’s always a table scan no matter how you store them. Does not matter if you use integer type or native psql bitmask type - there is literally no index available for that.

I suggest to simplify this as well, the easiest way is only to store highest status as an integer. This gives ability to search for reports that has at least an error, et least a warning etc. It’s not very useful to be able to search reports that have debug messages, it’s probably all of them. This vastly simplifies the complex bitarray dance and it can actually leverage btree index of a SQL databases.

Alternatively, I suggest a clean and simple (pragmatic) approach. Let’s create three columns i1, i2 and i3 as indexed integers so implementations can decide what to store them and how to name them in scoped_search. Possibly amount of failed reports for puppet, amount of passed or failed checks for OpenSCAP and similarly for others.

The metadata (columns) can be either stored with report itself (which is nice and compact, compressed anyway) or separately in the Ruby codebase. Then upgrade to newer version (new/changed column) would probably need an extra transformation but those don’t change very often so it’s also a good option to have that in Ruby codebase.

I like the idea of plugins adding their fields via API and that we count on plugins adding their own stuff. Being able to search in reports efficiently is vital for OpenSCAP, and there is always a possiblity that we might introduce additional search definitions in the future, but I think we can add another generic field as a wors case scenario.

From OpenSCAP perspective, it’s important to be able to search by status on specific rule. E.g. “give me all hosts failing on rule xyz”. Rule today is a log, equal to one resource in puppet report.

That’s the price we need to pay. Here is my suggestion as a generic solution: to introduce one more field text t1 which would include all resource names which failed as simple text, one per line. Then the query would be as simple as SELECT * FROM reports WHERE type = "OpenSCAP" AND i1 > 0 AND t1 LIKE "%xccdf_org.some_rule%";. This will use indices from type and status (i1) columns and then it’s a table scan on the rest.

Other plugins could take advantage of that indexed text column named t1. Alternatively this column would be only defined by OpenSCAP plugin but I think it’s better to allow other plugins to leverage that rather than having multiple columns added with same information later on.

Secondary solution is to implement a join table and list of resources similarly to what I am getting rid of in the OpenSCAP plugin, but given amount of problems we had with this I suggest to rely on a text field because PostgreSQL can compress it and still provide (slower) search.

It’s a trade-off indeed, but when deciding do realize how many times Foreman need to import a report (hudreds, thousands per hour) with how many times a user searches for such report (few times per hour). For me there is a clear winner - SQL LIKE statement.

Anyone against dropping origin column from reports? I don’t understand why we keep origin and report type (which is stored in type column). That sounds like the information is stored twice, also there is no index on origin anyway, why would anyone use this column. @bastilian speak up now or never :wink:

Edit: One idea I have is that Report is the base class that does very little, ConfigReport contains configuration stuff but it really should be named PuppetReport. Ansible reuses ConfigReport so for this reason another field origin was introduced. My take:

  • Rename ConfigReport to PuppetReport and only keep puppet-related stuff.
  • Make Report to be really the abstract class for all other report implementations.
  • Refactor Ansible to use Report instead ConfigReport.
  • Other plugins as well

Not really against it, but it is quite nice to have a visual differentiator if you have multiple solutions in place like puppet for baselining and ansible for application deployment. Perhaps not the most common scenario and reports from other plugins are already separated in a different way, so we do not lose much. Also the icon can perhaps be tied to something else to keep this small feature.

1 Like

At any point, I was not suggesting to remove that icon. The reason why the column was introduced was that CofigReport model was purely a Puppet report. Then we extracted this to Report class but plugins kept using ConfigReport and I think that origin was introduced as a workaround to differentiate config reports.

What I am aiming for is complete refactoring of reporting, since I am braking plugins by this anyway I suggest that all plugins now inherit from Report model where type determines the report kind. Thus we could get rid of origin.

1 Like