RFC: Optimized reports storage

If you’re storing a report as a blob, then you need some way to describe the format. https://puppet.com/docs/puppet/latest/format_report.html describes the Puppet report and as you can see it has a report_format parameter. This will allow application code to know if it can even render it. For example, it might refuse to render an old report format. You can also merge the source and the format by defining the origin as Puppet-10 but I don’t know if that makes it easier.

1 Like

Before I add it, is it really a real-world use-case to search for report with “particular rule which passed”?

We have that search but I do not have data on how much it is used.

Ok I’ve spent whole Friday thinking, drawing, talking, walking in the house trying to find the best solution. With help from Ewoud, Marek, Ondra and others I think I have hopefully the masterplan:

  • The model that will be kept is Report.
  • All STI and subclasses including ConfigReport is removed.
  • Report contents stored in new field body as plain text (JSON).
  • Plugins can store arbitrary contents in body however backward compatibility must be taken into account (new version of reports must work with old versions).
  • Report type is kept in origin field (Puppet-9, Ansible, OpenSCAP). Puppet-9 stands for report format V9 which is compatible with V10. Older versions would not be supported. There would be Unknown report type which would be simply store report as-is without further processing.
  • Status column converted from integer to 64bit big int.
  • StatusCalculator is kept and extended to use all 64bits.
  • Generic non-indexed* boolean columns b1, b2 and b3 available for use by plugins.
  • Plugin decides themselves how to store data in the body field in such a way that it’s usable and searchable.
  • New processing pipeline as described by @ekohl extensible by plugins.
  • Plugins define metadata via new sub-class of ReportMetadata which captures mapping of fields, processing hooks and transformation, searching fields and other details.
  • Plugins define its own ERB template and path to it in metadata to use when displaying detail page as well as index page.
  • Puppet metadata: b1 = failed resources, b2 = failed restart resources, b3 = changed resources
  • Puppet metadata: b1 = unreachable host, b2 = failed tasks, b3 = changed tasks
  • OpenSCAP metadata: b1 = failed rules, b2 = othered rules, b3 = passed rules, rules in searchable form in the body (e.g. F:rule.name.abc for “failed” rules)
  • No data upgrade in migration (when needed reports must be exported and imported back via special rake task before and after upgrade process).
  • Searching is supported for reports which has failures, changes, passed rules etc.
  • Index page (search result) shows also number of failures, changes etc (using StatusCalculator).
  • All searching should be by default scoped to a reasonable time frame (last week) so SQL server can quickly use index on “reported_at” column and do a quick table scan for the rest
  • PostgreSQL partitioning will be investigated in order to make report expiration even faster.

I believe this represents the best user experience with the most effective storage and performance unmatched which what we have today.

Legend:

  • * index on boolean column is very likely not to be used and its presence will slow down updates
1 Like

For the record, I was working today on couple of optimizations of the current design with promising results (10x improvement without much changing the schema).

I am updating the master-plan just a little:

  • New model class is created: HostReport, the reason for that is that this will be better upgrade experience, new tables can be migrated, data can be transformed and legacy tables Report and ConfigReport can be dropped afterwards. API will be completely different anyway, plugins will no longer handle directly with the model anymore.
  • Alternatively, current report import endpoint can be modified to create two reports simultaneously during some development period until migration is implemented.
  • All STI and subclasses including ConfigReport is removed. No STI is used in the new design.
  • Report content is stored in new field body as plain text (JSON) which is compressed by default by Postgres server.
  • Report origin is kept in a new type field (Puppet-9, Ansible, OpenSCAP). Puppet-9 stands for report format V9 which is compatible with V10. Older versions would not be supported. There would be Unknown report type which would be simply store report as-is without further processing.
  • Field named body_version (int) is available for plugins to detect body format for backward compatibility. For each body version there should be a dedicated view transformation class so even old reports can be viewed flawlessly.
  • Status column converted from integer to 64bit big int.
  • StatusCalculator is kept and extended to use all 64bits.
  • Plugin decides themselves how to store data in the body field in such a way that it’s presentable and searchable. Plugin authors should be dis-encouraged from complex (and slow) transformations tho - transformation during view should be encouraged.
  • New processing pipeline API will discourage plugins from accessing the model directly:
    • New report comes in
    • Foreman detect the origin
    • Foreman creates an instance of a plugin input transformation class
    • Report body (as Ruby hash) is passed into the class
    • Plugin performs transformation (hash-in - hash-out + status + keywords)
  • The same transformation is done during data migration.
  • For report displaying, similar pipeline is available:
    • Report is loaded for display
    • Foreman creates an instance of a plugin view transformation class
    • Report body (as Ruby hash) is passed into the class
    • Plugin performs transformation (hash-in - hash-out)
    • Data is passed into views (ERB, RABL) for final display
  • New model ReportKeyword(id: int, report_id: int, name: varchar) is created so plugins can create arbitrary number of keywords which are associated with Report model (M:N).
  • Example keywords:
    • PuppetHasFailedResource
    • PuppetHasFailedRestartResource
    • PuppetHasChangedResource
    • AnsibleHasUnreachableHost
    • AnsibleHasFailedTask
    • AnsibleHasChangedTask
    • ScapHasFailedRule
    • ScapHasOtheredRule
    • ScapHasHighSeverityFailure
    • ScapHasMediumSeverityFailure
    • ScapHasLowSeverityFailure
    • ScapFailure:xccdf_org.ssgproject.content_rule_ensure_redhat_gpgkey_installed
    • ScapFailure:xccdf_org.ssgproject.content_rule_security_patches_up_to_date
  • It is completely up to plugin authors which set of keywords they will generate.
  • Keyword generation can be configurable, for example OpenSCAP plugin can have a list of allowed rules to report (the most important ones).
  • The key is to keep amount of keywords at a reasonable level, for example OpenSCAP should not be creating ScapPassed:xyz keywords because there will be too many of them.
  • Searching is supported via:
    • Indexed keywords (e.g. origin = scap and keyword = ScapHasHighSeverityFailure or simply just the keyword which will be the default scoped_search field)
    • Full text in body (slow but this will work for searching for particular line)
  • Index page (search result) shows also number of failures, changes etc (using StatusCalculator).
  • All searching should be by default scoped to a reasonable time frame (last week) so SQL server can quickly use index on “reported_at” column and do a quick table scan for the rest

For the record, keyword proposal assumes there is finite amount of scap rules, which is indeed the case:

$ grep -hoP 'Rule id="\K\w+"' /usr/share/xml/scap/ssg/content/*xml | sort -u | wc -l
1649

Thanks for the writeup @lzap, a few comments/questions inline:

Changing all existing integration points for report importing will be difficult. In fact, we are still maintaining the /reports endpoint because that’s where older versions of ansible where configured to send their reports; I expect that is true for additional sources.

Do all sources provide a JSON output?
Also, do we want to enable searching inside the json? if so it may make sense to store a jsonb with proper index.

Why is this needed? Do we actually get a significant benefit from using a bit field for the statuses over using strings (or some sort of tags if we want to have multiple ones)?

Which sources should be handled in plugins and which should be handled in core? Right now we have puppet in core and plan to keep it that way. Perhaps this is something that should be all in core? I.E. Foreman knows out of the box how to handle reports from multiple sources, without requiring additional plugins (which usually adds more functionality and complexity besides report processing) to set up.

Some of these seem very similar to statuses, could you elaborate on how keywords are different and why we need both?

with 10k hosts reporting in every 30 minutes (the default for puppet), this means over 3 million reports per week. Even if we limit search to a week, we should never have a search use full table scan, but rather make sure all searchable fields are properly indexed.

Input format must not change, therefore this is only matter of an “alias”, single controller can handle multiple URLs.

Actually, I said JSON but I did not mean it. The body column is arbitrary text, plugins can use whatever fits their needs. As long as it’s text.

I did my research, JSON/JSONB types are actually uncompressed and slower than arbitrary text. JSON searching is slow overall and something I want specifically to avoid in my design.

Numbers have some benefits over strings, yeah. You can compare them numerically :wink: Granted this status number cannot be indexed so it’s always a table scan. Which is not a huge problem, keywords (described below) and host ID can be used to make the scan set small enough.

In my previous proposals, I was trying to find a solution using “unnamed” columns like t1 or t2 and similar where every plugin would store their own data. StatusCalculator is build around Puppet: applied restarted failed failed_restarts skipped pending. It currently uses 5 bytes of 8 bytes available for bigint, maximum value currently is 63.

If you think it is the time to get rid of this weird thing, I am more than happy to kill status calculator. One way would be creating INT columns for selected (or all?) statuses naming them in more generic way, perhaps something like: success, warning, error, skipped. Each plugin would map its own meaning.

Alternatively, these could be come keywords too: PuppetHasFaiulres2, PuppetHasFaiulres3-10 (sort of intervals, we don’t want to create a keyword for every single number).

The reason is why I propose to keep the status calculator is the code is there, it’s supported in scoped_search so we would not break existing queries. I want just to bump amount of bits from 6 to 10 so whole 64bits of bigint is utilized (1024 possible values per status).

I’d say just puppet until it’s extracted? There will be “the default” implementation that will store the input “as is” and present it as a plain text on the view page. Type: unknown.

Keywords are for searching only, they have an index so database can quickly pull reports with given keywords. As I said, statuses are numbers, allows you to search for exact amount of failures. Secondly, status column allows rendering of index page/table - each report has amount of statuses shown:

[root@sat68 ~]# hammer report list


---|---------------|---------------------|---------|-----------|--------|------------------|---------|--------
ID | HOST          | LAST REPORT         | APPLIED | RESTARTED | FAILED | RESTART FAILURES | SKIPPED | PENDING
---|---------------|---------------------|---------|-----------|--------|------------------|---------|--------
1  | sat68.nat.lan | 2020/09/29 09:44:01 | 0       | 0         | 0      | 0                | 0       | 0      
---|---------------|---------------------|---------|-----------|--------|------------------|---------|--------

Granted, this can be only stored in body, that will probably make index page little bit slower but since we only render finite amount of reports on a page or hammer list, it’s not big deal.

Every update to a table with 10 indices ultimately leads to 10 index updates which is a relatively expensive operation. With 10k hosts reporting every 30 minutes, database does a lot of work. Now, compare to how often you search for a report? Once a day? Tens times a day? My point is, this whole design is optimized for report storing, not searching. Thus my ultimate goal is to have as little indices as possible

Having that said, there is the concept of keywords which are indeed indexed and will give instant results. Users will be able to quickly find reports that has failures, errors, whatever, in case of OpenSCAP even failed rules. Depends on what plugin authors creates.

Then there’s scoped search for status which is a table scan over bitmap today, I propose to keep that behavior (or remove it completely). This does not change.

Finally, today reports allow search for exact line in messages or sources. You have to search for the whole line, e.g. “Applied catalog in 0.39 seconds” only after that it’s a hash + join. If you try to do SQL LIKE search it’s again full table scan today. And this proposal does not change that - my goal is definitely not to introduce some fulltext search for reports. This is the only “downgrade” in my proposal, which is not a problem - you never search for that whole line and if you really really need this, you can do a tablescan over body with reported_at limit.

I am putting together some notes from Tomer’s response and my discussion with Marek. Here are some open questions:

  • Do we want to kill StatusCalculator (the bitfield, status column)? It is clunky indeed, do users really need to make queries like “all reports that has 4 or more errors”? Isn’t just “reports with errors” enough?
  • The “version” column is probably not needed - version can be stored in the body itself if needed.
  • Reports should contain proxy_id reference from what proxy the report came in. This is probably the only reason why OpenSCAP plugin would need to extend the model, this can be useful for other plugins as well.
  • There could be a generic set of keywords without “Puppet” or “Ansible” prefixes: HasFailure or HasWarning that would search across types. There is no way to enforce plugins to create those keywords, so I think this could be actually confusing for users not finding all records in the future.
  • To what degree we can break search syntax? The purpose of this refactoring| is to vastly improve performance by storing reports in a different way, it will change for sure. Here is a list of hopefully all possible search conditions:
Old condition New condition Indices used Comment
host = xyz host = xyz yes
taxonomy taxonomy yes Both organization and location untouched
message = “Applied catalog in 0.39 seconds” - yes This will not work anymore (*)
source = "//blah/blah/blah.pp/ - yes This will not work anymore (*)
environment = xyz ? yes Although I don’t see this reference in schema anymore, is this something that should work?
hostgroup = xyz ? yes Although I don’t see this reference in schema anymore, is this something that should work?
applied > 1 keyword = PuppetHasApplied no Only if we choose to keep StatusCalculator or replace it with regular columns. The same for restarted, failed, skipped, pending etc.

(*) - we can either make it an error, which I prefer, or modify code in a way that his performs tablescan instead

For the record, here are statuses (metrics) which are available for the three major report types:

Type Metrics
Puppet applied restarted failed failed_restarts skipped pending
Ansible ok change unreachable failed skipped rescued ignored
OpenSCAP passed othered* failed

*othered - one of: pass fail error unknown notapplicable notchecked notselected informational fixed

Finding a common terms is a challenge without confusing the user, keeping the bitmask might be a good idea.

As much as I would like to do a small patch(es) and remain plugins compatibility, this is not possible and I need to create a brand new model that can co-exist with the current Report and ConfigReport. STI in the new design does not make sense, a report will be simply a body (text) with a type and classes that handles import/view actions.

I will start with ReportTranscript model name for now, if you have a better idea let me know. I want to be able to work on the PR(s) alongside of the current model. There will be modernized API, completely new UI and CLI around and the mentioned /report and /config_report endpoints will eventually be refactored to drop old models and make use of the new ones.

I want to keep StatusCalculator which has some drawbacks but on the other hand it allows plugins to work with their own sets of statuses. For searching there will be keywords anyway. Version column is not needed, I will store proxy_id in the new model.

@tbrisker can you show me where do we have this? I can’t find anything relevant in routes.rb.

For the record, I am creating a tracking redmine ticket for this task. I am gonna start working on this pretty soon:

https://projects.theforeman.org/issues/31142

Do you think it is a good time to deprecate this endpoint for the next release and remove it in the release after when I deliver the refactoring?

Perhaps we can even include a deprecation in 2.2, even if we only remove it in 2.4.

Unfortunately, that end point has been deprecated for several years already. However older versions of ansible (until 2.10 if i’m not mistaken) ship with a callback pointing to it. The same was also true for the chef handler, and potentially people who’ve installed our puppet enc script many years ago. That is the reason I didn’t remove it with the rest of the /reports api endpoints in

I will keep the endpoint then and fix Ansible.

Ansible is fixed already since https://github.com/ansible/ansible/pull/64955 but that only affects ansible 2.10 and newer. Unfortunately many distros ship older versions by default - for example, epel7 has 2.9 and debian 10 has 2.7, so we won’t be able to drop this endpoint until they go eol, which is several years away.

1 Like

Note that ansible callback using this API now lives in theforeman.foreman ansible collection and we’ll start deploying it by default soon (if we don’t already). We don’t have to worry about that much.