RFC: Optimized reports storage

I am updating the master-plan just a little:

  • New model class is created: HostReport, the reason for that is that this will be better upgrade experience, new tables can be migrated, data can be transformed and legacy tables Report and ConfigReport can be dropped afterwards. API will be completely different anyway, plugins will no longer handle directly with the model anymore.
  • Alternatively, current report import endpoint can be modified to create two reports simultaneously during some development period until migration is implemented.
  • All STI and subclasses including ConfigReport is removed. No STI is used in the new design.
  • Report content is stored in new field body as plain text (JSON) which is compressed by default by Postgres server.
  • Report origin is kept in a new type field (Puppet-9, Ansible, OpenSCAP). Puppet-9 stands for report format V9 which is compatible with V10. Older versions would not be supported. There would be Unknown report type which would be simply store report as-is without further processing.
  • Field named body_version (int) is available for plugins to detect body format for backward compatibility. For each body version there should be a dedicated view transformation class so even old reports can be viewed flawlessly.
  • Status column converted from integer to 64bit big int.
  • StatusCalculator is kept and extended to use all 64bits.
  • Plugin decides themselves how to store data in the body field in such a way that it’s presentable and searchable. Plugin authors should be dis-encouraged from complex (and slow) transformations tho - transformation during view should be encouraged.
  • New processing pipeline API will discourage plugins from accessing the model directly:
    • New report comes in
    • Foreman detect the origin
    • Foreman creates an instance of a plugin input transformation class
    • Report body (as Ruby hash) is passed into the class
    • Plugin performs transformation (hash-in - hash-out + status + keywords)
  • The same transformation is done during data migration.
  • For report displaying, similar pipeline is available:
    • Report is loaded for display
    • Foreman creates an instance of a plugin view transformation class
    • Report body (as Ruby hash) is passed into the class
    • Plugin performs transformation (hash-in - hash-out)
    • Data is passed into views (ERB, RABL) for final display
  • New model ReportKeyword(id: int, report_id: int, name: varchar) is created so plugins can create arbitrary number of keywords which are associated with Report model (M:N).
  • Example keywords:
    • PuppetHasFailedResource
    • PuppetHasFailedRestartResource
    • PuppetHasChangedResource
    • AnsibleHasUnreachableHost
    • AnsibleHasFailedTask
    • AnsibleHasChangedTask
    • ScapHasFailedRule
    • ScapHasOtheredRule
    • ScapHasHighSeverityFailure
    • ScapHasMediumSeverityFailure
    • ScapHasLowSeverityFailure
    • ScapFailure:xccdf_org.ssgproject.content_rule_ensure_redhat_gpgkey_installed
    • ScapFailure:xccdf_org.ssgproject.content_rule_security_patches_up_to_date
  • It is completely up to plugin authors which set of keywords they will generate.
  • Keyword generation can be configurable, for example OpenSCAP plugin can have a list of allowed rules to report (the most important ones).
  • The key is to keep amount of keywords at a reasonable level, for example OpenSCAP should not be creating ScapPassed:xyz keywords because there will be too many of them.
  • Searching is supported via:
    • Indexed keywords (e.g. origin = scap and keyword = ScapHasHighSeverityFailure or simply just the keyword which will be the default scoped_search field)
    • Full text in body (slow but this will work for searching for particular line)
  • Index page (search result) shows also number of failures, changes etc (using StatusCalculator).
  • All searching should be by default scoped to a reasonable time frame (last week) so SQL server can quickly use index on “reported_at” column and do a quick table scan for the rest

For the record, keyword proposal assumes there is finite amount of scap rules, which is indeed the case:

$ grep -hoP 'Rule id="\K\w+"' /usr/share/xml/scap/ssg/content/*xml | sort -u | wc -l
1649