RFC: Whitelisting facts for better overall performance

Hello,

Foreman as of today stores all facts in the database in quite expensive way. Fact import is very slow, facts and values takes lot of resources in the database and fetching all facts involves several joins. I think we can do better.

The idea is to store facts in the JSON format in TEXT/BLOB column in the database and then store copies of some (whiltelisted) facts in searchable form as we do today. We would keep via some kind of whitelist with some sane default like cpu, memory, interfaces and couple of others. This will end up with two data sources:

  • Blob of the whole JSON (for all types like puppet, ansible, rhsm). This is available everytime user or ENC script requests all facts for a particular host.
  • Selected facts for use in searching hosts in UI/API/CLI.

It can be confusing why a fact cannot be found via search functionality, therefore we would present a message like “Can’t find facts? Make sure they are whitelisted.” in the UI/CLI to explain this.

The whitelist would not be another setting, I am thinking leveraging template renderer for this - a template which renders to list of regular expressions for each fact type. A plugin could easily add new template with its default whitelist, or we could keep all of the in community-templates repo.

We discussed this with @Shimon_Shtein today and this looks like a good approach that leverages what we already have (fact parsers, scoped search) but in a way that this is usable both for small and large scale deployments.

1 Like

To add to the idea, for speeding up the process of fact insertion, we can offload the parsing to an asynchronous task. This way we both spend less resources on application server (it will only be responsible to store the facts in the blob) and move resource consuming tasks to a proper place.

I like the idea.

But from some customer environments I know they care much more about some custom facts then the ones I would normally expect, so whitelisting should be easy (I think this is achievable with the template renderer), easy to find and well documented.

Is this a good use case for the future documentation style / the technical writers? If we have this documentation the error message should point to it.

And a technical question: You plan to use not only one template which then would perhaps needs customization and tends to diverge from upstream, but multiple templates which all add to the rendering, correct?

I think this is an awesome idea!! It could be a “setting” in the admin interface that takes a list of facts to whitelist (in addition to the sane defaults).

I think storing the blob could be interesting, but currently the ENC script we use in Puppet uploads facts and then retrieves the ENC itself which can be based on facts. If it’s done with background processing then we need to consider polling for job completion. In the API sense it would mean returning a HTTP 202 Accepted with a URL for the job it can poll.

It does make me wonder how performant the JSON backend in PostgreSQL would be. It could be good enough instead of the normalized column structure we have today. If we’re considering dropping MySQL and deprecating sqlite to just building, that could be a very interesting option and avoid the entire need for this whitelist.

I would like to try hard making the import process fast so we would not need this. Because if the process is not fast enough, we will start stacking import requests somewhere in a queue and we actually gain nothing but user confusion (why facts are not updated, oh a process died). Direct import has one advantage - it’s either imported or not, you immediately see it and if it failed, a managed node will come back in 30 minutes usually.

Good point. Custom facts need to be added and we must make this as smooth as possible.

I think we already do this at various places, in this case we can do this and more: direct message in the UI like “Are you missing some facts? Add them to this list.”

This is just an idea, not a particular plan so far. But if we agree it’s a good plan, I would like to work on that because we have huge customers with insanely big databases just because loads of facts they mostly don’t use at all.

To your question - what I would see as a good approach could be a single (locked) template per fact type (Puppet, Ansible…). An example “Puppet fact filter”:

<%#
 This template renders to the list of regular expressions, one each line.
 During fact import, all fact names are matched against the list and
 allowed to be imported into fact table. The rest is only stored as blob,
 still available to ENC but not searchable.
 The template is cached, takes up to 30 minutes for the changes to take effect.
%>
^(is_)?virtual$
^ipaddress$
^operatingsystem$
^kernel$
^blockdevice_
# etc etc
<%= snippet_if_exists("Puppet fact filter custom") %>
%>

Users can define their own custom template which is concatenated with the existing one. The same for Ansible, RHSM and other plugins which can easily seed their own templates (see discovery plugin how to seed own template).

Then the importing process will render this template, build one huge regular expression and put it into Rails cache so we don’t render on EVERY request - 30 mins default is my idea. And the regexp is then use to filter facts which will be put into fact names and values tables. The rest of the process is unchanged.

In addition to that, unchaned JSON is also stored in a BLOB/TEXT column and this one is used for the Host.facts_as_hash call which is used by the ENC or by the Facts button in the UI/CLI. It will always provide all facts regardless of filters. So Puppet is not affected.

Now, this has one flaw - Foreman provide Smart Parameters which relies on the facts being imported if I understand this feature correctly. We need to either warn users that facts must be allowed for use with Smart Parameters or refactor the code to use facts_hash.

Generally, we want less settings and the list can easily get out of control for some customers, templates are I think better fit.

We considered this option, however the biggest problem is Scoped Search - it is heavily built around traditional SQL. In the end, storing blob + some extra fields in the traditional way is not very different from native JSON which also creates indices but at the cost of complex querying - Rails is not yet ready for native JSON (there is some progress but it’s not what you’d expect).

But it is important to know that implementing it in this simple (and easily achievable) approach does not mean we can’t switch to native JSON in some future.

I am interested in reading more opinions about that.

I wonder if we could work internally on the JSON column and have a limited number of indexed facts which can be searched for.

It’s also good to note that on the Puppet side (or perhaps facter side) the unstructured facts are deprecated (ipaddress_$interface etc). Puppet 7 will drop them in favor of the structured ones (networking which has nested arrays). That will also reduce the number of (top level) facts.

I am not sure what exactly mean, but what I am thinking is something we can easily build today. The approach I am presenting is basically adding a whitelist and blob leveraging everything we have today. I see no added value in using PostgreSQL JSON if that’s what you mean other than it’s stored in a different way.

So for Foreman this is like 3 years of support for old deployments. We can’t ditch them today.

What I mean is that if you need a fact, you retrieve it from the JSON column. That would be the source of truth. Since it’s a since column value replacement, it wouldn’t need backend processing and always be up to date. Then you would have a background process that updates the “index” (non-normalized values) so scoped search could find them. This could still be a manually maintained list.

We can detect Facter 3 and if so, drop all known deprecated facts. They are redundant in Facter 3 and can all be found in other places. This should also better prepare us for Facter 4. Note that Foreman internally also uses deprecated facts (osfamily instead of os.family, hardwareisa instead of os.hardware etc). To give an idea of the difference:

$ facter --json | jq 'keys|length'
23
$ facter --json --show-legacy | jq 'keys|length'
123

We also require Puppet 4 starting 1.23 so other than some edge cases (non-AIO Puppet with Facter 2), all Puppet agents should be reporting in with Facter 3. We can drop support for Facter 2 facts with a proper deprecation period. That’s not 3 years, that’s 1 or 2 releases.

1 Like

Indeed a good idea, we can probably provide whitelists for Puppet legacy and modern (structured) facts.

This is not correct comparison, the query only takes first level into account. One idea is probably to count “:” characters which represents “key-value” pair:

[root@rc ~]# facter --json --show-legacy | fgrep -o : | wc -l
492
[root@rc ~]# facter --json | fgrep -o : | wc -l
369

Still it’s something we should definitely take into consideration, however there is one important thing that blocks ditching legacy facts - that is the Foreman Discovery Image. It is based on puppet-based facter (2.x) and AFAIK there were some packaging issues in this regard.

Edit: FDI is based on CentOS7 + EPEL and that only provides Puppet 2.x.

This has been a show stopper many times. Puppet is the only cfgmgmt that has this AFAIK. I know there were some users of this feature. But it feels like less known feature and potentially something, we could remove to simplify. If we feel strongly about keeping it, we should document the use case or at least blog about it.

Users can disable this (--puppet-server-foreman-facts false). Foreman :: Manual also documents the use to upload facts outside of the ENC.

The tricky thing with removal is that it always behaved this way so users might not even be aware they’re relying on it. For example, I think that this fact upload is actually the way we register new hosts and various users have used this to migrate.

Also note that if we move it to backend processing we can still poll for the result if users prefer this.

Summary:

  • The feedback is positive.
  • Alternative approach of background processing was suggested, but there are some concerns around ENC and stacked jobs.
  • We must define good set of whitelisted facts.

Please keep the comments coming.