RFC: Dropping nested (structured) facts

lzap · January 3, 2022, 10:38am

Hello,

Foreman facts feature currently stores facts into an actual DB tree structure. The class responsible for this is called StructuredFactImporter and utilized by Puppet and RHSM (perhaps more). The problem with this approach is that the fact importing is very costly - Foreman needs to analyze fact names one by one, ensure all nodes and create/update them. When fetching facts, it also needs to compose all fact keys too.

The only advantage it seems is that users can drill down by nodes in the Fact index page across multiple hosts. I think this is not very useful, if not actually a bad user experience as it is quite slow to reach leaf nodes in the UI. I would like to propose to drop the tree structure and store facts in a plain way: key - value. So instead this:

dmi::
  bios::
    release_date: 07/08/2009
    vendor: Lenovo

we would simply store:

dmi::bios::release_date: 07/08/2009
dmi::bios::vendor:: Lenovo

From the user perspective, nothing would change. Searching only supports full leaf nodes, so search queries would remain the same. E.g. dmi::bios::vendor = "Lenovo". Non-structured sources (e.g. Ansible) would remain unchanged. The only missing feature would be a redesigned Facts index page where instead of drill-down table we would simply present a list.

Why am I proposing this should be obvious by now - by simplifying the design, we could actually dramatically improve fact import performance. Once we ensured the host is a valid host stored in the database (currently we allow unsaved host active records instances), we could insert and update facts more efficiently. Breaking up nodes would no longer be necessary and we could also explore some more advanced postgresql features like UPSERT to optimize fact imports even further.

I propose to keep the :: (four dots) separator, I know that a simple dot might look easier to type, however, we do not want to break searching and also dot is quite bad candidate because it already caused issues previoiusly (e.g. interface names with vlan tags or aliases: eth0.1).

This will also open doors to another level of optimization - I believe that breaking up facts into FactName and FactValue tables is not the best solution. A single table could be used to store all facts: ID, HOST_ID, FACT_NAME, FACT_VALUE. This SQL de-normalization could possibly be more efficient for updates (and less efficient for reads across many hosts), however, this is definitely not a material for the first version and I would need to research more about this. But we can keep en eye on this for the future.

One final idea is also quite interesting - offloading fact parsing to a smart proxy. Since import is the only way how data in Foreman DB can change, smart proxy could effectively perform fact filtering (we do drop some unwanted facts) and also it could also maintain a cache of “last seen” facts for a particular host. Then the calculation of what needs to be updated, created and deleted could be completely done on smart proxy, therefore it could automatically drop those facts which are unchanged and Foreman would have less work to process.

If you have any concerns with the removal of the three structure, speak up now. Cheers!

Marek_Hulan · January 25, 2022, 4:40pm

How would the array facts be represented? And array of hashes? E.g.

networking::interfaces::virbr0::bindings => [{"address"=>"192.168.122.1", "netmask"=>"255.255.255.0", "network"=>"192.168.122.0"}]

I assume this would also mean we’d have to implement the whole new UI to build the tree on client side. Since navigating in a tree is much easier comparing to just listing hundreds of facts with the same prefix. The UI wouldn’t allow to use scoped search but probably some kind of a fulltext, again frontend only?

lzap · January 26, 2022, 7:06am

We would do whatever we do today? That is inserting the value as a JSON representation as a string.

Of course if we want to in the future, we could parse that into something like networking::interfaces::virbr0::bindings[0]::address or similar. But that would be out of scope.

New UI - well yes and no. We would not be able to drill down the tree across multiple hosts anymore, so here we would be removing/refactoring the UI a bit. Now, for a single particular host if you feel like a tree structure is better then yes, we would need to fetch all records for a particular host (which will be much faster than drilling down via N queries) and then building the tree in-memory.

I still stand by my opinion that facts should be also stored as JSON in SQL table as blob/text so everytime all facts are required (ENC, host facts detail page) it would be just a single fetch. This also allows us to do more strict filtering and only “indexing” selected facts. But that is something I call Common Facts Model and it is a different story. I think this is worth doing before CFM as this appears to be a smaller change.

Marek_Hulan · January 26, 2022, 11:39am

Yeah, sounds good to me. I think we need a tree structure navigation for single host, comparing cross hosts is probably not that important. Distribution charts would I think still work. Perhaps that would be a bigger challenge with storing the whole JSON, but that’s another story as you say. I’m for this change.

lzap · January 26, 2022, 12:26pm

Any volunteers?