RFC: Denormalize fact tables

lzap · May 9, 2018, 10:48am

Hey,

For a time span of several upstream releases, we experience poor fact import/update performance. Problems are being reported upstream and downstream and while decent work has been done in this regard the results are mixed. I believe that we can’t achieve good results because the way we store the data is not ideal. In the current design, we have fact_names and fact_values tables and we join them. During fact import phase, we need to identify fact names which were added, updated and deleted and update both fact_names and fact_values tables accordingly.

Two separate tables make things worse - all work of identifying what needs to be added/updated/deleted must be performed twice. Users often have fact names which have random names and although we have fact filters now, it’s not bulletproof solution. On top of that, we store fact names in hierarchy (os::name stored as two records “os” and “name” with parent value).

From database point of view, this design should be correct - there is some theory behind this called normal forms. This gives us great query options - you can build various queries like “give me fact values across all hosts” or “list sub-facts for this particular tree”. The design is search friendly. There is one problem tho. It is very update unfriendly.

Before I propose a solution to this problem, I want to ask devs and users - how often do you use fact UI page drilldown. There are two levels of drilldown, clicking on a fact name shows list of hosts having the fact name. This is also possible via hierarchy - clicking on “os” node results on the same page but with all sub-facts listed.

The new design still allows the same drilldown but quering will not be as straightforward. The thing is - it looks like these pages is not widely used on per-minute or per-hour basis, but the fact import is. I am not aware of any other fact queries which are being called on regular basis. If this is confirmed, it opens doors to a proposal.

The proposal

Denormalize fact_names and fact_values into one single table host_facts (id, name, value, host_id) and drop hierarchy support. Example how things are stored in this new model:

ID, NAME, VALUE, HOST_ID 1, 'os::name', 'RedHat', 77 2, 'ipaddress', '192.168.11.2', 77 ... 199, 'biosvendor', 'SeaBIOS', 77 200, 'os::name', 'Debian', 78 201, 'ipaddress', '192.168.11.9', 78 ... 377, 'biosvendor', 'SeaBIOS', 78

This cuts the fact import/update complexity down to one update query on one table with no joins. The drawbacks are obviously some additional space needed to store fact names and more complex queries to fetch some cases - for example structured fact query will require LIKE. Normal queries like “all facts for a host” or “fact values for a fact” should perform same - an index must be placed both on name and (name, host_id).

This is core part of Foreman inventory and many plugins use facts or have their own facts, so we must be sure there are no issues. Also, this is only worth making if we see dramatic performance boost during the development. I haven’t tested anything yet, just brainstorming here.

While I was digging through the codebase, other optimalizations come to my mind:

Drop explicit transactions - I still think these can cause some lockups and there is no big point for consistency of facts - data that in case of an error will be reuploaded after 30 minutes anyway.
Drop created_at and updated_at columns - not sure if we present these to the user, but since we don’t touch updated_at during unchanged fact anyway, they are unlikely correct.
Drop hierarchy support - the code behind storing fact names in a tree is complex and costly, maybe it’s one of the reasons of slow performance itself.

Gwmngilfen · May 9, 2018, 11:12am

+1 for simplifying the DB structure, that separation has always weirded me out . However…

Drop explicit transactions - I still think these can cause some lockups and there is no big point for consistency of facts - data that in case of an error will be reuploaded after 30 minutes anyway.

That’s not for sure. I know quite a few people that run $tool on an infrequent (1/day say) or ad-hoc basis.

Drop hierarchy support - the code behind storing fact names in a tree is complex and costly, maybe it’s one of the reasons of slow performance itself.

I’d like to be sure we’re not impacting all the various config management “fact” systems (Ohai attributes, Salt grains, etc). Most of these are hierarchical, so we’d still need to be able to import that, and potentially export it again (we do have a facts API, and I think users would expect to get back something similar to what went in). I do use Salt, so I’d have to go think about that - can anyone weigh on the others?

So long as we address that, and also clearly instruct users on how to refer to a hierarchical value within the UI (e.g. what does $::os::name become when accessed via @host.facts('...')?) then I’m all for it.