Refactoring of fact parsers

dmatoulek · August 5, 2021, 8:11am

Thanks @lzap for commenting and mentioning the testing upon Discovery plugin because it’s a good point!

Since refactoring probably takes longer time to do properly it’s also a good idea to make some proper thinking about that and find where there could be some space to not only improve but also speed up the logic behind facts.

When I searching the forum about facts, I found two topics - RFC: Denormalize fact tables and RFC: Common Fact Model

In short the first one is about adapting table model for faster updating of facts. The second one is about making the fact model more robust and rigid by filtering important facts from a complete “fact report”. We have the second one partially implemented by Reported Data, that you can use also to search for hosts. Generally, I think that these two requests are really good and complement each other.

The first one is a really good example of a cache table and flatten the structure for easier searching and updating. However, my “Database Architecture person” doesn’t like the idea of saving the name of fact when this name will be used many times. I understand the @lzap concern about the complexity of the fact_names table, where is also stored a hierarchical structure of facts. On the other hand, relational database engines are built for numbers, not for text. So that makes sense to have fact names in different table. There is also the space saving argument for that. On the other hand, for an update of facts, you have to do a select to this fact_names table to get the ID of the fact name. You can get all names in one query but I understand that could be a problem. I think that most of the problem with the speed of fact logic is not tied with selects for names, but with the stored architecture of facts. So I think it would be better to use fact_names only to store the names of facts and get rid of architecture. It takes the best from both ideas - reduce the complexity and save space.

Second one is really interesting. Basically It takes several attributes from facts and transforms them into the most understandable (ie. memory to MB, time to UTC…) format for that data and save them with the link to host. It doesn’t matter what origin has the original fact. This RFC can even work with host where is fact gathering archived by several configuration/provision tools (ie. Puppet and Ansible at same time). I know that this use case doesn’t have any reason to have it in the real world but it shows one thing. This RFC is designed to transform gathered facts from several sources to simple structure and this structure can be used in many ways. Actually, right now we have a Reported Data that makes (from my point) almost the same thing. It’s only for a small amount of facts but it’s not a problem to add more of them.

I have a several ideas what to do:

We have parsers merged into the core, and they are different. It makes sense because every parser is designed for a different set of facts. However, facts are still the same even if you’re using a colleague to report memory usage of the server laying somewhere. From a developer perspective, the way to refactor them should be to unify them. It guarantees the same function calls and “similar” working for other components and also saves time to support them. That means that I would like to have all the logic around parsers in parsers. Right now there is some logic in the importers and they should work like an importer, not like a parser. Maybe there is a reason for that and it’s hidden for me.
Simplify the fact_names table. It was mentioned several paragraphs earlier. I think that it can help with the time of loading and searching for the stored facts. With the Common fact model in mind, there could also be different way to just get rid of fact_names and fact_values tables. The original facts will be stored as json in the DB. I really would like to use a CouchDB for that because it’s a great companion exactly for use cases (I used it as json cache). On the other hand, adding more services means more things can be broken so I think that storing this data as JSONB in PostgreSQL will be great.
This is much more optimization rather than improvement and it’s totally an option. I was wondering why the parser was working in the same thread as the application. Usually you don’t have a result from parsing on demand. So it makes sense to me to have another thread to parse facts.

So, what do others think? I will start with the unifying of the parsers first but all ideas are open to discussion, so write down your opinion, what do you think.