Refactoring of fact parsers

dmatoulek · October 1, 2021, 8:33am

Yesterday we finally have some time with @lzap to discuss refactoring of fact parsers little more and I think we figured out elegant way how to archive most (ideally all but I’m trying to be realistic) of the requirements.

Meanwhile I’ve got merged a PR with Fact Parser registry (https://github.com/theforeman/foreman/pull/8762) and I hope that “one importer” PR (https://github.com/theforeman/foreman/pull/8781) will be merged soon too. Next step is to get parsers into one module because they are spread in different modules according to their origin. Also, it’s not a good practice to have foreman_ansible module in core because parser

But back to the proposed architecture. As it were said in posts earlier, the main goals of this refactoring is speedup the process of fact parsing because it can be really complex text structure and also unify parsing is some way to have easier maintaining. We have there also a foreman_discovery plugin that is also necessary to consider because it’s very specific plugin for this case. Anyway I’ll try to write this post as a set of decisions with some arguments because it’s good for others to know what we have in mind and what we can missed and also this post is my memo for this topic.

Current implementation of parsing is far from ideal. There is a lot caveats but in short facts are uploaded to the /facts endpoint then it’s taken by FactImporter and there is two independent flows of the same hash:

Hash of facts is normalized (os: { kernel: 5 } is transformed to os::kernel: 5) and then stored to DB
Hash is passed to FactParser, parsed and then some facts are stored as HostFacets

@ekohl get point for really great idea that only FactImporter should be only component that can write data to DB and I like that idea.
@lzap has really good point about parsing right name of operating system (Correctly identify AlmaLinux - #27 by lzap) and as a software engineer it leads me to think that parsing of OS will be too complex to include it in FactParser so it be good to have as separate “sub-parser”.

Just a little black box like component where you pass facts and it returns some excellent name for operating system. This sub-parser gives me (independently also @lzap) that we can rewrite into “sub-parsers”. That idea means that there will be not only program specific parsers (AnsibleParser, PuppetParser…) but also problem-domain specific parsers (Puppet::StorageParser, Ansible::NicParser…). Just little components that makes one thing but do it well (Hello, Unix Philosophy!). However component like these should have some sort of common way how to communicate with (Good point by @Marek_Hulan). Something like interface in Java but not exactly like that. Interface can handle only “headers” of methods and constant variables but thanks to Ruby, we can go even further. We can define what can be an output. Let’s take example an Architecture. It can have many values like i386, POWER9, x84_64, amd64. Some values are same like 64-bit ones but it will be great to define that regardless you got x84_64 or amd64 in facts, parser return only x86_64. Yes, there are values that this pattern is hard to apply but at least we can document what values returns in code. This brings us to interface defined by methods. I already pointed that in examples… There some kind parsers that specializes to something. One to OS detection and another to storage parsing. Well, even the parser for Ansible or Puppet has totally different facts, I still can get list of NICs by the same methods… Right? I can archive that by inheritance and override some method. So what I suggest is have this interface like stuff in abstract domain specific parsers. Thanks to that I can be sure that I can call some kind of parser is exact way regardless to tool origin.

When we have a black-box like parsers we have to orchestrate a parsing upon them. Luckily, we have a FactImporter and it has a great fit for that. Importer can call parser one by one to parse specific facts but how we store that? @lzap got an idea with two hashes. One that is read only and contains uploaded facts and second one that will be used to store new kind of Foreman facts (Foreman::OS: RHEL5). These Foreman specific facts are used as the result of fact parsing. So the idea behind this is fairly simple. By parsing, we can produce facts that are more streamlined and specific that the original one. Also it opens door to really good extensibility.

Speaking of extensibility, when we call parser by parser, it’s a good idea to add hooks for before-parse, after-parse and maybe also skip-parse. This ideas produce really nice flow of parsing but when we want to allow developers to totally rewrite the flow if parsing, it should be also good to have some priority mechanism for parser call and I exactly know why this is important. foreman_discovery will use it. It also speed up parsing because we can skip extensive parsing (NICs) to time when is really needed (according to @lzap measurements) .

So If I summarize this long post, it produces this points:

Store Raw facts as text
Break parsers into domain specific ones
Improve FactParserRegistry to accept kinds of parsers and be able to prioritize them
Rewrite FactImporter to allow make a flow of fact parsing

I know that it’s will be a long way to archive this and many questions are not answered but that’s we have this forum. To discuss new things how to make Foreman better!