Refactoring of fact parsers

dmatoulek · May 26, 2021, 6:43pm

Facts, one of some important things in The Foreman project. That piece of information that informs about facts of host. They can come from many programs, ie. Puppet, Ansible, Salt and so on.
Most integration with these programs are in plugins, however Puppet has historically exclusive place in The Foreman. Right now, there’s effort to make Puppet as option like others (Making Puppet optional). What is means for parsers? Well, since the parsers deal with the same thing (aka facts of the host), we decided to unify these parsers.

That means I opened PRs where I move parsers from Ansible plugin (fixes #32518 - Adding Ansible fact parser by domitea · Pull Request #8491 · theforeman/foreman · GitHub), Katello (Fix #32517 - Moving rhsm parser and importer to core by domitea · Pull Request #8508 · theforeman/foreman · GitHub), Salt plugin (Fixes 32548 - Moving salt parser and importer to core by domitea · Pull Request #8511 · theforeman/foreman · GitHub) and Chef plugin (Fix 32558 - Moving Chef parser and importer to Core by domitea · Pull Request #8512 · theforeman/foreman · GitHub). However, this is a small step for something bigger and more interesting - The Refactoring! If I move all parsers to one big pile then I can refactor them into much cleaner code. Some preparations started by @ekohl in https://github.com/theforeman/foreman/pull/8449 but it’s a beginning.

I wanted to start this thread because I need some sort of memo to avoid forget something important during refactoring. I know that @lzap has some great comments for that:

Fictures, Fictures and Fictures. I mean the file that represents facts from some hosts for testing. This is very important point and I want to keep it in mind because it’s keeps integrity of parsers. @ekohl has similar thought and informed me about facterdb (GitHub - voxpupuli/facterdb: A Database of OS facts provided by Facter) that’s ideal for that. I can test the parser for all facts in automatize way.
Hooks. Discovery is a great plugin and it uses ImportHookService (Searching for the foreman_default_hostgroup plugin maintainer - #2 by lzap) to actually update the facts before they gets saved. I think It’s would be great addition because it could be used across plugins.

So, there It is, my thoughts about refactoring fact parsers. if It’s there anything I forgot, drop me a comment!

lzap · May 27, 2021, 7:41am

That facter db is awesome. If we are able to integrate that that would be dream. We will not need all of them, but some good picks would be great - I would be probably interested in different families, OS versions and network configurations (regular, VLAN, bridge, bond, team, bond with VLAN).

I am ready to do the review just let me know as long as PRs are ready, ideally off github

One wish: If you could test everything with discovery plugin turned on. After each change, please do import a fake discovery host via extra/discover-host because discovery plugin is the first to break as it heavily relies on puppet parser and extends it quite a bit. I really hope this refactoring will help and we can even move discovery importer into core too (not the plugin yet).

Thanks for the writeup! Great stuff.

dmatoulek · August 5, 2021, 8:11am

Thanks @lzap for commenting and mentioning the testing upon Discovery plugin because it’s a good point!

Since refactoring probably takes longer time to do properly it’s also a good idea to make some proper thinking about that and find where there could be some space to not only improve but also speed up the logic behind facts.

When I searching the forum about facts, I found two topics - RFC: Denormalize fact tables and RFC: Common Fact Model

In short the first one is about adapting table model for faster updating of facts. The second one is about making the fact model more robust and rigid by filtering important facts from a complete “fact report”. We have the second one partially implemented by Reported Data, that you can use also to search for hosts. Generally, I think that these two requests are really good and complement each other.

The first one is a really good example of a cache table and flatten the structure for easier searching and updating. However, my “Database Architecture person” doesn’t like the idea of saving the name of fact when this name will be used many times. I understand the @lzap concern about the complexity of the fact_names table, where is also stored a hierarchical structure of facts. On the other hand, relational database engines are built for numbers, not for text. So that makes sense to have fact names in different table. There is also the space saving argument for that. On the other hand, for an update of facts, you have to do a select to this fact_names table to get the ID of the fact name. You can get all names in one query but I understand that could be a problem. I think that most of the problem with the speed of fact logic is not tied with selects for names, but with the stored architecture of facts. So I think it would be better to use fact_names only to store the names of facts and get rid of architecture. It takes the best from both ideas - reduce the complexity and save space.

Second one is really interesting. Basically It takes several attributes from facts and transforms them into the most understandable (ie. memory to MB, time to UTC…) format for that data and save them with the link to host. It doesn’t matter what origin has the original fact. This RFC can even work with host where is fact gathering archived by several configuration/provision tools (ie. Puppet and Ansible at same time). I know that this use case doesn’t have any reason to have it in the real world but it shows one thing. This RFC is designed to transform gathered facts from several sources to simple structure and this structure can be used in many ways. Actually, right now we have a Reported Data that makes (from my point) almost the same thing. It’s only for a small amount of facts but it’s not a problem to add more of them.

I have a several ideas what to do:

We have parsers merged into the core, and they are different. It makes sense because every parser is designed for a different set of facts. However, facts are still the same even if you’re using a colleague to report memory usage of the server laying somewhere. From a developer perspective, the way to refactor them should be to unify them. It guarantees the same function calls and “similar” working for other components and also saves time to support them. That means that I would like to have all the logic around parsers in parsers. Right now there is some logic in the importers and they should work like an importer, not like a parser. Maybe there is a reason for that and it’s hidden for me.
Simplify the fact_names table. It was mentioned several paragraphs earlier. I think that it can help with the time of loading and searching for the stored facts. With the Common fact model in mind, there could also be different way to just get rid of fact_names and fact_values tables. The original facts will be stored as json in the DB. I really would like to use a CouchDB for that because it’s a great companion exactly for use cases (I used it as json cache). On the other hand, adding more services means more things can be broken so I think that storing this data as JSONB in PostgreSQL will be great.
This is much more optimization rather than improvement and it’s totally an option. I was wondering why the parser was working in the same thread as the application. Usually you don’t have a result from parsing on demand. So it makes sense to me to have another thread to parse facts.

So, what do others think? I will start with the unifying of the parsers first but all ideas are open to discussion, so write down your opinion, what do you think.

Marek_Hulan · August 6, 2021, 7:30am

If I get that right, you propose to drop everything besides the actual name from the FactName. How we’d keep the hierarchy then? Or do you propose to flatten this entirely and stop storing the parent links? I guess that’s what you mean by the “stored architecture of facts”?

I think this is in fact pretty common use case. People use Puppet and Ansible for two different purposes. Also there are subscription-manager facts that are sent from registered hosts every 4 hours and report similar data as Puppet or Ansible. I think we should be considering multiple fact sources as a valid or even common use case.

Ad idea 1 - so you say to have a single general importer, while we keep parser for each origin separate, correct? I’d to that.

Ad idea 2 - I’d be against adding another DB just to store the JSON. But the actual proposal here is to implement Arbitraty facts as defined in the common facts RFC?

Ad idea 3 - Puppet specifically has a feature (or quirk) that the ENC can be generated based on the facts. Therefore facts processing must happen before the ENC is rendered and therefore the enc.rb script waits for the request to finish first. It could still be implemented even if the facts processing is asynchronous but we never added polling to the enc.rb. I think having facts parsing as a background task is generally a good thing to have and other than Puppet facts can already use foreman-tasks for that. I think it’s time to make this the only way and see if there’s someone who actually relies on this. If so, we’d need to adjust enc.rb and add the polling in there. In general from me.

ekohl · August 6, 2021, 10:19am

This isn’t really a Puppet feature, but rather that Foreman’s parameters can be determined based on facts. Isn’t the same is really true for anything that queries parameters which use facts?

As for async processing, I think that’s fine. Just like you I do think you need to build a mechanism to determine when it’s finished. Then the client can POST and get a HTTP 202 Accepted back with the job ID. Then you can poll that job ID till it’s completed if you rely on it being processed. If you don’t, just carry on.

Marek_Hulan · August 6, 2021, 10:59am

I meant Puppet integration feature, meaning the ENC output could be changed by the facts. That includes both Parameters and Smart Class Parameters (or better to say their values). Ansible can’t do that, because facts are sent later, during the actual run. Chef integration works differently, typically users maintain their parameters in Chef server and we only merge additional values coming from Foreman, but there it depends which handler is registered first. And from the beginning the asynchornous importing was enabled in this integration, I don’t think anyone would use this. I’m not sure about Salt but I think there it works similarly to Puppet. I’d agree it’s true for everything that asks for hosts parametrization, but usually it does not upload facts in the same “action”.

That was just a clarification, I think we’re in agreement that with enabling polling it’s up to the consumer whether they care or not.

dmatoulek · August 11, 2021, 8:28am

The main idea was to flatten the structure to make it compatible with the fact names in sent facts and avoid “complex” parsing the right fact name from hierarchy model. In other words instead of storing “networking”, “eth0”, “ipv4” and “ipv6” with the respective hierarchy just store “networking::eth0::ipv4” “networking::eth0::ipv6”. It saves also some space in the table (because we don’t too many rows) but I think that improvement may vary. I’m not a person with a deep facts background.

That’s right. I would to gave refactors and one importer. In fact, I would to have some sort of abstract parser that contains parsing of ordinary values like IPv4 or IPv6 and then have separate parsers for every software like ansible, puppet…etc. This approach should unify parser logic and make it also more extendable because if you want to improve something, you’ll do it only from one place - ideally only from abstract parser. This parser should also contain a option to extend “fact report” (sent json with facts). Discovery plugin would benefits from that.

The idea with Couchdb was only an idea but generally I think that the path I would like to achieve. There is a lot questions to answer but generally the @lzap 's RFC makes really sense for me.

Marek_Hulan · August 11, 2021, 10:32am

Ok, I can imagine this could work on the backend side but I think from UI perspective, it’s desired to be able to navigate in the tree structure. But perhaps it could be client that builds the structure.

for other replies

lzap · August 20, 2021, 8:17am

You saw two RFCs because one was older than another. When I started looking into the problem, I thought that denormalizing would be a good idea. Indeed, it is against what we were tought during database lectures at schools, but in good books denormalization is a term, a chapter, something you need to consider. After all, denormalization is often used in OLAP.

However then I filed Common Fact Model for discussion and that RFC from my standpoint shadows the first one. The idea is to stop storing facts as reported from systems (Facter, Ansible…) in our normalized tables (fact value, name) and only store them as texts in JSON. Then we would introduce “Foreman Facts” that would be highly curated and transformed in a way that they would work across all implementations. This would be what we’d store in our tables, these would be used for searching. Users should be still able to retrieve the original (all) facts from text fields per host, but these would not be searchable.

I think long term we should move Reported Data from where we store it to the fact name/value tables and refactor the UI/CLI so the most important shows on the host detail page as well.

Let’s avoid please adding anoter database for facts, also I would highly disencourage from storing facts in JSON structured fields (JSON/JSONB) in Postgres. Also there is no reason to make the architecture more complex by utilizing threads or processes or async prosessing - this is something Foreman can do just fine within a single request. I did my homework and did many measurements and I can confidently say that storing many small records in fact names and values with a huge join table is the culprit. Let’s focus on that and only after this is solved by not storing many records there, let’s do another measuring to see what’s another bottleneck.

You are right that we have two stages:

Fact parsing - this is where every implementation detect things like operating system, environment, domain or interfaces (there is a regexp in setting to ignore interfaces). I believe we should start discussion about whether we need this - this causes so many problems in the past. It works well for users who uses Foreman just as an inventory for unmanaged hosts, but once you start managing those this is causing problems (operatign system is overwritten, Ansible fight Puppet etc). We have several settings to remove this feature per field (e.g. domain, subnet, operating system), I think we should have a different approach: either a flag that would turn on/off all of the fields or a different workflow: this would only work for new hosts or some kind of curating (change would need to be manually confirmed by user). If we had a common fact model and highly curated facts, that could be actually helpful - operating system, domain or environment could be easily “adopted” with a click of a button on user request.
Fact importing - the process of breaking up facts and storing them into fact name/value tables. There is also another regular expression in setting (exclude filter) to ignore some facts which we know are causing issues (too much data, too many of them).

This isn’t an easy task, both parsing and importing needs to be refactored. Here is how I currently see it:

Create a code that would store facts as-is in a SQL TEXT field. We will need a new table associated with hosts: original_facts(id, type, text) where type is one of puppet or ansible or others. Provide mechanisms to view the data from UI, API, CLI and templates (e.g. @host.original_facts(:puppet)) as well as ENC.
Start building the Common Fact Model first by creating code that would cherry pick facts from given list and set of “tranforming” methods/classes per each fact. This needs to be easily extendable by other implementations (Salt, Chef). For the fist stage, new fact type called “Foreman” can be used and once this is complete, databases can be migrated deleting other types and only keeping Foreman type and dropping type column alltogether.
With help from the community, build a list of facts we want to store. Start small (cpus, total memory, total disks, basially all Reported Data). Implement them all.
Stop storing other facts into fact names/values table. This is the first milestone - from now on, users must use original_facts in their templates to access special facts that are not stored separately. This is also the point where Foreman heavy deployments will get faster as not many facts will be stored anymore. This is the key bottleneck of the problem according to my measurements - all the parsing and importing is not that big of a deal, what’s really slow is storing many facts and values in the database.
Remove fact filter from the settings - it is not needed anymore.
Refactor Reported Data to use Common Facts. I think ability to put much more information to the host detail page can go well hand in hand with our UX redesign - as Common Facts list will grow, we will be able to provide more and more meaningful data for this page.
From the feedback from our users, add more Common Facts that the community is missing.
Fact parsing can be solved as the last step, this needs to be discussed separately but I think that once we have a reasonable Common Facts, we can simply drop it completely and make it an explicit action “Set Operatingsystem from Common Fact” etc. For new hosts, could could be done automatically, for existing hosts manually.

One big question I am fuzzy about is Lookup variables / Smart parameters and ENC. For ENC I believe this works on a per-host basis, so it should be possible to offer Puppet facts stored as text. But Lookup variables / Smart parameters are searchable and Common Fact Model could cause regressions. But I remember discussion about whether we should maybe drop this feature because it is not utilized too much? @ekohl

Edit: I think we removed smart params already, I am not sure how Lookup variables are relevant to this.

Take this as just my opinion, I might have missed some important stuff, but this is how I would approach it.

dmatoulek · October 1, 2021, 8:33am

Yesterday we finally have some time with @lzap to discuss refactoring of fact parsers little more and I think we figured out elegant way how to archive most (ideally all but I’m trying to be realistic) of the requirements.

Meanwhile I’ve got merged a PR with Fact Parser registry (https://github.com/theforeman/foreman/pull/8762) and I hope that “one importer” PR (https://github.com/theforeman/foreman/pull/8781) will be merged soon too. Next step is to get parsers into one module because they are spread in different modules according to their origin. Also, it’s not a good practice to have foreman_ansible module in core because parser

But back to the proposed architecture. As it were said in posts earlier, the main goals of this refactoring is speedup the process of fact parsing because it can be really complex text structure and also unify parsing is some way to have easier maintaining. We have there also a foreman_discovery plugin that is also necessary to consider because it’s very specific plugin for this case. Anyway I’ll try to write this post as a set of decisions with some arguments because it’s good for others to know what we have in mind and what we can missed and also this post is my memo for this topic.

Current implementation of parsing is far from ideal. There is a lot caveats but in short facts are uploaded to the /facts endpoint then it’s taken by FactImporter and there is two independent flows of the same hash:

Hash of facts is normalized (os: { kernel: 5 } is transformed to os::kernel: 5) and then stored to DB
Hash is passed to FactParser, parsed and then some facts are stored as HostFacets

@ekohl get point for really great idea that only FactImporter should be only component that can write data to DB and I like that idea.
@lzap has really good point about parsing right name of operating system (Correctly identify AlmaLinux - #27 by lzap) and as a software engineer it leads me to think that parsing of OS will be too complex to include it in FactParser so it be good to have as separate “sub-parser”.

Just a little black box like component where you pass facts and it returns some excellent name for operating system. This sub-parser gives me (independently also @lzap) that we can rewrite into “sub-parsers”. That idea means that there will be not only program specific parsers (AnsibleParser, PuppetParser…) but also problem-domain specific parsers (Puppet::StorageParser, Ansible::NicParser…). Just little components that makes one thing but do it well (Hello, Unix Philosophy!). However component like these should have some sort of common way how to communicate with (Good point by @Marek_Hulan). Something like interface in Java but not exactly like that. Interface can handle only “headers” of methods and constant variables but thanks to Ruby, we can go even further. We can define what can be an output. Let’s take example an Architecture. It can have many values like i386, POWER9, x84_64, amd64. Some values are same like 64-bit ones but it will be great to define that regardless you got x84_64 or amd64 in facts, parser return only x86_64. Yes, there are values that this pattern is hard to apply but at least we can document what values returns in code. This brings us to interface defined by methods. I already pointed that in examples… There some kind parsers that specializes to something. One to OS detection and another to storage parsing. Well, even the parser for Ansible or Puppet has totally different facts, I still can get list of NICs by the same methods… Right? I can archive that by inheritance and override some method. So what I suggest is have this interface like stuff in abstract domain specific parsers. Thanks to that I can be sure that I can call some kind of parser is exact way regardless to tool origin.

When we have a black-box like parsers we have to orchestrate a parsing upon them. Luckily, we have a FactImporter and it has a great fit for that. Importer can call parser one by one to parse specific facts but how we store that? @lzap got an idea with two hashes. One that is read only and contains uploaded facts and second one that will be used to store new kind of Foreman facts (Foreman::OS: RHEL5). These Foreman specific facts are used as the result of fact parsing. So the idea behind this is fairly simple. By parsing, we can produce facts that are more streamlined and specific that the original one. Also it opens door to really good extensibility.

Speaking of extensibility, when we call parser by parser, it’s a good idea to add hooks for before-parse, after-parse and maybe also skip-parse. This ideas produce really nice flow of parsing but when we want to allow developers to totally rewrite the flow if parsing, it should be also good to have some priority mechanism for parser call and I exactly know why this is important. foreman_discovery will use it. It also speed up parsing because we can skip extensive parsing (NICs) to time when is really needed (according to @lzap measurements) .

So If I summarize this long post, it produces this points:

Store Raw facts as text
Break parsers into domain specific ones
Improve FactParserRegistry to accept kinds of parsers and be able to prioritize them
Rewrite FactImporter to allow make a flow of fact parsing

I know that it’s will be a long way to archive this and many questions are not answered but that’s we have this forum. To discuss new things how to make Foreman better!

lzap · October 1, 2021, 9:22am

Can you describe the final design you want to achieve? For me these are the flaws we would like to solve:

Unreadable code (several objects calling each other, unclear responsibilities)
Cleaner flow of data, ideally a Ruby hash should be passed along for the whole processing until it reaches the final class that stores it in the DB
Ability to normalize facts for host facets (OS names, architecture names)
Flexible and clean way for plugins to override or add new parsers without method overloading or monkey patching
Ability to call some parsers later (from a rake task/background job) - useful for NIC processing which is quite slow but do not implement this in the first phase just make the design allows this.
Decreasing amount of facts stored in FactName and FactValue tables (aka Common Fact Model)
Storing all facts as a raw JSON in Postgres for ENC or templates.

Here are some support use cases:

As a dev, I want a clean codebase with an easy-to-understand code flow.
As a puppet/ansible/rhsm user, I want to avoid each technology to overwrite OS or host facets with different value (normalized values).
As a puppet/ansible/rhsm user, I want fact import to be fast.
As a user with hypervisors with many NICs, I still want to be able to filter out network names.
As a user with hypervisors with many NICs, I maybe want to defer NIC processing to later.

I suggest you continue in what you are doing so far - refactoring the code without any semantic changes into the final design that is clear to understand. Only then start working on improving things like normalizing or storing facts in JSON in the DB.

From reading your post, I am not entirely sure how things would look like. Instead of the spaghetti code we currently have, I would expect a chain of calls with a clean interface. I think we agree that the pipeline should be simple so I suggest something like:

FactParser.process(raw_facts, foreman_facts)

where raw_facts is the input (immutable) Ruby hash and foreman_facts is a hash that is empty in the beginning and parsers can store values into it. It can be used later on in the chain to be stored.

Storing of facts should probably happen in a different class, something like:

FactSerializer.process(raw_facts, foreman_facts, host)

We would need several implementations of serializers:

NormalFactSerializer - stores all raw_facts in FactName/Value tables as we do today (we have a filtering and some limiting mechanisms - we should keep this for now)
ForemanFactSerlializer - stores foreman_facts as extra facts under “foreman” namespace, so things like “foreman::os_name” or “foreman::os_version” are available. This gives plugin authors the flexibility to add their own normalized facts across technologies (RHSM, Ansible, Puppet, Chef, Salt) without implementing host facet.
NICFactSerlializer - performs detection of NICs and stores them for a host.
HostAttributesFactSerializer - performs storing of host facets (are these called attributes? not sure)
RawFactSerializer - later on this would store all raw_facts unchanged in a SQL TEXT table. We would return that in @host.facts_hash call so even after we decide to cherry pick facts to bring number of facts in FactName/Value tables down, users in templates or ENC would see them all. We need to store all fact types separately so this would be a joined table per host with a type column.

Note that normal FactParser classes (or whatever you name them) do not have access to anything else than input hash and output hash. They should not be doing any ActiveRecord, this is intended. Also, discovery needs to be able to override NICFactSerializer via as you mention a hook or something to change suggested primary interface.

As for the background processing, we discussed we could process some facts (NICs) just before user attempts to open a page with a host with some “dirty” flag, however, Marek had a good point that you want to be able to search for hosts by IP or MAC, which can be out of date. So we would either need some background processing or simply telling our users to use facts search.

I still believe that updating host data from facts is a bad thing and we should not be doing that at all. This is irrelevant for ENC, facts are used for ENC and not Foreman host information. So if this was on me, I would simply implement a helper text on a host result page saying “Did not find what you want? Try searching via facts as Foreman data can be out of date.” Pulling info from facts should be an action explicitly triggered by user (sure it can be a background process because it will take long for all hosts perhaps).

We would still need to do immediate fact parsing for new hosts which become unmanaged hosts. We need to parse everything immediately for these hosts, background processing or and kind of deferred parsing can be only applied to existing hosts (as more likely there are no changes for them).