Orchstration framework problems

Hello,

some years ago, Ivan identified major issue with our orchestration based on ActiveRecord. Specifically, we cannot use SQL explicit transactions which makes most of our actions fragile when it comes to concurrency.

For example during automated host discovery, Foreman must find matching rule, find amount of hosts already provisioned by this rule, compare this to the rule limit and initiate provisioning. When multiple hosts are being provisioned, this indeed runs into issues and as a result, rules with limit set are overprovisioned. This was articulated in:

I can find many other examples where a simple SQL transaction would solve the problem, however our orchestration framework. So couple of topics for this thread.

First. Shall we start discussing a different approach than abusing ActiveRecord to do orchestration? This would be probably huge effort, most of our plugins use orchestration framework. For the record, we have a simple framework defined in app/models/concerns/orchestration.rb which provides two queues: queue and post_queue. These are just list of methods to be called on around_save and after_commit ActiveRecord hooks. This is indeed easy to understand for Rails developers, however this is not ideal.

Second. For the particular issue I am trying to solve (discovery and post commit), it looks like removing post_queue would help. It does not appear to be widely used: basically only SSH provision and PuppetCA in core use it, then I was able to find Discovery plugin. I wonder what was the reason to implement this queue as a separate one - Amos moved it 8 years ago without much commenting about it, thatā€™s all I can read from the git log. All those actions could be high-priority normal queue tasks.

Any kind of change in this regard will likely affect many plugins, but I would love to at least solve the post_queue problem which would enable us to use more SQL explicit transactions. We learned to ignore them over the time because of this specific problem (I hope itā€™s just this and not more) and many of our calls are not concurrent-friendly.

Looking into the after_commit problem, it looks like there were identified some problems in Rails and fixed in Rails 6:

I wrote a unit test for ActiveRecord covering this particular issue and it looks like this has been fixed now:

If I understand Ivanā€™s analysis correctly, after we upgrade to Rails 6, we can start using explicit transactions for saving hosts too. Can someone confirm or did I get this incorrectly?

To 1. I believe the framework should be moved to something more robust, but that was already disscussed many times. If we would want to do it, we would need to find a way of communicating about the solution (maybe some diagrams, that can be versioned would do :man_shrugging:)

2. I think you are correct, Ivanā€™s analysis makes sense, that is one of the reasons why rails changed behaviour of exceptions in save process. So I believe it should be safe now.

Isnā€™t the post queue also used for https://github.com/theforeman/foreman_salt/blob/08c5b56b33f263964a96fbc18cb62e9584fe8ce2/app/models/foreman_salt/concerns/host_managed_extensions.rb#L45 ? Similar also exists in foreman_bootdisk.

If there are major changes to this section, I would vote for something in which you can run a method in foreman if the host is really completely provisioned. So, there should be more ā€œhooksā€ which can be used to do ā€œsomethingā€ in core and in plugins.

FWIW, Iā€™d deeply miss this hook if it were gone. We also use it in some internal plugins.

1 Like

Would a new foreman_webhooks solve this or you need to control the provisioning process from the hook?

I think instead of tweaking or dropping some hooks in existing orchestration, we should move it to proper tasking system (foreman_tasks/dynflow) that supports subscribing/extending mechanism.

My comment is totally unrelated to foreman_webhooks. The internal plugins we maintain are native Foreman plugins that use the public APIs. I consider the host orchestration part of Foreman coreā€™s plugin API.
Iā€™m a huge fan of the host orchestration we currently use. Itā€™s very simple and gets the job done well. I donā€™t think that changing to some other task system is really helpful. What would we gain from it except breaking probably all existing plugins.
Weā€™ve talked about redoing the orchestration system quite often lot in the past, it never happened. Iā€™d say thatā€™s because itā€™s just not necessary.

The current design is super easy: A developer queues all the tasks that should be executed (the normal queue) before the host is saved and all tasks that require the host record to be saved beforehand (the post queue). Tasks have an order and a mechanism to rollback the tasks if one of them fails. What do we need more?
Yes, a known limitation is that all the actions in the queue happen in a database transaction. Correct me if Iā€™m wrong, but I believe that is how it needs to be.

1 Like

My comment is totally unrelated to foreman_webhooks.

Please donā€™t get me wrong. Iā€™m not saying we should simply drop something that plugins rely on. I can imagine we could even keep these hooks but replace their implementation. What I wanted to figure was, if known extensions need to control the orchestration flow or not. And in future, if we ever change our provisioning orchestration, weā€™ll be able to address these usecase by other means.

The internal plugins we maintain are native Foreman plugins that use the public APIs. I consider the host orchestration part of Foreman coreā€™s plugin API.

Thatā€™s surprising to me. Iā€™d like to better understand, how developers figure out what is the public API. I donā€™t mean it negatively. My motivation for this is to set the right expectations. Or maybe even define that.

Perhaps this deserves a separate post and Iā€™m happy to start that discussion if youā€™d be interested, but at least a quick summary of my view. I think plugins should use what we described in how to write a plugin and we keep REST API backwards compatible. Iā€™d agree that template macros can be seen as something in between, but we never called it stable API I think. We tried to define the plugin interface in plugin.rb. I know the reality is a lot of plugins using monkey patches, method wrapping, deface etc. Sometimes just defining a method in some included concern just works. But I hope everyone undestands the risk of relying on such code.

Iā€™m a huge fan of the host orchestration we currently use

Youā€™re the first developer I know about :slight_smile: Iā€™ll tell you what I dislike on the current orchestration. Iā€™m backing some of these arguments by existing redmine issues, but I hit all of these myself several times myself.

  1. transactions - IMHO no good reason for this limitation, just an outcome of the current implementation
  2. itā€™s inconsistent with other orchestration tasks we have in the ecosystem (katello, rex, ansible, wreckingball)
  3. it does not support background operations and block the whole thread, this is problematic in some flows, especially image based provisioning where booting can take a long time, I wonder how comes we survived with this for more than 4 years, probably people donā€™t use this kind of provisioning at scale
  4. itā€™s hard to debug failures, given the orchestration is rollbacked automatically, itā€™s impossible to pause and investigate what exactly failed, retry some step etc, with migration to Puma this may even get worse since weā€™ll timeout on proxied requests after some time not knowing what the state at the end is
  5. you canā€™t parallelize tasks (perhaps not a big deal in this case)
  6. given we hook to ActiveRecord, we donā€™t have a good control over previous and current state when the orchestration is triggered, leading to code like https://github.com/theforeman/foreman/blob/develop/app/models/concerns/interface_cloning.rb
  7. from code perspective, weā€™re putting orchestration logic to the Host class (which is a god object already), orchestration should not be a concern we include to the class but an external service object, but that may be quite subjective

Of course there are some pros too. Itā€™s easy to add up/down pair of methods. Luckilly the code is not touched these days so it does not break too often. But when I touched that years ago, it was a nightmare. So whenever someone wants to modify it, Iā€™d prefer a proper fix to make it less fragile. With an investigation of how to keep existing extension mechanisms while adding new ones. I think that should be possible, so plugins which canā€™t upgrade in near future would continue to work.

Wow. Well. I did not expect that! :slight_smile:

If transactions turns out to be a bug fixed in Rails 6, that could solve big part of my own problem I have with orchestration.

Foreman and Katello is a mix of two approaches. Early error with a rollback is actually what Foreman core orchestration does. And pause a task and let an operator to investigate is Katello approach (I think Dynflow also supports synchronous jobs).

Timo highlights one important aspect, Foreman core orchestration is easy to understand while Dynflow is exactly the opposite. Quite honestly, I donā€™t like idea of using Dynflow for everything now that the both original authors (Ivan/Petr) are no longer involved in the project. I think lot of the complexity comes from the major feature: ability to suspend a plan (a job, a process, whatever this is called).

What I do see as a feasible approach is relying on ActiveJob API to perform background tasks. Of course, this is super-simple API which only allows spawning one (independent) action which is exactly the opposite of what dynflow provides. If we are able to cut the complexity from core and plugin developers, maybe we can find a reasonable way how to use background processing hand in hand with the current orchestration.

I donā€™t think we will be able to solve god object called host, everything what happens in Foreman happens around hosts.

I know, that was a bold statement. Let me soften that a bit: I like it when I, as a developer, want to quickly add orchestration steps or quickly develop a new plugin.
Of course, there are problems with the current approach. But they wonā€™t go away just by switching the orchestration.

Just some random thoughts to what you mentioned, Iā€™ll give this more thoughts and add to this list:

I think the idea, that a host is simply not saved when an error occurs rather than marked as failed to deploy or similar makes the whole process quite easy to understand and implement. A user can simply retry to create the host if there were errors is easy and good UX (make changes in the form, hit save again). If we introduce some kind of new host state (Itā€™s there, but itā€™s broken) this will be a lot more difficult to explain and maintain.

This is indeed problematic. The only workflow I know is where the ssh finish script runs. Maybe this is the exception where we should move that step out of the normal orchestration and introduce another lifecycle hook (similar to the build state of a host).

Could it be, that this is only a UI issue. I believe all the orchestration steps are only shown for a brief timespan in the UI so it might be hard read and comprehend the steps. Maybe we need to show more error details in the UI. We could introduce some kind of a ā€œrollback reasonā€ that explains what went wrong and possible steps to mitigate the issue.

Is this a feature rather than a bug? This approach keeps things simple.

Yeah, thatā€™s - frankly speaking - bad design and technical dept.

I feel that idea is worth more investigation.

What would happen if weā€™d simply process the orchestration queues in an ActiveJob job?

Yes because our current orchestration framework does not provide anything so we ended up with the clunky SSH finish step and havenā€™t implemented anything else for steps which are good candidates for async processing. For example TFTP/boot files download and possibly others.

Weā€™d loose one of the aspects you like - resource is not saved in the end of the queue. But I think if we implemented sync and async orchestration steps (using ActiveJob) as two separate entities that could do it:

  • sync orchestration steps - unchanged (as they work today)
  • async orchestration steps - they all trigger in parallel (no ordering), no dependencies, they are guaranteed to trigger after all sync steps tho, there is no return value, they cannot stop object from saving obviously and if they fail user need to take action to fix them
  • both sync and async jobs could have some kind of ā€œoutputā€ or ā€œlogā€ available to the user so the whole progress could be easily tracked in the UI/CLI (what was done, what failed, what is still running) similarly to what we do currently but something more permanent (stored in the db for recording purposes)

A good candidate would be TFTP/boot files download - the step could actually render a preliminary PXE template that would keep rebooting hosts with some delay until proxy returns back download success, then it could turn PXE files to the final form and finish the job.

Although SSH finish could look like a good candidate, I think itā€™s much better to rewrite it to leverage Remote Execution which provides more capabilities.

1 Like

It does, although the semantics is a bit different from activejobā€™s perform now.

Depends on which level you want understand it. Iā€™d say on a dynflow consumer level (person writing actions for dynflow, not developing dynflow itself) it is not that bad. Of course, if youā€™re after ā€œI want to be able to reimplement the whole thing from the top of my headā€ kind of understanding, then I have to agree.

I had to deal with the orchestration once and hated every single moment of it. For me it lacks observability.

As you mentioned, it is a super-simple api. It could become limiting if you wanted to do something more complex than fire-it-and-forget-it kind of jobs. Also nothing stops you from writing simple things in Dynflow.