Orchstration framework problems

lzap · October 27, 2020, 9:45am

Hello,

some years ago, Ivan identified major issue with our orchestration based on ActiveRecord. Specifically, we cannot use SQL explicit transactions which makes most of our actions fragile when it comes to concurrency.

For example during automated host discovery, Foreman must find matching rule, find amount of hosts already provisioned by this rule, compare this to the rule limit and initiate provisioning. When multiple hosts are being provisioned, this indeed runs into issues and as a result, rules with limit set are overprovisioned. This was articulated in:

I can find many other examples where a simple SQL transaction would solve the problem, however our orchestration framework. So couple of topics for this thread.

First. Shall we start discussing a different approach than abusing ActiveRecord to do orchestration? This would be probably huge effort, most of our plugins use orchestration framework. For the record, we have a simple framework defined in app/models/concerns/orchestration.rb which provides two queues: queue and post_queue. These are just list of methods to be called on around_save and after_commit ActiveRecord hooks. This is indeed easy to understand for Rails developers, however this is not ideal.

Second. For the particular issue I am trying to solve (discovery and post commit), it looks like removing post_queue would help. It does not appear to be widely used: basically only SSH provision and PuppetCA in core use it, then I was able to find Discovery plugin. I wonder what was the reason to implement this queue as a separate one - Amos moved it 8 years ago without much commenting about it, that’s all I can read from the git log. All those actions could be high-priority normal queue tasks.

Any kind of change in this regard will likely affect many plugins, but I would love to at least solve the post_queue problem which would enable us to use more SQL explicit transactions. We learned to ignore them over the time because of this specific problem (I hope it’s just this and not more) and many of our calls are not concurrent-friendly.

lzap · October 27, 2020, 11:12am

Looking into the after_commit problem, it looks like there were identified some problems in Rails and fixed in Rails 6:

github.com/rails/rails

[ActiveRecord] after_commit callback is called when update fails within explicit transaction block

opened 11:31PM - 10 Jul 17 UTC

closed 01:10PM - 12 Apr 19 UTC

naw

attached PR With reproduction steps

### Steps to reproduce ```ruby begin require "bundler/inline" rescue Loa…dError => e $stderr.puts "Bundler version 1.10 or later is required. Please update your Bundler" raise e end gemfile(true) do source "https://rubygems.org" gem "rails", github: "rails/rails" gem "arel", github: "rails/arel" gem "sqlite3" end require "active_record" require "minitest/autorun" require "logger" # This connection will do for database-independent bug reports. ActiveRecord::Base.establish_connection(adapter: "sqlite3", database: ":memory:") ActiveRecord::Base.logger = Logger.new(STDOUT) ActiveRecord::Schema.define do create_table :posts, force: true do |t| t.string :arbitrary_column end end class Post < ActiveRecord::Base after_commit :my_callback validates :arbitrary_column, length: { minimum: 10 } def my_callback puts "Callback was called" if changed? fail "This callback should not be called" end end end class BugTest < Minitest::Test def test_callback_is_not_called_when_save_fails post = Post.create!(arbitrary_column: "This is valid") post = Post.find(post.id) Post.transaction do post.arbitrary_column = 'too short' assert !post.valid? post.save # This won't actually save, but the callback is (incorrectly) called anyway because the transaction commits end end end ``` ### Expected behavior `after_commit` callbacks should only be called if the underlying record is successfully saved. ### Actual behavior When updating an existing record, `after_commit` callbacks are called even if the update fails (when attempted update is within an explicit transaction block). ### Note A similar issue was fixed for record creation in #9115 I believe the underlying problem is that https://github.com/rails/rails/blob/07ed697f7b0debd8736a188fad67fe5e0c98739e/activerecord/lib/active_record/transactions.rb#L341 uses `persisted?` to determine if the save succeeded. This isn't an adequate heuristic for existing records (because they are `persisted?` regardless of whether the underlying update succeeds). I would be happy to work on a PR, if you can confirm that this is a bug.

I wrote a unit test for ActiveRecord covering this particular issue and it looks like this has been fixed now:

github.com/rails/rails

Test after_commit not called after raise in callback

rails:main ← lzap:test-transaction-after-commit-not-called

opened 11:10AM - 27 Oct 20 UTC

lzap

+49 -0

I wanted to find out if ActiveRecord calls transaction callback after exception …in a callback. There was a patch that corrected this behavior for exceptions in transaction callbacks (https://github.com/rails/rails/commit/5eaec23b89a83763b59bd017d872d35feea70af1) but we were suffering from the similar problem but in regular callbacks. This patch only adds two test cases for this, since I wrote the code I am offering this if that's any useful. Feel free to close the PR if you don't find these two tests useful

If I understand Ivan’s analysis correctly, after we upgrade to Rails 6, we can start using explicit transactions for saving hosts too. Can someone confirm or did I get this incorrectly?

ezr-ondrej · October 29, 2020, 10:15am

To 1. I believe the framework should be moved to something more robust, but that was already disscussed many times. If we would want to do it, we would need to find a way of communicating about the solution (maybe some diagrams, that can be versioned would do )

2. I think you are correct, Ivan’s analysis makes sense, that is one of the reasons why rails changed behaviour of exceptions in save process. So I believe it should be safe now.

Bernhard_Suttner · October 29, 2020, 10:31am

Isn’t the post queue also used for https://github.com/theforeman/foreman_salt/blob/08c5b56b33f263964a96fbc18cb62e9584fe8ce2/app/models/foreman_salt/concerns/host_managed_extensions.rb#L45 ? Similar also exists in foreman_bootdisk.

If there are major changes to this section, I would vote for something in which you can run a method in foreman if the host is really completely provisioned. So, there should be more “hooks” which can be used to do “something” in core and in plugins.

TimoGoebel · October 29, 2020, 10:36am

FWIW, I’d deeply miss this hook if it were gone. We also use it in some internal plugins.

Marek_Hulan · November 3, 2020, 10:04am

Would a new foreman_webhooks solve this or you need to control the provisioning process from the hook?

I think instead of tweaking or dropping some hooks in existing orchestration, we should move it to proper tasking system (foreman_tasks/dynflow) that supports subscribing/extending mechanism.

TimoGoebel · November 3, 2020, 2:56pm

My comment is totally unrelated to foreman_webhooks. The internal plugins we maintain are native Foreman plugins that use the public APIs. I consider the host orchestration part of Foreman core’s plugin API.
I’m a huge fan of the host orchestration we currently use. It’s very simple and gets the job done well. I don’t think that changing to some other task system is really helpful. What would we gain from it except breaking probably all existing plugins.
We’ve talked about redoing the orchestration system quite often lot in the past, it never happened. I’d say that’s because it’s just not necessary.

The current design is super easy: A developer queues all the tasks that should be executed (the normal queue) before the host is saved and all tasks that require the host record to be saved beforehand (the post queue). Tasks have an order and a mechanism to rollback the tasks if one of them fails. What do we need more?
Yes, a known limitation is that all the actions in the queue happen in a database transaction. Correct me if I’m wrong, but I believe that is how it needs to be.

Marek_Hulan · November 3, 2020, 4:10pm

My comment is totally unrelated to foreman_webhooks.

Please don’t get me wrong. I’m not saying we should simply drop something that plugins rely on. I can imagine we could even keep these hooks but replace their implementation. What I wanted to figure was, if known extensions need to control the orchestration flow or not. And in future, if we ever change our provisioning orchestration, we’ll be able to address these usecase by other means.

The internal plugins we maintain are native Foreman plugins that use the public APIs. I consider the host orchestration part of Foreman core’s plugin API.

That’s surprising to me. I’d like to better understand, how developers figure out what is the public API. I don’t mean it negatively. My motivation for this is to set the right expectations. Or maybe even define that.

Perhaps this deserves a separate post and I’m happy to start that discussion if you’d be interested, but at least a quick summary of my view. I think plugins should use what we described in how to write a plugin and we keep REST API backwards compatible. I’d agree that template macros can be seen as something in between, but we never called it stable API I think. We tried to define the plugin interface in plugin.rb. I know the reality is a lot of plugins using monkey patches, method wrapping, deface etc. Sometimes just defining a method in some included concern just works. But I hope everyone undestands the risk of relying on such code.

I’m a huge fan of the host orchestration we currently use

You’re the first developer I know about I’ll tell you what I dislike on the current orchestration. I’m backing some of these arguments by existing redmine issues, but I hit all of these myself several times myself.

transactions - IMHO no good reason for this limitation, just an outcome of the current implementation
it’s inconsistent with other orchestration tasks we have in the ecosystem (katello, rex, ansible, wreckingball)
it does not support background operations and block the whole thread, this is problematic in some flows, especially image based provisioning where booting can take a long time, I wonder how comes we survived with this for more than 4 years, probably people don’t use this kind of provisioning at scale
it’s hard to debug failures, given the orchestration is rollbacked automatically, it’s impossible to pause and investigate what exactly failed, retry some step etc, with migration to Puma this may even get worse since we’ll timeout on proxied requests after some time not knowing what the state at the end is
you can’t parallelize tasks (perhaps not a big deal in this case)
given we hook to ActiveRecord, we don’t have a good control over previous and current state when the orchestration is triggered, leading to code like https://github.com/theforeman/foreman/blob/develop/app/models/concerns/interface_cloning.rb
from code perspective, we’re putting orchestration logic to the Host class (which is a god object already), orchestration should not be a concern we include to the class but an external service object, but that may be quite subjective

Of course there are some pros too. It’s easy to add up/down pair of methods. Luckilly the code is not touched these days so it does not break too often. But when I touched that years ago, it was a nightmare. So whenever someone wants to modify it, I’d prefer a proper fix to make it less fragile. With an investigation of how to keep existing extension mechanisms while adding new ones. I think that should be possible, so plugins which can’t upgrade in near future would continue to work.

lzap · November 3, 2020, 4:42pm

Wow. Well. I did not expect that!

If transactions turns out to be a bug fixed in Rails 6, that could solve big part of my own problem I have with orchestration.

Foreman and Katello is a mix of two approaches. Early error with a rollback is actually what Foreman core orchestration does. And pause a task and let an operator to investigate is Katello approach (I think Dynflow also supports synchronous jobs).

Timo highlights one important aspect, Foreman core orchestration is easy to understand while Dynflow is exactly the opposite. Quite honestly, I don’t like idea of using Dynflow for everything now that the both original authors (Ivan/Petr) are no longer involved in the project. I think lot of the complexity comes from the major feature: ability to suspend a plan (a job, a process, whatever this is called).

What I do see as a feasible approach is relying on ActiveJob API to perform background tasks. Of course, this is super-simple API which only allows spawning one (independent) action which is exactly the opposite of what dynflow provides. If we are able to cut the complexity from core and plugin developers, maybe we can find a reasonable way how to use background processing hand in hand with the current orchestration.

I don’t think we will be able to solve god object called host, everything what happens in Foreman happens around hosts.

TimoGoebel · November 3, 2020, 5:08pm

I know, that was a bold statement. Let me soften that a bit: I like it when I, as a developer, want to quickly add orchestration steps or quickly develop a new plugin.
Of course, there are problems with the current approach. But they won’t go away just by switching the orchestration.

Just some random thoughts to what you mentioned, I’ll give this more thoughts and add to this list:

I think the idea, that a host is simply not saved when an error occurs rather than marked as failed to deploy or similar makes the whole process quite easy to understand and implement. A user can simply retry to create the host if there were errors is easy and good UX (make changes in the form, hit save again). If we introduce some kind of new host state (It’s there, but it’s broken) this will be a lot more difficult to explain and maintain.

This is indeed problematic. The only workflow I know is where the ssh finish script runs. Maybe this is the exception where we should move that step out of the normal orchestration and introduce another lifecycle hook (similar to the build state of a host).

Could it be, that this is only a UI issue. I believe all the orchestration steps are only shown for a brief timespan in the UI so it might be hard read and comprehend the steps. Maybe we need to show more error details in the UI. We could introduce some kind of a “rollback reason” that explains what went wrong and possible steps to mitigate the issue.

Is this a feature rather than a bug? This approach keeps things simple.

Yeah, that’s - frankly speaking - bad design and technical dept.

I feel that idea is worth more investigation.

What would happen if we’d simply process the orchestration queues in an ActiveJob job?

lzap · November 4, 2020, 10:21am

Yes because our current orchestration framework does not provide anything so we ended up with the clunky SSH finish step and haven’t implemented anything else for steps which are good candidates for async processing. For example TFTP/boot files download and possibly others.

We’d loose one of the aspects you like - resource is not saved in the end of the queue. But I think if we implemented sync and async orchestration steps (using ActiveJob) as two separate entities that could do it:

sync orchestration steps - unchanged (as they work today)
async orchestration steps - they all trigger in parallel (no ordering), no dependencies, they are guaranteed to trigger after all sync steps tho, there is no return value, they cannot stop object from saving obviously and if they fail user need to take action to fix them
both sync and async jobs could have some kind of “output” or “log” available to the user so the whole progress could be easily tracked in the UI/CLI (what was done, what failed, what is still running) similarly to what we do currently but something more permanent (stored in the db for recording purposes)

A good candidate would be TFTP/boot files download - the step could actually render a preliminary PXE template that would keep rebooting hosts with some delay until proxy returns back download success, then it could turn PXE files to the final form and finish the job.

Although SSH finish could look like a good candidate, I think it’s much better to rewrite it to leverage Remote Execution which provides more capabilities.

aruzicka · November 4, 2020, 11:57am

It does, although the semantics is a bit different from activejob’s perform now.

Depends on which level you want understand it. I’d say on a dynflow consumer level (person writing actions for dynflow, not developing dynflow itself) it is not that bad. Of course, if you’re after “I want to be able to reimplement the whole thing from the top of my head” kind of understanding, then I have to agree.

I had to deal with the orchestration once and hated every single moment of it. For me it lacks observability.

As you mentioned, it is a super-simple api. It could become limiting if you wanted to do something more complex than fire-it-and-forget-it kind of jobs. Also nothing stops you from writing simple things in Dynflow.