Upcoming changes to Dynflow

As you may have heard, there are some changes coming soon to Dynflow.

Current state

In production deployments, there currently is the foreman-tasks/dynflowd service. This service represents an executor. The executor consists of two major logical parts, the orchestrator and worker. The worker is in fact a collection of worker threads, but let’s consider them as a group to keep things simple.

Current issues

There were three major issues with this approach: memory hogging, scaling and issues with executor utilization.

Memory hogging
The executor slowly consumed more and more memory and the only way to reclaim it was to restart the entire process. However, the orchestrator part keeps quite a lot of state and restarting it meant losing it which caused issues.

Scaling
The issue with scaling was we could not scale the workers independently on the rest. We could (and did) spawn more worker threads in an executor or spawn more executors in their entirety, but there was a catch. Spawning more threads is only useful until a certain point, where GIL kicks in and prevents threads from actually being useful.

Executor utilization
A job usually consist of several smaller execution items, called steps. When the executor accepted a job it always had to process it completely, there was no way it could share the steps with other executor’s workers. This could lead to situations where one executor would be heavily loaded with tasks, while another would be mostly unused.

Coming changes

These issues were hard to address without making some major changes. We decided to let go of some of Dynflow’s responsibilities and make it more focused on the orchestration part, while leveraging another project for the “raw” background processing. The original idea was to use ActiveJob as an interface and to let users decide which backend they want to use. After initial testing we decided to be more opinionated and to use Sidekiq directly, which promised better performance, if Sidekiq-specific API is used.

To address the previously described issues, we decided to break apart the executor. What used to be an executor gets split into three components. The first one is Redis, which is used for communication between the other parts, orchestrator and the workers.

The orchestrator’s role is to decide what should be run and when. Because of its stateful nature, there has to be exactly one instance running at time. As mentioned earlier, it is backed by Sidekiq and listens for events on the dynflow_orchestrator queue. In the end the orchestrator shouldn’t need to load the entire rails environment, which should in turn make it thin and allow fast startup.

The other functional piece is the worker. A worker in this case means a single Sidekiq process with many threads. Worker’s role is to actually perform the jobs and it should keep only the bare minimum of state. The workers can be scaled horizontally by increasing the number of threads per process or spawning more processes. We will also be able to dedicate a worker process to a certain queue, meaning we will be able to ensure priority tasks won’t get blocked by others.

To limit the impact of these changes onto already existing deployments, the new changes live side by side with the old model and can be switched. This allows us to keep using Dynflow “the old way” where needed, for example when using Dynflow on Smart Proxy for remote execution and ansible.

This is still a work in progress, but we’re hoping to get this out soon.

TL; DR:

The current problems

  • memory hogging
  • scaling
  • executor utilization

The solution

  • split the executor into an orchestrator and a worker
  • instead of dynflowd service there will be redis, dynflow-orchestrator and (possibly several) dynflow-workers

The good parts

  • we’ll be able to scale workers independently
  • we’ll be able to restart workers without having to worry
  • we’ll be able to dedicate workers to certain queues (say host updates, rex…)

To be done

  • deployment related changes (packaging, installer)
  • some tasks don’t fit into the new model well (looking at you LOCE/EQM)

Side note
Currently we use Dynflow as an ActiveJob backend. We don’t have plans to change that in the near future.

4 Likes

Thanks for the summary and all the improvements.

At the moment I think dynflow/tasks is the thing which is most difficult to debug and this sound like it can get even harder. Are there any plans for something like a troubleshooting guide as a public documentation? I think I am not the only one who would appreciate it.

1 Like

this sound like it can get even harder

It might be a bit different, but I wouldn’t say it will get inherently harder.

Are there any plans for something like a troubleshooting guide as a public documentation

So far I’m only planning to update the regular documentation on dynflow.github.io.

As for a troubleshooting guide, I’m afraid I don’t have insight into what are the most common issues users face, or at least my idea could be pretty biased. To handle this properly I would be glad if we managed to gather some feedback from the users and then we could pick the most frequent issues and write troubleshooting guides for those.

Hey, thanks for the headsup.

That’s a bit vague. How many? Is this the main configuration point for scaling dynflow processing?

What on Earth is that?

1 Like

There should be at least one. This is a cornerstone for dynflow scaling, the users should be able to scale up the number of workers freely (if they stay within reasonable limits). My guess would be people won’t be deploying more than 5 workers, unless they have a >50k host environment. We will have to do some tests to determine the optimal number of workers for default installations.

Listen on candlepin events and Event queue monitor. These actions spawn their own threads, which we could tolerate in single-process dynflow, but it could break in the multi-process environment we will have.

How do you plan on guaranteeing there’s only one orchestrator running? In a load balanced setup it’s common to have two (ideally) identical machines. Can you have the orchestrator running on both machines where one process only monitors the other or should the admin choose one primary machine?

Currently the admin should choose one machine to be the primary one.

I have a couple of ideas but I don’t like any of them.

So far I hope we’ll be able to to lift this requirement before we roll this out into the world. For now I’d say being able to deploy this in an active-passive mode would be good enough.

Currently users who do this already need to be aware of this because of how we deploy cronjobs. We also have parameters in the Puppet module to enable/disable the dynflowd service so we can implement a similar pattern.

Are you planning to build this active-passive into the orchestrator? For example, grab a lock in Redis and whenever the lock is taken, assume some other orchestrator is running. Whenever the lock becomes available, the other orchestrator has died. I’d understand if this is more of a phase 2 thing :slight_smile:

Yes, this is exactly the thing I’d like to eventually do.

I’ve opened up Bug #27674: migrate LOCE and Event Queue off of dynflow - Katello - Foreman to track this work. We should be able to do it at any point in time.

Sorry for being a few months late to the party! I blame my newest kid for distracting me with smiles. You may need a cup of coffee for these questions.

Sidekiq represents a framework is that correct? Are the new dynflow processes Sidekiq instances? They are in the end just Ruby processes?

Given the many ways that are introducing Redis to our stack its good to see centralization. This means, however, we need to be quite cognizant of it’s performance and all the consumers (Pulp 3, caching, dynflow).

You may want to reach out to the Pulp project and ask about their experience here. If I understand correctly, they had the same limitation and were able to modify their architecture to allow scaling the orchestrator which in turn gave greater reliability if one died.

This implies for some period of time a user would have dynflowd, redis, dynflow_orchestrator and dynflow_worker all running at the same time?

In what cases would we want to run it both ways if the end result is a “seamless” transition from a developer building Dynflow tasks perspective?

Can you elaborate on what connections each service will need? I assume each one needs a connection to Redis. What about database connections?

How will this be configured?

If I restart a worker in the middle of processing, what happens?
Does this require any restart orchestration across any of the new services, database migrations and the main web server process?

This is Brian from the Pulp team. We’ve had similar issues with the Pulp2 tasking system. I read through this thread and everything here sounds reasonable. The Pulp resource manager also has a singleton requirement, and we have used database locks in the past to allow multiple processes to start, and when one dies another one takes it’s lock. We’re also going to port this to Redis for Pulp3.

We ended up going with a post-forker model which always frees memory after the end of each task. It’s slower, but for the throughput needs of Pulp having that safety and simpliciyt is nice. It’s a quality of the RQ project. http://python-rq.org/docs/

One area we had to pay attention to was reconnect support for the code in our workers. Make sure to test your reconnect support to Redis/DB/etc.

Pulp also needs coordination around which types of tasks can be run concurrently. We’ve used a multi-locking method to ensure each task has locked the resources it needs.

Handling the case of cleaning up from a worker that was killed by the OOM is also an important case.

Yes, Sidekiq could be considered a background processing framework. All the new dynflow processes will be Sidekiq instances. In the end they are “just” ruby processes, but they will need to load the rails environment anyway. In the future we’d like to get to a point where the orchestrator wouldn’t have to load the rails baggage, but that’s still far away.

We are aware of Redis being introduced in other places in our stack, but we were thinking about deploying a dedicated instance for Dynflow. Our primary concern was that you can tune Redis towards specific workload (caching/throughput). I haven’t measured this, but I assume that if we reused Redis which is configured for caching, then throughput for Dynflow would suffer and the other way around.

Yes, will do that. We’d like to lift the singleton requirement, but it doesn’t have the highest priority right now. We can run orchestrators in a failover mode, which I’d say is good enough for now.

In theory it should be possible to run all those at once, but I wouldn’t recommend it. My idea was that during upgrade it will disable the old service and enable the new ones and fresh installs will always get the new ones.

We’re keeping the old way around for using Dynflow on the Smart Proxy, where we probably don’t want to deploy two dynflow processes and redis just to run REX and friends.

Each will need a connection to Redis and Postgres and I think this probably won’t change.

Good question. The most obvious idea would be to have a systemd template unit (dynflow@), where the instance specifier would dictate which sidekiq configuration should be used.

So for example for dynflow@orchestrator, there would be a config in /etc/foreman/dynflow/orchestrator.yml which would say it should have N threads and consume jobs from the dynflow_orchestrator queue. Workers would be configured the same way. “Dumb” scaling could be done by just symlinking the configuration and the starting another service. Queue dedication could be done by copying default configuration, changing queues there and starting a new service. How does this sound?

There is a bit of a timeout for which we hope the worker will finish. If it doesn’t finish in time, it will be killed and the job marked as failed.

There shouldn’t be a need for any {,re}start orchestration, the services should be able to start in any order, they just won’t be fully operational until the entire service set is up.

Thanks for chiming it. It is good to hear that our proposed solution sounds solid from someone who dealt with a similar thing.

That’s what we decided to do as well.

Usually the jobs executed by Sidekiq should be tiny and I’m afraid we’d incur a major performance hit because of the overhead when forking. But thanks for that suggestion, I’ll keep it in mind.

Good point, we will need to check this.

So far we don’t really need this, noted for future reference.

We should have this case covered in a best-effort way. When this happens, the job will be most likely be marked as failed as we can’t tell where exactly it died, but at least it should not block others from being processed.

From your testing, do you have an idea of the memory footprint at rest for each worker given it’s loading the entire Rails environment?

Does the dynflow_orchestrator also load the entire Rails environment or is it more light weight?

This makes a lot of sense. The interesting part is how we would deploy and track multiple instances of Redis.

I should know what this is. Can you expand on failover mode for me?

This implies that the dynflow services we are deploying are intimately tied to Foreman’s deployment and environment. Which if that is true, I wonder then if the service names should be prefixed with foreman_ to indicate this. For example, Pulp does this with pulp_worker and pulp_resource_manager. As in, it sounds more like we are using sidekiq and dynflow as frameworks and libraries but deploying a specific tasking processing configuration for the Foreman environment.

I am wondering about LOCE & EQM since I’ve started the work to move them away from dynflow[1]. There are some challenges involved, so I want to make sure that we’ve thought this all through.

As it was mentioned, these actions are creating their own threads and waking up every few seconds to look for and process work. From what I can tell this threading approach was the driver for moving them from Dynflow. But what if we changed the model? For example, could we rework them to be more cron-like: run every 5 seconds, drain and process and work, and complete. Could that, or similar, fit into the new Dynflow?

[1] https://github.com/Katello/katello/pull/8366

Not really, but I’ll configure a production Satellite to use the new way and report back with exact numbers.

Currently both orchestrator and the workers need to load the rails environment.

Containers? Just kidding. We could install redis from packages, create a new service as a symlink to the redis unit and then create override files for the new service to make it use different configuration.

Multiple orchestrator processes can be run at once, but only one is truly active and others are on standby. When the active one dies, another takes it place.

That was just an example which would make sense in our environment, but we could decouple it, store the configurations elsewhere and keep the service names.

Yes, the original cause was that spawning threads is a no-no. If there was only one worker, then it would work, but things would start to break once you’d start scaling workers up. You could end up with many poller threads (in different workers)

It would be great if we could change the model the way you suggest. My concern is that LOCE/EQM are reading the messages off a bus and I’m not sure how much the bus would like having a client constantly reconnecting to it, as no connection could be persisted across drains.

As promised, some numbers.

I spawned redis, a single orchestrator, worker for default queue with 20 threads and a worker for remote execution with 20 threads. The RSS below is taken from the output of ps -aux

Process RSS fresh RSS after 1k hosts RSS after 10k
Redis 10344 10360 10348
Orchestrator 445008 510428 546368
Worker default 449732 466376 514368
Worker REX 438200 664128 980056

This confirms my theory that if everything works smoothly then redis should consume almost no memory. Of course it may grow if jobs start piling up.

I also tried Katello things, I synced some RHEL8 repos ({BaseOS,AppStream}-{rpms,kickstart}) and it worked smoothly.

Do you have numbers for a comparison with the current implementation? I’m also slightly surprised our Rails stack takes ~440MB of memory by default.

That’s a common size for multiple plugins, when running just foreman (no plugins) using containers / openshift I see similar sizes:

would you provide container images for all of the new services (or at least one “worker” image?) thanks.