Dynaflow Orchestator Will Not Become Active

Problem:
No orchestrators go into active mode.

Expected outcome:
Orchestrators go into active mode.

Foreman and Proxy versions:

[root@10-222-206-158 ~]# yum list installed | grep rubygem-foreman
rubygem-foreman-tasks.noarch                             11.0.0-1.fm3_15.el9            @foreman-plugins
rubygem-foreman_kubevirt.noarch                          0.4.1-1.fm3_15.el9             @foreman-plugins
rubygem-foreman_maintain.noarch                          1:1.10.3-1.el9                 @foreman
rubygem-foreman_puppet.noarch                            9.0.0-1.fm3_15.el9             @foreman-plugins
rubygem-foreman_remote_execution.noarch                  16.0.3-1.fm3_15.el9            @foreman-plugins
rubygem-foreman_salt.noarch                              17.0.2-1.fm3_15.el9            @foreman-plugins
rubygem-foreman_statistics.noarch                        2.1.0-3.fm3_11.el9             @foreman-plugins
rubygem-foreman_templates.noarch                         10.0.8-1.fm3_15.el9            @foreman-plugins
rubygem-foreman_vault.noarch                             3.0.0-1.fm3_15.el9             @foreman-plugins
rubygem-foreman_webhooks.noarch                          4.0.1-1.fm3_15.el9             @foreman-plugins
[root@10-222-206-158 ~]#
[root@10-222-206-158 ~]#
[root@10-222-206-158 ~]# yum list installed | grep dyn
dynflow-utils.x86_64                                     1.6.3-1.el9                    @foreman
foreman-dynflow-sidekiq.noarch                           3.15.0-1.el9                   @foreman
rubygem-dynflow.noarch                                   1.9.1-1.el9                    @foreman

Distribution and version:

Alma 9

Other relevant data:

I cannot get our orchestrators to become active. We have 2 foreman servers, using redis broker and psql backend. Even if I completely purge postgres and redis of their knowledge of any ExecutorLocks, it still fails. I have been working on this for over 9 hours now, and cannot get either orchestrator to get a reliable lock. Purging, Flushing Redis DB, server restarts, Ive literally tried everything.

Current state:

[root@10-222-206-152 ~]# PGPASSWORD=“$PGPASS” psql -h “$PGHOST” -U “$PGUSER” -d “$PGDB” -c “SELECT class, owner_id, COUNT(*)FROM dynflow_coordinator_recordsWHERE class IN (‘Dynflow::Coordinator::ExecutorWorld’,‘Dynflow::Coordinator::ClientWorld’,‘Dynflow::Coordinator::DelayedExecutorLock’,‘Dynflow::Coordinator::ExecutionInhibitionLock’)GROUP BY class, owner_idORDER BY class, owner_id;”class                   |                  owner_id                  | count-------------------------------------------±-------------------------------------------±------Dynflow::Coordinator::ClientWorld         |                                            |     3Dynflow::Coordinator::DelayedExecutorLock | world:f318bd94-f5d5-425f-bbc7-9bef17d9f3cd |     1Dynflow::Coordinator::ExecutorWorld       |                                            |     1(3 rows)

So in this case psql thinks its: f318bd94-f5d5-425f-bbc7-9bef17d9f3cd
And redis also agrees:

[root@10-222-206-158 ~]# redis-cli -h $REDIS_HOST -p $REDIS_PORT -n $REDIS_DB GET dynflow_orchestrator_uuid“f318bd94-f5d5-425f-bbc7-9bef17d9f3cd”

However, regardless of this, neither orchestrator thinks its active:

[root@10-222-206-152 ~]# systemctl status dynflow-sidekiq@orchestrator.service
● dynflow-sidekiq@orchestrator.service - Foreman jobs daemon - orchestrator on sidekiq
     Loaded: loaded (/usr/lib/systemd/system/dynflow-sidekiq@.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/dynflow-sidekiq@.service.d
             └─override.conf
     Active: active (running) since Mon 2025-11-10 20:17:40 UTC; 8min ago
       Docs: https://theforeman.org
   Main PID: 12779 (sidekiq)
     Status: "orchestrator in passive mode"
      Tasks: 10 (limit: 407761)
     Memory: 508.6M
        CPU: 5min 9.617s
     CGroup: /system.slice/system-dynflow\x2dsidekiq.slice/dynflow-sidekiq@orchestrator.service
             └─12779 /usr/bin/ruby /usr/bin/sidekiq -e production -r /usr/share/foreman/extras/dynflow-sidekiq.rb -C /etc/foreman/dynflow/orchestrator.yml

Nov 10 20:17:31 10-222-206-152.ssnc-corp.cloud systemd[1]: Starting Foreman jobs daemon - orchestrator on sidekiq...
Nov 10 20:17:31 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[12779]: 2025-11-10T20:17:31.246Z pid=12779 tid=ajf INFO: Enabling systemd notification integration
Nov 10 20:17:33 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[12779]: 2025-11-10T20:17:33.198Z pid=12779 tid=ajf INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc-corp.cloud:6379/0"}
Nov 10 20:17:33 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[12779]: 2025-11-10T20:17:33.201Z pid=12779 tid=ajf INFO: GitLab reliable fetch activated!
Nov 10 20:17:40 10-222-206-152.ssnc-corp.cloud systemd[1]: Started Foreman jobs daemon - orchestrator on sidekiq.

[root@10-222-206-158 ~]# systemctl status dynflow-sidekiq@orchestrator.service
● dynflow-sidekiq@orchestrator.service - Foreman jobs daemon - orchestrator on sidekiq
     Loaded: loaded (/usr/lib/systemd/system/dynflow-sidekiq@.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/dynflow-sidekiq@.service.d
             └─override.conf
     Active: active (running) since Mon 2025-11-10 20:22:59 UTC; 4min 7s ago
       Docs: https://theforeman.org
   Main PID: 3348261 (sidekiq)
     Status: "orchestrator in passive mode"
      Tasks: 9 (limit: 407762)
     Memory: 298.1M
        CPU: 9.287s
     CGroup: /system.slice/system-dynflow\x2dsidekiq.slice/dynflow-sidekiq@orchestrator.service
             └─3348261 /usr/bin/ruby /usr/bin/sidekiq -e production -r /usr/share/foreman/extras/dynflow-sidekiq.rb -C /etc/foreman/dynflow/orchestrator.yml

Nov 10 20:22:49 10-222-206-158.ssnc-corp.cloud systemd[1]: Starting Foreman jobs daemon - orchestrator on sidekiq...
Nov 10 20:22:49 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[3348261]: 2025-11-10T20:22:49.564Z pid=3348261 tid=1zqtx INFO: Enabling systemd notification integration
Nov 10 20:22:51 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[3348261]: 2025-11-10T20:22:51.600Z pid=3348261 tid=1zqtx INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc>
Nov 10 20:22:51 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[3348261]: 2025-11-10T20:22:51.602Z pid=3348261 tid=1zqtx INFO: GitLab reliable fetch activated!
Nov 10 20:22:59 10-222-206-158.ssnc-corp.cloud systemd[1]: Started Foreman jobs daemon - orchestrator on sidekiq.

Im really not sure what else to do here, Ive tried and tried and tried everything I possibly know.

You can see the workers are handling their tasks, but the dynflow_orchestrator/coordinator just keeps growing (and thus no jobs complete)

[root@10-222-206-152 ~]# PGPASSWORD="$PGPASS" psql -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -c "SELECT
(SELECT COUNT(*) FROM dynflow_coordinator_records) AS coordinator_records,
(SELECT COUNT(*) FROM dynflow_delayed_plans)       AS delayed_plans,
(SELECT COUNT(*) FROM foreman_tasks_tasks WHERE state='running') AS running_tasks;"
 coordinator_records | delayed_plans | running_tasks
---------------------+---------------+---------------
                3715 |             0 |         15142
(1 row)

[root@10-222-206-152 ~]# PGPASSWORD="$PGPASS" psql -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -c "SELECT
(SELECT COUNT(*) FROM dynflow_coordinator_records) AS coordinator_records,
(SELECT COUNT(*) FROM dynflow_delayed_plans)       AS delayed_plans,
(SELECT COUNT(*) FROM foreman_tasks_tasks WHERE state='running') AS running_tasks;"
 coordinator_records | delayed_plans | running_tasks
---------------------+---------------+---------------
                3718 |             0 |         15145
(1 row)

Any ideas?

The passive/active is based purely on the entry in redis. The orchestrators try to acquire the lock with a TTL of 60 seconds and try to refresh it using a lua-in-redis script every 15 seconds.

Chances are you already tried this, but I’d try to:

  1. Stop both orchestrators
  2. Wait two minutes
  3. Check that the key in redis is unset
    1. If it is still set, wipe it with a DEL command in redis
  4. Start one orchestrator
  5. Wait for it to catch up
  6. Start the second

Even if this works, it won’t probably address the problem in the long run, but that will probably take some time. Do you have any notion how you got into this situation?

Thanks for the reply @aruzicka - Was hoping to hear from you :slight_smile:

So after 12 painful hours, yesterday, trying to get this working - I gave up and went to bed. I come back this morning and everything is working. We have seen this with dynflow many, many times - where we just wait 5 mins, 20 mins, 2 hours, or some other random amount of time, and then things all the sudden just work. So this begs the question, is there some sort of strange wait or delay? In this case, based on my logs, it looks like everything started working 16 mins after my last FLUSH/Restart of all services.

Do you have any notion how you got into this situation?

Yep, I know exactly how we got there:

Nov 09 06:08:27 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[1568914]: E, [2025-11-09T06:08:27.640064 #1568914] ERROR -- /connector-database-core: Receiving envelopes failed on PG::UnableToSend: no connection to the server

Nov 09 06:03:43 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[1568914]: 2025-11-09T06:03:43.139Z pid=1568914 tid=xnaa WARN: ActiveRecord::StatementInvalid: PG::ConnectionBad: PQconsumeInput() SSL connection has been closed unexpectedly
Nov 09 06:03:43 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[1568914]: 2025-11-09T06:03:43.139Z pid=1568914 tid=xnaa WARN: /usr/share/gems/gems/activerecord-7.0.8.7/lib/active_record/connection_adapters/postgresql/database_statements.rb:48:in `exec'

We had a network outage that lasted for about 4 mins - caused the connection from the foreman servers, to the postgres k8s cluster to go down.

Thanks again for the reply! Time for me to write up some simple service that reads the logs, looks for a certain state and restarts the orch/workers.

I can’t think of any explicit wait or delay that we intentionally have in the code that could lead to things hanging for random amounts of time. It feels like some timeout somewhere that we’re not aware of that gets triggered on loss of connection.

I was worried you were going to link that :slight_smile:

Did the connection to redis go down too or was it unaffected? I’m not sure when I’ll have the time, but I’d like to attempt to reproduce this in laboratory conditions.

Thread derailment warning: Only tangentially related
You have workers running on both servers, but only one of the two orchestrators should be active at a time. That is a limitation of dynflow that applies within the scope of a single redis database. In theory, you should be able to point the dynflow-sidekiq@* services of each of the machines to its own redis database (it could be within a single redis instance) and get both orchestrators active at the same time, with each handling that machine’s set of workers. There would be slight differences in how the things get distributed among the workers, but it might be beneficial in situations where a single orchestrator is a bottleneck. Would that be something you’d be interested in?

I am not completely certain if the connection to redis went down, but I would say there is a >95 chance that it did. Because our foreman, psql, redis all live in the same environment. So Im sure it did.

You have workers running on both servers, but only one of the two orchestrators should be active at a time. That is a limitation of dynflow that applies within the scope of a single redis database. In theory, you should be able to point the dynflow-sidekiq@* services of each of the machines to its own redis database (it could be within a single redis instance) and get both orchestrators active at the same time, with each handling that machine’s set of workers. There would be slight differences in how the things get distributed among the workers, but it might be beneficial in situations where a single orchestrator is a bottleneck. Would that be something you’d be interested in?

Correct. We have 3 workers running on each of the 2 respective foreman servers (under a load balancer).

I have read many of your past tickets/comments regarding wanting to allow multiple orchestrators!

It would be quite interesting and also quite easy for me to implement. I didn’t even think of the limitation being redis. We constantly (or more frequently than I’d like) have issues with the orchestrators. So this would be really helpful. I think I’ll give this a try very soon.

Well, we tried to add a new redis server and point jobs-01 to redis-01 and jobs-02 to redis-02 - But after 15 mins, we just could not get both to run in active mode. One always came up as active and the other passive, no matter what we attempted. We also discovered that it breaks the sidekiq dashboard, since its behind a load balancer, and doesnt know where to get the data from.