Dynaflow Orchestator Will Not Become Active

Problem:
No orchestrators go into active mode.

Expected outcome:
Orchestrators go into active mode.

Foreman and Proxy versions:

[root@10-222-206-158 ~]# yum list installed | grep rubygem-foreman
rubygem-foreman-tasks.noarch                             11.0.0-1.fm3_15.el9            @foreman-plugins
rubygem-foreman_kubevirt.noarch                          0.4.1-1.fm3_15.el9             @foreman-plugins
rubygem-foreman_maintain.noarch                          1:1.10.3-1.el9                 @foreman
rubygem-foreman_puppet.noarch                            9.0.0-1.fm3_15.el9             @foreman-plugins
rubygem-foreman_remote_execution.noarch                  16.0.3-1.fm3_15.el9            @foreman-plugins
rubygem-foreman_salt.noarch                              17.0.2-1.fm3_15.el9            @foreman-plugins
rubygem-foreman_statistics.noarch                        2.1.0-3.fm3_11.el9             @foreman-plugins
rubygem-foreman_templates.noarch                         10.0.8-1.fm3_15.el9            @foreman-plugins
rubygem-foreman_vault.noarch                             3.0.0-1.fm3_15.el9             @foreman-plugins
rubygem-foreman_webhooks.noarch                          4.0.1-1.fm3_15.el9             @foreman-plugins
[root@10-222-206-158 ~]#
[root@10-222-206-158 ~]#
[root@10-222-206-158 ~]# yum list installed | grep dyn
dynflow-utils.x86_64                                     1.6.3-1.el9                    @foreman
foreman-dynflow-sidekiq.noarch                           3.15.0-1.el9                   @foreman
rubygem-dynflow.noarch                                   1.9.1-1.el9                    @foreman

Distribution and version:

Alma 9

Other relevant data:

I cannot get our orchestrators to become active. We have 2 foreman servers, using redis broker and psql backend. Even if I completely purge postgres and redis of their knowledge of any ExecutorLocks, it still fails. I have been working on this for over 9 hours now, and cannot get either orchestrator to get a reliable lock. Purging, Flushing Redis DB, server restarts, Ive literally tried everything.

Current state:

[root@10-222-206-152 ~]# PGPASSWORD=“$PGPASS” psql -h “$PGHOST” -U “$PGUSER” -d “$PGDB” -c “SELECT class, owner_id, COUNT(*)FROM dynflow_coordinator_recordsWHERE class IN (‘Dynflow::Coordinator::ExecutorWorld’,‘Dynflow::Coordinator::ClientWorld’,‘Dynflow::Coordinator::DelayedExecutorLock’,‘Dynflow::Coordinator::ExecutionInhibitionLock’)GROUP BY class, owner_idORDER BY class, owner_id;”class                   |                  owner_id                  | count-------------------------------------------±-------------------------------------------±------Dynflow::Coordinator::ClientWorld         |                                            |     3Dynflow::Coordinator::DelayedExecutorLock | world:f318bd94-f5d5-425f-bbc7-9bef17d9f3cd |     1Dynflow::Coordinator::ExecutorWorld       |                                            |     1(3 rows)

So in this case psql thinks its: f318bd94-f5d5-425f-bbc7-9bef17d9f3cd
And redis also agrees:

[root@10-222-206-158 ~]# redis-cli -h $REDIS_HOST -p $REDIS_PORT -n $REDIS_DB GET dynflow_orchestrator_uuid“f318bd94-f5d5-425f-bbc7-9bef17d9f3cd”

However, regardless of this, neither orchestrator thinks its active:

[root@10-222-206-152 ~]# systemctl status dynflow-sidekiq@orchestrator.service
● dynflow-sidekiq@orchestrator.service - Foreman jobs daemon - orchestrator on sidekiq
     Loaded: loaded (/usr/lib/systemd/system/dynflow-sidekiq@.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/dynflow-sidekiq@.service.d
             └─override.conf
     Active: active (running) since Mon 2025-11-10 20:17:40 UTC; 8min ago
       Docs: https://theforeman.org
   Main PID: 12779 (sidekiq)
     Status: "orchestrator in passive mode"
      Tasks: 10 (limit: 407761)
     Memory: 508.6M
        CPU: 5min 9.617s
     CGroup: /system.slice/system-dynflow\x2dsidekiq.slice/dynflow-sidekiq@orchestrator.service
             └─12779 /usr/bin/ruby /usr/bin/sidekiq -e production -r /usr/share/foreman/extras/dynflow-sidekiq.rb -C /etc/foreman/dynflow/orchestrator.yml

Nov 10 20:17:31 10-222-206-152.ssnc-corp.cloud systemd[1]: Starting Foreman jobs daemon - orchestrator on sidekiq...
Nov 10 20:17:31 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[12779]: 2025-11-10T20:17:31.246Z pid=12779 tid=ajf INFO: Enabling systemd notification integration
Nov 10 20:17:33 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[12779]: 2025-11-10T20:17:33.198Z pid=12779 tid=ajf INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc-corp.cloud:6379/0"}
Nov 10 20:17:33 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[12779]: 2025-11-10T20:17:33.201Z pid=12779 tid=ajf INFO: GitLab reliable fetch activated!
Nov 10 20:17:40 10-222-206-152.ssnc-corp.cloud systemd[1]: Started Foreman jobs daemon - orchestrator on sidekiq.

[root@10-222-206-158 ~]# systemctl status dynflow-sidekiq@orchestrator.service
● dynflow-sidekiq@orchestrator.service - Foreman jobs daemon - orchestrator on sidekiq
     Loaded: loaded (/usr/lib/systemd/system/dynflow-sidekiq@.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/dynflow-sidekiq@.service.d
             └─override.conf
     Active: active (running) since Mon 2025-11-10 20:22:59 UTC; 4min 7s ago
       Docs: https://theforeman.org
   Main PID: 3348261 (sidekiq)
     Status: "orchestrator in passive mode"
      Tasks: 9 (limit: 407762)
     Memory: 298.1M
        CPU: 9.287s
     CGroup: /system.slice/system-dynflow\x2dsidekiq.slice/dynflow-sidekiq@orchestrator.service
             └─3348261 /usr/bin/ruby /usr/bin/sidekiq -e production -r /usr/share/foreman/extras/dynflow-sidekiq.rb -C /etc/foreman/dynflow/orchestrator.yml

Nov 10 20:22:49 10-222-206-158.ssnc-corp.cloud systemd[1]: Starting Foreman jobs daemon - orchestrator on sidekiq...
Nov 10 20:22:49 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[3348261]: 2025-11-10T20:22:49.564Z pid=3348261 tid=1zqtx INFO: Enabling systemd notification integration
Nov 10 20:22:51 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[3348261]: 2025-11-10T20:22:51.600Z pid=3348261 tid=1zqtx INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc>
Nov 10 20:22:51 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[3348261]: 2025-11-10T20:22:51.602Z pid=3348261 tid=1zqtx INFO: GitLab reliable fetch activated!
Nov 10 20:22:59 10-222-206-158.ssnc-corp.cloud systemd[1]: Started Foreman jobs daemon - orchestrator on sidekiq.

Im really not sure what else to do here, Ive tried and tried and tried everything I possibly know.

You can see the workers are handling their tasks, but the dynflow_orchestrator/coordinator just keeps growing (and thus no jobs complete)

[root@10-222-206-152 ~]# PGPASSWORD="$PGPASS" psql -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -c "SELECT
(SELECT COUNT(*) FROM dynflow_coordinator_records) AS coordinator_records,
(SELECT COUNT(*) FROM dynflow_delayed_plans)       AS delayed_plans,
(SELECT COUNT(*) FROM foreman_tasks_tasks WHERE state='running') AS running_tasks;"
 coordinator_records | delayed_plans | running_tasks
---------------------+---------------+---------------
                3715 |             0 |         15142
(1 row)

[root@10-222-206-152 ~]# PGPASSWORD="$PGPASS" psql -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -c "SELECT
(SELECT COUNT(*) FROM dynflow_coordinator_records) AS coordinator_records,
(SELECT COUNT(*) FROM dynflow_delayed_plans)       AS delayed_plans,
(SELECT COUNT(*) FROM foreman_tasks_tasks WHERE state='running') AS running_tasks;"
 coordinator_records | delayed_plans | running_tasks
---------------------+---------------+---------------
                3718 |             0 |         15145
(1 row)

Any ideas?

The passive/active is based purely on the entry in redis. The orchestrators try to acquire the lock with a TTL of 60 seconds and try to refresh it using a lua-in-redis script every 15 seconds.

Chances are you already tried this, but I’d try to:

  1. Stop both orchestrators
  2. Wait two minutes
  3. Check that the key in redis is unset
    1. If it is still set, wipe it with a DEL command in redis
  4. Start one orchestrator
  5. Wait for it to catch up
  6. Start the second

Even if this works, it won’t probably address the problem in the long run, but that will probably take some time. Do you have any notion how you got into this situation?

Thanks for the reply @aruzicka - Was hoping to hear from you :slight_smile:

So after 12 painful hours, yesterday, trying to get this working - I gave up and went to bed. I come back this morning and everything is working. We have seen this with dynflow many, many times - where we just wait 5 mins, 20 mins, 2 hours, or some other random amount of time, and then things all the sudden just work. So this begs the question, is there some sort of strange wait or delay? In this case, based on my logs, it looks like everything started working 16 mins after my last FLUSH/Restart of all services.

Do you have any notion how you got into this situation?

Yep, I know exactly how we got there:

Nov 09 06:08:27 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[1568914]: E, [2025-11-09T06:08:27.640064 #1568914] ERROR -- /connector-database-core: Receiving envelopes failed on PG::UnableToSend: no connection to the server

Nov 09 06:03:43 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[1568914]: 2025-11-09T06:03:43.139Z pid=1568914 tid=xnaa WARN: ActiveRecord::StatementInvalid: PG::ConnectionBad: PQconsumeInput() SSL connection has been closed unexpectedly
Nov 09 06:03:43 10-222-206-158.ssnc-corp.cloud dynflow-sidekiq@orchestrator[1568914]: 2025-11-09T06:03:43.139Z pid=1568914 tid=xnaa WARN: /usr/share/gems/gems/activerecord-7.0.8.7/lib/active_record/connection_adapters/postgresql/database_statements.rb:48:in `exec'

We had a network outage that lasted for about 4 mins - caused the connection from the foreman servers, to the postgres k8s cluster to go down.

Thanks again for the reply! Time for me to write up some simple service that reads the logs, looks for a certain state and restarts the orch/workers.

I can’t think of any explicit wait or delay that we intentionally have in the code that could lead to things hanging for random amounts of time. It feels like some timeout somewhere that we’re not aware of that gets triggered on loss of connection.

I was worried you were going to link that :slight_smile:

Did the connection to redis go down too or was it unaffected? I’m not sure when I’ll have the time, but I’d like to attempt to reproduce this in laboratory conditions.

Thread derailment warning: Only tangentially related
You have workers running on both servers, but only one of the two orchestrators should be active at a time. That is a limitation of dynflow that applies within the scope of a single redis database. In theory, you should be able to point the dynflow-sidekiq@* services of each of the machines to its own redis database (it could be within a single redis instance) and get both orchestrators active at the same time, with each handling that machine’s set of workers. There would be slight differences in how the things get distributed among the workers, but it might be beneficial in situations where a single orchestrator is a bottleneck. Would that be something you’d be interested in?

I am not completely certain if the connection to redis went down, but I would say there is a >95 chance that it did. Because our foreman, psql, redis all live in the same environment. So Im sure it did.

You have workers running on both servers, but only one of the two orchestrators should be active at a time. That is a limitation of dynflow that applies within the scope of a single redis database. In theory, you should be able to point the dynflow-sidekiq@* services of each of the machines to its own redis database (it could be within a single redis instance) and get both orchestrators active at the same time, with each handling that machine’s set of workers. There would be slight differences in how the things get distributed among the workers, but it might be beneficial in situations where a single orchestrator is a bottleneck. Would that be something you’d be interested in?

Correct. We have 3 workers running on each of the 2 respective foreman servers (under a load balancer).

I have read many of your past tickets/comments regarding wanting to allow multiple orchestrators!

It would be quite interesting and also quite easy for me to implement. I didn’t even think of the limitation being redis. We constantly (or more frequently than I’d like) have issues with the orchestrators. So this would be really helpful. I think I’ll give this a try very soon.

Well, we tried to add a new redis server and point jobs-01 to redis-01 and jobs-02 to redis-02 - But after 15 mins, we just could not get both to run in active mode. One always came up as active and the other passive, no matter what we attempted. We also discovered that it breaks the sidekiq dashboard, since its behind a load balancer, and doesnt know where to get the data from.

@aruzicka - I am 4.5 hours in to try to get our orchestrators working correctly. No matter what I have tried, they both come up passive/passive. Even after waiting for 45 mins, they both still just sit in passive/passive.

Is there anything you could have me try, or anything you want to see - in order to determine what is happening?

We have this problem at least monthly, so I would really like to have a happy path to fixing it.
Thanks.

I ran this:

[root@10-222-206-152 ~]# journalctl -u dynflow-sidekiq@orchestrator --since "2026-01-02 17:50:00" --no-pager | egrep -i "error|fatal|exception|warn|redis|cannot|timeout|killed|oom|stack"
Jan 02 17:53:16 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[53313]: E, [2026-01-02T17:53:16.114516 #53313] ERROR -- /connector-database-core: Sending envelope failed on timeout: 5.0, elapsed: 5.012567403000503
Jan 02 17:53:55 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[62793]: 2026-01-02T17:53:55.842Z pid=62793 tid=1bsp INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc-corp.cloud:6379/0"}
[root@10-222-206-152 ~]# date
Fri  2 Jan 18:05:12 UTC 2026
[root@10-222-206-152 ~]# redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" GET dynflow_orchestrator_uuid
redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" TTL dynflow_orchestrator_uuid
redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" TYPE dynflow_orchestrator_uuid
"33cc157f-8424-4418-8dfd-74aa1a8df4a2"
(integer) 55
string
[root@10-222-206-152 ~]# redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" GET dynflow_orchestrator_uuid
redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" TTL dynflow_orchestrator_uuid
"33cc157f-8424-4418-8dfd-74aa1a8df4a2"
(integer) 52
[root@10-222-206-152 ~]# redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" CLIENT LIST | head -n 60
id=686 addr=10.222.206.158:36090 laddr=10.222.172.152:6379 fd=122 name= age=15177 idle=13641 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=787 addr=10.222.206.152:52034 laddr=10.222.172.152:6379 fd=13 name= age=438 idle=435 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=759 addr=10.222.206.152:52350 laddr=10.222.172.152:6379 fd=11 name= age=9642 idle=4113 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=694 addr=10.222.206.158:42686 laddr=10.222.172.152:6379 fd=9 name= age=13770 idle=13770 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=691 addr=10.222.206.158:53980 laddr=10.222.172.152:6379 fd=127 name= age=14018 idle=5346 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=699 addr=10.222.206.158:58930 laddr=10.222.172.152:6379 fd=115 name= age=13074 idle=2235 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=771 addr=10.222.206.152:37868 laddr=10.222.172.152:6379 fd=63 name= age=6973 idle=5532 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=772 addr=10.222.206.152:56244 laddr=10.222.172.152:6379 fd=64 name= age=6669 idle=6370 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=758 addr=10.222.206.152:56822 laddr=10.222.172.152:6379 fd=10 name= age=9857 idle=9135 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=770 addr=10.222.206.152:39080 laddr=10.222.172.152:6379 fd=62 name= age=7632 idle=1653 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=690 addr=10.222.206.158:46502 laddr=10.222.172.152:6379 fd=126 name= age=14528 idle=132 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=680 addr=10.222.206.158:59590 laddr=10.222.172.152:6379 fd=116 name= age=15831 idle=5835 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=695 addr=10.222.206.158:50994 laddr=10.222.172.152:6379 fd=66 name= age=13708 idle=5211 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=778 addr=10.222.206.152:39432 laddr=10.222.172.152:6379 fd=12 name= age=4506 idle=3 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=766 addr=10.222.206.152:55570 laddr=10.222.172.152:6379 fd=61 name= age=9293 idle=9293 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=684 addr=10.222.206.158:58346 laddr=10.222.172.152:6379 fd=120 name= age=15236 idle=12954 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=685 addr=10.222.206.158:38842 laddr=10.222.172.152:6379 fd=121 name= age=15210 idle=4922 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=687 addr=10.222.206.158:42176 laddr=10.222.172.152:6379 fd=123 name= age=15021 idle=3039 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=769 addr=10.222.206.152:59572 laddr=10.222.172.152:6379 fd=14 name= age=8235 idle=7615 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=786 addr=10.222.206.152:58686 laddr=10.222.172.152:6379 fd=15 name= age=715 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61464 events=r cmd=lpush user=default redir=-1
id=710 addr=10.222.206.158:56044 laddr=10.222.172.152:6379 fd=8 name= age=10928 idle=3732 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=688 addr=10.222.206.158:46612 laddr=10.222.172.152:6379 fd=124 name= age=14933 idle=7272 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1
id=793 addr=10.222.206.152:48472 laddr=10.222.172.152:6379 fd=16 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=26 qbuf-free=40928 argv-mem=10 obl=0 oll=0 omem=0 tot-mem=61466 events=r cmd=client user=default redir=-1
[root@10-222-206-152 ~]# redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" INFO replication | egrep -i "role|master_host|master_link_status|slave_read_only"
role:master
[root@10-222-206-152 ~]# redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" SET dynflow_write_test "ok" EX 15
redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" GET dynflow_write_test
OK
"ok"
 events=r cmd=llen user=default redir=-1 id=694 addr=10.222.206.158:42686 laddr=10.222.172.152:6379 fd=9 name= age=13770 idle=13770 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=691 addr=10.222.206.158:53980 laddr=10.222.172.152:6379 fd=127 name= age=14018 idle=5346 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=699 addr=10.222.206.158:58930 laddr=10.222.172.152:6379 fd=115 name= age=13074 idle=2235 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=771 addr=10.222.206.152:37868 laddr=10.222.172.152:6379 fd=63 name= age=6973 idle=5532 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=772 addr=10.222.206.152:56244 laddr=10.222.172.152:6379 fd=64 name= age=6669 idle=6370 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=758 addr=10.222.206.152:56822 laddr=10.222.172.152:6379 fd=10 name= age=9857 idle=9135 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=770 addr=10.222.206.152:39080 laddr=10.222.172.152:6379 fd=62 name= age=7632 idle=1653 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=690 addr=10.222.206.158:46502 laddr=10.222.172.152:6379 fd=126 name= age=14528 idle=132 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=680 addr=10.222.206.158:59590 laddr=10.222.172.152:6379 fd=116 name= age=15831 idle=5835 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=695 addr=10.222.206.158:50994 laddr=10.222.172.152:6379 fd=66 name= age=13708 idle=5211 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=778 addr=10.222.206.152:39432 laddr=10.222.172.152:6379 fd=12 name= age=4506 idle=3 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=766 addr=10.222.206.152:55570 laddr=10.222.172.152:6379 fd=61 name= age=9293 idle=9293 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=684 addr=10.222.206.158:58346 laddr=10.222.172.152:6379 fd=120 name= age=15236 idle=12954 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=685 addr=10.222.206.158:38842 laddr=10.222.172.152:6379 fd=121 name= age=15210 idle=4922 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=687 addr=10.222.206.158:42176 laddr=10.222.172.152:6379 fd=123 name= age=15021 idle=3039 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=769 addr=10.222.206.152:59572 laddr=10.222.172.152:6379 fd=14 name= age=8235 idle=7615 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=786 addr=10.222.206.152:58686 laddr=10.222.172.152:6379 fd=15 name= age=715 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61464 events=r cmd=lpush user=default redir=-1 id=710 addr=10.222.206.158:56044 laddr=10.222.172.152:6379 fd=8 name= age=10928 idle=3732 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=688 addr=10.222.206.158:46612 laddr=10.222.172.152:6379 fd=124 name= age=14933 idle=7272 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=llen user=default redir=-1 id=793 addr=10.222.206.152:48472 laddr=10.222.172.152:6379 fd=16 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=26 qbuf-free=40928 argv-mem=10 obl=0 oll=0 omem=0 tot-mem=61466 events=r cmd=client user=default redir=-1 [root@10-222-206-152 ~]# redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" INFO replication | egrep -i "role|master_host|master_link_status|slave_read_only" role:master [root@10-222-206-152 ~]# redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" SET dynflow_write_test "ok" EX 15 redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" GET dynflow_write_test OK "ok"

1767377202.468995 [0 10.222.206.152:58686] "eval" "if redis.call(\"get\", KEYS[1]) == ARGV[1] then\n redis.call(\"del\", KEYS[1])\nend\nreturn 0\n" "1" "dynflow_orchestrator_uuid" "33cc157f-8424-4418-8dfd-74aa1a8df4a2" 1767377202.469043 [0 lua] "get" "dynflow_orchestrator_uuid" 1767377202.469053 [0 lua] "del" "dynflow_orchestrator_uuid" 1767377203.878029 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377203.879127 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377203.879420 [0 10.222.206.152:39432] "llen" "queue:default" 1767377203.879448 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377206.882698 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377206.883464 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377206.883685 [0 10.222.206.152:39432] "llen" "queue:default" 1767377206.883714 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377209.877187 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377209.878034 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377209.878240 [0 10.222.206.152:39432] "llen" "queue:default" 1767377209.878243 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377212.889024 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377212.889790 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377212.890046 [0 10.222.206.152:39432] "llen" "queue:default" 1767377212.890054 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377215.883026 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377215.884204 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377215.884472 [0 10.222.206.152:39432] "llen" "queue:default" 1767377215.884502 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377216.212747 [0 10.222.206.152:58686] "set" "reliable-fetcher-heartbeat-10-222-206-152.ssnc-corp.cloud-62793-67b4131099c2" "1" "EX" "60" 1767377218.881517 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377218.882599 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377218.882921 [0 10.222.206.152:39432] "llen" "queue:default" 1767377218.882925 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377221.880981 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377221.881923 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377221.882249 [0 10.222.206.152:39432] "llen" "queue:default" 1767377221.882252 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377224.890344 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377224.891187 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377224.891467 [0 10.222.206.152:39432] "llen" "queue:default" 1767377224.891471 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377227.917320 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377227.918206 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377227.918463 [0 10.222.206.152:39432] "llen" "queue:default" 1767377227.918466 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377230.880414 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377230.881265 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377230.881539 [0 10.222.206.152:39432] "llen" "queue:default" 1767377230.881543 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377233.887626 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377233.888532 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377233.888811 [0 10.222.206.152:39432] "llen" "queue:default" 1767377233.888815 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377236.266467 [0 10.222.206.152:58686] "set" "reliable-fetcher-heartbeat-10-222-206-152.ssnc-corp.cloud-62793-67b4131099c2" "1" "EX" "60" 1767377236.887216 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377236.888086 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377236.888363 [0 10.222.206.152:39432] "llen" "queue:default" 1767377236.888366 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377239.881115 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377239.882191 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377239.882468 [0 10.222.206.152:39432] "llen" "queue:default" 1767377239.882473 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377242.883614 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377242.884418 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377242.884644 [0 10.222.206.152:39432] "llen" "queue:default" 1767377242.884647 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377245.882817 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377245.883933 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377245.884222 [0 10.222.206.152:39432] "llen" "queue:default" 1767377245.884226 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377248.875533 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377248.876262 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377248.876515 [0 10.222.206.152:39432] "llen" "queue:default" 1767377248.876518 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" ^C [root@10-222-206-152 ~]# redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" MONITOR | egrep -i "dynflow|orchestrator|heartbeat|reliable|sidekiq|queue" 1767377254.879659 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377254.880668 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377254.880945 [0 10.222.206.152:39432] "llen" "queue:default" 1767377254.880952 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377256.437387 [0 10.222.206.152:58686] "set" "reliable-fetcher-heartbeat-10-222-206-152.ssnc-corp.cloud-62793-67b4131099c2" "1" "EX" "60" 1767377257.887696 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377257.888755 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377257.889023 [0 10.222.206.152:39432] "llen" "queue:default" 1767377257.889027 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377260.888536 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377260.889400 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377260.889673 [0 10.222.206.152:39432] "llen" "queue:default" 1767377260.889676 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377263.883317 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377263.884503 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377263.884784 [0 10.222.206.152:39432] "llen" "queue:default" 1767377263.884787 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377265.491454 [0 10.222.206.152:36768] "set" "reliable-fetcher-heartbeat-10-222-206-152.ssnc-corp.cloud-71270-60cd814ad434" "1" "EX" "60" 1767377266.882864 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377266.883919 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377266.884189 [0 10.222.206.152:39432] "llen" "queue:default" 1767377266.884192 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377269.876240 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377269.877273 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377269.877578 [0 10.222.206.152:39432] "llen" "queue:default" 1767377269.877582 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377272.892004 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1" 1767377272.892829 [0 10.222.206.152:39432] "sscan" "queues" "0" 1767377272.893135 [0 10.222.206.152:39432] "llen" "queue:default" 1767377272.893140 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator" 1767377273.227006 [0 10.222.206.152:34138] "set" "dynflow_orchestrator_uuid" "350fbd40-302d-4ca1-8172-d2518fdec7c1" "EX" "60" "NX" 1767377273.230504 [0 10.222.206.152:34138] "sadd" "queues" "default" 1767377273.230516 [0 10.222.206.152:34138] "lpush" "queue:default" "{\"retry\":false,\"queue\":\"default\",\"backtrace\":true,\"args\":[\"350fbd40-302d-4ca1-8172-d2518fdec7c1\"],\"class\":\"Dynflow::Executors::Sidekiq::WorkerJobs::DrainMarker\",\"jid\":\"afcf441770e399c887ff6006\",\"created_at\":1767377273.2301507,\"enqueued_at\":1767377273.2301984}"

But the orchestrator service is still in passive mode:

1767377202.468995 [0 10.222.206.152:58686] "eval" "if redis.call(\"get\", KEYS[1]) == ARGV[1] then\n  redis.call(\"del\", KEYS[1])\nend\nreturn 0\n" "1" "dynflow_orchestrator_uuid" "33cc157f-8424-4418-8dfd-74aa1a8df4a2"
1767377202.469043 [0 lua] "get" "dynflow_orchestrator_uuid"
1767377202.469053 [0 lua] "del" "dynflow_orchestrator_uuid"
1767377203.878029 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377203.879127 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377203.879420 [0 10.222.206.152:39432] "llen" "queue:default"
1767377203.879448 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377206.882698 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377206.883464 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377206.883685 [0 10.222.206.152:39432] "llen" "queue:default"
1767377206.883714 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377209.877187 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377209.878034 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377209.878240 [0 10.222.206.152:39432] "llen" "queue:default"
1767377209.878243 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377212.889024 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377212.889790 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377212.890046 [0 10.222.206.152:39432] "llen" "queue:default"
1767377212.890054 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377215.883026 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377215.884204 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377215.884472 [0 10.222.206.152:39432] "llen" "queue:default"
1767377215.884502 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377216.212747 [0 10.222.206.152:58686] "set" "reliable-fetcher-heartbeat-10-222-206-152.ssnc-corp.cloud-62793-67b4131099c2" "1" "EX" "60"
1767377218.881517 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377218.882599 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377218.882921 [0 10.222.206.152:39432] "llen" "queue:default"
1767377218.882925 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377221.880981 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377221.881923 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377221.882249 [0 10.222.206.152:39432] "llen" "queue:default"
1767377221.882252 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377224.890344 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377224.891187 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377224.891467 [0 10.222.206.152:39432] "llen" "queue:default"
1767377224.891471 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377227.917320 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377227.918206 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377227.918463 [0 10.222.206.152:39432] "llen" "queue:default"
1767377227.918466 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377230.880414 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377230.881265 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377230.881539 [0 10.222.206.152:39432] "llen" "queue:default"
1767377230.881543 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377233.887626 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377233.888532 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377233.888811 [0 10.222.206.152:39432] "llen" "queue:default"
1767377233.888815 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377236.266467 [0 10.222.206.152:58686] "set" "reliable-fetcher-heartbeat-10-222-206-152.ssnc-corp.cloud-62793-67b4131099c2" "1" "EX" "60"
1767377236.887216 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377236.888086 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377236.888363 [0 10.222.206.152:39432] "llen" "queue:default"
1767377236.888366 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377239.881115 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377239.882191 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377239.882468 [0 10.222.206.152:39432] "llen" "queue:default"
1767377239.882473 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"

1767377242.883614 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377242.884418 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377242.884644 [0 10.222.206.152:39432] "llen" "queue:default"
1767377242.884647 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377245.882817 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377245.883933 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377245.884222 [0 10.222.206.152:39432] "llen" "queue:default"
1767377245.884226 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377248.875533 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377248.876262 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377248.876515 [0 10.222.206.152:39432] "llen" "queue:default"
1767377248.876518 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
^C
[root@10-222-206-152 ~]# redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" MONITOR | egrep -i "dynflow|orchestrator|heartbeat|reliable|sidekiq|queue"
1767377254.879659 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377254.880668 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377254.880945 [0 10.222.206.152:39432] "llen" "queue:default"
1767377254.880952 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377256.437387 [0 10.222.206.152:58686] "set" "reliable-fetcher-heartbeat-10-222-206-152.ssnc-corp.cloud-62793-67b4131099c2" "1" "EX" "60"
1767377257.887696 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377257.888755 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377257.889023 [0 10.222.206.152:39432] "llen" "queue:default"
1767377257.889027 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377260.888536 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377260.889400 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377260.889673 [0 10.222.206.152:39432] "llen" "queue:default"
1767377260.889676 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377263.883317 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377263.884503 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377263.884784 [0 10.222.206.152:39432] "llen" "queue:default"
1767377263.884787 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377265.491454 [0 10.222.206.152:36768] "set" "reliable-fetcher-heartbeat-10-222-206-152.ssnc-corp.cloud-71270-60cd814ad434" "1" "EX" "60"
1767377266.882864 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377266.883919 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377266.884189 [0 10.222.206.152:39432] "llen" "queue:default"
1767377266.884192 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377269.876240 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377269.877273 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377269.877578 [0 10.222.206.152:39432] "llen" "queue:default"
1767377269.877582 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377272.892004 [0 10.222.206.152:39432] "lrange" "queue:default" "-1" "-1"
1767377272.892829 [0 10.222.206.152:39432] "sscan" "queues" "0"
1767377272.893135 [0 10.222.206.152:39432] "llen" "queue:default"
1767377272.893140 [0 10.222.206.152:39432] "llen" "queue:dynflow_orchestrator"
1767377273.227006 [0 10.222.206.152:34138] "set" "dynflow_orchestrator_uuid" "350fbd40-302d-4ca1-8172-d2518fdec7c1" "EX" "60" "NX"
1767377273.230504 [0 10.222.206.152:34138] "sadd" "queues" "default"
1767377273.230516 [0 10.222.206.152:34138] "lpush" "queue:default" "{\"retry\":false,\"queue\":\"default\",\"backtrace\":true,\"args\":[\"350fbd40-302d-4ca1-8172-d2518fdec7c1\"],\"class\":\"Dynflow::Executors::Sidekiq::WorkerJobs::DrainMarker\",\"jid\":\"afcf441770e399c887ff6006\",\"created_at\":1767377273.2301507,\"enqueued_at\":1767377273.2301984}"

2 more hours of attempts, still no luck, but trying to put as much info here as I can.

[root@10-222-206-152 ~]# PGPASSWORD="$PGPASS" psql -X -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -v ON_ERROR_STOP=1 -c "
WITH blocked AS (
  SELECT pid, unnest(pg_blocking_pids(pid)) AS blocking_pid
  FROM pg_stat_activity
  WHERE datname = current_database()
)
SELECT
  b.pid AS blocked_pid,
  ba.usename AS blocked_user,
  now() - ba.query_start AS blocked_age,
  left(ba.query, 120) AS blocked_query,
  b.blocking_pid,
  bb.usename AS blocking_user,
  now() - bb.query_start AS blocking_age,
  left(bb.query, 120) AS blocking_query
FROM blocked b
JOIN pg_stat_activity ba ON ba.pid = b.pid
JOIN pg_stat_activity bb ON bb.pid = b.blocking_pid
ORDER BY blocked_age DESC;
"
 blocked_pid | blocked_user | blocked_age | blocked_query | blocking_pid | blocking_user | blocking_age | blocking_query
-------------+--------------+-------------+---------------+--------------+---------------+--------------+----------------
(0 rows)

[root@10-222-206-152 ~]# ^C
[root@10-222-206-152 ~]# systemctl start dynflow-sidekiq@worker-*
Warning: systemctl start called with a glob pattern.
Hint: unit globs expand to loaded units, so start will usually have no effect.
      Passing --all will also load units which are pulled in by other units.
      See systemctl(1) for more details.
[root@10-222-206-152 ~]# systemctl start dynflow-sidekiq@worker-{1,2,3}
[root@10-222-206-152 ~]# systemctl stop dynflow-sidekiq@orchestrator.service
systemctl stop 'dynflow-sidekiq@worker*' 2>/dev/null || true
pgrep -a sidekiq || echo "No sidekiq processes found"
No sidekiq processes found
[root@10-222-206-152 ~]# export PG_LEADER_UUID="$(
PGPASSWORD="$PGPASS" psql -X -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -v ON_ERROR_STOP=1 -t -A -c "
SELECT regexp_replace(owner_id, '^world:', '')
FROM dynflow_coordinator_records
WHERE class = 'Dynflow::Coordinator::DelayedExecutorLock'
LIMIT 1;
" | tr -d '\r'
)"
echo "PG_LEADER_UUID=$PG_LEADER_UUID"
PG_LEADER_UUID=33cc157f-8424-4418-8dfd-74aa1a8df4a2
[root@10-222-206-152 ~]# export REDIS_LEADER_UUID="$(redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" GET dynflow_orchestrator_uuid | tr -d '\"\r')"
echo "REDIS_LEADER_UUID=$REDIS_LEADER_UUID"
REDIS_LEADER_UUID=
[root@10-222-206-152 ~]# PGPASSWORD="$PGPASS" psql -X -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -v ON_ERROR_STOP=1 -x -c "
SELECT class, id, owner_id
FROM dynflow_coordinator_records
WHERE class = 'Dynflow::Coordinator::DelayedExecutorLock';
"
-[ RECORD 1 ]----------------------------------------
class    | Dynflow::Coordinator::DelayedExecutorLock
id       | delayed-executor
owner_id | world:33cc157f-8424-4418-8dfd-74aa1a8df4a2

[root@10-222-206-152 ~]# PGPASSWORD="$PGPASS" psql -X -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -v ON_ERROR_STOP=1 <<SQL
BEGIN;

DELETE FROM dynflow_coordinator_records
WHERE class = 'Dynflow::Coordinator::DelayedExecutorLock'
  AND owner_id <> 'world:${REDIS_LEADER_UUID}';

COMMIT;
SQL
BEGIN
DELETE 1
COMMIT
[root@10-222-206-152 ~]# PGPASSWORD="$PGPASS" psql -X -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -v ON_ERROR_STOP=1 -x -c "
SELECT class, id, owner_id
FROM dynflow_coordinator_records
WHERE class IN (
  'Dynflow::Coordinator::ExecutorWorld',
  'Dynflow::Coordinator::ClientWorld'
)
ORDER BY class, id;
"
-[ RECORD 1 ]----------------------------------
class    | Dynflow::Coordinator::ExecutorWorld
id       | 33cc157f-8424-4418-8dfd-74aa1a8df4a2
owner_id |
-[ RECORD 2 ]----------------------------------
class    | Dynflow::Coordinator::ExecutorWorld
id       | 350fbd40-302d-4ca1-8172-d2518fdec7c1
owner_id |

[root@10-222-206-152 ~]# STALE_WORLD_UUID="$PG_LEADER_UUID"
echo "STALE_WORLD_UUID=$STALE_WORLD_UUID"

PGPASSWORD="$PGPASS" psql -X -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -v ON_ERROR_STOP=1 -c "
DELETE FROM dynflow_coordinator_records
WHERE class IN (
  'Dynflow::Coordinator::ExecutorWorld',
  'Dynflow::Coordinator::ClientWorld'
)
AND id::text = '${STALE_WORLD_UUID}';
"
STALE_WORLD_UUID=33cc157f-8424-4418-8dfd-74aa1a8df4a2
DELETE 1
[root@10-222-206-152 ~]# redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" DEL dynflow_orchestrator_uuid
redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -n "$REDIS_DB" TTL dynflow_orchestrator_uuid
(integer) 0
(integer) -2
[root@10-222-206-152 ~]# systemctl start dynflow-sidekiq@orchestrator.service
journalctl -u dynflow-sidekiq@orchestrator -n 80 --no-pager
Jan 02 15:18:41 10-222-206-152.ssnc-corp.cloud systemd[1]: Starting Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 15:18:46 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[1002]: 2026-01-02T15:18:46.737Z pid=1002 tid=1ga INFO: Enabling systemd notification integration
Jan 02 15:18:57 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[1002]: 2026-01-02T15:18:57.804Z pid=1002 tid=1ga INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc-corp.cloud:6379/0"}
Jan 02 15:18:57 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[1002]: 2026-01-02T15:18:57.807Z pid=1002 tid=1ga INFO: GitLab reliable fetch activated!
Jan 02 15:19:22 10-222-206-152.ssnc-corp.cloud systemd[1]: Started Foreman jobs daemon - orchestrator on sidekiq.
Jan 02 15:32:23 10-222-206-152.ssnc-corp.cloud systemd[1]: Stopping Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 15:32:58 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[1002]: E, [2026-01-02T15:32:58.403069 #1002] ERROR -- /connector-database-core: Receiving envelopes failed on timeout: 5.0, elapsed: 5.008903511999961
Jan 02 15:33:24 10-222-206-152.ssnc-corp.cloud systemd[1]: dynflow-sidekiq@orchestrator.service: Deactivated successfully.
Jan 02 15:33:24 10-222-206-152.ssnc-corp.cloud systemd[1]: Stopped Foreman jobs daemon - orchestrator on sidekiq.
Jan 02 15:33:24 10-222-206-152.ssnc-corp.cloud systemd[1]: dynflow-sidekiq@orchestrator.service: Consumed 8min 28.857s CPU time, 5.9G memory peak.
Jan 02 16:18:25 10-222-206-152.ssnc-corp.cloud systemd[1]: Starting Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 16:18:25 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[32344]: 2026-01-02T16:18:25.691Z pid=32344 tid=oa0 INFO: Enabling systemd notification integration
Jan 02 16:18:27 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[32344]: 2026-01-02T16:18:27.615Z pid=32344 tid=oa0 INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc-corp.cloud:6379/0"}
Jan 02 16:18:27 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[32344]: 2026-01-02T16:18:27.617Z pid=32344 tid=oa0 INFO: GitLab reliable fetch activated!
Jan 02 16:18:35 10-222-206-152.ssnc-corp.cloud systemd[1]: Started Foreman jobs daemon - orchestrator on sidekiq.
Jan 02 16:49:48 10-222-206-152.ssnc-corp.cloud systemd[1]: Stopping Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 16:49:55 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[32344]: I, [2026-01-02T16:49:55.442973 #32344]  INFO -- /default_dead_letter_handler: got dead letter #<Concurrent::Actor::Envelope:2501600> @message=:check_delayed_plans, @sender=#<Dynflow::ClockReference:0x00007b3bee913818 /clock (Dynflow::Clock)>, @address=#<Concurrent::Actor::Reference:0x00007b3bee7dcad0 /delayed-executor (Dynflow::DelayedExecutors::PollingCore)>>
Jan 02 16:50:50 10-222-206-152.ssnc-corp.cloud systemd[1]: dynflow-sidekiq@orchestrator.service: Deactivated successfully.
Jan 02 16:50:50 10-222-206-152.ssnc-corp.cloud systemd[1]: Stopped Foreman jobs daemon - orchestrator on sidekiq.
Jan 02 16:50:50 10-222-206-152.ssnc-corp.cloud systemd[1]: dynflow-sidekiq@orchestrator.service: Consumed 23min 12.752s CPU time, 4.9G memory peak.
Jan 02 17:38:36 10-222-206-152.ssnc-corp.cloud systemd[1]: Starting Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 17:38:37 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[53313]: 2026-01-02T17:38:37.023Z pid=53313 tid=1629 INFO: Enabling systemd notification integration
Jan 02 17:38:38 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[53313]: 2026-01-02T17:38:38.966Z pid=53313 tid=1629 INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc-corp.cloud:6379/0"}
Jan 02 17:38:38 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[53313]: 2026-01-02T17:38:38.968Z pid=53313 tid=1629 INFO: GitLab reliable fetch activated!
Jan 02 17:38:46 10-222-206-152.ssnc-corp.cloud systemd[1]: Started Foreman jobs daemon - orchestrator on sidekiq.
Jan 02 17:52:52 10-222-206-152.ssnc-corp.cloud systemd[1]: Stopping Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 17:53:16 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[53313]: E, [2026-01-02T17:53:16.114516 #53313] ERROR -- /connector-database-core: Sending envelope failed on timeout: 5.0, elapsed: 5.012567403000503
Jan 02 17:53:53 10-222-206-152.ssnc-corp.cloud systemd[1]: dynflow-sidekiq@orchestrator.service: Deactivated successfully.
Jan 02 17:53:53 10-222-206-152.ssnc-corp.cloud systemd[1]: Stopped Foreman jobs daemon - orchestrator on sidekiq.
Jan 02 17:53:53 10-222-206-152.ssnc-corp.cloud systemd[1]: dynflow-sidekiq@orchestrator.service: Consumed 9min 26.981s CPU time, 6.3G memory peak.
Jan 02 17:53:53 10-222-206-152.ssnc-corp.cloud systemd[1]: Starting Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 17:53:53 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[62793]: 2026-01-02T17:53:53.946Z pid=62793 tid=1bsp INFO: Enabling systemd notification integration
Jan 02 17:53:55 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[62793]: 2026-01-02T17:53:55.842Z pid=62793 tid=1bsp INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc-corp.cloud:6379/0"}
Jan 02 17:53:55 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[62793]: 2026-01-02T17:53:55.844Z pid=62793 tid=1bsp INFO: GitLab reliable fetch activated!
Jan 02 17:54:03 10-222-206-152.ssnc-corp.cloud systemd[1]: Started Foreman jobs daemon - orchestrator on sidekiq.
Jan 02 18:06:42 10-222-206-152.ssnc-corp.cloud systemd[1]: Stopping Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 18:07:11 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[62793]: E, [2026-01-02T18:07:11.776948 #62793] ERROR -- /connector-database-core: Receiving envelopes failed on timeout: 5.0, elapsed: 6.4335142399995675
Jan 02 18:07:19 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[62793]: E, [2026-01-02T18:07:19.099914 #62793] ERROR -- /connector-database-core: Sending envelope failed on timeout: 5.0, elapsed: 6.561982693998289
Jan 02 18:07:30 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[62793]: E, [2026-01-02T18:07:30.670137 #62793] ERROR -- /connector-database-core: Sending envelope failed on timeout: 5.0, elapsed: 6.3181936139990285
Jan 02 18:07:38 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[62793]: E, [2026-01-02T18:07:38.377016 #62793] ERROR -- /connector-database-core: Sending envelope failed on timeout: 5.0, elapsed: 5.577250770000319
Jan 02 18:07:43 10-222-206-152.ssnc-corp.cloud systemd[1]: dynflow-sidekiq@orchestrator.service: Deactivated successfully.
Jan 02 18:07:43 10-222-206-152.ssnc-corp.cloud systemd[1]: Stopped Foreman jobs daemon - orchestrator on sidekiq.
Jan 02 18:07:43 10-222-206-152.ssnc-corp.cloud systemd[1]: dynflow-sidekiq@orchestrator.service: Consumed 8min 34.410s CPU time, 5.7G memory peak.
Jan 02 18:07:43 10-222-206-152.ssnc-corp.cloud systemd[1]: Starting Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 18:07:43 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[71270]: 2026-01-02T18:07:43.598Z pid=71270 tid=1iae INFO: Enabling systemd notification integration
Jan 02 18:07:45 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[71270]: 2026-01-02T18:07:45.487Z pid=71270 tid=1iae INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc-corp.cloud:6379/0"}
Jan 02 18:07:45 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[71270]: 2026-01-02T18:07:45.489Z pid=71270 tid=1iae INFO: GitLab reliable fetch activated!
Jan 02 18:07:53 10-222-206-152.ssnc-corp.cloud systemd[1]: Started Foreman jobs daemon - orchestrator on sidekiq.
Jan 02 18:29:09 10-222-206-152.ssnc-corp.cloud systemd[1]: Stopping Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 18:30:13 10-222-206-152.ssnc-corp.cloud systemd[1]: dynflow-sidekiq@orchestrator.service: Deactivated successfully.
Jan 02 18:30:13 10-222-206-152.ssnc-corp.cloud systemd[1]: Stopped Foreman jobs daemon - orchestrator on sidekiq.
Jan 02 18:30:13 10-222-206-152.ssnc-corp.cloud systemd[1]: dynflow-sidekiq@orchestrator.service: Consumed 13min 50.754s CPU time, 12.8G memory peak.
Jan 02 18:52:18 10-222-206-152.ssnc-corp.cloud systemd[1]: Starting Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 18:52:19 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[94520]: 2026-01-02T18:52:19.070Z pid=94520 tid=21s8 INFO: Enabling systemd notification integration
Jan 02 18:52:21 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[94520]: 2026-01-02T18:52:21.005Z pid=94520 tid=21s8 INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc-corp.cloud:6379/0"}
Jan 02 18:52:21 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[94520]: 2026-01-02T18:52:21.007Z pid=94520 tid=21s8 INFO: GitLab reliable fetch activated!
Jan 02 18:52:28 10-222-206-152.ssnc-corp.cloud systemd[1]: Started Foreman jobs daemon - orchestrator on sidekiq.
[root@10-222-206-152 ~]# systemctl status dynflow-sidekiq@orchestrator.service -l --no-pager
● dynflow-sidekiq@orchestrator.service - Foreman jobs daemon - orchestrator on sidekiq
     Loaded: loaded (/usr/lib/systemd/system/dynflow-sidekiq@.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/dynflow-sidekiq@.service.d
             └─override.conf
     Active: active (running) since Fri 2026-01-02 18:52:28 UTC; 8min ago
       Docs: https://theforeman.org
   Main PID: 94520 (sidekiq)
     Status: "orchestrator in passive mode"
      Tasks: 11 (limit: 407756)
     Memory: 483.3M (peak: 484.1M)
        CPU: 4min 51.560s
     CGroup: /system.slice/system-dynflow\x2dsidekiq.slice/dynflow-sidekiq@orchestrator.service
             └─94520 /usr/bin/ruby /usr/bin/sidekiq -e production -r /usr/share/foreman/extras/dynflow-sidekiq.rb -C /etc/foreman/dynflow/orchestrator.yml

Jan 02 18:52:18 10-222-206-152.ssnc-corp.cloud systemd[1]: Starting Foreman jobs daemon - orchestrator on sidekiq...
Jan 02 18:52:19 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[94520]: 2026-01-02T18:52:19.070Z pid=94520 tid=21s8 INFO: Enabling systemd notification integration
Jan 02 18:52:21 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[94520]: 2026-01-02T18:52:21.005Z pid=94520 tid=21s8 INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://10-222-172-152.ssnc-corp.cloud:6379/0"}
Jan 02 18:52:21 10-222-206-152.ssnc-corp.cloud dynflow-sidekiq@orchestrator[94520]: 2026-01-02T18:52:21.007Z pid=94520 tid=21s8 INFO: GitLab reliable fetch activated!
Jan 02 18:52:28 10-222-206-152.ssnc-corp.cloud systemd[1]: Started Foreman jobs daemon - orchestrator on sidekiq.
[root@10-222-206-152 ~]# ^C
[root@10-222-206-152 ~]# PGPASSWORD="$PGPASS" psql -X -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -v ON_ERROR_STOP=1 -x -c "
SELECT class, owner_id
FROM dynflow_coordinator_records
WHERE class = 'Dynflow::Coordinator::DelayedExecutorLock';
"
-[ RECORD 1 ]----------------------------------------
class    | Dynflow::Coordinator::DelayedExecutorLock
owner_id | world:20c07487-e7d0-4329-9591-080f2345ae86


I get a new ExecutorLock each time, but the orchestrator service itself never comes out of Passive mode. Right now, I am trying to simplify it by completely disabling 1 of the 2 servers. So just 1 comes back as active. But just cannot get it to happen.

I wanted to follow up here.
So after 9-10 hours of trying to get one of the two servers in an active state, I gave up. I restarted them both and went to bed. Woke up the next morning, and checked both services, seeing one Active. Looked at the logs and roughly 7 hours after I restarted them and went to bed, one of the two orchestrators finally went back into Active mode.

Do you have any idea why this is taking so long? I can see the get/set/del from what I think is called a lua script (within redis), but the issue seems to be that the systemd service never actually changes to Active? I have not looked into this process yet, but Im guessing there’s something Im missing or not understanding.

Thanks.

That is expected. Dynflow needs to place a lock to be able to communicate with other processes (worlds in dynflow lingo), so this lock has to be placed fairly early in the process.

If the get/set/del gets called from a lua script then that means one of the processes has actually acquired the lock in redis and is trying to refresh it (using exists, get and set) or release it (using get and del) and the point where it “gets stuck” is actually further down the road.

Once the lock in redis is acquired, there are a two things that need to happen before the orchestrator flips fully into active mode:

  • It sends a message through the default queue and expects any worker to reply to that. If the queue is already filled with other things, this may take some time
  • It performs world validity checks (it goes over the records in dynflow_coordinator_records) table and tries to check that all the processes there are still alive, again this might take some time

These two take some time, but I’ve never seen it take hours before.

Let’s go back to the basics, could you please paste configurations of all the orchestrators and workers from all the machines?

I think this is the answer. There were over 350,000 items in the worker queue. I think this is why in the past when Ive had this issue - if I completely blow away all the queues, one of the orchestrators becomes active fairly quickly. But this time I wanted to save everything in queue. So I bet it took almost 7 hours to get through all 350,000 items + any new ones that would have been coming in (probably another 200,000).

That being said, is there a way to make the message higher priority? Could the message be sent to a different queue, with a worker that is only listening to that queue? Of course the root cause of all of this is due to brief, short-lived disconnects or issues talking to postgres - that is what I always see in the dynflow log that causes the orchestrator to flip out of active/passive. But Im wondering if there is something more I can do, to alleviate this from happening every other week.

Thanks for the reply, as always!

In retrospect, adding this whole mechanism wasn’t the wisest decision and I’ll think about reworking that or getting rid of it altogether.

What you’re suggesting could be done in code, but sadly not by just a configuration change.

Just before the holidays I released dynflow-2.0.0, which tries to make it at least a little bit more resilient to brief postgres and redis connection drops so maybe that could help, but no promises

1 Like

@aruzicka - We are in another situation where both orchestrators are stuck in passive. We’ve been working on it for around 4+ hours now and still, no matter what we attempt, cannot get them to move to active. We have even gone as far as clearing out every single item in the queue. We’ve cleared out postgres, we’ve flushed Redis, the queues are all at 0. We start the orchestrator and it just sits in passive mode.

Is there anything we can do to force it into active? Even if it is something low level, or an API call of some sort, or foreman-rake, anything? Its getting very aggravating that we have this occur every couple weeks, and it takes us hours or days to resolve.

Appreciate anything you can point us towards.

On a side note, I think I misunderstood you when you said we could run in an active/active state, if we had 2 redis servers. Being we are in HA, with a single external PSQL DB, it would appear that is not possible? - since the DelayedExecutorLock in the dynflow_coordinator_records table is the master lock that controls which orchestrator processes delayed/scheduled tasks.

Below is what ended up making it work. AI summary obviously (thanks Claude!!)


Nuclear Option - What Fixed It

The Problem:

  • Both orchestrators stuck in passive mode

  • 81,926 accumulated tasks (mostly Salt report imports)

  • 8,443 stale coordinator records blocking world validity checks

  • Phantom world locks preventing leadership acquisition

The Solution: Complete State Reset

Commands That Fixed It

bash

# 1. STOP ALL SERVICES ON UI-01 and UI-02
systemctl stop dynflow-sidekiq@orchestrator.service
systemctl stop dynflow-sidekiq@worker-{1,2,3}.service

# 2. NUCLEAR DATABASE CLEANUP - Delete everything
PGPASSWORD="$PGPASS" psql -X -h "$PGHOST" -U "$PGUSER" -d "$PGDB" -v ON_ERROR_STOP=1 << 'EOF'
TRUNCATE TABLE foreman_tasks_tasks CASCADE;
DELETE FROM dynflow_coordinator_records;
DELETE FROM dynflow_delayed_plans;
TRUNCATE TABLE dynflow_steps CASCADE;
TRUNCATE TABLE dynflow_actions CASCADE;
TRUNCATE TABLE dynflow_execution_plans CASCADE;
EOF

# 3. NUCLEAR REDIS CLEANUP - Flush everything
redis-cli -h foreman-redisjobs-01.ssnc-corp.cloud -p 6379 FLUSHALL

# 4. START ORCHESTRATOR - Clean slate allows proper initialization
foreman-maintain service restart

# 5. VERIFY SUCCESS
systemctl status dynflow-sidekiq@orchestrator.service | grep "Status:"
# Result: "orchestrator in active mode" ✓

Why It Worked:

  • Removed all stale coordinator records that were blocking world validity checks

  • Cleared phantom execution locks (there were 8,000+)

  • Eliminated conflicting state from crashed/old orchestrators

  • Allowed orchestrator to start with clean state and properly claim leadership

Result:

  • Orchestrator 1: ACTIVE ✓

  • Orchestrator 2: PASSIVE ✓ (correct for HA)

  • Tasks processing normally ✓

No, the point is that it can’t go to active until all the checks are done and the whole system is in a reasonable and known state. Providing a way to bypass that would open door to unpredictable behaviour.

That’s one way of going about it.

Some entities in foreman rely on this table so you might begin to see some weird stuff here and there.

If you had remote execution jobs scheduled for the future or for periodic runs or if you had any sync plans (if you have katello), those will now be borked.

They’re not “blocking” validity checks per se. Validity checks take a long time because all of these need to be processed and reconciled to get the whole system into sane state.

On a side note, if you have a backlog of over 8k execution plans, it seems the system isn’t able to keep up with the load and problems snowball from there.

Yes, thank you for that explanation. We always have to setup our recurring tasks and future jobs when this happens.

On a side note, if you have a backlog of over 8k execution plans, it seems the system isn’t able to keep up with the load and problems snowball from there.

Ive thought about this as well. However, when the system is up and running everything is fine. We currently process around 1,200,000 jobs a day. And we never have a queue. The only time we get in this state is when the orchestrators get all out wack. I still dont really know how to fix it. I basically throw a ton of commands at it, until it works, or just give up and hours later it fixes itself.

These are our current settings, anything you can see that we can tweak?

[root@10-222-206-152 dynflow]# ll
total 12
-rw-r--r--. 1 root foreman 237 Jan 23 17:15 orchestrator.yml
-rw-r--r--. 1 root root     54 Jan 23 17:15 orchestrator.yml.orig
lrwxrwxrwx. 1 root root     10 Feb 27  2025 worker-1.yml -> worker.yml
lrwxrwxrwx. 1 root root     10 Feb 27  2025 worker-2.yml -> worker.yml
lrwxrwxrwx. 1 root root     10 Feb 27  2025 worker-3.yml -> worker.yml
-rw-r--r--. 1 root root     59 Feb 27  2025 worker.yml
[root@10-222-206-152 dynflow]# cat orchestrator.yml
---
:concurrency: 1
:queues:
  - dynflow_orchestrator
# Redis connection
:redis_url: 'redis://10-222-172-152.ssnc-corp.cloud:6379/0'
# Pool sizes
:pool_size: 20
# Delayed executor flag
:delayed_executor: true

[root@10-222-206-152 dynflow]# cat worker.yml
:concurrency: 15
:queues:
  - default
  - remote_execution

[root@10-222-206-152 dynflow]# lscpu | grep CPU
CPU op-mode(s):                          32-bit, 64-bit
CPU(s):                                  8
On-line CPU(s) list:                     0-7
Model name:                              Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
BIOS Model name:                         Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
CPU family:                              6
NUMA node0 CPU(s):                       0-7
Vulnerability Mds:                       Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Mmio stale data:           Mitigation; Clear CPU buffers; SMT Host state unknown

[root@10-222-206-152 dynflow]# free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        10Gi        47Gi        90Mi       4.1Gi        51Gi
Swap:          2.0Gi          0B       2.0Gi

We can certainly bump up CPU/RAM if needed, but honestly all our numbers look good. CPU/RAM rarely spikes.

Thanks again for all your replies, it means a lot to us that arent as knowledgeable.