How did the deployment look like? Which services were running on the nodes? What is the desired state? Do you want to be able to do failover from one node to another or do you want an active-active kind of setup?
Before we moved to the sidekiq based implementation, we had our own homegrown executor. An executor contained a part which decided what and when should be executed, workers which executed the jobs and a mechanism using which the deciding part could talk to the workers.
The homegrown executor was a single process containing everything. After the move, we ended up with a bunch of smaller services, but the executor still exists as a concept. The part that was controlling what should happen when became the orchestrator, instead of in-process communication we have redis and the rest became the workers.
To rule out out-of-order execution there has to be a single orchestrator per executor. You can have multiple orchestrator processes running, but they’ll operate in passive-active mode. First orchestrator to start will acquire a lock on redis and all other orchestrators will wait on the lock. Sadly this works only at process level, so bumping concurrency for an orchestrator would break things as we have no way to enforce order of execution.
In large deployments I could see two kinds of setups.
Option 1 - multiple executors
Each node runs a full executor (orchestrator+redis+workers). In this setup, all the executors run in active-active mode so the workload is spread across them. Sadly the algorithm used is fairly basic and in the end does not really spread the load evenly.
Option 2 - One big executor
Each node runs an orchestrator and workers, but all of them use the same redis. In this setup, the orchestrators work in failover mode and all workers are always active. I don’t have any data to back this claim up, but I’d say this should lead to better utilization of available resources.
The downside of this model is that the currently active orchestrator becomes a bottleneck. Once you fully saturate the orchestrator and it starts to negatively impact job throughput, you need to add another executor.
Option 3 - Mix and match
This is just a combination of the previous two. You don’t have to limit yourself and can have multiple big executors.
Q&A
Did you have it configured to use multiple threads?
It sounds like you went with option 2. In this case, the orchestrator from the other node will not show up until it becomes active.
You can have as many of them as you want, but each of them has to have only one thread. All the processes will work in failover mode. There’s no need to control the failover by hand.
Just the usual ones when you talk to a service over the network. If the connection is solid, it should be fine.
If you can reproduce this, could you check a couple of things for me? Do the jobs get picked up by the orchestrator, put on redis and then get stuck there (ie the workers are not picking the jobs up) or do they “get stuck” even before reaching redis? The sidekiq console you discovered should help shed some light into this.