During the past few weeks, while doing some testing @evgeni and I were running into content syncs hanging when performing upgrades to Katello 4.0. After some digging, the high level issue I see is:
A yum update and installer run happens which are part of the upgrade process changing out the python libraries and Pulp packages
Pulp workers either pick up the new code, can’t connect to Redis, pickup a partially migrated database or an unmigrated database (there are a lot of failures and logs to try to parse that its any “1” thing) and start to fail
Pulp workers fail and exit and get restarted by systemd
This happens many, many times until the complete yum transaction + installer run is complete
Pulp database ends up with many missing workers (in my case ~20)
When the tests kick off and the sync occurs, Pulp has many missing workers in it’s database and the task hangs waiting for Pulp to assign it to a worker. Eventually, Pulp will clear out all the missing workers and rectify the state, this appears to take some amount of time (5-20 minutes?) after which the tasks still never finish.
If you perform the upgrade, and then wait 5-20 minutes before running tests they will pass as all of the workers get cleaned up.
You might ask, why do I need that Forklift patch? Could that be causing the problem? I think that PR clearly exposes the problem. In a normal upgrade pipeline, we upgrade the server and then we upgrade a content proxy. The time it takes to run the steps to upgrade the content proxy is often enough time for Pulp to have cleared up any missing workers so that by the time we run tests we were not seeing this. I do not think this diminishes that there is an issue here with upgrade orchestration and worker handling. We should not end up with 20+ missing workers that Pulp has to clean up during upgrades.
What i know is that pulp does not implement a zero downtime policy. So halting services during upgrade is a sane idea.
Maybe the changes made in this specific upgrade lead specifically to failing workers.
As for stopping services: I’m always hesitant with that. In complex multi-machine deployments it’s harder to orchestrate.
Another thing I’d like to ask is whether it’s needed for patch releases as well or if we can limit it to minor releases.
Does this mean we need to stop all services (all workers, orchestrator, content and API) or only the workers or orchestrator?
Thinking out loud: can we set a cache key somewhere that makes Pulp workers stop accepting jobs? Perhaps the Django DB migrations could set (and unset) this. The question is then still open of what to do with workers that are still processing jobs.
Patch releases will never contain migrations, because there is no proper way to keep the database sane. (For some reasonable definition of “never”…)
So I’d say for patch upgrades it is fine. And as long as there are no changes to the tasks tables, it’s probably also not a problem.
Note, I also saw this happening simply as packages were getting updated by yum as well. I’ll have to respin the environment to get the output if anyone is interested.
I do want to re-emphasize that eventually Pulp does appear to rectify itself and return to a working state. For me, the issue for users is:
the uncertainty of how long this takes to happen
tasks started during the uncertainty period seem to end up hung and never properly re-scheduled
Does the new workers handle this better @x9c4 ?
Can we do a “easy” stop services fix for 4.0? This might depend on if these are systematic problems with the new workers too.