Missing Pulp 3 workers on upgrade to Katello 4.0

ehelms · June 17, 2021, 6:04pm

During the past few weeks, while doing some testing @evgeni and I were running into content syncs hanging when performing upgrades to Katello 4.0. After some digging, the high level issue I see is:

A yum update and installer run happens which are part of the upgrade process changing out the python libraries and Pulp packages
Pulp workers either pick up the new code, can’t connect to Redis, pickup a partially migrated database or an unmigrated database (there are a lot of failures and logs to try to parse that its any “1” thing) and start to fail
Pulp workers fail and exit and get restarted by systemd
This happens many, many times until the complete yum transaction + installer run is complete
Pulp database ends up with many missing workers (in my case ~20)

When the tests kick off and the sync occurs, Pulp has many missing workers in it’s database and the task hangs waiting for Pulp to assign it to a worker. Eventually, Pulp will clear out all the missing workers and rectify the state, this appears to take some amount of time (5-20 minutes?) after which the tasks still never finish.

If you perform the upgrade, and then wait 5-20 minutes before running tests they will pass as all of the workers get cleaned up.

I am left wondering if this is:

a problem with Pulp 3.9
a problem with our upgrade steps
a problem with how we orchestrate the upgrade in puppet-pulpcore (e.g. should we stop the workers before migrating similar to what we did with Pulp 2 (Refs #29052: Stop services before running database migrations · theforeman/puppet-pulp@f78dba6 · GitHub)

How to replicate the issue:

clone GitHub - theforeman/forklift: Helpful deployment scripts for Foreman and Katello
follow set up instructions for dependencies
Apply this as a patch run upgrades twice by evgeni · Pull Request #1337 · theforeman/forklift · GitHub
ansible-playbook pipelines/upgrade_pipeline.yml -e @pipelines/vars/common/katello_nightly_centos7.yml -e pipeline_version=4.0

You might ask, why do I need that Forklift patch? Could that be causing the problem? I think that PR clearly exposes the problem. In a normal upgrade pipeline, we upgrade the server and then we upgrade a content proxy. The time it takes to run the steps to upgrade the content proxy is often enough time for Pulp to have cleared up any missing workers so that by the time we run tests we were not seeing this. I do not think this diminishes that there is an issue here with upgrade orchestration and worker handling. We should not end up with 20+ missing workers that Pulp has to clean up during upgrades.

Pinging a few folks directly for input @Justin_Sherrill @ekohl @goosemania

Justin_Sherrill · June 17, 2021, 6:32pm

I don’t have much to add here other than it seems that we should be stopping these services somehow during the upgrade process.

I wonder if GitHub - pulp/pulp_installer: Ansible roles to install & configure Pulp 3 from PyPI handles it, i looked through a bit and didn’t see any indication it did (but may have overlooked it)

x9c4 · June 18, 2021, 9:39am

What i know is that pulp does not implement a zero downtime policy. So halting services during upgrade is a sane idea.
Maybe the changes made in this specific upgrade lead specifically to failing workers.

ekohl · June 18, 2021, 10:58am

I believe this may be the same thing:

As for stopping services: I’m always hesitant with that. In complex multi-machine deployments it’s harder to orchestrate.

Another thing I’d like to ask is whether it’s needed for patch releases as well or if we can limit it to minor releases.

Does this mean we need to stop all services (all workers, orchestrator, content and API) or only the workers or orchestrator?

Thinking out loud: can we set a cache key somewhere that makes Pulp workers stop accepting jobs? Perhaps the Django DB migrations could set (and unset) this. The question is then still open of what to do with workers that are still processing jobs.

x9c4 · June 18, 2021, 12:43pm

Patch releases will never contain migrations, because there is no proper way to keep the database sane. (For some reasonable definition of “never”…)
So I’d say for patch upgrades it is fine. And as long as there are no changes to the tasks tables, it’s probably also not a problem.

ehelms · June 18, 2021, 2:07pm

Note, I also saw this happening simply as packages were getting updated by yum as well. I’ll have to respin the environment to get the output if anyone is interested.

I do want to re-emphasize that eventually Pulp does appear to rectify itself and return to a working state. For me, the issue for users is:

the uncertainty of how long this takes to happen
tasks started during the uncertainty period seem to end up hung and never properly re-scheduled

Does the new workers handle this better @x9c4 ?
Can we do a “easy” stop services fix for 4.0? This might depend on if these are systematic problems with the new workers too.

x9c4 · June 18, 2021, 2:12pm

The new worker type will handle the

part better, because only healthy worker will assign tasks to them and eventually all waiting tasks will we handled by the system.

ehelms · June 21, 2021, 6:42pm

Linked in a related issue by @x9c4, this bug seems relevant: Issue #8779: Task started on removed worker - Pulp