Repeated failing start of pulpcore-resource-manager.service during boot/start

gvde · May 28, 2021, 4:21pm

Problem:

I have noticed, that during system startup after a reboot or after a foreman-maintain service stop & start, the pulpcore-resource-manager.service repeatedly tries to start, fails, and retries until it succeeds after a couple of minutes. Sometimes it takes two minutes, sometimes even 5. Each attempt takes about 5 seconds, thus you have 12 or even 60 failed starts…

Foreman and Proxy versions:

Foreman 2.3.4, Katello 3.18.2 on CentOS 7.9

Other relevant data:

Each cycle looks like this in the systemd journal:

May 28 17:37:23 foreman.dkrz.de systemd[1]: Started Pulp Resource Manager.
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: pulp: pulpcore.tasking.services.worker_watcher:INFO: Cleaning up shutdown worker 'resource-manager'.
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: Traceback (most recent call last):
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/bin/rq", line 11, in <module>
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: load_entry_point('rq==1.5.2', 'console_scripts', 'rq')()
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: return self.main(*args, **kwargs)
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: rv = self.invoke(ctx)
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: return _process_result(sub_ctx.command.invoke(sub_ctx))
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: return ctx.invoke(self.callback, **ctx.params)
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: return callback(*args, **kwargs)
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/lib/python3.6/site-packages/rq/cli/cli.py", line 78, in wrapper
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: return ctx.invoke(func, cli_config, *args[1:], **kwargs)
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: return callback(*args, **kwargs)
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/lib/python3.6/site-packages/rq/cli/cli.py", line 268, in worker
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: max_jobs=max_jobs, with_scheduler=with_scheduler)
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/lib/python3.6/site-packages/rq/worker.py", line 508, in work
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: self.register_birth()
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/lib/python3.6/site-packages/pulpcore/tasking/worker.py", line 171, in register_birth
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: return super().register_birth(*args, **kwargs)
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: File "/usr/lib/python3.6/site-packages/rq/worker.py", line 273, in register_birth
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: raise ValueError(msg.format(self.name))
May 28 17:37:25 foreman.dkrz.de pulpcore-resource-manager[38626]: ValueError: There exists an active worker named 'resource-manager' already
May 28 17:37:25 foreman.dkrz.de systemd[1]: pulpcore-resource-manager.service: main process exited, code=exited, status=1/FAILURE
May 28 17:37:25 foreman.dkrz.de systemd[1]: Unit pulpcore-resource-manager.service entered failed state.
May 28 17:37:25 foreman.dkrz.de systemd[1]: pulpcore-resource-manager.service failed.
May 28 17:37:28 foreman.dkrz.de systemd[1]: pulpcore-resource-manager.service holdoff time over, scheduling restart.
May 28 17:37:28 foreman.dkrz.de systemd[1]: Stopped Pulp Resource Manager.
May 28 17:37:28 foreman.dkrz.de systemd[1]: Started Pulp Resource Manager.
...
May 28 17:39:28 foreman.dkrz.de systemd[1]: pulpcore-resource-manager.service: main process exited, code=exited, status=1/FAILURE
May 28 17:39:28 foreman.dkrz.de systemd[1]: Unit pulpcore-resource-manager.service entered failed state.
May 28 17:39:28 foreman.dkrz.de systemd[1]: pulpcore-resource-manager.service failed.
May 28 17:39:31 foreman.dkrz.de systemd[1]: pulpcore-resource-manager.service holdoff time over, scheduling restart.
May 28 17:39:31 foreman.dkrz.de systemd[1]: Stopped Pulp Resource Manager.
May 28 17:39:31 foreman.dkrz.de systemd[1]: Started Pulp Resource Manager.
May 28 17:39:33 foreman.dkrz.de pulpcore-resource-manager[39403]: pulp: pulpcore.tasking.services.worker_watcher:INFO: Cleaning up shutdown worker 'resource-manager'.
May 28 17:39:33 foreman.dkrz.de pulpcore-resource-manager[39403]: pulp: rq.worker:INFO: Worker rq:worker:resource-manager: started, version 1.5.2
May 28 17:39:33 foreman.dkrz.de pulpcore-resource-manager[39403]: pulp: rq.worker:INFO: *** Listening on resource-manager...
May 28 17:39:33 foreman.dkrz.de pulpcore-resource-manager[39403]: pulp: pulpcore.tasking.services.worker_watcher:INFO: Worker 'resource-manager' is back online.
May 28 17:39:33 foreman.dkrz.de pulpcore-resource-manager[39403]: pulp: rq.worker:INFO: Cleaning registries for queue: resource-manager

kobena · June 1, 2021, 9:54am

Hi @gvde

Just to confirm I can reproduce this with same katello/foreman version.

I just notice that foreman-maintain service stop seems to have no effect to stop pulpcore-worker@*. When command is finished, you wan still see pulp worker processes.

Can you try this

foreman-maintain service stop
Check pulp worker processes if they’re still alive (ps, top, htop, …)
If yes, stop them with systemctl stop pulpcore-worker@*
Check again that pulp worker process are stopped
Start Katello and check if an other terminal with foreman-tail.

Can you confirm it’s better ?

ekohl · June 1, 2021, 10:07am

Looking at the code:

github.com

rq/rq/blob/73d0210d65fc3fa70463e3f6b6ecd8f3bb679d63/rq/worker.py#L289-L292


if self.connection.exists(self.key) and \
        not self.connection.hexists(self.key, 'death'):
    msg = 'There exists an active worker named {0!r} already'
    raise ValueError(msg.format(self.name))

I wonder if a worked died and didn’t register its death. This should be stored in Redis but I’m not sure how to really solve it.

kobena · June 1, 2021, 10:07am

Sorry, small fix/update

Start Katello and check in an other terminal with foreman-tail.

kobena · June 1, 2021, 10:17am

What I understand from the code is : “Pulp workers should be died”. If not raise an error.

So maybe during the foreman-maintain stop step, we should assume that these processes are stopped ?

kobena · June 1, 2021, 10:28am

A quick fix could be to remove raise ValueError(msg.format(self.name)).

Otherwise, add a loop of x iterations over n seconds, use try/except and maybe raise errors if all iterations are failed.

For me it’s not a bug, it’s a dirty message during the start process in foreman logs. If you wait (~5min), foreman is able to “create the new world”.

ekohl · June 1, 2021, 10:30am

I think foreman-maintain is the wrong place to solve it. foreman-maintain is tied to a single server but in theory you could run it on multiple servers with some fallback.

What’s really needed is some way to clean up a zombie worker. One that’s died, but isn’t registered as dead. I think that means either removing the key in Redis or adding a death entry.

Perhaps @pulp can help here.

kobena · June 1, 2021, 11:00am

Please @gvde don’t try to follow my 1…5 steps. It will break your pulp workers services.
If you already did it, just re run foreman-installer.

@ekohl I’m agree with you.

gvde · June 1, 2021, 2:16pm

No and yes. If I stop foreman with foreman-maintain service stop it will first stop everything. If I check immediately after that, it’s gone. But then it quickly restarts. Logs show

systemd: pulpcore-worker@8.service holdoff time over, scheduling restart.

So systemd restarts the workers because they are have crashed and not stopped.

I haven’t looked any further in to this problem so far, but my guess was that the systemd configuration doesn’t reflect the correct startup or shutdown order of the services involved, causing systemd to start services which are supposed to be stopped…

kobena · June 1, 2021, 2:47pm

Hi @gvde,

I’m not sure it concerns systemd nor foreman-maintain. It’s a pulp mechanism. As @ekohl explains:

What’s really needed is some way to clean up a zombie worker. One that’s died, but isn’t registered as dead. I think that means either removing the key in Redis or adding a death entry.

So it looks like more a pulp issue, and seems be linked with redis entries.

Once again, it’s not a bug since foreman/katello is still working. In my case, I just wait a few minutes, and check with foreman-maintain service status -b / hammer ping until it’s up.

Sorry but I don’t have enough skill to tell you more.

gvde · June 1, 2021, 2:53pm

It was just a observation from me that systemd restarts the worker processes because they have not been flagged to be stopped. The processes die and start again by systemd. So this “register(ed) as dead” doesn’t seem to help because the next worker restart is just around the corner…

kobena · June 1, 2021, 3:32pm

I found this issue:

Issue #6979: cluster: Ensure resource-manager worker name uniqueness - Pulp
[WIP] Ensure resource-manager has a unique name when horizontally scaled by Spredzy · Pull Request #333 · pulp/pulp_installer · GitHub

gvde · August 10, 2021, 6:44am

By now I am on 4.1.2.1. Just stop all services (foreman-maintain service stop) before a foreman-installer run and still noticed, that systemd keeps restarting pulpcore-worker@*.service.

I don’t think this is a zombie worker. systemd doesn’t know that it’s supposed to be stopped. In my opinion that is something foreman-maintain should do.

A zombie worker would be a pulpcore-worker which is running after the rest has crashed. But with “service stop” there should be a clean shutdown of all related services including all pulpcore-workers…