Tasks stuck with "waiting for Pulp to start the task" - foreman-2.4 / Katello 4.0.0 / Pulp3

Hello @iballou

Yes curl https://foremanserver/pulp/api/v3/status/ show the dead workers too
No, I don’t see any messages similar to the following in /var/log/messages.
May 13 19:39:10 server.example.com rq[46406]: pulp [None]: pulpcore.tasking.worker_wa

I was able to delete the zombi tasks using pulp api and curl as below :

[root@foremanserver~]# curl -X DELETE https://`hostname`/pulp/api/v3/tasks/4472cad9-8390-46c7-a9ba-3dbe06689049/ --cert /etc/pki/katello/certs/pulp-client.crt   --key /etc/pki/katello/private/pulp-client.key
[root@foremanserver~]#

[root@foremanserver~]# curl -X GET https://`hostname`/pulp/api/v3/tasks/4472cad9-8390-46c7-a9ba-3dbe06689049/ --cert /etc/pki/katello/certs/pulp-client.crt  --key /etc/pki/katello/private/pulp-client.key | jq
{
  "detail": "Not found."
}
[root@foremanserver~]#

After deleting all these zombi tasks (waiting tasks) + foreman-maintain service restart now I’m able to run new tasks successfully.

At this point we dont have the RCA of this issue and it can happen again.

Regards,

1 Like

Okay, I’m glad you can run new tasks now. The dead workers are still a concern. I think they’re not being cleaned up, especially if they’re showing up under “online_workers”. I am slightly confused that the workers from your workers = [w.name for w in Worker.objects.online_workers()] ... query were all actual workers. There’s a mismatch between that pulpcore-manager shell output and the online workers in the status api output? I would assume that they come from the same database records, but maybe not.

To move forward, I’ll first see if the Pulp team has any more suggestions. If not, perhaps you can just monitor the number of dead vs alive workers. If the dead worker number keeps rising, and they aren’t being cleaned, I think it would be good to make a bug report here that shows the dead workers and some information about your system: https://pulp.plan.io/issues/new

Hello @iballou

OK noted, I’ll continue monitoring the number of dead vs alive workers on my system
I really appreciate your help! thanks a lot :slight_smile:

Regards,

1 Like

Hello, I have the same issue. But I can’t delete waiting tasks. On my "curl -X DELETE … " I get HTTP 409 Conflict error. How can you delete this waiting tasks?
< HTTP/1.1 409 Conflict
< Date: Tue, 18 May 2021 09:13:40 GMT
< Server: gunicorn/20.0.4
< Vary: Accept,Cookie
< Allow: GET, PATCH, DELETE, HEAD, OPTIONS
< X-Frame-Options: SAMEORIGIN
< Content-Length: 0

Hello @fedextm ,

Here what I used :

[root@foremanserver~]# curl -X DELETE https://hostname/pulp/api/v3/tasks/4472cad9-8390-46c7-a9ba-3dbe06689049/ --cert /etc/pki/katello/certs/pulp-client.crt --key /etc/pki/katello/private/pulp-client.key
[root@foremanserver~]#

Regards,

I’m also having this issue on a fresh install of Foreman 2.4 and Katello 4.0. I’m running Katello on Oracle Linux 8 and I’ve added the Oracle Linux repos for OL6/7/8 as well as some other repos and I triggered them all to sync at the same time.

I tried syncing A large number of repositories and now I’ve got a task thats waiting for Pulp to start. Looking at the history of this post I can see that pulp had 4 workers + resource-manger running. But according to the API (https://hostname/pulp/api/v3/workers/) there are 25 workers, 5 are active and 20 have last_heartbeats from 4+ days ago (when things died).

The task list has 1 task the hung one, which is assigned to one of the workers that is dead. Once I cancelled that task, things started moving again.

I did see entries in the logs about missing workers that match up with the IDs in my API /workers/ list.

I’ll add in all my my outputs too incase it helps someone else:

/pulp/api/v3/stats/ output

# curl "https://`hostname`/pulp/api/v3/status/" --cert /etc/pki/katello/certs/pulp-client.crt  --key /etc/pki/katello/private/pulp-client.key | jq
{
  "versions": [
    {
      "component": "pulpcore",
      "version": "3.9.1"
    },
    {
      "component": "pulp_rpm",
      "version": "3.10.0"
    },
    {
      "component": "pulp_file",
      "version": "1.5.0"
    },
    {
      "component": "pulp_deb",
      "version": "2.9.2"
    },
    {
      "component": "pulp_container",
      "version": "2.2.2"
    },
    {
      "component": "pulp_certguard",
      "version": "1.1.0"
    }
  ],
  "online_workers": [
    {
      "pulp_created": "2021-06-17T23:25:41.745736Z",
      "pulp_href": "/pulp/api/v3/workers/f7bd8f4c-3f14-4443-9924-2b1cabe48f19/",
      "name": "1402@<servername>",
      "last_heartbeat": "2021-06-21T03:43:53.646216Z"
    },
    {
      "pulp_created": "2021-06-17T23:25:42.326468Z",
      "pulp_href": "/pulp/api/v3/workers/83f67f6d-d547-4627-836f-99e69ac7b437/",
      "name": "1408@<servername>",
      "last_heartbeat": "2021-06-21T03:44:07.504398Z"
    },
    {
      "pulp_created": "2021-06-17T23:25:42.933978Z",
      "pulp_href": "/pulp/api/v3/workers/8e881a23-cf2b-4556-909c-a75f72f34c45/",
      "name": "1401@<servername>",
      "last_heartbeat": "2021-06-21T03:42:40.659272Z"
    },
    {
      "pulp_created": "2021-06-15T10:51:04.127965Z",
      "pulp_href": "/pulp/api/v3/workers/9590f209-11ae-42b0-b9a9-67d48022596b/",
      "name": "resource-manager",
      "last_heartbeat": "2021-06-21T03:42:40.058114Z"
    },
    {
      "pulp_created": "2021-06-17T23:25:42.929995Z",
      "pulp_href": "/pulp/api/v3/workers/9b2825df-16e3-493e-a0c7-6b0e30c431fa/",
      "name": "1407@<servername>",
      "last_heartbeat": "2021-06-21T03:43:54.148199Z"
    }
  ],
  "online_content_apps": [
    {
      "name": "1955@<servername>",
      "last_heartbeat": "2021-06-21T03:44:06.263164Z"
    },
    {
      "name": "1936@<servername>",
      "last_heartbeat": "2021-06-21T03:44:08.705915Z"
    },
    {
      "name": "1952@<servername>",
      "last_heartbeat": "2021-06-21T03:44:09.400131Z"
    },
    {
      "name": "1950@<servername>",
      "last_heartbeat": "2021-06-21T03:44:09.357857Z"
    },
    {
      "name": "1943@<servername>",
      "last_heartbeat": "2021-06-21T03:44:09.414391Z"
    },
    {
      "name": "1962@<servername>",
      "last_heartbeat": "2021-06-21T03:44:04.110343Z"
    },
    {
      "name": "1956@<servername>",
      "last_heartbeat": "2021-06-21T03:44:09.356084Z"
    },
    {
      "name": "1958@<servername>",
      "last_heartbeat": "2021-06-21T03:44:04.110480Z"
    },
    {
      "name": "1959@<servername>",
      "last_heartbeat": "2021-06-21T03:44:09.411785Z"
    }
  ],
  "database_connection": {
    "connected": true
  },
  "redis_connection": {
    "connected": true
  },
  "storage": {
    "total": 321961070592,
    "used": 19971796992,
    "free": 301989273600
  }
}

pulp-core output

# sudo -u pulp PULP_SETTINGS='/etc/pulp/settings.py' DJANGO_SETTINGS_MODULE='pulpcore.app.settings' pulpcore-manager shell <<EOF
from pulpcore.app.models import Worker                  
workers = [w.name for w in Worker.objects.online_workers()]
for rwork in workers:
  if rwork in workers:
    print(f'Worker {rwork}')
EOF

Worker 1402@<servername>
Worker 1408@<servername>
Worker 1401@<servername>
Worker 1407@<servername>
Worker resource-manager

/pulp/api/v3/workers/ output

# curl https://`hostname`/pulp/api/v3/workers/  --cert /etc/pki/katello/certs/pulp-client.crt  --key /etc/pki/katello/private/pulp-client.key  | jq 
{
  "count": 25,
  "next": null,
  "previous": null,
  "results": [
    {
      "pulp_created": "2021-06-17T23:25:42.933978Z",
      "pulp_href": "/pulp/api/v3/workers/8e881a23-cf2b-4556-909c-a75f72f34c45/",
      "name": "1401@<servername>",
      "last_heartbeat": "2021-06-21T03:30:32.876435Z"
    },
    {
      "pulp_created": "2021-06-17T23:25:42.929995Z",
      "pulp_href": "/pulp/api/v3/workers/9b2825df-16e3-493e-a0c7-6b0e30c431fa/",
      "name": "1407@<servername>",
      "last_heartbeat": "2021-06-21T03:30:32.274900Z"
    },
    {
      "pulp_created": "2021-06-17T23:25:42.326468Z",
      "pulp_href": "/pulp/api/v3/workers/83f67f6d-d547-4627-836f-99e69ac7b437/",
      "name": "1408@<servername>",
      "last_heartbeat": "2021-06-21T03:30:45.523285Z"
    },
    {
      "pulp_created": "2021-06-17T23:25:41.745736Z",
      "pulp_href": "/pulp/api/v3/workers/f7bd8f4c-3f14-4443-9924-2b1cabe48f19/",
      "name": "1402@<servername>",
      "last_heartbeat": "2021-06-21T03:30:31.773271Z"
    },
    {
      "pulp_created": "2021-06-17T04:45:53.460706Z",
      "pulp_href": "/pulp/api/v3/workers/63bc74cc-ac55-42d8-8485-7e9312aef588/",
      "name": "87641@<servername>",
      "last_heartbeat": "2021-06-17T23:23:53.921023Z"
    },
    {
      "pulp_created": "2021-06-17T04:45:51.516344Z",
      "pulp_href": "/pulp/api/v3/workers/361e0bc2-79fd-481c-86ec-72881b090a74/",
      "name": "87558@<servername>",
      "last_heartbeat": "2021-06-17T23:29:02.317408Z"
    },
    {
      "pulp_created": "2021-06-17T04:45:51.470894Z",
      "pulp_href": "/pulp/api/v3/workers/f820d466-eeab-4bca-b424-f65eada5919b/",
      "name": "87591@<servername>",
      "last_heartbeat": "2021-06-17T23:23:53.946154Z"
    },
    {
      "pulp_created": "2021-06-17T04:45:51.385979Z",
      "pulp_href": "/pulp/api/v3/workers/6c7cb505-46f6-45cb-b364-ccdf0e8dbe9c/",
      "name": "87567@<servername>",
      "last_heartbeat": "2021-06-17T23:23:53.938188Z"
    },
    {
      "pulp_created": "2021-06-16T23:29:51.450305Z",
      "pulp_href": "/pulp/api/v3/workers/341351fe-2801-4bc3-b7fb-b805ef5fc0c1/",
      "name": "69639@<servername>",
      "last_heartbeat": "2021-06-17T04:45:50.113995Z"
    },
    {
      "pulp_created": "2021-06-16T23:28:55.317166Z",
      "pulp_href": "/pulp/api/v3/workers/46f1252d-3674-46a1-9895-cb431c28d4db/",
      "name": "69501@<servername>",
      "last_heartbeat": "2021-06-17T04:45:50.071885Z"
    },
    {
      "pulp_created": "2021-06-16T23:28:54.390401Z",
      "pulp_href": "/pulp/api/v3/workers/317098bb-a71f-4e5f-b400-6cbfb454d4d3/",
      "name": "69495@<servername>",
      "last_heartbeat": "2021-06-17T04:45:50.028776Z"
    },
    {
      "pulp_created": "2021-06-16T23:28:45.296031Z",
      "pulp_href": "/pulp/api/v3/workers/26b68f40-de97-49a8-82a5-a54d896abbe2/",
      "name": "69469@<servername>",
      "last_heartbeat": "2021-06-17T04:45:49.998275Z"
    },
    {
      "pulp_created": "2021-06-15T11:51:43.238418Z",
      "pulp_href": "/pulp/api/v3/workers/19fc28bd-fd42-479e-ac6c-9baa8fe43074/",
      "name": "1444@<servername>",
      "last_heartbeat": "2021-06-16T23:32:06.182504Z"
    },
    {
      "pulp_created": "2021-06-15T11:51:43.116338Z",
      "pulp_href": "/pulp/api/v3/workers/b1a39420-bc73-4fbb-9e3b-0d501b59748c/",
      "name": "1459@<servername>",
      "last_heartbeat": "2021-06-16T23:32:15.442657Z"
    },
    {
      "pulp_created": "2021-06-15T11:51:43.073757Z",
      "pulp_href": "/pulp/api/v3/workers/078fbcce-8bc4-4430-b679-a19d5bf0bb34/",
      "name": "1451@<servername>",
      "last_heartbeat": "2021-06-16T23:32:15.170233Z"
    },
    {
      "pulp_created": "2021-06-15T11:51:42.988822Z",
      "pulp_href": "/pulp/api/v3/workers/743bef47-7d2b-4987-af60-2e83d0807cb7/",
      "name": "1448@<servername>",
      "last_heartbeat": "2021-06-16T23:33:12.355166Z"
    },
    {
      "pulp_created": "2021-06-15T10:55:59.257146Z",
      "pulp_href": "/pulp/api/v3/workers/b21e0ec0-9f59-447f-a75f-dbba5faa1199/",
      "name": "25247@<servername>",
      "last_heartbeat": "2021-06-15T11:51:42.895031Z"
    },
    {
      "pulp_created": "2021-06-15T10:55:57.836954Z",
      "pulp_href": "/pulp/api/v3/workers/fb0965e3-f1e6-4e62-8366-d571f711c49a/",
      "name": "25240@<servername>",
      "last_heartbeat": "2021-06-15T11:22:08.912543Z"
    },
    {
      "pulp_created": "2021-06-15T10:55:56.111694Z",
      "pulp_href": "/pulp/api/v3/workers/11576ae6-8ed0-4fc7-88fa-b39dcc064694/",
      "name": "25230@<servername>",
      "last_heartbeat": "2021-06-15T11:51:42.875015Z"
    },
    {
      "pulp_created": "2021-06-15T10:55:54.378224Z",
      "pulp_href": "/pulp/api/v3/workers/bacb1c40-620e-4b17-8097-b19b1534d1ca/",
      "name": "25225@<servername>",
      "last_heartbeat": "2021-06-15T11:22:08.854539Z"
    },
    {
      "pulp_created": "2021-06-15T10:51:04.127965Z",
      "pulp_href": "/pulp/api/v3/workers/9590f209-11ae-42b0-b9a9-67d48022596b/",
      "name": "resource-manager",
      "last_heartbeat": "2021-06-21T03:30:39.100613Z"
    },
    {
      "pulp_created": "2021-06-15T10:50:51.726137Z",
      "pulp_href": "/pulp/api/v3/workers/1fe78503-8037-47c5-adfc-cf2343d129fb/",
      "name": "22859@<servername>",
      "last_heartbeat": "2021-06-15T10:59:15.207717Z"
    },
    {
      "pulp_created": "2021-06-15T10:50:50.228626Z",
      "pulp_href": "/pulp/api/v3/workers/22e44eb6-c2c5-4436-b7c0-ad155cacf37f/",
      "name": "22770@<servername>",
      "last_heartbeat": "2021-06-15T10:59:15.193882Z"
    },
    {
      "pulp_created": "2021-06-15T10:50:48.372229Z",
      "pulp_href": "/pulp/api/v3/workers/5530928d-8df7-424c-a50a-89c2c2bdd3ac/",
      "name": "22684@<servername>",
      "last_heartbeat": "2021-06-15T10:59:15.177090Z"
    },
    {
      "pulp_created": "2021-06-15T10:50:46.895867Z",
      "pulp_href": "/pulp/api/v3/workers/5d7ff8ba-1d17-49e9-9ff7-efc4653eed6e/",
      "name": "22599@<servername>",
      "last_heartbeat": "2021-06-15T10:59:15.166665Z"
    }
  ]
}

pulp task list --state waiting output

# pulp task list --state waiting
[
  {
    "pulp_href": "/pulp/api/v3/tasks/c4352eca-cf08-47ff-aa84-4c03b6c7cd48/",
    "pulp_created": "2021-06-17T23:32:31.291011Z",
    "state": "waiting",
    "name": "pulp_rpm.app.tasks.publishing.publish",
    "logging_cid": "45bdee2ce16448b394addf08d079b3d7",
    "started_at": null,
    "finished_at": null,
    "error": null,
    "worker": "/pulp/api/v3/workers/361e0bc2-79fd-481c-86ec-72881b090a74/",
    "parent_task": null,
    "child_tasks": [],
    "task_group": null,
    "progress_reports": [],
    "created_resources": [],
    "reserved_resources_record": [
      "/pulp/api/v3/repositories/rpm/rpm/2749521d-75bf-4e85-adfc-e6d1fd497606/"
    ]
  }
]

“worker named X is missing” output

# grep "is missing" messages-20210620                                                                                       2:02PM
Jun 15 20:59:15 server pulpcore-worker-1[25225]: pulp [None]: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22599@<servername> is missing. Canceling the tasks in its queue.
Jun 15 20:59:15 server pulpcore-worker-1[25225]: pulp [None]: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22684@<servername> is missing. Canceling the tasks in its queue.
Jun 15 20:59:15 server pulpcore-worker-1[25225]: pulp [None]: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22770@<servername> is missing. Canceling the tasks in its queue.
Jun 15 20:59:15 server pulpcore-worker-1[25225]: pulp [None]: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 22859@<servername> is missing. Canceling the tasks in its queue.
Jun 15 21:51:42 server pulpcore-resource-manager[1447]: pulp [None]: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 25230@<servername> is missing. Canceling the tasks in its queue.
Jun 15 21:51:42 server pulpcore-resource-manager[1447]: pulp [None]: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 25247@<servername> is missing. Canceling the tasks in its queue.
Jun 17 09:32:05 server pulpcore-worker-4[69469]: pulp [None]: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 1444@<servername> is missing. Canceling the tasks in its queue.
Jun 17 09:32:14 server pulpcore-worker-2[69495]: pulp [None]: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 1451@<servername> is missing. Canceling the tasks in its queue.
Jun 17 09:32:15 server pulpcore-worker-2[69495]: pulp [None]: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 1459@<servername> is missing. Canceling the tasks in its queue.
Jun 17 09:33:11 server pulpcore-worker-1[69639]: pulp [None]: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 1448@<servername> is missing. Canceling the tasks in its queue.
Jun 18 09:29:02 server pulpcore-worker-2[1402]: pulp [None]: pulpcore.tasking.services.worker_watcher:ERROR: The worker named 87558@<servername> is missing. Canceling the tasks in its queue.

This seems similar to issues being discussed during upgrades (Missing Pulp 3 workers on upgrade to Katello 4.0 - #7 by x9c4)

@x9c4 do you have any thoughts on how this can be navigated or if need to investigate ways to prevent or handle this situation better?

The relevant info I see is:
Worker:

Task:

So ~ 3 seconds after the last Heartbeat the task was scheduled, and another ~ 10 h later the worker was considered missing. The two known issues with similar symptoms are:
https://pulp.plan.io/issues/8779
https://pulp.plan.io/issues/8708

Thanks I’ll follow those issues now.

Also a note for anyone else reading this having the same issues. Do not install pulp-cli globally. Make sure you install it into a virtualenv. pulp-cli requires a newer version of PyYAML than pulp can deal with (5.4.1) so pulp breaks if you install pulp-cli without an virtualenv.

Also I believe the Oracle Linux 8 AppStream triggers this issue pretty reliably. I upgraded to 4.0 because I was hitting this on 3.18 too (now I know it’s a pulp issue the makes sense). This is what I’m using for the Repo that breaks incase anyone wants to reproduce it:

curl -o /tmp/RPM-GPG-KEY-oracle-ol8 https://yum.oracle.com/RPM-GPG-KEY-oracle-ol8
hammer content-credentials create --content-type gpg_key --name "RPM-GPG-KEY-oracle-ol8" --organization Default --path "/tmp/RPM-GPG-KEY-oracle-ol8"
hammer product create --description "Oracle Linux" --label "Oracle_Linux" --name "Oracle_Linux" --organization Default --gpg-key-id 1
hammer repository create --name "Oracle Linux 8 - AppStream Latest" --label "oraclelinux8-x86_64-appstream" --organization-label "Default" --content-type "yum" --gpg-key "RPM-GPG-KEY-oracle-ol8" --product "Oracle_Linux"  --download-policy "on_demand" --publish-via-http "yes" --url "http://yum.oracle.com/repo/OracleLinux/OL8/appstream/x86_64" --ignorable-content "distribution,srpm"

Syncing that repo ends in the task timing out.

It looks like those fixes are coming out in the new pulp core soon. Whats the normal time between a pulp core release and it making it into a Katello release?

Hi, have the same issues.
Did clean install on rhel8 + katello 4.0 and it broke during 1st day when syncing and configuring all things.
After that i did all data reset, but it stared to missing workers after next day :confused:

Is there any way to clean up these zombie processes until this issue is fixed normally? - reset all data is not an option.

I have three satellite servers (CentOS 7.9, foreman 2.4.1, katello 4.0.3) with all three having sync status stuck with note: waiting for Pulp to start the task. I can’t do a database reset since all three are serving production systems. Can’t have these running without being able to have tasks run.

For example, I have a sync task that is in a cancelled state because it could not reach the uplink URL. There was a network problem that has been resolved. However, I can’t start a new sync process because the lock is held by this one that has result: warning and Task: cancelled. Does not make sense.

@bradawk can you show us the output of the following from above?

[root@formanserver ~]# sudo -u pulp PULP_SETTINGS='/etc/pulp/settings.py' DJANGO_SETTINGS_MODULE='pulpcore.app.settings' pulpcore-manager shell <<EOF
> from pulpcore.app.models import ReservedResource, Worker
> worker_to_res = {}
> for rr in ReservedResource.objects.all():
>   worker_to_res[rr.worker_id] = rr.pulp_id
> workers = [w.pulp_id for w in Worker.objects.online_workers()]
> for rwork in worker_to_res:
>   if rwork not in workers:
>     print(f'Worker {rwork} owns ReservedResource {worker_to_res[rwork]} and is not in online_workers!!')
> EOF

Also, can you share any related Dynflow task outputs from your cancelled task?

I did run that and the cleanup portion also. That got rid of a lot of my stuck tasks. I think our basic problem is the uplink repository is having issues. The metadata says rpm “x” is available, but when our satellite server attempts to download, it gets 404 file not found message which then hangs our sync process. I think I need to wait on them to fix their issues before pursuing this further.

If you’re able to upgrade to Katello 4.1 that probably would be helpful for the hanging tasks at least. Katello 4.1 and later has a newer version of Pulpcore with a revamped tasking system.