Katello - virt-who triggered hypervisors task broken for certain Org

areyus · September 6, 2021, 11:39am

Problem:
Since some days ago, virt-who is failing at creating new hypervisor hosts in Katello or updating guest-hosts mapping. This happens for all compute resources reporting to our main organization, yet the virt-who reports for a secondary organization work just fine. We also found that this only hits our production environment, no test environment is affected. We are still running Foreman 2.0.3 with Katello 3.15.3 (I know, quite old) and virt-who 0.28.10 in all environments.
We noticed the Dynflow tasks are somewhat borked for where it does not work. The task finishes in under 1 second (according to dynflow it is 0.00s) and is missing the step “Actions::Candlepin::AsyncHypervisors” (compared to the tasks from the working org), so we only have a “Actions::Katello::Host::HypervisorsUpdate” step. Foreman Tasks shows the raw input as

{
  "hypervisors": null,
  "current_request_id": null,
  "current_timezone": "UTC",
  "current_user_id": null,
  "current_organization_id": null,
  "current_location_id": null
}

After a lot of debugging and digging through the logs, the only error we were able to find is this from production.log with debug enabled:

2021-09-06T12:36:45 [D|kat|d824aced] Candlepin request d222080b-5e6f-4096-87f5-b84c46e4d81d returned with code 200
2021-09-06T12:36:45 [D|kat|d824aced] Processing response: 200
2021-09-06T12:36:45 [D|kat|d824aced] {"created":null,"updated":null,"id":null,"name":"hypervisor_update","group":null,"origin":"deployp001.srv.muenchen.de","executor":null,"principal":"foreman_admin","state":"ABORTED","previousState":"CREATED","startTime":null,"endTime":null,"attempts":0,"maxAttempts":1,"statusPath":null,"resultData":"Job queuing blocked by the following jobs: 8a206c957b95f64c017b95f79f980000","key":"HypervisorUpdateJob"}

From the third line, it looks like another task is currently blocking the queue in Artemis that the hypervisor update task should write to. The only reason we could now think of how this came to be is that last week, our /var/lib/candlepin partition hit 90% disk space usage and candlepin suddenly stopped working. We were taken by surprise by the max disk space usage beeing 90% and so could not react in time. After increasing the disk space everything worked again, but we assume this might be the root cause for our current problem and we just did not realize it in time, but this is pure speculation.
Anyways, is there a way to “unstuck” or remove the task from the queue thats blocking or are there any other analytical steps we should take first?
Expected outcome:
virt-who reports should be processed again for all arganizations.
Foreman and Proxy versions:
Foreman 2.0.3
Katello 3.15.3

Distribution and version:
RHEl 7
Other relevant data:
If there are any further logs that could help analyze the issue, I will happy to provide those. Just let me know what might be helpful.

areyus · September 6, 2021, 11:52am

Just after I finished the writeup, a coworker found the solution that brought guest-to-host mapping and importing of hosts back to life.
The solution in this redmine comment fixed the issue for us. I will leave the thread here in case someone else stumbles upon this issue in the future so it is easier to find via the forum search.