Reproducible deadlock on smart-proxy full-sync

areyus · December 6, 2021, 4:43pm

Problem:
When syncing content via the “Complete Sync” option to one of our smart-proxies, we reproducibly get a deadlock error. This does not happen when syncing to our other 4 smart-proxies, 1 of them hosting the exactly same content. Optimized Sync also goes through without errors.

Expected outcome:
Complete Sync completing without error.

Foreman and Proxy versions:
2.5.4

Foreman and Proxy plugin versions:

Plugins:
 - foreman-tasks 4.1.5
 - foreman_expire_hosts 7.0.4
 - foreman_hooks 0.3.17
 - foreman_remote_execution 4.5.6
 - foreman_scc_manager 1.8.10
 - foreman_snapshot_management 2.0.1
 - foreman_templates 9.1.0
 - katello 4.1.4
 - puppetdb_foreman 5.0.0

Distribution and version:
RHEL7

Other relevant data:
Error message in the task:

deadlock detected
DETAIL:  Process 105229 waits for ShareLock on transaction 1999179; blocked by process 104764.
Process 104764 waits for ShareLock on transaction 1999180; blocked by process 105229.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (3800,22) in relation "core_artifact"

I could not get more insight from Postgres logs, but maybe I just did not know what to look for.
We recently had a “No space left on device” on the /var partition on this host, so maybe that caused some DB inconsistency? If so, I would greatly appreciate help with sorting this out.

areyus · December 10, 2021, 11:28am

We managed to “solve” this by ourselfs. Not sure what the underlying root-cause was, but in case someone else stumbles upon this, here is what we did:

We got the IDs of the affected repositories from the tasks error summary. Turns out it were two repositories, both lifecycle-environment specific “copies” of repositories in the current version of a CCV, which is why only one smartproxy was affected.
We ran foreman-rake katello:reimport on our Foreman server, which solved the issue for one of the two repositories, but the second one still caused the sync to fail.
We than “demoted” the affected CCV in the affected lifecycle environment (promoted an older version we still had), did a sync to the smart-proxy, ran both foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null and another foreman-rake katello:reimport, then promoted the CCV back to the original version we were encountering problems with.

I am not entirely sure which of these steps actually solved the problem, but it’s gone now and the smart-proxy is syncing without errors again.

nixfu · June 27, 2023, 2:56pm

Getting the exact same error on full sync on Katello 4.8.1.

sajha · June 27, 2023, 6:26pm

Is this also reproducible on every sync and is the task error similar? There were some improvements around this recently but I can’t remember on the top off my head. I can check some more based on the error log for the task.