Katello-nightly-rpm-pipeline 120 failed

Katello nightly pipeline failed:

https://ci.theforeman.org/job/katello-nightly-rpm-pipeline/120/

foreman-katello-nightly-test (failed)
foreman-katello-upgrade-nightly-test (passed)

Yet again failing on the deadlock during “Delete an Organization” bats test.
Can someone from @katello team please take a look? this is failing practically every other run, causing delays in nighly releases and lots of false alarms on this category.

I had opened Bug #26821: PG::TRDeadlockDetected: ERROR: deadlock detected when deleting an Organization - Katello - Foreman for that some time ago.

1 Like

I can take a look at this. I’ll try to get to the root of the problem as its not clear what is causing the lock, but we can always try wrapping the destroy action in ActiveRecord::Base.connection_pool.with_connection and see if this disappears

1 Like

I ran the bats tests locally 12 times and wasn’t able to reproduce the deadlock. I’m not sure what is going on, but I think our best bet is to try things out to either workaround the problem or give us more information when it happens.

I created this PR to sleep 20 seconds before deleting the organization, which may help to avoid it. I would rather try this solution first before modifying katello code (like wrapping in with_connection like I suggested earlier), since this only affects bats tests and not user-facing code.

We can also add the following to the file to give us more info on the deadlocks when they do exist:

sudo su - postgres -c ' psql foreman -c "SELECT relation::regclass, * FROM pg_locks WHERE NOT GRANTED;"'
sudo su - postgres -c ' psql foreman -c "SELECT blocked_locks.pid AS blocked_pid, blocked_activity.usename AS blocked_user, blocking_locks.pid AS blocking_pid, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_statement, blocking_activity.query AS current_statement_in_blocking_process FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.GRANTED;"'

I was trying to do this locally, but since I couldn’t reproduce, it wasn’t much benefit.

Hope this helps! :man_shrugging: I’m out Friday and next week so feel free to merge the PR if it could be helpful :slight_smile:

1 Like

I opened a PR that suspect will solve the issue here:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Looks like we’re still seeing failures:
https://ci.centos.org/job/foreman-luna-nightly-test/54/tapResults/
Or is luna using an outdated katello build that still doesn’t contain this fix?

The Luna build is kicked by a passing Katello so it shouldn’t use outdated packages.