Lots of "Mysql2::Error: Deadlock found when trying to get lock" under increased load

lzap · September 22, 2017, 8:08am

Hey, you are absolutely right that this is huge design gap in
discovery, we are tracking a refactor ticket to redesign how
discovered hosts are stored, but this is complete change of how
discovery hosts are being provisioned (you would not be able to use
New Hosts screen for example). I think this change will happen as soon
as we redesign new host form to be a session-full wizard.

A workaround could be a setting that would attempt to delete existing
host when new one is discovered, but this would be very dangerous
thing (security related), not sure if that is feasible even via
opt-in.

In the past, we have seen these deadlocks (on fact_name or fact_value)
because this is very busy table - discovery, facter/ENC and other
plugins (katello rhsm, openscap, ansible…) are all writing there or
changing data. I am unable to tell from info you provided what is
going on - you need to dig deeper.

One more idea - we have seen similar (but different tables) deadlocks
when a background (cron) job we ship by default attempts to delete old
reports. Can you check if there is any cronjob or any other process
doing some management of facts? Even deleting lot of data can block
all updates for a long time (minutes to hours). Perhaps try to disable
all foreman jobs and re-test.

LZ

···

On Thu, Sep 21, 2017 at 2:27 AM, 'Konstantin Orekhov' via Foreman users wrote: > > On Wednesday, September 20, 2017 at 3:55:43 AM UTC-7, Lukas Zapletal wrote: >> >> A MAC address can only exist once, if you already have a >> (managed/unmanaged) host and you try to discover a host with same MAC, >> you will get error. Depending on Foreman discovery it is either 422 or >> "Host already exists": >> >> https://github.com/theforeman/foreman_discovery/commit/210f143bc85c58caeb67e8bf9a5cc2edbe764683 > > > Hmm, one generic question on this - according to above logic, if my managed > host had crashed, say because it lost its HW RAID controller, for example, > so it can't boot off the disk anymore thus resulting in PXE boot (given that > BIOS boot order is set that way), correct? > Now, by default, Foreman default pxeconfig file makes a system to boot off > its disk, which in this particular situation will result in endless loop > until some external (to Foreman) monitoring detects a system failure, then a > human gets on a console and real troubleshooting starts only then. > That does not scale beyond a 100 systems or so. For this reason in our > current setup where we *don't* use Foreman for OS provisioning but only for > system discovery, I've updated the default pxeconfig to always load a > discovery OS. This covers both a new systems and a crashed system scenario I > described above. Each of discovered hosts is reported to a higher layer of > orchestration on a after_commit event and that orchestration handles OS > provisioning on its own so the discovered system never ends up in managed > hosts in Foreman. Once OS provisioning is done, higher layer comes and > deletes a host it just provisioned from discovered hosts. If orchestration > detects that a hook call from Foreman reports a system that was previously > provisioned, such system is automatically marked "maintenance" and HW > diagnostics auto-started. Based on the result of that, orchestration will > start either a HW replacement flow or a new problem troubleshooting starts. > As you can see, humans are only involved very late in a process and only if > auto-remediation is not possible (HW component failed, unknown signature > detected). Otherwise, at large scale environments it is just impossible to > attend to each of failed system individually. Such automation flow is > allowing us to save hundreds of man-hours, as you can imagine. > Now, with that in mind, I was thinking of moving actual OS provisioning > tasks to Foreman as well. However, if crashed system would never be allowed > to re-register (get discovered) because it is already managed by Foreman, > the above flow is just not going to work anymore and I'd have re-think all > flows. Are there specific reasons why this in place? I understand that this > is how it is implemented now, but is there a bigger idea behind that? If so, > what is it? Also, if you take my example of flows stitching for a complete > system lifecycle management, what would you suggest we could do differently > to allow Foreman to be a system that we use for both discovery and OS > provisioning? > > Another thing (not as generic as above, but actually very applicable to my > current issue) - if a client system is not allowed to register and given 422 > error, for example, it keeps trying to register resulting in huge amount of > work. This is also a gap, IMHO - discovery plug-in needs to do this > differently somehow so rejected systems do not take away Foreman resources > (see below for actual numbers of such attempts in one of my cluster). > >> >> Anyway you wrote you have deadlocks, but in the log snippet I do see >> that you have host discovery at rate 1-2 imports per minute. This >> cannot block anything, this is quite slow rate. I don't understand, >> can you pastebin log snippet from the peak time when you have these >> deadlocks? > > > After more digging I've done after this issue was reported to me, it does > not look to me as load-related. Even with low number of registrations, I see > a high rate of deadlocks. I took another Foreman cluster (3 active nodes as > well) and see the following activity as it pertains to system discovery > (since 3:30am this morning): > > [root@spc01 ~]# grep "/api/v2/discovered_hosts/facts" > /var/log/foreman/production.log | wc -l > 282 > > [root@spc02 ~]# grep "/api/v2/discovered_hosts/facts" > /var/log/foreman/production.log | wc -l > 2278 > > [root@spc03 ~]# grep "/api/v2/discovered_hosts/facts" > /var/log/foreman/production.log | wc -l > 143 > > These are the numbers of attempts rejected (all of them are 422s): > > [root@spc01 ~]# grep Entity /var/log/foreman/production.log | wc -l > 110 > > [root@spc02 ~]# grep Entity /var/log/foreman/production.log | wc -l > 2182 > > [root@spc03 ~]# grep Entity /var/log/foreman/production.log | wc -l > 57 > > A number of deadlocks: > > [root@spc01 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l > 59 > > [root@spc02 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l > 31 > > [root@spc03 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l > 30 > > Actual deadlock messages are here - > https://gist.github.com/anonymous/a20f4097396037cd30903d232a3e6d0f > As you can see, most of them are locked on attempts to update facts. > > That large number of attempts to register and rejects on spc02 node is > mostly contributed by a single host: > > [root@spc02 ~]# grep "Multiple discovered hosts found with MAC address" > /var/log/foreman/production.log | wc -l > 1263 > > [root@spc02 ~]# grep "Multiple discovered hosts found with MAC address" > /var/log/foreman/production.log | head -1 > 2017-09-20 04:39:15 de3ee3bf [app] [W] Multiple discovered hosts found with > MAC address 00:8c:fa:f1:ab:e4, choosing one > > After I removed both incomplete "mac008cfaf1abe4" records, that system has > finally was able to properly register. > > Here's also a full debug I took yesterday - it is a single host trying to > register. Unfortunately, this one does not have any deadlocks - > https://gist.github.com/anonymous/47fe4baa60fc5285b70faf37e6f797af > Do you want me to try to get one of those deadlock? > > All this business makes me think that root cause of this behavior may be > outside of Foreman - 2 obvious things spring to mind: > (a) local-balanced active/active configuration of my Foreman nodes - even > though I do have source_address binding enabled for connections to 443 on > Foreman vServer on a LB, maybe there's more to it. This is rather easy to > verify - I'm going to shut off other 2 instances and see if I get any > deadlock again. > (b) Second possibility is Galera-based MySQL. This one is harder to check, > but if the first option does not help me, I'll have to convert a DB back to > single node and see. If this turns out to do an issue, it is very bad as > that would mean no proper HA for a Foreman DB, so I'm hoping this is not the > case. > > While I'm working on, please let me know if I can provide any more info or > if you have any other suggestions, etc. > Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.

–
Later,
Lukas @lzap Zapletal