Lots of "Mysql2::Error: Deadlock found when trying to get lock" under increased load

After I got the debug output, I've deleted this host from Foreman and on
its next attempt it got registered perfectly fine - no issues with
interfaces or anything anymore:

(on a client side)

Discovered by URL: https://spc.vip
Registering host with Foreman (https://spc.vip)
Response from Foreman 201: {"id":447371,"name":"mac3cfdfe52252c" <snip>

(on Foreman side):
2017-09-19 12:39:34 7ca37aca [app] [I] Started DELETE "/discovered_hosts/mac3cfdfe52252c" for 10.102.141.20 at 2017-09-19 12:39:34 -0700
2017-09-19 12:39:34 7ca37aca [app] [I] Parameters: {"authenticity_token"=>"<removed>", "id"=>"mac3cfdfe52252c"}

2017-09-19 12:40:04 1a346c39 [app] [I] Started POST "/api/v2/discovered_hosts/facts" for 10.102.141.20 at 2017-09-19 12:40:04 -0700
2017-09-19 12:40:04 1a346c39 [app] [I] Processing by Api::V2::DiscoveredHostsController#facts as JSON
2017-09-19 12:40:04 1a346c39 [app] [I] Parameters: {"facts"=>"[FILTERED]", "apiv"=>"v2", "discovered_host"=>{"facts"=>"[FILTERED]"}}
2017-09-19 12:40:06 1a346c39 [audit] [I] [mac3cfdfe52252c] deleted 0 (1694.6ms)
2017-09-19 12:40:06 1a346c39 [audit] [I] [mac3cfdfe52252c] updated 0 (2.6ms)
2017-09-19 12:40:07 1a346c39 [audit] [I] [mac3cfdfe52252c] added 385 (1637.5ms)
2017-09-19 12:40:07 1a346c39 [app] [I] Import facts for 'mac3cfdfe52252c' completed. Added: 385, Updated: 0, Deleted 0 facts

It would be nice to figure out what's causing this in a first place - I do see a lot of those "Unprocessable Entity" messages logged.
Thanks!

BTW, Lukas, you mentioned that some improvements were made in 1.14. I am
running 1.14.1 and 1.14.3.
Did you mean 1.15 maybe? Should I even consider an upgrade to help resolve
this situation?

A MAC address can only exist once, if you already have a
(managed/unmanaged) host and you try to discover a host with same MAC,
you will get error. Depending on Foreman discovery it is either 422 or
"Host already exists":
https://github.com/theforeman/foreman_discovery/commit/210f143bc85c58caeb67e8bf9a5cc2edbe764683

Anyway you wrote you have deadlocks, but in the log snippet I do see
that you have host discovery at rate 1-2 imports per minute. This
cannot block anything, this is quite slow rate. I don't understand,
can you pastebin log snippet from the peak time when you have these
deadlocks?

··· On Tue, Sep 19, 2017 at 10:03 PM, 'Konstantin Orekhov' via Foreman users wrote: > After I got the debug output, I've deleted this host from Foreman and on its > next attempt it got registered perfectly fine - no issues with interfaces or > anything anymore: > > (on a client side) > > Discovered by URL: https://spc.vip > Registering host with Foreman (https://spc.vip) > Response from Foreman 201: {"id":447371,"name":"mac3cfdfe52252c" > > (on Foreman side): > 2017-09-19 12:39:34 7ca37aca [app] [I] Started DELETE > "/discovered_hosts/mac3cfdfe52252c" for 10.102.141.20 at 2017-09-19 12:39:34 > -0700 > 2017-09-19 12:39:34 7ca37aca [app] [I] Parameters: > {"authenticity_token"=>"", "id"=>"mac3cfdfe52252c"} > > 2017-09-19 12:40:04 1a346c39 [app] [I] Started POST > "/api/v2/discovered_hosts/facts" for 10.102.141.20 at 2017-09-19 12:40:04 > -0700 > 2017-09-19 12:40:04 1a346c39 [app] [I] Processing by > Api::V2::DiscoveredHostsController#facts as JSON > 2017-09-19 12:40:04 1a346c39 [app] [I] Parameters: {"facts"=>"[FILTERED]", > "apiv"=>"v2", "discovered_host"=>{"facts"=>"[FILTERED]"}} > 2017-09-19 12:40:06 1a346c39 [audit] [I] [mac3cfdfe52252c] deleted 0 > (1694.6ms) > 2017-09-19 12:40:06 1a346c39 [audit] [I] [mac3cfdfe52252c] updated 0 (2.6ms) > 2017-09-19 12:40:07 1a346c39 [audit] [I] [mac3cfdfe52252c] added 385 > (1637.5ms) > 2017-09-19 12:40:07 1a346c39 [app] [I] Import facts for 'mac3cfdfe52252c' > completed. Added: 385, Updated: 0, Deleted 0 facts > > It would be nice to figure out what's causing this in a first place - I do > see a lot of those "Unprocessable Entity" messages logged. > Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

Well no, the biggest update was for 1.14 there:

http://projects.theforeman.org/issues/9016

That focused on memory consumption tho, there was a little speedup but
nothing big.

··· On Tue, Sep 19, 2017 at 10:05 PM, 'Konstantin Orekhov' via Foreman users wrote: > BTW, Lukas, you mentioned that some improvements were made in 1.14. I am > running 1.14.1 and 1.14.3. > Did you mean 1.15 maybe? Should I even consider an upgrade to help resolve > this situation? > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

>
> A MAC address can only exist once, if you already have a
> (managed/unmanaged) host and you try to discover a host with same MAC,
> you will get error. Depending on Foreman discovery it is either 422 or
> "Host already exists":
>
> https://github.com/theforeman/foreman_discovery/commit/210f143bc85c58caeb67e8bf9a5cc2edbe764683
>

Hmm, one generic question on this - according to above logic, if my managed
host had crashed, say because it lost its HW RAID controller, for example,
so it can't boot off the disk anymore thus resulting in PXE boot (given
that BIOS boot order is set that way), correct?
Now, by default, Foreman default pxeconfig file makes a system to boot off
its disk, which in this particular situation will result in endless loop
until some external (to Foreman) monitoring detects a system failure, then
a human gets on a console and real troubleshooting starts only then.
That does not scale beyond a 100 systems or so. For this reason in our
current setup where we don't use Foreman for OS provisioning but only for
system discovery, I've updated the default pxeconfig to always load a
discovery OS. This covers both a new systems and a crashed system scenario
I described above. Each of discovered hosts is reported to a higher layer
of orchestration on a after_commit event and that orchestration handles OS
provisioning on its own so the discovered system never ends up in managed
hosts in Foreman. Once OS provisioning is done, higher layer comes and
deletes a host it just provisioned from discovered hosts. If orchestration
detects that a hook call from Foreman reports a system that was previously
provisioned, such system is automatically marked "maintenance" and HW
diagnostics auto-started. Based on the result of that, orchestration will
start either a HW replacement flow or a new problem troubleshooting starts.
As you can see, humans are only involved very late in a process and only if
auto-remediation is not possible (HW component failed, unknown signature
detected). Otherwise, at large scale environments it is just impossible to
attend to each of failed system individually. Such automation flow is
allowing us to save hundreds of man-hours, as you can imagine.
Now, with that in mind, I was thinking of moving actual OS provisioning
tasks to Foreman as well. However, if crashed system would never be allowed
to re-register (get discovered) because it is already managed by Foreman,
the above flow is just not going to work anymore and I'd have re-think all
flows. Are there specific reasons why this in place? I understand that this
is how it is implemented now, but is there a bigger idea behind that? If
so, what is it? Also, if you take my example of flows stitching for a
complete system lifecycle management, what would you suggest we could do
differently to allow Foreman to be a system that we use for both discovery
and OS provisioning?

Another thing (not as generic as above, but actually very applicable to my
current issue) - if a client system is not allowed to register and given
422 error, for example, it keeps trying to register resulting in huge
amount of work. This is also a gap, IMHO - discovery plug-in needs to do
this differently somehow so rejected systems do not take away Foreman
resources (see below for actual numbers of such attempts in one of my
cluster).

> Anyway you wrote you have deadlocks, but in the log snippet I do see
> that you have host discovery at rate 1-2 imports per minute. This
> cannot block anything, this is quite slow rate. I don't understand,
> can you pastebin log snippet from the peak time when you have these
> deadlocks?
>

After more digging I've done after this issue was reported to me, it does
not look to me as load-related. Even with low number of registrations, I
see a high rate of deadlocks. I took another Foreman cluster (3 active
nodes as well) and see the following activity as it pertains to system
discovery (since 3:30am this morning):

[root@spc01 ~]# grep "/api/v2/discovered_hosts/facts"
/var/log/foreman/production.log | wc -l
282

[root@spc02 ~]# grep "/api/v2/discovered_hosts/facts"
/var/log/foreman/production.log | wc -l
2278

[root@spc03 ~]# grep "/api/v2/discovered_hosts/facts"
/var/log/foreman/production.log | wc -l
143

These are the numbers of attempts rejected (all of them are 422s):

[root@spc01 ~]# grep Entity /var/log/foreman/production.log | wc -l
110

[root@spc02 ~]# grep Entity /var/log/foreman/production.log | wc -l
2182

[root@spc03 ~]# grep Entity /var/log/foreman/production.log | wc -l
57

A number of deadlocks:

[root@spc01 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l
59

[root@spc02 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l
31

[root@spc03 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l
30

Actual deadlock messages are here -


As you can see, most of them are locked on attempts to update facts.

That large number of attempts to register and rejects on spc02 node is
mostly contributed by a single host:

[root@spc02 ~]# grep "Multiple discovered hosts found with MAC address"
/var/log/foreman/production.log | wc -l
1263

[root@spc02 ~]# grep "Multiple discovered hosts found with MAC address"
/var/log/foreman/production.log | head -1
2017-09-20 04:39:15 de3ee3bf [app] [W] Multiple discovered hosts found with
MAC address 00:8c:fa:f1:ab:e4, choosing one

After I removed both incomplete "mac008cfaf1abe4"
<https://spc.vip.phx.ebay.com/discovered_hosts/mac008cfaf1abe4> records,
that system has finally was able to properly register.

Here's also a full debug I took yesterday - it is a single host trying to
register. Unfortunately, this one does not have any deadlocks -


Do you want me to try to get one of those deadlock?

All this business makes me think that root cause of this behavior may be
outside of Foreman - 2 obvious things spring to mind:
(a) local-balanced active/active configuration of my Foreman nodes - even
though I do have source_address binding enabled for connections to 443 on
Foreman vServer on a LB, maybe there's more to it. This is rather easy to
verify - I'm going to shut off other 2 instances and see if I get any
deadlock again.
(b) Second possibility is Galera-based MySQL. This one is harder to check,
but if the first option does not help me, I'll have to convert a DB back to
single node and see. If this turns out to do an issue, it is very bad as
that would mean no proper HA for a Foreman DB, so I'm hoping this is not
the case.

While I'm working on, please let me know if I can provide any more info or
if you have any other suggestions, etc.
Thanks!

··· On Wednesday, September 20, 2017 at 3:55:43 AM UTC-7, Lukas Zapletal wrote:

Hey, you are absolutely right that this is huge design gap in
discovery, we are tracking a refactor ticket to redesign how
discovered hosts are stored, but this is complete change of how
discovery hosts are being provisioned (you would not be able to use
New Hosts screen for example). I think this change will happen as soon
as we redesign new host form to be a session-full wizard.

A workaround could be a setting that would attempt to delete existing
host when new one is discovered, but this would be very dangerous
thing (security related), not sure if that is feasible even via
opt-in.

In the past, we have seen these deadlocks (on fact_name or fact_value)
because this is very busy table - discovery, facter/ENC and other
plugins (katello rhsm, openscap, ansible…) are all writing there or
changing data. I am unable to tell from info you provided what is
going on - you need to dig deeper.

One more idea - we have seen similar (but different tables) deadlocks
when a background (cron) job we ship by default attempts to delete old
reports. Can you check if there is any cronjob or any other process
doing some management of facts? Even deleting lot of data can block
all updates for a long time (minutes to hours). Perhaps try to disable
all foreman jobs and re-test.

LZ

··· On Thu, Sep 21, 2017 at 2:27 AM, 'Konstantin Orekhov' via Foreman users wrote: > > On Wednesday, September 20, 2017 at 3:55:43 AM UTC-7, Lukas Zapletal wrote: >> >> A MAC address can only exist once, if you already have a >> (managed/unmanaged) host and you try to discover a host with same MAC, >> you will get error. Depending on Foreman discovery it is either 422 or >> "Host already exists": >> >> https://github.com/theforeman/foreman_discovery/commit/210f143bc85c58caeb67e8bf9a5cc2edbe764683 > > > Hmm, one generic question on this - according to above logic, if my managed > host had crashed, say because it lost its HW RAID controller, for example, > so it can't boot off the disk anymore thus resulting in PXE boot (given that > BIOS boot order is set that way), correct? > Now, by default, Foreman default pxeconfig file makes a system to boot off > its disk, which in this particular situation will result in endless loop > until some external (to Foreman) monitoring detects a system failure, then a > human gets on a console and real troubleshooting starts only then. > That does not scale beyond a 100 systems or so. For this reason in our > current setup where we *don't* use Foreman for OS provisioning but only for > system discovery, I've updated the default pxeconfig to always load a > discovery OS. This covers both a new systems and a crashed system scenario I > described above. Each of discovered hosts is reported to a higher layer of > orchestration on a after_commit event and that orchestration handles OS > provisioning on its own so the discovered system never ends up in managed > hosts in Foreman. Once OS provisioning is done, higher layer comes and > deletes a host it just provisioned from discovered hosts. If orchestration > detects that a hook call from Foreman reports a system that was previously > provisioned, such system is automatically marked "maintenance" and HW > diagnostics auto-started. Based on the result of that, orchestration will > start either a HW replacement flow or a new problem troubleshooting starts. > As you can see, humans are only involved very late in a process and only if > auto-remediation is not possible (HW component failed, unknown signature > detected). Otherwise, at large scale environments it is just impossible to > attend to each of failed system individually. Such automation flow is > allowing us to save hundreds of man-hours, as you can imagine. > Now, with that in mind, I was thinking of moving actual OS provisioning > tasks to Foreman as well. However, if crashed system would never be allowed > to re-register (get discovered) because it is already managed by Foreman, > the above flow is just not going to work anymore and I'd have re-think all > flows. Are there specific reasons why this in place? I understand that this > is how it is implemented now, but is there a bigger idea behind that? If so, > what is it? Also, if you take my example of flows stitching for a complete > system lifecycle management, what would you suggest we could do differently > to allow Foreman to be a system that we use for both discovery and OS > provisioning? > > Another thing (not as generic as above, but actually very applicable to my > current issue) - if a client system is not allowed to register and given 422 > error, for example, it keeps trying to register resulting in huge amount of > work. This is also a gap, IMHO - discovery plug-in needs to do this > differently somehow so rejected systems do not take away Foreman resources > (see below for actual numbers of such attempts in one of my cluster). > >> >> Anyway you wrote you have deadlocks, but in the log snippet I do see >> that you have host discovery at rate 1-2 imports per minute. This >> cannot block anything, this is quite slow rate. I don't understand, >> can you pastebin log snippet from the peak time when you have these >> deadlocks? > > > After more digging I've done after this issue was reported to me, it does > not look to me as load-related. Even with low number of registrations, I see > a high rate of deadlocks. I took another Foreman cluster (3 active nodes as > well) and see the following activity as it pertains to system discovery > (since 3:30am this morning): > > [root@spc01 ~]# grep "/api/v2/discovered_hosts/facts" > /var/log/foreman/production.log | wc -l > 282 > > [root@spc02 ~]# grep "/api/v2/discovered_hosts/facts" > /var/log/foreman/production.log | wc -l > 2278 > > [root@spc03 ~]# grep "/api/v2/discovered_hosts/facts" > /var/log/foreman/production.log | wc -l > 143 > > These are the numbers of attempts rejected (all of them are 422s): > > [root@spc01 ~]# grep Entity /var/log/foreman/production.log | wc -l > 110 > > [root@spc02 ~]# grep Entity /var/log/foreman/production.log | wc -l > 2182 > > [root@spc03 ~]# grep Entity /var/log/foreman/production.log | wc -l > 57 > > A number of deadlocks: > > [root@spc01 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l > 59 > > [root@spc02 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l > 31 > > [root@spc03 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l > 30 > > Actual deadlock messages are here - > https://gist.github.com/anonymous/a20f4097396037cd30903d232a3e6d0f > As you can see, most of them are locked on attempts to update facts. > > That large number of attempts to register and rejects on spc02 node is > mostly contributed by a single host: > > [root@spc02 ~]# grep "Multiple discovered hosts found with MAC address" > /var/log/foreman/production.log | wc -l > 1263 > > [root@spc02 ~]# grep "Multiple discovered hosts found with MAC address" > /var/log/foreman/production.log | head -1 > 2017-09-20 04:39:15 de3ee3bf [app] [W] Multiple discovered hosts found with > MAC address 00:8c:fa:f1:ab:e4, choosing one > > After I removed both incomplete "mac008cfaf1abe4" records, that system has > finally was able to properly register. > > Here's also a full debug I took yesterday - it is a single host trying to > register. Unfortunately, this one does not have any deadlocks - > https://gist.github.com/anonymous/47fe4baa60fc5285b70faf37e6f797af > Do you want me to try to get one of those deadlock? > > All this business makes me think that root cause of this behavior may be > outside of Foreman - 2 obvious things spring to mind: > (a) local-balanced active/active configuration of my Foreman nodes - even > though I do have source_address binding enabled for connections to 443 on > Foreman vServer on a LB, maybe there's more to it. This is rather easy to > verify - I'm going to shut off other 2 instances and see if I get any > deadlock again. > (b) Second possibility is Galera-based MySQL. This one is harder to check, > but if the first option does not help me, I'll have to convert a DB back to > single node and see. If this turns out to do an issue, it is very bad as > that would mean no proper HA for a Foreman DB, so I'm hoping this is not the > case. > > While I'm working on, please let me know if I can provide any more info or > if you have any other suggestions, etc. > Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

A few extra thoughts on this, since a lot of it is still based on my
design from nearly 5 years ago :wink:

>
> Hmm, one generic question on this - according to above logic, if my
> managed host had crashed, say because it lost its HW RAID controller,
> for example, so it can't boot off the disk anymore thus resulting in
> PXE boot (given that BIOS boot order is set that way), correct?

> Now, by default, Foreman default pxeconfig file makes a system to
> boot off its disk, which in this particular situation will result in
> endless loop until some external (to Foreman) monitoring detects a
> system failure, then a human gets on a console and real
> troubleshooting starts only then.

This is absolutely true. We had, at one time, considered adding a state
machine (or similar) to Foreman, so that such things (as well as boot
loops in Kickstart, and so forth) could be detected, but it was never
completed.

> Now, with that in mind, I was thinking of moving actual OS
> provisioning tasks to Foreman as well. However, if crashed system
> would never be allowed to re-register (get discovered) because it is
> already managed by Foreman, the above flow is just not going to work
> anymore and I'd have re-think all flows. Are there specific reasons
> why this in place? I understand that this is how it is implemented
> now, but is there a bigger idea behind that? If so, what is it?

There were two goals - to prevent duplicates (if unprovisioned hosts
are rebooted, for example), and to allow recycling (delete a host from
Foreman, reboot it, and it'll be back in the discovered hosts list to
be re-used). Neither of these is insurmountable some other way, but
this was the easiest.

> Also, if you take my example of flows stitching for a complete system
> lifecycle management, what would you suggest we could do differently
> to allow Foreman to be a system that we use for both discovery and OS
> provisioning?

As Lukas says, a full refactor may well happen, and we'd love input on
that as we go forward. For a workaround today, I'd probably lean
towards a secondary plugin that sits on top of Discovery and interacts
with the registration process - given your example, you could add a
check if the regitraion matches a host that's already provisioned, and
take further action if so. That might also be a good way to proof-of-
concept some ideas, before merging the code back into Discovery.

> Another thing (not as generic as above, but actually very applicable
> to my current issue) - if a client system is not allowed to register
> and given 422 error, for example, it keeps trying to register
> resulting in huge amount of work. This is also a gap, IMHO -
> discovery plug-in needs to do this differently somehow so rejected
> systems do not take away Foreman resources (see below for actual
> numbers of such attempts in one of my cluster).

I think I agree - the hosts should keep retrying until they get a
response from Foreman, but then actions can be taken. I'd probably be
in favour of keeping the retry (so that, say, if the offending MAC is
removed in Foreman, the host can register on the next retry), but
perhaps split the process into two calls. The first is a light "am I
registered?" call that returns true/false, and only if false would the
heavier registration call be made. Does that work?

Thanks!
Greg

··· On Wed, 2017-09-20 at 17:27 -0700, 'Konstantin Orekhov' via Foreman users wrote:

> This is absolutely true. We had, at one time, considered adding a state
> machine (or similar) to Foreman, so that such things (as well as boot
> loops in Kickstart, and so forth) could be detected, but it was never
> completed.
>

State machine would be nice as it allows for more actions to be taken for a
machine in different states. For example, in some other threads, I was
asking about ability to use RemoteExec for discovered hosts, not just
managed hosts as it is now.
Proper hooks for systems entering/leaving any of those states also open up
a lot of opportunities.

> As Lukas says, a full refactor may well happen, and we'd love input on
> that as we go forward.

Any of you, guys, going to PuppetConf this year? If so, can we meet and
have a discussion on this maybe?

I think I agree - the hosts should keep retrying until they get a
> response from Foreman, but then actions can be taken. I'd probably be
> in favour of keeping the retry (so that, say, if the offending MAC is
> removed in Foreman, the host can register on the next retry), but
> perhaps split the process into two calls. The first is a light "am I
> registered?" call that returns true/false, and only if false would the
> heavier registration call be made. Does that work?
>

Yes, this would definitely work. This is also is one of the states of a
system in the state machine we talked about above.

> One more idea - we have seen similar (but different tables) deadlocks
> when a background (cron) job we ship by default attempts to delete old
> reports. Can you check if there is any cronjob or any other process
> doing some management of facts? Even deleting lot of data can block
> all updates for a long time (minutes to hours). Perhaps try to disable
> all foreman jobs and re-test.
>

I have tried this to no avail. However, I think the culprit of a problem is
in very slow DELETE MySQL query, which is apparently happens even for
absolutely new and freshly-discovered systems as well already discovered
ones.

2017-09-28 13:09:49 c75f5c40 [sql] [D] SQL (50843.2ms) DELETE FROM
fact_values WHERE fact_values.id IN

Please see these gists I've recorded with SQL debug enabled. I have a ton
of hosts doing exactly the same thing - try to register, mysql delete
expires (it takes up to 50 sec as you can see), some rollback happens and
expires again. And so on and so forth until systems register one by one.
This results in many empty or duplicate entries even for a small batch of
systems coming online at the same time.

The above examples are taken from a single Foreman instance running against
a regular (non-Galera) MySQL DB, so at least I can say that the fact that I
had several Foreman instances behind a load-balancer talking to
Galera-replicated MySQL has nothing to do with this behavior. The only
difference is that in Galera-enabled DB, expiration errors are replaced
with deadlock error, which makes total sense - if delete operation takes
almost a minute, no wonder it results in some rows being locked. As load
increases (more systems register at the same time), more and more such
errors are happening, so I believe a proper way to deal with this is
optimize MySQL query first and the go from there. Would you agree?

Ok I can see there is a subselect, these are sometimes painful
particularly for MySQL. We fixed that already, see in fact_importer.rb
(this is develop branch):

def delete_removed_facts
ActiveSupport::Notifications.instrument
"fact_importer_deleted.foreman", :host_id => host.id, :host_name =>
host.name, : +facts => facts, :deleted => [] do |payload|
delete_query = FactValue.joins(:fact_name).where(:host => host,
'fact_names.type' => fact_name_class.name).where.
+not('fact_names.name' => facts.keys)
if ActiveRecord::Base.connection.adapter_name.downcase.starts_with?
'mysql'
# MySQL does not handle delete with inner query correctly
(slow) so we will do two queries on purpose
payload[:count] = @counters[:deleted] = FactValue.where(:id =>
delete_query.pluck(:id)).delete_all
else
# deletes all facts using a single SQL query with inner query otherwise
payload[:count] = @counters[:deleted] = delete_query.delete_all
end
end
end

See the comment there, do you have this in your instance? If not git
blame the commit and apply it. You have some older version I assume.

··· On Wed, Oct 4, 2017 at 1:38 AM, 'Konstantin Orekhov' via Foreman users wrote: > >> One more idea - we have seen similar (but different tables) deadlocks >> when a background (cron) job we ship by default attempts to delete old >> reports. Can you check if there is any cronjob or any other process >> doing some management of facts? Even deleting lot of data can block >> all updates for a long time (minutes to hours). Perhaps try to disable >> all foreman jobs and re-test. > > > I have tried this to no avail. However, I think the culprit of a problem is > in very slow DELETE MySQL query, which is apparently happens even for > absolutely new and freshly-discovered systems as well already discovered > ones. > > 2017-09-28 13:09:49 c75f5c40 [sql] [D] SQL (50843.2ms) DELETE FROM > `fact_values` WHERE `fact_values`.`id` IN > > Please see these gists I've recorded with SQL debug enabled. I have a ton of > hosts doing exactly the same thing - try to register, mysql delete expires > (it takes up to 50 sec as you can see), some rollback happens and expires > again. And so on and so forth until systems register one by one. This > results in many empty or duplicate entries even for a small batch of systems > coming online at the same time. > > https://gist.github.com/anonymous/a721e220d82f5160450e483b8776489d > > The above examples are taken from a single Foreman instance running against > a regular (non-Galera) MySQL DB, so at least I can say that the fact that I > had several Foreman instances behind a load-balancer talking to > Galera-replicated MySQL has nothing to do with this behavior. The only > difference is that in Galera-enabled DB, expiration errors are replaced with > deadlock error, which makes total sense - if delete operation takes almost a > minute, no wonder it results in some rows being locked. As load increases > (more systems register at the same time), more and more such errors are > happening, so I believe a proper way to deal with this is optimize MySQL > query first and the go from there. Would you agree? > > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

> > As Lukas says, a full refactor may well happen, and we'd love input
> > on that as we go forward.
>
> Any of you, guys, going to PuppetConf this year? If so, can we meet
> and have a discussion on this maybe?

I certainly won't be, sadly. I'll ask around and see if anyone is
heading down.

> > I think I agree - the hosts should keep retrying until they get a
> > response from Foreman, but then actions can be taken. I'd probably
> > be in favour of keeping the retry (so that, say, if the offending
> > MAC is removed in Foreman, the host can register on the next
> > retry), but perhaps split the process into two calls. The first is
> > a light "am I registered?" call that returns true/false, and only
> > if false would the heavier registration call be made. Does that
> > work?
>
> Yes, this would definitely work. This is also is one of the states of
> a system in the state machine we talked about above.

Agreed, a proper place to hook it would be ideal, I'm just throwing
ideas out that might help in the short term. Sounds like Lukas has you
covered on the DB locking issues anyway though :stuck_out_tongue:

Greg

··· On Tue, 2017-10-03 at 16:12 -0700, 'Konstantin Orekhov' via Foreman users wrote:

> See the comment there, do you have this in your instance? If not git
> blame the commit and apply it. You have some older version I assume.
>

Yes, I'm running several 1.14.1 and 1.14.3 instances/clusters. Both had the
same issue with deadlocks. I've updated 2 of them with above patch and was
lucky enough to immediately observe a registration of at least 62 systems
went through w/o a single error.
I'll monitor things more, but so far this is huge steps forward.

Thanks!

>
> Agreed, a proper place to hook it would be ideal, I'm just throwing
> ideas out that might help in the short term.
>

Sure, would be a nice thing to have to start from.

Let us know next week if this helped. I highly suggest upgrade to
1.15, it is a very solid release.

LZ

··· On Wed, Oct 4, 2017 at 11:04 PM, 'Konstantin Orekhov' via Foreman users wrote: > >> See the comment there, do you have this in your instance? If not git >> blame the commit and apply it. You have some older version I assume. > > > Yes, I'm running several 1.14.1 and 1.14.3 instances/clusters. Both had the > same issue with deadlocks. I've updated 2 of them with above patch and was > lucky enough to immediately observe a registration of at least 62 systems > went through w/o a single error. > I'll monitor things more, but so far this is huge steps forward. > > Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

> Let us know next week if this helped. I highly suggest upgrade to
> 1.15, it is a very solid release.
>
>
Is this patch for MySQL a part of 1.15? As you suggested, I've taken it
from develop branch, so assumed it is not released yet.

Plus, there are other 2 things that worry me:

Do you think all of the above make it to 1.15.5?

> Let us know next week if this helped. I highly suggest upgrade to
> 1.15, it is a very solid release.
>

Are there any performance improvements for Smart-Proxy in 1.15, BTW?

Lately, in one of my busiest locations, we've started seeing a strange
issue when SmP stops responding on 8443 for API calls. The process itself
is running, logs messages are logged, just no response from it:

[root@spc01 ~]# systemctl start foreman-proxy

[root@spc01 ~]# date; curl --connect-timeout 30 -kSs
https://localhost:8443/bmc; date
Thu Oct 5 17:53:36 MST 2017
curl: (7) Failed connect to localhost:8443; Connection refused
Thu Oct 5 17:53:36 MST 2017

It does take ~30 seconds to start up in our env because of large DHCP
dataset, during which the connection would be refused.

[root@spc01 ~]# date; curl --connect-timeout 30 -kSs
https://localhost:8443/bmc; date
Thu Oct 5 17:53:49 MST 2017
curl: (28) NSS: client certificate not found (nickname not specified)
Thu Oct 5 17:54:19 MST 2017

Then it starts working for a very short period of time (above) and then
stops (below).

[root@spc01 ~]# date; curl --connect-timeout 30 -kSs
https://localhost:8443/bmc; date
Thu Oct 5 17:54:24 MST 2017
curl: (28) Operation timed out after 30001 milliseconds with 0 out of 0
bytes received
Thu Oct 5 17:54:54 MST 2017

So far there's nothing in proxy.log that helps me identify the issue. I
can't replicate it at will no matter what I do - had a bunch clients
hitting different APIs for a couple of days, nothing.
Then today the above happens and the only thing that helped me is to move
SmP from one node to another (I really wish DHCP SmP would allow for
active/active horizontal scaling instead of just being limited to a single
node).
Strace is useless as it only give this on tracing "ruby foreman-proxy"
process:

[root@spc03 ~]# strace -p 12526
strace: Process 12526 attached
futex(0x184e634, FUTEX_WAIT_PRIVATE, 1, NULL^Cstrace: Process 12526 detached
<detached …>

I tried https://github.com/tmm1/rbtrace, but it is so heavy that it
actually pretty much kills SmP by itself.

Do you have any suggestions on ways to troubleshoot this? I have DEBUG
enabled with these values:

:log_buffer: 4000
:log_buffer_errors: 2000

Also, is a way to move SmP from WebBrick to Apache/Passenger if that makes
sense at all? If so, any docs? Any other ways to increase the performance
as it does feel like a performance issue to me.

Thanks!

I've not seen that, no - I've CC'd someone who might know :wink:

Greg

··· On Thu, 2017-10-05 at 18:35 -0700, 'Konstantin Orekhov' via Foreman users wrote: > > Let us know next week if this helped. I highly suggest upgrade to > > 1.15, it is a very solid release. > > > Are there any performance improvements for Smart-Proxy in 1.15, BTW? > > Lately, in one of my busiest locations, we've started seeing a > strange issue when SmP stops responding on 8443 for API calls. The > process itself is running, logs messages are logged, just no response > from it: > > [root@spc01 ~]# systemctl start foreman-proxy > > [root@spc01 ~]# date; curl --connect-timeout 30 -kSs https://localhos > t:8443/bmc; date > Thu Oct 5 17:53:36 MST 2017 > curl: (7) Failed connect to localhost:8443; Connection refused > Thu Oct 5 17:53:36 MST 2017 > > It does take ~30 seconds to start up in our env because of large DHCP > dataset, during which the connection would be refused. > > > [root@spc01 ~]# date; curl --connect-timeout 30 -kSs https://localhos > t:8443/bmc; date > Thu Oct 5 17:53:49 MST 2017 > curl: (28) NSS: client certificate not found (nickname not specified) > Thu Oct 5 17:54:19 MST 2017 > > Then it starts working for a very short period of time (above) and > then stops (below). > > [root@spc01 ~]# date; curl --connect-timeout 30 -kSs https://localhos > t:8443/bmc; date > Thu Oct 5 17:54:24 MST 2017 > curl: (28) Operation timed out after 30001 milliseconds with 0 out of > 0 bytes received > Thu Oct 5 17:54:54 MST 2017 > > So far there's nothing in proxy.log that helps me identify the issue. > I can't replicate it at will no matter what I do - had a bunch > clients hitting different APIs for a couple of days, nothing. > Then today the above happens and the only thing that helped me is to > move SmP from one node to another (I really wish DHCP SmP would allow > for active/active horizontal scaling instead of just being limited to > a single node). > Strace is useless as it only give this on tracing "ruby foreman- > proxy" process: > > [root@spc03 ~]# strace -p 12526 > strace: Process 12526 attached > futex(0x184e634, FUTEX_WAIT_PRIVATE, 1, NULL^Cstrace: Process 12526 > detached > > > I tried https://github.com/tmm1/rbtrace, but it is so heavy that it > actually pretty much kills SmP by itself. > > Do you have any suggestions on ways to troubleshoot this? I have > DEBUG enabled with these values: > > :log_buffer: 4000 > :log_buffer_errors: 2000 > > Also, is a way to move SmP from WebBrick to Apache/Passenger if that > makes sense at all? If so, any docs? Any other ways to increase the > performance as it does feel like a performance issue to me. > > Thanks! > -- > You received this message because you are subscribed to the Google > Groups "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, > send an email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


IRC / Twitter: @gwmngilfen
Diaspora: gwmngilfen@joindiaspora.com

Hello,

please use git to find out which branches it landed in, I believe the
MySQL facter patch is 1.15+ only.

For 1.15.5 you need to talk with release engineer of this version
which is Daniel, if the changes are small enough I see no reason not
to include them. I think it's too late for 1.15.5 tho, maybe .6.

For smart proxy, there was a patch done by Dmitri who redesigned DHCP
parser, it's much more capable and faster now. I think this landed in
1.16 RC1, yeah: Refactor #19441: Rewrite isc dhcpd config parser to support various nested blocks - Smart Proxy - Foreman
(https://github.com/theforeman/smart-proxy/commit/21813c6cde0d2be10747682f1a001a7c0bd3ffb9)

I did not hear about unresponsive smart-proxy processes, can you check
system limits (open handles etc)? SELinux? Firewall. Any proxy plugins
enable? Then file a redmine bug, haven't seen that.

··· On Fri, Oct 6, 2017 at 3:35 AM, 'Konstantin Orekhov' via Foreman users wrote: > >> Let us know next week if this helped. I highly suggest upgrade to >> 1.15, it is a very solid release. > > > Are there any performance improvements for Smart-Proxy in 1.15, BTW? > > Lately, in one of my busiest locations, we've started seeing a strange issue > when SmP stops responding on 8443 for API calls. The process itself is > running, logs messages are logged, just no response from it: > > [root@spc01 ~]# systemctl start foreman-proxy > > [root@spc01 ~]# date; curl --connect-timeout 30 -kSs > https://localhost:8443/bmc; date > Thu Oct 5 17:53:36 MST 2017 > curl: (7) Failed connect to localhost:8443; Connection refused > Thu Oct 5 17:53:36 MST 2017 > > It does take ~30 seconds to start up in our env because of large DHCP > dataset, during which the connection would be refused. > > > [root@spc01 ~]# date; curl --connect-timeout 30 -kSs > https://localhost:8443/bmc; date > Thu Oct 5 17:53:49 MST 2017 > curl: (28) NSS: client certificate not found (nickname not specified) > Thu Oct 5 17:54:19 MST 2017 > > Then it starts working for a very short period of time (above) and then > stops (below). > > [root@spc01 ~]# date; curl --connect-timeout 30 -kSs > https://localhost:8443/bmc; date > Thu Oct 5 17:54:24 MST 2017 > curl: (28) Operation timed out after 30001 milliseconds with 0 out of 0 > bytes received > Thu Oct 5 17:54:54 MST 2017 > > So far there's nothing in proxy.log that helps me identify the issue. I > can't replicate it at will no matter what I do - had a bunch clients hitting > different APIs for a couple of days, nothing. > Then today the above happens and the only thing that helped me is to move > SmP from one node to another (I really wish DHCP SmP would allow for > active/active horizontal scaling instead of just being limited to a single > node). > Strace is useless as it only give this on tracing "ruby foreman-proxy" > process: > > [root@spc03 ~]# strace -p 12526 > strace: Process 12526 attached > futex(0x184e634, FUTEX_WAIT_PRIVATE, 1, NULL^Cstrace: Process 12526 detached > > > I tried https://github.com/tmm1/rbtrace, but it is so heavy that it actually > pretty much kills SmP by itself. > > Do you have any suggestions on ways to troubleshoot this? I have DEBUG > enabled with these values: > > :log_buffer: 4000 > :log_buffer_errors: 2000 > > Also, is a way to move SmP from WebBrick to Apache/Passenger if that makes > sense at all? If so, any docs? Any other ways to increase the performance as > it does feel like a performance issue to me. > > Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

> please use git to find out which branches it landed in, I believe the
> MySQL facter patch is 1.15+ only.
>

Yes, I already found that and am planning on an upgrade in our lab instance.
BTW, even after applying a patch (on 1.14), which helped tremendously, from
time to time I still get some duplicate entries caused by an already
discovered system trying to send its discovery payload. For whatever reason
though, Foreman discovery plugin does not recognize it as the same host and
creates a new entry in a DB with different ID:

<https://lh3.googleusercontent.com/-ouz9YJ25K5M/WdvXTxpj4lI/AAAAAAAAAEE/1LcC1bcM2PUzzuH3gQtTQuT0Bx7mFHKTQCLcBGAs/s1600/Screen%2BShot%2B2017-10-09%2Bat%2B12.59.22%2BPM.png>

The host in question keeps of retrying, of course, and gets 422 "Name
already taken" over and over again. My question though is why a duplicate
was created instead of updating an existing host? It seems to me that this
host was not recognized properly. A complete debug log of a operation, that
I believe resulted in above duplicate entry, is here -


I did not see any 422 error before this transaction so I think this is it.
Although I did not see any long MySQL queries, the whole transaction still
took ~11 seconds to complete for some reason:

2017-10-07 01:26:16 e3da2f90 [app] [I] Completed 422 Unprocessable Entity
in 11291ms (Views: 0.4ms | ActiveRecord: 172.9ms)

> For smart proxy, there was a patch done by Dmitri who redesigned DHCP
> parser, it's much more capable and faster now. I think this landed in
> 1.16 RC1, yeah: Refactor #19441: Rewrite isc dhcpd config parser to support various nested blocks - Smart Proxy - Foreman
> <http://www.google.com/url?q=http%3A%2F%2Fprojects.theforeman.org%2Fissues%2F19441&sa=D&sntz=1&usg=AFQjCNGvX5FtIFwzkZEj8ilJdCjDl8c4oQ>
> (
> https://github.com/theforeman/smart-proxy/commit/21813c6cde0d2be10747682f1a001a7c0bd3ffb9)
>
>

welcomed change :slight_smile:

> I did not hear about unresponsive smart-proxy processes, can you check
> system limits (open handles etc)? SELinux? Firewall. Any proxy plugins
> enable? Then file a redmine bug, haven't seen that.
>

That's the problem - no smoking gun that I could find. No system resource
shortages logged, system itself is a rather beefy VM that does not even
sweat, no firewalls, selinux set to permissive mode. I only run 3 SmP -
bmc, dhcp and tftp.
On top of that, since I can't replicate this at will, I have to wait until
this issue manifests itself naturally.

And just to make it clear - it is not that SmP process becomes completely
unresponsive, but only an API-facing part. That's why I'm wondering if
moving away from Webrick to Apache or Nginx with Passenger is a possibility.
Another question along the same lines - is it possible to run each of the
smart-proxies as a separate processes (listening on different ports)
instead of one with several proxies and single port. For example, in this
particular situation even if one SmP was having an issue, it would not
affect the other 2, also it would also pinpoint the troubled proxy
simplifying troubleshooting efforts.

Thanks!
Konstantin.

··· From my side, any performance improvements for DHCP SmP is always a

Hey

I did not see any 422 error before this transaction so I think this is it.
> Although I did not see any long MySQL queries, the whole transaction still
> took ~11 seconds to complete for some reason:
>

Name has already been taken - this usually means that host (either
discovered or managed or unmanaged) of that name "macXXXXXXXXX" already
exist. Same mac address? You can change easily how discovered hosts are
being named, by default it is "mac" + MAC address, you can change that to
random number or any different fact you want. See settings and our
documentation. Try to add a random number at the end if that helps.

>
>
>> For smart proxy, there was a patch done by Dmitri who redesigned DHCP
>> parser, it's much more capable and faster now. I think this landed in
>> 1.16 RC1, yeah: Refactor #19441: Rewrite isc dhcpd config parser to support various nested blocks - Smart Proxy - Foreman
>> <http://www.google.com/url?q=http%3A%2F%2Fprojects.theforeman.org%2Fissues%2F19441&sa=D&sntz=1&usg=AFQjCNGvX5FtIFwzkZEj8ilJdCjDl8c4oQ>
>> (https://github.com/theforeman/smart-proxy/commit/21813c6cde
>> 0d2be10747682f1a001a7c0bd3ffb9)
>>
>
> From my side, any performance improvements for DHCP SmP is always a
> welcomed change :slight_smile:
>
>
>> I did not hear about unresponsive smart-proxy processes, can you check
>> system limits (open handles etc)? SELinux? Firewall. Any proxy plugins
>> enable? Then file a redmine bug, haven't seen that.
>>
>
> That's the problem - no smoking gun that I could find. No system resource
> shortages logged, system itself is a rather beefy VM that does not even
> sweat, no firewalls, selinux set to permissive mode. I only run 3 SmP -
> bmc, dhcp and tftp.
> On top of that, since I can't replicate this at will, I have to wait until
> this issue manifests itself naturally.
>
> And just to make it clear - it is not that SmP process becomes completely
> unresponsive, but only an API-facing part. That's why I'm wondering if
> moving away from Webrick to Apache or Nginx with Passenger is a possibility.
>

Proxy is a regular Sinatra app, so any Rack servere should do the trick
(Puma perhaps). I'd try that to see if it helps. Might be bug in Webrick,
try to downgrade or upgrade it.

Another question along the same lines - is it possible to run each of the
> smart-proxies as a separate processes (listening on different ports)
> instead of one with several proxies and single port. For example, in this
> particular situation even if one SmP was having an issue, it would not
> affect the other 2, also it would also pinpoint the troubled proxy
> simplifying troubleshooting efforts.
>

We don't support that, unfortunately.