Lots of "Mysql2::Error: Deadlock found when trying to get lock" under increased load

Hi, all!

Under increased load (which comes in spikes), I noticed lots of mysql
deadlock errors resulting in failed transactions and incorrectly discovered
systems (duplicate and/or empty entries in discovered_hosts I reported in
this group some time ago, just can't find those posts for some reason).

Anyway, these are the type of messages I receive:

2017-09-14 15:01:13 173c1d40 [app] [E] Fact processor37 could not be
imported because of Mysql2::Error: Deadlock found when trying to get lock;
try restarting transaction: SELECT 1 AS one FROM fact_values WHERE
(fact_values.fact_name_id = BINARY 248 AND fact_values.host_id =
446074) LIMIT 1
2017-09-14 15:01:14 173c1d40 [audit] [I] [mac90e2bae93da0] added 353
(2693.0ms)
2017-09-14 15:01:14 173c1d40 [app] [W] Error during fact import for
mac90e2bae93da0
> ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when
trying to get lock; try restarting transaction: SELECT 1 AS one FROM
fact_values WHERE (fact_values.fact_name_id = BINARY 248 AND
fact_values.host_id = 446074) LIMIT 1
>
/opt/theforeman/tfm/root/usr/share/gems/gems/mysql2-0.4.5/lib/mysql2/client.rb:120:in
_query' > /opt/theforeman/tfm/root/usr/share/gems/gems/mysql2-0.4.5/lib/mysql2/client.rb:120:inblock in query'
>
/opt/theforeman/tfm/root/usr/share/gems/gems/mysql2-0.4.5/lib/mysql2/client.rb:119:in
handle_interrupt' > /opt/theforeman/tfm/root/usr/share/gems/gems/mysql2-0.4.5/lib/mysql2/client.rb:119:inquery'
<snip>

I do run an active/active cluster of 3 1.14.x Foreman VMs with replicated
MariaDB Mysql backend.

I saw a couple of people were the same questions in the IRC chat, but I
could not find any responses to that over there.

Anyone has any suggestions/recommendations? Anything like
https://github.com/qertoip/transaction_retry is planned to be used instead
of failing transactions in Foreman?

Thanks!

Hey,

what kind of load do you have? Puppet? Facter? Is that ENC? Something else?

Can you tell which requests are slow from logs or monitoring?

LZ

··· On Fri, Sep 15, 2017 at 3:35 AM, 'Konstantin Orekhov' via Foreman users wrote: > > Hi, all! > > Under increased load (which comes in spikes), I noticed lots of mysql > deadlock errors resulting in failed transactions and incorrectly discovered > systems (duplicate and/or empty entries in discovered_hosts I reported in > this group some time ago, just can't find those posts for some reason). > > Anyway, these are the type of messages I receive: > > 2017-09-14 15:01:13 173c1d40 [app] [E] Fact processor37 could not be > imported because of Mysql2::Error: Deadlock found when trying to get lock; > try restarting transaction: SELECT 1 AS one FROM `fact_values` WHERE > (`fact_values`.`fact_name_id` = BINARY 248 AND `fact_values`.`host_id` = > 446074) LIMIT 1 > 2017-09-14 15:01:14 173c1d40 [audit] [I] [mac90e2bae93da0] added 353 > (2693.0ms) > 2017-09-14 15:01:14 173c1d40 [app] [W] Error during fact import for > mac90e2bae93da0 > > ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying > to get lock; try restarting transaction: SELECT 1 AS one FROM `fact_values` > WHERE (`fact_values`.`fact_name_id` = BINARY 248 AND `fact_values`.`host_id` > = 446074) LIMIT 1 > > > /opt/theforeman/tfm/root/usr/share/gems/gems/mysql2-0.4.5/lib/mysql2/client.rb:120:in > `_query' > > > /opt/theforeman/tfm/root/usr/share/gems/gems/mysql2-0.4.5/lib/mysql2/client.rb:120:in > `block in query' > > > /opt/theforeman/tfm/root/usr/share/gems/gems/mysql2-0.4.5/lib/mysql2/client.rb:119:in > `handle_interrupt' > > > /opt/theforeman/tfm/root/usr/share/gems/gems/mysql2-0.4.5/lib/mysql2/client.rb:119:in > `query' > > > I do run an active/active cluster of 3 1.14.x Foreman VMs with replicated > MariaDB Mysql backend. > > I saw a couple of people were the same questions in the IRC chat, but I > could not find any responses to that over there. > > Anyone has any suggestions/recommendations? Anything like > https://github.com/qertoip/transaction_retry is planned to be used instead > of failing transactions in Foreman? > > Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

>
> what kind of load do you have? Puppet? Facter? Is that ENC? Something
> else?
>
> Can you tell which requests are slow from logs or monitoring?
>
>
Yes, I should have mentioned that - there's very little puppet and ENC work
done by this cluster at this point (more is coming soon though). Host
discovery is by far the largest workload - 7600 discovered systems at this
point. The last spike that I saw the impact to overall flows was when
300-400 systems were trying to register at the same time. Because of the
deadlocks, about 200-300 systems could not register repeatedly and had to
keep retrying for a rather long time.
Rather often these registration attempts would end up creating either
duplicate entries with the same "mac<mac>" but different IDs in a DB or an
"empty" discovery host entry. Both of these would prevent a system
successfully register unless they are removed (I had to write a little
script that runs from the cron to do so). Here are the examples of an
"empty" record and duplicate ones (as they get deleted):

{
"id": 437923,
"name": "mac90e2bae6cc70",
"last_compile": null,
"last_report": null,
"updated_at": "2017-08-22T07:08:54.000Z",
"created_at": "2017-08-22T07:08:54.000Z",
"root_pass": "<removed>",
"architecture_id": null,
"operatingsystem_id": null,
"environment_id": null,
"ptable_id": null,
"medium_id": null,
"build": false,
"comment": null,
"disk": null,
"installed_at": null,
"model_id": null,
"hostgroup_id": null,
"owner_id": null,
"owner_type": null,
"enabled": true,
"puppet_ca_proxy_id": null,
"managed": false,
"use_image": null,
"image_file": null,
"uuid": null,
"compute_resource_id": null,
"puppet_proxy_id": null,
"certname": null,
"image_id": null,
"organization_id": null,
"location_id": null,
"otp": null,
"realm_id": null,
"compute_profile_id": null,
"provision_method": null,
"grub_pass": "",
"global_status": 0,
"lookup_value_matcher": null,
"discovery_rule_id": null,
"salt_proxy_id": null,
"salt_environment_id": null,
"pxe_loader": null
}

Duplicates (usually the later duplicate would be an empty one as well, but
not all the time):

{
"id": 430090,
"name": "mac3417ebe3f8f1",
"last_compile": null,
"last_report": "2017-09-14T19:47:55.000Z",
"updated_at": "2017-09-14T19:47:57.000Z",
"created_at": "2017-03-08T20:24:05.000Z",
"root_pass": "<removed>",
"architecture_id": null,
"operatingsystem_id": null,
"environment_id": null,
"ptable_id": null,
"medium_id": null,
"build": false,
"comment": null,
"disk": null,
"installed_at": null,
"model_id": 3,
"hostgroup_id": null,
"owner_id": 10,
"owner_type": "User",
"enabled": true,
"puppet_ca_proxy_id": null,
"managed": false,
"use_image": null,
"image_file": null,
"uuid": null,
"compute_resource_id": null,
"puppet_proxy_id": null,
"certname": null,
"image_id": null,
"organization_id": null,
"location_id": null,
"otp": null,
"realm_id": null,
"compute_profile_id": null,
"provision_method": null,
"grub_pass": "",
"global_status": 0,
"lookup_value_matcher": null,
"discovery_rule_id": null,
"salt_proxy_id": null,
"salt_environment_id": null,
"pxe_loader": null
}
{
"id": 438146,
"name": "mac3417ebe3f8f1",
"last_compile": null,
"last_report": "2017-09-11T08:58:05.000Z",
"updated_at": "2017-09-11T08:58:07.000Z",
"created_at": "2017-08-24T19:44:23.000Z",
"root_pass": "<removed>",
"architecture_id": null,
"operatingsystem_id": null,
"environment_id": null,
"ptable_id": null,
"medium_id": null,
"build": false,
"comment": null,
"disk": null,
"installed_at": null,
"model_id": null,
"hostgroup_id": null,
"owner_id": null,
"owner_type": null,
"enabled": true,
"puppet_ca_proxy_id": null,
"managed": false,
"use_image": null,
"image_file": null,
"uuid": null,
"compute_resource_id": null,
"puppet_proxy_id": null,
"certname": null,
"image_id": null,
"organization_id": null,
"location_id": null,
"otp": null,
"realm_id": null,
"compute_profile_id": null,
"provision_method": null,
"grub_pass": "",
"global_status": 0,
"lookup_value_matcher": null,
"discovery_rule_id": null,
"salt_proxy_id": null,
"salt_environment_id": null,
"pxe_loader": null
}

I can't tell if any queries are slow - can you remind me how to do that?
Thanks!

This is how it looks in WebUI:

Empties:

And most of those empties would also have duplicates in this form:

Auto Generated Inline Image 1 (13.3 KB)

>
>
>> what kind of load do you have? Puppet? Facter? Is that ENC? Something
>> else?
>>
>> Can you tell which requests are slow from logs or monitoring?
>>
>>
> Yes, I should have mentioned that - there's very little puppet and ENC
> work done by this cluster at this point (more is coming soon though). Host
> discovery is by far the largest workload - 7600 discovered systems at this
> point. The last spike that I saw the impact to overall flows was when
> 300-400 systems were trying to register at the same time. Because of the
> deadlocks, about 200-300 systems could not register repeatedly and had to
> keep retrying for a rather long time.
> Rather often these registration attempts would end up creating either
> duplicate entries with the same "mac<mac>" but different IDs in a DB or an
> "empty" discovery host entry. Both of these would prevent a system
> successfully register unless they are removed (I had to write a little
> script that runs from the cron to do so). Here are the examples of an
> "empty" record and duplicate ones (as they get deleted):
>
>
Lukas - how about we change discovery to be async? e.g. import all new
discovered systems into active job and than process than one / multiple at
a time? I assume this would require a image change too (so it knows when
the discovery"job" is done)

··· On Fri, Sep 15, 2017 at 8:56 PM, 'Konstantin Orekhov' via Foreman users < foreman-users@googlegroups.com> wrote:

{
“id”: 437923,
“name”: “mac90e2bae6cc70”,
“last_compile”: null,
“last_report”: null,
“updated_at”: “2017-08-22T07:08:54.000Z”,
“created_at”: “2017-08-22T07:08:54.000Z”,
“root_pass”: “”,
“architecture_id”: null,
“operatingsystem_id”: null,
“environment_id”: null,
“ptable_id”: null,
“medium_id”: null,
“build”: false,
“comment”: null,
“disk”: null,
“installed_at”: null,
“model_id”: null,
“hostgroup_id”: null,
“owner_id”: null,
“owner_type”: null,
“enabled”: true,
“puppet_ca_proxy_id”: null,
“managed”: false,
“use_image”: null,
“image_file”: null,
“uuid”: null,
“compute_resource_id”: null,
“puppet_proxy_id”: null,
“certname”: null,
“image_id”: null,
“organization_id”: null,
“location_id”: null,
“otp”: null,
“realm_id”: null,
“compute_profile_id”: null,
“provision_method”: null,
“grub_pass”: “”,
“global_status”: 0,
“lookup_value_matcher”: null,
“discovery_rule_id”: null,
“salt_proxy_id”: null,
“salt_environment_id”: null,
“pxe_loader”: null
}

Duplicates (usually the later duplicate would be an empty one as well, but
not all the time):

{
“id”: 430090,
“name”: “mac3417ebe3f8f1”,
“last_compile”: null,
“last_report”: “2017-09-14T19:47:55.000Z”,
“updated_at”: “2017-09-14T19:47:57.000Z”,
“created_at”: “2017-03-08T20:24:05.000Z”,
“root_pass”: “”,
“architecture_id”: null,
“operatingsystem_id”: null,
“environment_id”: null,
“ptable_id”: null,
“medium_id”: null,
“build”: false,
“comment”: null,
“disk”: null,
“installed_at”: null,
“model_id”: 3,
“hostgroup_id”: null,
“owner_id”: 10,
“owner_type”: “User”,
“enabled”: true,
“puppet_ca_proxy_id”: null,
“managed”: false,
“use_image”: null,
“image_file”: null,
“uuid”: null,
“compute_resource_id”: null,
“puppet_proxy_id”: null,
“certname”: null,
“image_id”: null,
“organization_id”: null,
“location_id”: null,
“otp”: null,
“realm_id”: null,
“compute_profile_id”: null,
“provision_method”: null,
“grub_pass”: “”,
“global_status”: 0,
“lookup_value_matcher”: null,
“discovery_rule_id”: null,
“salt_proxy_id”: null,
“salt_environment_id”: null,
“pxe_loader”: null
}
{
“id”: 438146,
“name”: “mac3417ebe3f8f1”,
“last_compile”: null,
“last_report”: “2017-09-11T08:58:05.000Z”,
“updated_at”: “2017-09-11T08:58:07.000Z”,
“created_at”: “2017-08-24T19:44:23.000Z”,
“root_pass”: “”,
“architecture_id”: null,
“operatingsystem_id”: null,
“environment_id”: null,
“ptable_id”: null,
“medium_id”: null,
“build”: false,
“comment”: null,
“disk”: null,
“installed_at”: null,
“model_id”: null,
“hostgroup_id”: null,
“owner_id”: null,
“owner_type”: null,
“enabled”: true,
“puppet_ca_proxy_id”: null,
“managed”: false,
“use_image”: null,
“image_file”: null,
“uuid”: null,
“compute_resource_id”: null,
“puppet_proxy_id”: null,
“certname”: null,
“image_id”: null,
“organization_id”: null,
“location_id”: null,
“otp”: null,
“realm_id”: null,
“compute_profile_id”: null,
“provision_method”: null,
“grub_pass”: “”,
“global_status”: 0,
“lookup_value_matcher”: null,
“discovery_rule_id”: null,
“salt_proxy_id”: null,
“salt_environment_id”: null,
“pxe_loader”: null
}

I can’t tell if any queries are slow - can you remind me how to do that?
Thanks!


You received this message because you are subscribed to the Google Groups
"Foreman users" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-users+unsubscribe@googlegroups.com.
To post to this group, send email to foreman-users@googlegroups.com.
Visit this group at https://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

I would rather fix importing code to be faster than doing async, that
is the last resort.

Konstantin, thanks for analysis. Our import code is slow indeed, we
improved it a bit in 1.14. Note we mostly test this on PostgreSQL. For
each import, there is a log in INFO level about how much time was
spent in each phase of import (delete, add, update). Can you share the
numbers there?

What happens I think is that by default every node tries to update
facts every 5 minutes. At that scale, you need to increase this to
more resonable value.

When Foreman is busy, these requests can stack up. We are not using
transactions, so imports can fail leaving incorrect records.

LZ

··· On Tue, Sep 19, 2017 at 8:12 AM, Ohad Levy wrote: > > > On Fri, Sep 15, 2017 at 8:56 PM, 'Konstantin Orekhov' via Foreman users > wrote: >> >> >>> >>> what kind of load do you have? Puppet? Facter? Is that ENC? Something >>> else? >>> >>> Can you tell which requests are slow from logs or monitoring? >>> >> >> Yes, I should have mentioned that - there's very little puppet and ENC >> work done by this cluster at this point (more is coming soon though). Host >> discovery is by far the largest workload - 7600 discovered systems at this >> point. The last spike that I saw the impact to overall flows was when >> 300-400 systems were trying to register at the same time. Because of the >> deadlocks, about 200-300 systems could not register repeatedly and had to >> keep retrying for a rather long time. >> Rather often these registration attempts would end up creating either >> duplicate entries with the same "mac" but different IDs in a DB or an >> "empty" discovery host entry. Both of these would prevent a system >> successfully register unless they are removed (I had to write a little >> script that runs from the cron to do so). Here are the examples of an >> "empty" record and duplicate ones (as they get deleted): >> > > Lukas - how about we change discovery to be async? e.g. import all new > discovered systems into active job and than process than one / multiple at a > time? I assume this would require a image change too (so it knows when the > discovery"job" is done) >> >> { >> "id": 437923, >> "name": "mac90e2bae6cc70", >> "last_compile": null, >> "last_report": null, >> "updated_at": "2017-08-22T07:08:54.000Z", >> "created_at": "2017-08-22T07:08:54.000Z", >> "root_pass": "", >> "architecture_id": null, >> "operatingsystem_id": null, >> "environment_id": null, >> "ptable_id": null, >> "medium_id": null, >> "build": false, >> "comment": null, >> "disk": null, >> "installed_at": null, >> "model_id": null, >> "hostgroup_id": null, >> "owner_id": null, >> "owner_type": null, >> "enabled": true, >> "puppet_ca_proxy_id": null, >> "managed": false, >> "use_image": null, >> "image_file": null, >> "uuid": null, >> "compute_resource_id": null, >> "puppet_proxy_id": null, >> "certname": null, >> "image_id": null, >> "organization_id": null, >> "location_id": null, >> "otp": null, >> "realm_id": null, >> "compute_profile_id": null, >> "provision_method": null, >> "grub_pass": "", >> "global_status": 0, >> "lookup_value_matcher": null, >> "discovery_rule_id": null, >> "salt_proxy_id": null, >> "salt_environment_id": null, >> "pxe_loader": null >> } >> >> Duplicates (usually the later duplicate would be an empty one as well, but >> not all the time): >> >> { >> "id": 430090, >> "name": "mac3417ebe3f8f1", >> "last_compile": null, >> "last_report": "2017-09-14T19:47:55.000Z", >> "updated_at": "2017-09-14T19:47:57.000Z", >> "created_at": "2017-03-08T20:24:05.000Z", >> "root_pass": "", >> "architecture_id": null, >> "operatingsystem_id": null, >> "environment_id": null, >> "ptable_id": null, >> "medium_id": null, >> "build": false, >> "comment": null, >> "disk": null, >> "installed_at": null, >> "model_id": 3, >> "hostgroup_id": null, >> "owner_id": 10, >> "owner_type": "User", >> "enabled": true, >> "puppet_ca_proxy_id": null, >> "managed": false, >> "use_image": null, >> "image_file": null, >> "uuid": null, >> "compute_resource_id": null, >> "puppet_proxy_id": null, >> "certname": null, >> "image_id": null, >> "organization_id": null, >> "location_id": null, >> "otp": null, >> "realm_id": null, >> "compute_profile_id": null, >> "provision_method": null, >> "grub_pass": "", >> "global_status": 0, >> "lookup_value_matcher": null, >> "discovery_rule_id": null, >> "salt_proxy_id": null, >> "salt_environment_id": null, >> "pxe_loader": null >> } >> { >> "id": 438146, >> "name": "mac3417ebe3f8f1", >> "last_compile": null, >> "last_report": "2017-09-11T08:58:05.000Z", >> "updated_at": "2017-09-11T08:58:07.000Z", >> "created_at": "2017-08-24T19:44:23.000Z", >> "root_pass": "", >> "architecture_id": null, >> "operatingsystem_id": null, >> "environment_id": null, >> "ptable_id": null, >> "medium_id": null, >> "build": false, >> "comment": null, >> "disk": null, >> "installed_at": null, >> "model_id": null, >> "hostgroup_id": null, >> "owner_id": null, >> "owner_type": null, >> "enabled": true, >> "puppet_ca_proxy_id": null, >> "managed": false, >> "use_image": null, >> "image_file": null, >> "uuid": null, >> "compute_resource_id": null, >> "puppet_proxy_id": null, >> "certname": null, >> "image_id": null, >> "organization_id": null, >> "location_id": null, >> "otp": null, >> "realm_id": null, >> "compute_profile_id": null, >> "provision_method": null, >> "grub_pass": "", >> "global_status": 0, >> "lookup_value_matcher": null, >> "discovery_rule_id": null, >> "salt_proxy_id": null, >> "salt_environment_id": null, >> "pxe_loader": null >> } >> >> I can't tell if any queries are slow - can you remind me how to do that? >> Thanks! >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Foreman users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to foreman-users+unsubscribe@googlegroups.com. >> To post to this group, send email to foreman-users@googlegroups.com. >> Visit this group at https://groups.google.com/group/foreman-users. >> For more options, visit https://groups.google.com/d/optout. > > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

Here you go, Lukas (just one host that can't register and keeps on
retrying):

2017-09-19 11:45:55 5de80a14 [audit] [I] [mac3cfdfe52252c] deleted 0
(1898.9ms)
2017-09-19 11:45:56 5de80a14 [audit] [I] [mac3cfdfe52252c] updated 0
(575.8ms)
2017-09-19 11:45:56 5de80a14 [audit] [I] [mac3cfdfe52252c] added 0 (3.2ms)
2017-09-19 11:45:56 5de80a14 [app] [I] Import facts for 'mac3cfdfe52252c'
completed. Added: 0, Updated: 0, Deleted 0 facts
2017-09-19 11:46:28 1c0d90c8 [audit] [I] [mac3cfdfe52252c] deleted 0
(1787.2ms)
2017-09-19 11:46:29 1c0d90c8 [audit] [I] [mac3cfdfe52252c] updated 0
(495.7ms)
2017-09-19 11:46:29 1c0d90c8 [audit] [I] [mac3cfdfe52252c] added 0 (2.9ms)
2017-09-19 11:46:29 1c0d90c8 [app] [I] Import facts for 'mac3cfdfe52252c'
completed. Added: 0, Updated: 0, Deleted 0 facts
2017-09-19 11:47:02 7de14afb [audit] [I] [mac3cfdfe52252c] deleted 0
(1705.5ms)
2017-09-19 11:47:02 7de14afb [audit] [I] [mac3cfdfe52252c] updated 0
(612.4ms)
2017-09-19 11:47:02 7de14afb [audit] [I] [mac3cfdfe52252c] added 0 (4.2ms)
2017-09-19 11:47:02 7de14afb [app] [I] Import facts for 'mac3cfdfe52252c'
completed. Added: 0, Updated: 0, Deleted 0 facts
2017-09-19 11:47:35 51585eea [audit] [I] [mac3cfdfe52252c] deleted 0
(1755.6ms)
2017-09-19 11:47:36 51585eea [audit] [I] [mac3cfdfe52252c] updated 0
(1187.4ms)
2017-09-19 11:47:36 51585eea [audit] [I] [mac3cfdfe52252c] added 0 (5.3ms)
2017-09-19 11:47:36 51585eea [app] [I] Import facts for 'mac3cfdfe52252c'
completed. Added: 0, Updated: 0, Deleted 0 facts
2017-09-19 11:48:09 4b643a56 [audit] [I] [mac3cfdfe52252c] deleted 0
(1895.9ms)
2017-09-19 11:48:10 4b643a56 [audit] [I] [mac3cfdfe52252c] updated 0
(536.7ms)
2017-09-19 11:48:10 4b643a56 [audit] [I] [mac3cfdfe52252c] added 0 (4.4ms)
2017-09-19 11:48:10 4b643a56 [app] [I] Import facts for 'mac3cfdfe52252c'
completed. Added: 0, Updated: 0, Deleted 0 facts
2017-09-19 11:48:48 8bc4666b [audit] [I] [mac3cfdfe52252c] deleted 0
(1653.8ms)
2017-09-19 11:48:48 8bc4666b [audit] [I] [mac3cfdfe52252c] updated 0
(708.9ms)
2017-09-19 11:48:48 8bc4666b [audit] [I] [mac3cfdfe52252c] added 0 (4.1ms)
2017-09-19 11:48:48 8bc4666b [app] [I] Import facts for 'mac3cfdfe52252c'
completed. Added: 0, Updated: 0, Deleted 0 facts
2017-09-19 11:49:21 c58afee7 [audit] [I] [mac3cfdfe52252c] deleted 0
(1739.9ms)
2017-09-19 11:49:22 c58afee7 [audit] [I] [mac3cfdfe52252c] updated 0
(862.5ms)
2017-09-19 11:49:22 c58afee7 [audit] [I] [mac3cfdfe52252c] added 0 (3.1ms)
2017-09-19 11:49:22 c58afee7 [app] [I] Import facts for 'mac3cfdfe52252c'
completed. Added: 0, Updated: 0, Deleted 0 facts

On a client side (mac3cfdfe52252c), in foreman-discovery log, I see these
messages:

Discovered by URL: https://<foreman_url>
Registering host with Foreman (https://<foreman_url>)
Response from Foreman 422: {"message":"Validation failed: Interfaces some interfaces are invalid"}

Over and over again.

On Foreman side:

grep 323e78f4 /var/log/foreman/production.log

2017-09-19 11:51:00 323e78f4 [app] [I] Started POST "/api/v2/discovered_hosts/facts" for 10.102.141.20 at 2017-09-19 11:51:00 -0700
2017-09-19 11:51:00 323e78f4 [app] [I] Processing by Api::V2::DiscoveredHostsController#facts as JSON
2017-09-19 11:51:00 323e78f4 [app] [I] Parameters: {"facts"=>"[FILTERED]", "apiv"=>"v2", "discovered_host"=>{"facts"=>"[FILTERED]"}}
2017-09-19 11:51:02 323e78f4 [audit] [I] [mac3cfdfe52252c] deleted 0 (1718.3ms)
2017-09-19 11:51:03 323e78f4 [audit] [I] [mac3cfdfe52252c] updated 0 (613.2ms)
2017-09-19 11:51:03 323e78f4 [audit] [I] [mac3cfdfe52252c] added 0 (3.6ms)
2017-09-19 11:51:03 323e78f4 [app] [I] Import facts for 'mac3cfdfe52252c' completed. Added: 0, Updated: 0, Deleted 0 facts
2017-09-19 11:51:03 323e78f4 [app] [W] Subnet could not be detected for 10.212.36.110
2017-09-19 11:51:03 323e78f4 [app] [W] Host discovery failed, facts: <cut out facts>
<snip>
2017-09-19 11:51:03 323e78f4 [app] [I] Completed 422 Unprocessable Entity in 2823ms (Views: 0.7ms | ActiveRecord: 1923.9ms)

I'm going to enable a debug level to see if I can get more data on why some interfaces are considered to be invalid here. If you know - please let me know.
Also, if you need to see all facts sent from a client system - please let me know too.

Thanks!

After I got the debug output, I've deleted this host from Foreman and on
its next attempt it got registered perfectly fine - no issues with
interfaces or anything anymore:

(on a client side)

Discovered by URL: https://spc.vip
Registering host with Foreman (https://spc.vip)
Response from Foreman 201: {"id":447371,"name":"mac3cfdfe52252c" <snip>

(on Foreman side):
2017-09-19 12:39:34 7ca37aca [app] [I] Started DELETE "/discovered_hosts/mac3cfdfe52252c" for 10.102.141.20 at 2017-09-19 12:39:34 -0700
2017-09-19 12:39:34 7ca37aca [app] [I] Parameters: {"authenticity_token"=>"<removed>", "id"=>"mac3cfdfe52252c"}

2017-09-19 12:40:04 1a346c39 [app] [I] Started POST "/api/v2/discovered_hosts/facts" for 10.102.141.20 at 2017-09-19 12:40:04 -0700
2017-09-19 12:40:04 1a346c39 [app] [I] Processing by Api::V2::DiscoveredHostsController#facts as JSON
2017-09-19 12:40:04 1a346c39 [app] [I] Parameters: {"facts"=>"[FILTERED]", "apiv"=>"v2", "discovered_host"=>{"facts"=>"[FILTERED]"}}
2017-09-19 12:40:06 1a346c39 [audit] [I] [mac3cfdfe52252c] deleted 0 (1694.6ms)
2017-09-19 12:40:06 1a346c39 [audit] [I] [mac3cfdfe52252c] updated 0 (2.6ms)
2017-09-19 12:40:07 1a346c39 [audit] [I] [mac3cfdfe52252c] added 385 (1637.5ms)
2017-09-19 12:40:07 1a346c39 [app] [I] Import facts for 'mac3cfdfe52252c' completed. Added: 385, Updated: 0, Deleted 0 facts

It would be nice to figure out what's causing this in a first place - I do see a lot of those "Unprocessable Entity" messages logged.
Thanks!

BTW, Lukas, you mentioned that some improvements were made in 1.14. I am
running 1.14.1 and 1.14.3.
Did you mean 1.15 maybe? Should I even consider an upgrade to help resolve
this situation?

A MAC address can only exist once, if you already have a
(managed/unmanaged) host and you try to discover a host with same MAC,
you will get error. Depending on Foreman discovery it is either 422 or
"Host already exists":
https://github.com/theforeman/foreman_discovery/commit/210f143bc85c58caeb67e8bf9a5cc2edbe764683

Anyway you wrote you have deadlocks, but in the log snippet I do see
that you have host discovery at rate 1-2 imports per minute. This
cannot block anything, this is quite slow rate. I don't understand,
can you pastebin log snippet from the peak time when you have these
deadlocks?

··· On Tue, Sep 19, 2017 at 10:03 PM, 'Konstantin Orekhov' via Foreman users wrote: > After I got the debug output, I've deleted this host from Foreman and on its > next attempt it got registered perfectly fine - no issues with interfaces or > anything anymore: > > (on a client side) > > Discovered by URL: https://spc.vip > Registering host with Foreman (https://spc.vip) > Response from Foreman 201: {"id":447371,"name":"mac3cfdfe52252c" > > (on Foreman side): > 2017-09-19 12:39:34 7ca37aca [app] [I] Started DELETE > "/discovered_hosts/mac3cfdfe52252c" for 10.102.141.20 at 2017-09-19 12:39:34 > -0700 > 2017-09-19 12:39:34 7ca37aca [app] [I] Parameters: > {"authenticity_token"=>"", "id"=>"mac3cfdfe52252c"} > > 2017-09-19 12:40:04 1a346c39 [app] [I] Started POST > "/api/v2/discovered_hosts/facts" for 10.102.141.20 at 2017-09-19 12:40:04 > -0700 > 2017-09-19 12:40:04 1a346c39 [app] [I] Processing by > Api::V2::DiscoveredHostsController#facts as JSON > 2017-09-19 12:40:04 1a346c39 [app] [I] Parameters: {"facts"=>"[FILTERED]", > "apiv"=>"v2", "discovered_host"=>{"facts"=>"[FILTERED]"}} > 2017-09-19 12:40:06 1a346c39 [audit] [I] [mac3cfdfe52252c] deleted 0 > (1694.6ms) > 2017-09-19 12:40:06 1a346c39 [audit] [I] [mac3cfdfe52252c] updated 0 (2.6ms) > 2017-09-19 12:40:07 1a346c39 [audit] [I] [mac3cfdfe52252c] added 385 > (1637.5ms) > 2017-09-19 12:40:07 1a346c39 [app] [I] Import facts for 'mac3cfdfe52252c' > completed. Added: 385, Updated: 0, Deleted 0 facts > > It would be nice to figure out what's causing this in a first place - I do > see a lot of those "Unprocessable Entity" messages logged. > Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

Well no, the biggest update was for 1.14 there:

http://projects.theforeman.org/issues/9016

That focused on memory consumption tho, there was a little speedup but
nothing big.

··· On Tue, Sep 19, 2017 at 10:05 PM, 'Konstantin Orekhov' via Foreman users wrote: > BTW, Lukas, you mentioned that some improvements were made in 1.14. I am > running 1.14.1 and 1.14.3. > Did you mean 1.15 maybe? Should I even consider an upgrade to help resolve > this situation? > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

>
> A MAC address can only exist once, if you already have a
> (managed/unmanaged) host and you try to discover a host with same MAC,
> you will get error. Depending on Foreman discovery it is either 422 or
> "Host already exists":
>
> https://github.com/theforeman/foreman_discovery/commit/210f143bc85c58caeb67e8bf9a5cc2edbe764683
>

Hmm, one generic question on this - according to above logic, if my managed
host had crashed, say because it lost its HW RAID controller, for example,
so it can't boot off the disk anymore thus resulting in PXE boot (given
that BIOS boot order is set that way), correct?
Now, by default, Foreman default pxeconfig file makes a system to boot off
its disk, which in this particular situation will result in endless loop
until some external (to Foreman) monitoring detects a system failure, then
a human gets on a console and real troubleshooting starts only then.
That does not scale beyond a 100 systems or so. For this reason in our
current setup where we don't use Foreman for OS provisioning but only for
system discovery, I've updated the default pxeconfig to always load a
discovery OS. This covers both a new systems and a crashed system scenario
I described above. Each of discovered hosts is reported to a higher layer
of orchestration on a after_commit event and that orchestration handles OS
provisioning on its own so the discovered system never ends up in managed
hosts in Foreman. Once OS provisioning is done, higher layer comes and
deletes a host it just provisioned from discovered hosts. If orchestration
detects that a hook call from Foreman reports a system that was previously
provisioned, such system is automatically marked "maintenance" and HW
diagnostics auto-started. Based on the result of that, orchestration will
start either a HW replacement flow or a new problem troubleshooting starts.
As you can see, humans are only involved very late in a process and only if
auto-remediation is not possible (HW component failed, unknown signature
detected). Otherwise, at large scale environments it is just impossible to
attend to each of failed system individually. Such automation flow is
allowing us to save hundreds of man-hours, as you can imagine.
Now, with that in mind, I was thinking of moving actual OS provisioning
tasks to Foreman as well. However, if crashed system would never be allowed
to re-register (get discovered) because it is already managed by Foreman,
the above flow is just not going to work anymore and I'd have re-think all
flows. Are there specific reasons why this in place? I understand that this
is how it is implemented now, but is there a bigger idea behind that? If
so, what is it? Also, if you take my example of flows stitching for a
complete system lifecycle management, what would you suggest we could do
differently to allow Foreman to be a system that we use for both discovery
and OS provisioning?

Another thing (not as generic as above, but actually very applicable to my
current issue) - if a client system is not allowed to register and given
422 error, for example, it keeps trying to register resulting in huge
amount of work. This is also a gap, IMHO - discovery plug-in needs to do
this differently somehow so rejected systems do not take away Foreman
resources (see below for actual numbers of such attempts in one of my
cluster).

> Anyway you wrote you have deadlocks, but in the log snippet I do see
> that you have host discovery at rate 1-2 imports per minute. This
> cannot block anything, this is quite slow rate. I don't understand,
> can you pastebin log snippet from the peak time when you have these
> deadlocks?
>

After more digging I've done after this issue was reported to me, it does
not look to me as load-related. Even with low number of registrations, I
see a high rate of deadlocks. I took another Foreman cluster (3 active
nodes as well) and see the following activity as it pertains to system
discovery (since 3:30am this morning):

[root@spc01 ~]# grep "/api/v2/discovered_hosts/facts"
/var/log/foreman/production.log | wc -l
282

[root@spc02 ~]# grep "/api/v2/discovered_hosts/facts"
/var/log/foreman/production.log | wc -l
2278

[root@spc03 ~]# grep "/api/v2/discovered_hosts/facts"
/var/log/foreman/production.log | wc -l
143

These are the numbers of attempts rejected (all of them are 422s):

[root@spc01 ~]# grep Entity /var/log/foreman/production.log | wc -l
110

[root@spc02 ~]# grep Entity /var/log/foreman/production.log | wc -l
2182

[root@spc03 ~]# grep Entity /var/log/foreman/production.log | wc -l
57

A number of deadlocks:

[root@spc01 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l
59

[root@spc02 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l
31

[root@spc03 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l
30

Actual deadlock messages are here -


As you can see, most of them are locked on attempts to update facts.

That large number of attempts to register and rejects on spc02 node is
mostly contributed by a single host:

[root@spc02 ~]# grep "Multiple discovered hosts found with MAC address"
/var/log/foreman/production.log | wc -l
1263

[root@spc02 ~]# grep "Multiple discovered hosts found with MAC address"
/var/log/foreman/production.log | head -1
2017-09-20 04:39:15 de3ee3bf [app] [W] Multiple discovered hosts found with
MAC address 00:8c:fa:f1:ab:e4, choosing one

After I removed both incomplete "mac008cfaf1abe4"
<https://spc.vip.phx.ebay.com/discovered_hosts/mac008cfaf1abe4> records,
that system has finally was able to properly register.

Here's also a full debug I took yesterday - it is a single host trying to
register. Unfortunately, this one does not have any deadlocks -


Do you want me to try to get one of those deadlock?

All this business makes me think that root cause of this behavior may be
outside of Foreman - 2 obvious things spring to mind:
(a) local-balanced active/active configuration of my Foreman nodes - even
though I do have source_address binding enabled for connections to 443 on
Foreman vServer on a LB, maybe there's more to it. This is rather easy to
verify - I'm going to shut off other 2 instances and see if I get any
deadlock again.
(b) Second possibility is Galera-based MySQL. This one is harder to check,
but if the first option does not help me, I'll have to convert a DB back to
single node and see. If this turns out to do an issue, it is very bad as
that would mean no proper HA for a Foreman DB, so I'm hoping this is not
the case.

While I'm working on, please let me know if I can provide any more info or
if you have any other suggestions, etc.
Thanks!

··· On Wednesday, September 20, 2017 at 3:55:43 AM UTC-7, Lukas Zapletal wrote:

Hey, you are absolutely right that this is huge design gap in
discovery, we are tracking a refactor ticket to redesign how
discovered hosts are stored, but this is complete change of how
discovery hosts are being provisioned (you would not be able to use
New Hosts screen for example). I think this change will happen as soon
as we redesign new host form to be a session-full wizard.

A workaround could be a setting that would attempt to delete existing
host when new one is discovered, but this would be very dangerous
thing (security related), not sure if that is feasible even via
opt-in.

In the past, we have seen these deadlocks (on fact_name or fact_value)
because this is very busy table - discovery, facter/ENC and other
plugins (katello rhsm, openscap, ansible…) are all writing there or
changing data. I am unable to tell from info you provided what is
going on - you need to dig deeper.

One more idea - we have seen similar (but different tables) deadlocks
when a background (cron) job we ship by default attempts to delete old
reports. Can you check if there is any cronjob or any other process
doing some management of facts? Even deleting lot of data can block
all updates for a long time (minutes to hours). Perhaps try to disable
all foreman jobs and re-test.

LZ

··· On Thu, Sep 21, 2017 at 2:27 AM, 'Konstantin Orekhov' via Foreman users wrote: > > On Wednesday, September 20, 2017 at 3:55:43 AM UTC-7, Lukas Zapletal wrote: >> >> A MAC address can only exist once, if you already have a >> (managed/unmanaged) host and you try to discover a host with same MAC, >> you will get error. Depending on Foreman discovery it is either 422 or >> "Host already exists": >> >> https://github.com/theforeman/foreman_discovery/commit/210f143bc85c58caeb67e8bf9a5cc2edbe764683 > > > Hmm, one generic question on this - according to above logic, if my managed > host had crashed, say because it lost its HW RAID controller, for example, > so it can't boot off the disk anymore thus resulting in PXE boot (given that > BIOS boot order is set that way), correct? > Now, by default, Foreman default pxeconfig file makes a system to boot off > its disk, which in this particular situation will result in endless loop > until some external (to Foreman) monitoring detects a system failure, then a > human gets on a console and real troubleshooting starts only then. > That does not scale beyond a 100 systems or so. For this reason in our > current setup where we *don't* use Foreman for OS provisioning but only for > system discovery, I've updated the default pxeconfig to always load a > discovery OS. This covers both a new systems and a crashed system scenario I > described above. Each of discovered hosts is reported to a higher layer of > orchestration on a after_commit event and that orchestration handles OS > provisioning on its own so the discovered system never ends up in managed > hosts in Foreman. Once OS provisioning is done, higher layer comes and > deletes a host it just provisioned from discovered hosts. If orchestration > detects that a hook call from Foreman reports a system that was previously > provisioned, such system is automatically marked "maintenance" and HW > diagnostics auto-started. Based on the result of that, orchestration will > start either a HW replacement flow or a new problem troubleshooting starts. > As you can see, humans are only involved very late in a process and only if > auto-remediation is not possible (HW component failed, unknown signature > detected). Otherwise, at large scale environments it is just impossible to > attend to each of failed system individually. Such automation flow is > allowing us to save hundreds of man-hours, as you can imagine. > Now, with that in mind, I was thinking of moving actual OS provisioning > tasks to Foreman as well. However, if crashed system would never be allowed > to re-register (get discovered) because it is already managed by Foreman, > the above flow is just not going to work anymore and I'd have re-think all > flows. Are there specific reasons why this in place? I understand that this > is how it is implemented now, but is there a bigger idea behind that? If so, > what is it? Also, if you take my example of flows stitching for a complete > system lifecycle management, what would you suggest we could do differently > to allow Foreman to be a system that we use for both discovery and OS > provisioning? > > Another thing (not as generic as above, but actually very applicable to my > current issue) - if a client system is not allowed to register and given 422 > error, for example, it keeps trying to register resulting in huge amount of > work. This is also a gap, IMHO - discovery plug-in needs to do this > differently somehow so rejected systems do not take away Foreman resources > (see below for actual numbers of such attempts in one of my cluster). > >> >> Anyway you wrote you have deadlocks, but in the log snippet I do see >> that you have host discovery at rate 1-2 imports per minute. This >> cannot block anything, this is quite slow rate. I don't understand, >> can you pastebin log snippet from the peak time when you have these >> deadlocks? > > > After more digging I've done after this issue was reported to me, it does > not look to me as load-related. Even with low number of registrations, I see > a high rate of deadlocks. I took another Foreman cluster (3 active nodes as > well) and see the following activity as it pertains to system discovery > (since 3:30am this morning): > > [root@spc01 ~]# grep "/api/v2/discovered_hosts/facts" > /var/log/foreman/production.log | wc -l > 282 > > [root@spc02 ~]# grep "/api/v2/discovered_hosts/facts" > /var/log/foreman/production.log | wc -l > 2278 > > [root@spc03 ~]# grep "/api/v2/discovered_hosts/facts" > /var/log/foreman/production.log | wc -l > 143 > > These are the numbers of attempts rejected (all of them are 422s): > > [root@spc01 ~]# grep Entity /var/log/foreman/production.log | wc -l > 110 > > [root@spc02 ~]# grep Entity /var/log/foreman/production.log | wc -l > 2182 > > [root@spc03 ~]# grep Entity /var/log/foreman/production.log | wc -l > 57 > > A number of deadlocks: > > [root@spc01 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l > 59 > > [root@spc02 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l > 31 > > [root@spc03 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l > 30 > > Actual deadlock messages are here - > https://gist.github.com/anonymous/a20f4097396037cd30903d232a3e6d0f > As you can see, most of them are locked on attempts to update facts. > > That large number of attempts to register and rejects on spc02 node is > mostly contributed by a single host: > > [root@spc02 ~]# grep "Multiple discovered hosts found with MAC address" > /var/log/foreman/production.log | wc -l > 1263 > > [root@spc02 ~]# grep "Multiple discovered hosts found with MAC address" > /var/log/foreman/production.log | head -1 > 2017-09-20 04:39:15 de3ee3bf [app] [W] Multiple discovered hosts found with > MAC address 00:8c:fa:f1:ab:e4, choosing one > > After I removed both incomplete "mac008cfaf1abe4" records, that system has > finally was able to properly register. > > Here's also a full debug I took yesterday - it is a single host trying to > register. Unfortunately, this one does not have any deadlocks - > https://gist.github.com/anonymous/47fe4baa60fc5285b70faf37e6f797af > Do you want me to try to get one of those deadlock? > > All this business makes me think that root cause of this behavior may be > outside of Foreman - 2 obvious things spring to mind: > (a) local-balanced active/active configuration of my Foreman nodes - even > though I do have source_address binding enabled for connections to 443 on > Foreman vServer on a LB, maybe there's more to it. This is rather easy to > verify - I'm going to shut off other 2 instances and see if I get any > deadlock again. > (b) Second possibility is Galera-based MySQL. This one is harder to check, > but if the first option does not help me, I'll have to convert a DB back to > single node and see. If this turns out to do an issue, it is very bad as > that would mean no proper HA for a Foreman DB, so I'm hoping this is not the > case. > > While I'm working on, please let me know if I can provide any more info or > if you have any other suggestions, etc. > Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

A few extra thoughts on this, since a lot of it is still based on my
design from nearly 5 years ago :wink:

>
> Hmm, one generic question on this - according to above logic, if my
> managed host had crashed, say because it lost its HW RAID controller,
> for example, so it can't boot off the disk anymore thus resulting in
> PXE boot (given that BIOS boot order is set that way), correct?

> Now, by default, Foreman default pxeconfig file makes a system to
> boot off its disk, which in this particular situation will result in
> endless loop until some external (to Foreman) monitoring detects a
> system failure, then a human gets on a console and real
> troubleshooting starts only then.

This is absolutely true. We had, at one time, considered adding a state
machine (or similar) to Foreman, so that such things (as well as boot
loops in Kickstart, and so forth) could be detected, but it was never
completed.

> Now, with that in mind, I was thinking of moving actual OS
> provisioning tasks to Foreman as well. However, if crashed system
> would never be allowed to re-register (get discovered) because it is
> already managed by Foreman, the above flow is just not going to work
> anymore and I'd have re-think all flows. Are there specific reasons
> why this in place? I understand that this is how it is implemented
> now, but is there a bigger idea behind that? If so, what is it?

There were two goals - to prevent duplicates (if unprovisioned hosts
are rebooted, for example), and to allow recycling (delete a host from
Foreman, reboot it, and it'll be back in the discovered hosts list to
be re-used). Neither of these is insurmountable some other way, but
this was the easiest.

> Also, if you take my example of flows stitching for a complete system
> lifecycle management, what would you suggest we could do differently
> to allow Foreman to be a system that we use for both discovery and OS
> provisioning?

As Lukas says, a full refactor may well happen, and we'd love input on
that as we go forward. For a workaround today, I'd probably lean
towards a secondary plugin that sits on top of Discovery and interacts
with the registration process - given your example, you could add a
check if the regitraion matches a host that's already provisioned, and
take further action if so. That might also be a good way to proof-of-
concept some ideas, before merging the code back into Discovery.

> Another thing (not as generic as above, but actually very applicable
> to my current issue) - if a client system is not allowed to register
> and given 422 error, for example, it keeps trying to register
> resulting in huge amount of work. This is also a gap, IMHO -
> discovery plug-in needs to do this differently somehow so rejected
> systems do not take away Foreman resources (see below for actual
> numbers of such attempts in one of my cluster).

I think I agree - the hosts should keep retrying until they get a
response from Foreman, but then actions can be taken. I'd probably be
in favour of keeping the retry (so that, say, if the offending MAC is
removed in Foreman, the host can register on the next retry), but
perhaps split the process into two calls. The first is a light "am I
registered?" call that returns true/false, and only if false would the
heavier registration call be made. Does that work?

Thanks!
Greg

··· On Wed, 2017-09-20 at 17:27 -0700, 'Konstantin Orekhov' via Foreman users wrote:

> This is absolutely true. We had, at one time, considered adding a state
> machine (or similar) to Foreman, so that such things (as well as boot
> loops in Kickstart, and so forth) could be detected, but it was never
> completed.
>

State machine would be nice as it allows for more actions to be taken for a
machine in different states. For example, in some other threads, I was
asking about ability to use RemoteExec for discovered hosts, not just
managed hosts as it is now.
Proper hooks for systems entering/leaving any of those states also open up
a lot of opportunities.

> As Lukas says, a full refactor may well happen, and we'd love input on
> that as we go forward.

Any of you, guys, going to PuppetConf this year? If so, can we meet and
have a discussion on this maybe?

I think I agree - the hosts should keep retrying until they get a
> response from Foreman, but then actions can be taken. I'd probably be
> in favour of keeping the retry (so that, say, if the offending MAC is
> removed in Foreman, the host can register on the next retry), but
> perhaps split the process into two calls. The first is a light "am I
> registered?" call that returns true/false, and only if false would the
> heavier registration call be made. Does that work?
>

Yes, this would definitely work. This is also is one of the states of a
system in the state machine we talked about above.

> One more idea - we have seen similar (but different tables) deadlocks
> when a background (cron) job we ship by default attempts to delete old
> reports. Can you check if there is any cronjob or any other process
> doing some management of facts? Even deleting lot of data can block
> all updates for a long time (minutes to hours). Perhaps try to disable
> all foreman jobs and re-test.
>

I have tried this to no avail. However, I think the culprit of a problem is
in very slow DELETE MySQL query, which is apparently happens even for
absolutely new and freshly-discovered systems as well already discovered
ones.

2017-09-28 13:09:49 c75f5c40 [sql] [D] SQL (50843.2ms) DELETE FROM
fact_values WHERE fact_values.id IN

Please see these gists I've recorded with SQL debug enabled. I have a ton
of hosts doing exactly the same thing - try to register, mysql delete
expires (it takes up to 50 sec as you can see), some rollback happens and
expires again. And so on and so forth until systems register one by one.
This results in many empty or duplicate entries even for a small batch of
systems coming online at the same time.

The above examples are taken from a single Foreman instance running against
a regular (non-Galera) MySQL DB, so at least I can say that the fact that I
had several Foreman instances behind a load-balancer talking to
Galera-replicated MySQL has nothing to do with this behavior. The only
difference is that in Galera-enabled DB, expiration errors are replaced
with deadlock error, which makes total sense - if delete operation takes
almost a minute, no wonder it results in some rows being locked. As load
increases (more systems register at the same time), more and more such
errors are happening, so I believe a proper way to deal with this is
optimize MySQL query first and the go from there. Would you agree?

Ok I can see there is a subselect, these are sometimes painful
particularly for MySQL. We fixed that already, see in fact_importer.rb
(this is develop branch):

def delete_removed_facts
ActiveSupport::Notifications.instrument
"fact_importer_deleted.foreman", :host_id => host.id, :host_name =>
host.name, : +facts => facts, :deleted => [] do |payload|
delete_query = FactValue.joins(:fact_name).where(:host => host,
'fact_names.type' => fact_name_class.name).where.
+not('fact_names.name' => facts.keys)
if ActiveRecord::Base.connection.adapter_name.downcase.starts_with?
'mysql'
# MySQL does not handle delete with inner query correctly
(slow) so we will do two queries on purpose
payload[:count] = @counters[:deleted] = FactValue.where(:id =>
delete_query.pluck(:id)).delete_all
else
# deletes all facts using a single SQL query with inner query otherwise
payload[:count] = @counters[:deleted] = delete_query.delete_all
end
end
end

See the comment there, do you have this in your instance? If not git
blame the commit and apply it. You have some older version I assume.

··· On Wed, Oct 4, 2017 at 1:38 AM, 'Konstantin Orekhov' via Foreman users wrote: > >> One more idea - we have seen similar (but different tables) deadlocks >> when a background (cron) job we ship by default attempts to delete old >> reports. Can you check if there is any cronjob or any other process >> doing some management of facts? Even deleting lot of data can block >> all updates for a long time (minutes to hours). Perhaps try to disable >> all foreman jobs and re-test. > > > I have tried this to no avail. However, I think the culprit of a problem is > in very slow DELETE MySQL query, which is apparently happens even for > absolutely new and freshly-discovered systems as well already discovered > ones. > > 2017-09-28 13:09:49 c75f5c40 [sql] [D] SQL (50843.2ms) DELETE FROM > `fact_values` WHERE `fact_values`.`id` IN > > Please see these gists I've recorded with SQL debug enabled. I have a ton of > hosts doing exactly the same thing - try to register, mysql delete expires > (it takes up to 50 sec as you can see), some rollback happens and expires > again. And so on and so forth until systems register one by one. This > results in many empty or duplicate entries even for a small batch of systems > coming online at the same time. > > https://gist.github.com/anonymous/a721e220d82f5160450e483b8776489d > > The above examples are taken from a single Foreman instance running against > a regular (non-Galera) MySQL DB, so at least I can say that the fact that I > had several Foreman instances behind a load-balancer talking to > Galera-replicated MySQL has nothing to do with this behavior. The only > difference is that in Galera-enabled DB, expiration errors are replaced with > deadlock error, which makes total sense - if delete operation takes almost a > minute, no wonder it results in some rows being locked. As load increases > (more systems register at the same time), more and more such errors are > happening, so I believe a proper way to deal with this is optimize MySQL > query first and the go from there. Would you agree? > > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-users+unsubscribe@googlegroups.com. > To post to this group, send email to foreman-users@googlegroups.com. > Visit this group at https://groups.google.com/group/foreman-users. > For more options, visit https://groups.google.com/d/optout.


Later,
Lukas @lzap Zapletal

> > As Lukas says, a full refactor may well happen, and we'd love input
> > on that as we go forward.
>
> Any of you, guys, going to PuppetConf this year? If so, can we meet
> and have a discussion on this maybe?

I certainly won't be, sadly. I'll ask around and see if anyone is
heading down.

> > I think I agree - the hosts should keep retrying until they get a
> > response from Foreman, but then actions can be taken. I'd probably
> > be in favour of keeping the retry (so that, say, if the offending
> > MAC is removed in Foreman, the host can register on the next
> > retry), but perhaps split the process into two calls. The first is
> > a light "am I registered?" call that returns true/false, and only
> > if false would the heavier registration call be made. Does that
> > work?
>
> Yes, this would definitely work. This is also is one of the states of
> a system in the state machine we talked about above.

Agreed, a proper place to hook it would be ideal, I'm just throwing
ideas out that might help in the short term. Sounds like Lukas has you
covered on the DB locking issues anyway though :stuck_out_tongue:

Greg

··· On Tue, 2017-10-03 at 16:12 -0700, 'Konstantin Orekhov' via Foreman users wrote:

> See the comment there, do you have this in your instance? If not git
> blame the commit and apply it. You have some older version I assume.
>

Yes, I'm running several 1.14.1 and 1.14.3 instances/clusters. Both had the
same issue with deadlocks. I've updated 2 of them with above patch and was
lucky enough to immediately observe a registration of at least 62 systems
went through w/o a single error.
I'll monitor things more, but so far this is huge steps forward.

Thanks!

>
> Agreed, a proper place to hook it would be ideal, I'm just throwing
> ideas out that might help in the short term.
>

Sure, would be a nice thing to have to start from.