Deadlock problem updating hosts_count

Kyle_Flavin · October 23, 2015, 10:06pm

We're having trouble removing a host record from Foreman (1.7.1). It looks
like a process is holding a lock open. When we try to delete the record
from the UI, it sits there for a long time (several minutes) and then fails
with this error:

Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1 WHERE
models.id = 46 ORDER BY models.name

ActiveRecord::StatementInvalid
Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1 WHERE
models.id = 46 ORDER BY models.name
app/controllers/hosts_controller.rb:118:in destroy' app/models/concerns/foreman/thread_session.rb:33:in clear_thread'
lib/middleware/catch_json_parse_errors.rb:9:in `call'

This record has been "stuck" (un-deletable) since October 1st. I checked
the process list in MySQL (show processlist), and I see a 15 day old binlog
dump, and a 60 minute long thread. Nothing that's been open longer than 3
weeks, which makes me wonder what could be holding open this lock.

Nevertheless, I've been looking at the following two issues on Github, and
they appear to be relevant:

https://github.com/theforeman/foreman/pull/2362
https://github.com/theforeman/foreman/pull/2839

I'm trying to confirm if the counter_cache fix sounds applicable here? If
so, it sounds like an upgrade to 1.9.x is required?

tbrisker · October 25, 2015, 7:42am

Hello Kyle,
It is possible that this is related indeed to the counter_cache_fix, but in
order to confirm that I would need a bit more context from your logs as to
the error causing this.
The mysql log should contain the details of which two transactions are
conflicting and causing the deadlock.
If this is indeed the case, it should be possible to cherry pick only these
two patches (once the second one gets merged) to 1.7.1 if needed.

···

On Sat, Oct 24, 2015 at 1:06 AM, Kyle Flavin wrote:

We’re having trouble removing a host record from Foreman (1.7.1). It
looks like a process is holding a lock open. When we try to delete the
record from the UI, it sits there for a long time (several minutes) and
then fails with this error:

Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1 WHERE
models.id = 46 ORDER BY models.name

ActiveRecord::StatementInvalid
Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1 WHERE
models.id = 46 ORDER BY models.name
app/controllers/hosts_controller.rb:118:in destroy' app/models/concerns/foreman/thread_session.rb:33:in clear_thread’
lib/middleware/catch_json_parse_errors.rb:9:in `call’

This record has been “stuck” (un-deletable) since October 1st. I checked
the process list in MySQL (show processlist), and I see a 15 day old binlog
dump, and a 60 minute long thread. Nothing that’s been open longer than 3
weeks, which makes me wonder what could be holding open this lock.

Nevertheless, I’ve been looking at the following two issues on Github, and
they appear to be relevant:

https://github.com/theforeman/foreman/pull/2362
Fixes #12241 - Correct counter_cache deadlock fix by tbrisker · Pull Request #2839 · theforeman/foreman · GitHub

I’m trying to confirm if the counter_cache fix sounds applicable here? If
so, it sounds like an upgrade to 1.9.x is required?

–
You received this message because you are subscribed to the Google Groups
“Foreman users” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-users+unsubscribe@googlegroups.com.
To post to this group, send email to foreman-users@googlegroups.com.
Visit this group at http://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

–
Have a nice day,
Tomer Brisker
Red Hat Engineering

Kyle_Flavin · October 26, 2015, 6:12pm

Hi Tomer,
Unfortunately it doesn't look like we have logging turned on for this
particular server. Error logs are turned on, but nothing appears there
since September.

I was able to login and do a "show engine innodb status\G" and got the
following (sanitized output) - it seems to show two transactions attempting
to update host_count on the architecture table. These transactions have
been there since Friday. I expected to find something even older, as we've
had a few "stuck" host records that we cannot delete going back to Oct 1.

···

------------------------ LATEST DETECTED DEADLOCK ------------------------

2015-10-23 00:06:10 7fc5696a0700
*** (1) TRANSACTION:
TRANSACTION 9523120346, ACTIVE 3 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log entries
2
MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id 3037051607
foreman.example.com 1.1.1.1 foreman updating
UPDATE architectures SET hosts_count = COALESCE(hosts_count, 0) - 1
WHERE architectures.id = 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523120346 lock_mode X locks rec but not
gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info
bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** (2) TRANSACTION:
TRANSACTION 9523116898, ACTIVE 15 sec starting index read
mysql tables in use 1, locked 1
23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3
MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id 3037052724
foreman.example.com 1.1.1.1 foreman updating
UPDATE architectures SET hosts_count = COALESCE(hosts_count, 0) + 1
WHERE architectures.id = 1
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523116898 lock mode S locks rec but not
gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info
bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523116898 lock_mode X locks rec but not
gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info
bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** WE ROLL BACK TRANSACTION (1)

On Sunday, October 25, 2015 at 12:43:08 AM UTC-7, Tomer Brisker wrote:

Hello Kyle,
It is possible that this is related indeed to the counter_cache_fix, but
in order to confirm that I would need a bit more context from your logs as
to the error causing this.
The mysql log should contain the details of which two transactions are
conflicting and causing the deadlock.
If this is indeed the case, it should be possible to cherry pick only
these two patches (once the second one gets merged) to 1.7.1 if needed.

On Sat, Oct 24, 2015 at 1:06 AM, Kyle Flavin <kyle....@gmail.com > <javascript:>> wrote:

We’re having trouble removing a host record from Foreman (1.7.1). It
looks like a process is holding a lock open. When we try to delete the
record from the UI, it sits there for a long time (several minutes) and
then fails with this error:

Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1 WHERE
models.id = 46 ORDER BY models.name

ActiveRecord::StatementInvalid
Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1 WHERE
models.id = 46 ORDER BY models.name
app/controllers/hosts_controller.rb:118:in destroy' app/models/concerns/foreman/thread_session.rb:33:in clear_thread’
lib/middleware/catch_json_parse_errors.rb:9:in `call’

This record has been “stuck” (un-deletable) since October 1st. I checked
the process list in MySQL (show processlist), and I see a 15 day old binlog
dump, and a 60 minute long thread. Nothing that’s been open longer than 3
weeks, which makes me wonder what could be holding open this lock.

Nevertheless, I’ve been looking at the following two issues on Github,
and they appear to be relevant:

https://github.com/theforeman/foreman/pull/2362
Fixes #12241 - Correct counter_cache deadlock fix by tbrisker · Pull Request #2839 · theforeman/foreman · GitHub

I’m trying to confirm if the counter_cache fix sounds applicable here?
If so, it sounds like an upgrade to 1.9.x is required?

–
You received this message because you are subscribed to the Google Groups
“Foreman users” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-user...@googlegroups.com <javascript:>.
To post to this group, send email to forema...@googlegroups.com
<javascript:>.
Visit this group at http://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

–
Have a nice day,
Tomer Brisker
Red Hat Engineering

tbrisker · October 28, 2015, 9:15am

Hello Kyle,

You can check the value of your lock time out using:
> show variables like 'innodb_lock_wait_timeout';
If it is set to a very high value, you can change it using
> set innodb_lock_wait_timeout=100
for example to set it to time out after 100 seconds.

You can also force kill the stuck process(es) so you can get the hosts
unstuck:
> show processlist;
> kill <process_id>;

After getting your hosts unstuck, you can try to check if the two
aforementioned patches fix the deadlock indeed or not.
Or it might be completely unrelated and just a one-time problem that locked
them in the first place.

···

On Mon, Oct 26, 2015 at 8:12 PM, Kyle Flavin wrote:

Hi Tomer,
Unfortunately it doesn’t look like we have logging turned on for this
particular server. Error logs are turned on, but nothing appears there
since September.

I was able to login and do a “show engine innodb status\G” and got the
following (sanitized output) - it seems to show two transactions attempting
to update host_count on the architecture table. These transactions have
been there since Friday. I expected to find something even older, as we’ve
had a few “stuck” host records that we cannot delete going back to Oct 1.

LATEST DETECTED DEADLOCK

2015-10-23 00:06:10 7fc5696a0700
*** (1) TRANSACTION:
TRANSACTION 9523120346, ACTIVE 3 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log
entries 2
MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id
3037051607 foreman.example.com 1.1.1.1 foreman updating
UPDATE architectures SET hosts_count = COALESCE(hosts_count, 0) - 1
WHERE architectures.id = 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523120346 lock_mode X locks rec but not
gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info
bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** (2) TRANSACTION:
TRANSACTION 9523116898, ACTIVE 15 sec starting index read
mysql tables in use 1, locked 1
23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3
MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id
3037052724 foreman.example.com 1.1.1.1 foreman updating
UPDATE architectures SET hosts_count = COALESCE(hosts_count, 0) + 1
WHERE architectures.id = 1
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523116898 lock mode S locks rec but not
gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info
bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523116898 lock_mode X locks rec but not
gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info
bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** WE ROLL BACK TRANSACTION (1)

On Sunday, October 25, 2015 at 12:43:08 AM UTC-7, Tomer Brisker wrote:

Hello Kyle,
It is possible that this is related indeed to the counter_cache_fix, but
in order to confirm that I would need a bit more context from your logs as
to the error causing this.
The mysql log should contain the details of which two transactions are
conflicting and causing the deadlock.
If this is indeed the case, it should be possible to cherry pick only
these two patches (once the second one gets merged) to 1.7.1 if needed.

On Sat, Oct 24, 2015 at 1:06 AM, Kyle Flavin kyle....@gmail.com wrote:

We’re having trouble removing a host record from Foreman (1.7.1). It
looks like a process is holding a lock open. When we try to delete the
record from the UI, it sits there for a long time (several minutes) and
then fails with this error:

Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1
WHERE models.id = 46 ORDER BY models.name

ActiveRecord::StatementInvalid
Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1
WHERE models.id = 46 ORDER BY models.name
app/controllers/hosts_controller.rb:118:in destroy' app/models/concerns/foreman/thread_session.rb:33:in clear_thread’
lib/middleware/catch_json_parse_errors.rb:9:in `call’

This record has been “stuck” (un-deletable) since October 1st. I
checked the process list in MySQL (show processlist), and I see a 15 day
old binlog dump, and a 60 minute long thread. Nothing that’s been open
longer than 3 weeks, which makes me wonder what could be holding open this
lock.

Nevertheless, I’ve been looking at the following two issues on Github,
and they appear to be relevant:

https://github.com/theforeman/foreman/pull/2362
Fixes #12241 - Correct counter_cache deadlock fix by tbrisker · Pull Request #2839 · theforeman/foreman · GitHub

I’m trying to confirm if the counter_cache fix sounds applicable here?
If so, it sounds like an upgrade to 1.9.x is required?

–
You received this message because you are subscribed to the Google
Groups “Foreman users” group.
To unsubscribe from this group and stop receiving emails from it, send
an email to foreman-user...@googlegroups.com.
To post to this group, send email to forema...@googlegroups.com.
Visit this group at http://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

–
Have a nice day,
Tomer Brisker
Red Hat Engineering

–
You received this message because you are subscribed to the Google Groups
“Foreman users” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-users+unsubscribe@googlegroups.com.
To post to this group, send email to foreman-users@googlegroups.com.
Visit this group at http://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

–
Have a nice day,
Tomer Brisker
Red Hat Engineering

Kyle_Flavin · October 29, 2015, 11:41pm

Hi Tomer,
The value is set to 400:

···

> show variables like 'innodb_lock_wait_timeout'; +--------------------------+-------+ > Variable_name | Value | +--------------------------+-------+ > innodb_lock_wait_timeout | 400 | +--------------------------+-------+ 1 row in set (0.00 sec)

Over the past day I’ve backported the two changes from:

into our 1.7.1 Foreman servers (we’re running two), and I’ve restarted the
MySQL database, just to start fresh. However, I’m still unable to remove a
number of Foreman records. When I go to delete the record through the Web
UI, I still see this error:

Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1 WHERE
models.id = 46 ORDER BY models.name

When I do a “show processlist” on the database, I can see the query timing
out:

60 | foreman | myhost.example.com:48892 | foreman | Query | 390
Searching rows for update |
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1 WHERE
models.id = 46 ORDER BY

Where *390 *is the timeout. Once it gets to 400, the process clears. If I
try the delete again, a new process appears and also times out after 400
seconds. I could drop the timeout to 100, but looking through the list of
processes, it doesn’t appear that there are any other long running
processes (other than a process doing a binlog dump to our standby
database, but I don’t think that’s an issue) timing out, which might also
be holding a lock. When I try to delete the bad record, I see the one long
lived process start up and die.

The odd part is, there are other records I can remove. The records I can’t
remove appear to be the same every time. It’s as if a particular group of
records has a problem. What would cause one set of records to be unable to
update the host_count after repeated tries, while others can immediately?

We run 3 puppet masters with ~8000 hosts, and two Foreman servers that talk
to a single MySQL database. None of the machines look particularly
overloaded, but I’m wondering if the database isn’t happy with two app
servers trying to talk to it simultaneously. Are there any additional
settings you could recommend I set to give me more visibility?

On Wednesday, October 28, 2015 at 2:16:30 AM UTC-7, Tomer Brisker wrote:

Hello Kyle,

You can check the value of your lock time out using:

show variables like ‘innodb_lock_wait_timeout’;
If it is set to a very high value, you can change it using
set innodb_lock_wait_timeout=100
for example to set it to time out after 100 seconds.

You can also force kill the stuck process(es) so you can get the hosts
unstuck:

show processlist;
kill <process_id>;

After getting your hosts unstuck, you can try to check if the two
aforementioned patches fix the deadlock indeed or not.
Or it might be completely unrelated and just a one-time problem that
locked them in the first place.

On Mon, Oct 26, 2015 at 8:12 PM, Kyle Flavin <kyle....@gmail.com > <javascript:>> wrote:

Hi Tomer,
Unfortunately it doesn’t look like we have logging turned on for this
particular server. Error logs are turned on, but nothing appears there
since September.

I was able to login and do a “show engine innodb status\G” and got the
following (sanitized output) - it seems to show two transactions attempting
to update host_count on the architecture table. These transactions have
been there since Friday. I expected to find something even older, as we’ve
had a few “stuck” host records that we cannot delete going back to Oct 1.

LATEST DETECTED DEADLOCK

2015-10-23 00:06:10 7fc5696a0700
*** (1) TRANSACTION:
TRANSACTION 9523120346, ACTIVE 3 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log
entries 2
MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id
3037051607 foreman.example.com 1.1.1.1 foreman updating
UPDATE architectures SET hosts_count = COALESCE(hosts_count, 0) - 1
WHERE architectures.id = 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523120346 lock_mode X locks rec but not
gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info
bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** (2) TRANSACTION:
TRANSACTION 9523116898, ACTIVE 15 sec starting index read
mysql tables in use 1, locked 1
23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3
MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id
3037052724 foreman.example.com 1.1.1.1 foreman updating
UPDATE architectures SET hosts_count = COALESCE(hosts_count, 0) + 1
WHERE architectures.id = 1
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523116898 lock mode S locks rec but not
gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info
bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523116898 lock_mode X locks rec but not
gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info
bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** WE ROLL BACK TRANSACTION (1)

On Sunday, October 25, 2015 at 12:43:08 AM UTC-7, Tomer Brisker wrote:

Hello Kyle,
It is possible that this is related indeed to the counter_cache_fix, but
in order to confirm that I would need a bit more context from your logs as
to the error causing this.
The mysql log should contain the details of which two transactions are
conflicting and causing the deadlock.
If this is indeed the case, it should be possible to cherry pick only
these two patches (once the second one gets merged) to 1.7.1 if needed.

On Sat, Oct 24, 2015 at 1:06 AM, Kyle Flavin kyle....@gmail.com wrote:

We’re having trouble removing a host record from Foreman (1.7.1). It
looks like a process is holding a lock open. When we try to delete the
record from the UI, it sits there for a long time (several minutes) and
then fails with this error:

Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1
WHERE models.id = 46 ORDER BY models.name

ActiveRecord::StatementInvalid
Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1
WHERE models.id = 46 ORDER BY models.name
app/controllers/hosts_controller.rb:118:in destroy' app/models/concerns/foreman/thread_session.rb:33:in clear_thread’
lib/middleware/catch_json_parse_errors.rb:9:in `call’

This record has been “stuck” (un-deletable) since October 1st. I
checked the process list in MySQL (show processlist), and I see a 15 day
old binlog dump, and a 60 minute long thread. Nothing that’s been open
longer than 3 weeks, which makes me wonder what could be holding open this
lock.

Nevertheless, I’ve been looking at the following two issues on Github,
and they appear to be relevant:

https://github.com/theforeman/foreman/pull/2362
Fixes #12241 - Correct counter_cache deadlock fix by tbrisker · Pull Request #2839 · theforeman/foreman · GitHub

I’m trying to confirm if the counter_cache fix sounds applicable here?
If so, it sounds like an upgrade to 1.9.x is required?

–
You received this message because you are subscribed to the Google
Groups “Foreman users” group.
To unsubscribe from this group and stop receiving emails from it, send
an email to foreman-user...@googlegroups.com.
To post to this group, send email to forema...@googlegroups.com.
Visit this group at http://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

–
Have a nice day,
Tomer Brisker
Red Hat Engineering

–
You received this message because you are subscribed to the Google Groups
“Foreman users” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-user...@googlegroups.com <javascript:>.
To post to this group, send email to forema...@googlegroups.com
<javascript:>.
Visit this group at http://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

–
Have a nice day,
Tomer Brisker
Red Hat Engineering

Kyle_Flavin · October 30, 2015, 1:32am

More information to add onto my last post…

It ONLY appears to affect updates on the models table that occur on id=46.
If the record has a different model, the delete works.

···

On Thursday, October 29, 2015 at 4:41:36 PM UTC-7, Kyle Flavin wrote: > > Hi Tomer, > The value is set to 400: > > > show variables like 'innodb_lock_wait_timeout'; > +--------------------------+-------+ > > Variable_name | Value | > +--------------------------+-------+ > > innodb_lock_wait_timeout | 400 | > +--------------------------+-------+ > 1 row in set (0.00 sec) > > Over the past day I've backported the two changes from: > https://github.com/theforeman/foreman/pull/2362 > https://github.com/theforeman/foreman/pull/2839 > > into our 1.7.1 Foreman servers (we're running two), and I've restarted the > MySQL database, just to start fresh. However, I'm still unable to remove a > number of Foreman records. When I go to delete the record through the Web > UI, I still see this error: > > Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: > UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 WHERE > `models`.`id` = 46 ORDER BY models.name > > > When I do a "show processlist" on the database, I can see the query timing > out: > > > 60 | foreman | myhost.example.com:48892 | foreman | Query | > 390 | Searching rows for update > > UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 > WHERE `models`.`id` = 46 ORDER BY > > > Where *390 *is the timeout. Once it gets to 400, the process clears. If > I try the delete again, a new process appears and also times out after 400 > seconds. I could drop the timeout to 100, but looking through the list of > processes, it doesn't appear that there are any other long running > processes (other than a process doing a binlog dump to our standby > database, but I don't think that's an issue) timing out, which might also > be holding a lock. When I try to delete the bad record, I see the one long > lived process start up and die. > > The odd part is, there are other records I can remove. The records I > can't remove appear to be the same every time. It's as if a particular > group of records has a problem. What would cause one set of records to be > unable to update the host_count after repeated tries, while others can > immediately? > > We run 3 puppet masters with ~8000 hosts, and two Foreman servers that > talk to a single MySQL database. None of the machines look particularly > overloaded, but I'm wondering if the database isn't happy with two app > servers trying to talk to it simultaneously. Are there any additional > settings you could recommend I set to give me more visibility? > > > > On Wednesday, October 28, 2015 at 2:16:30 AM UTC-7, Tomer Brisker wrote: > > Hello Kyle, > > You can check the value of your lock time out using: > > show variables like 'innodb_lock_wait_timeout'; > If it is set to a very high value, you can change it using > > set innodb_lock_wait_timeout=100 > for example to set it to time out after 100 seconds. > > You can also force kill the stuck process(es) so you can get the hosts > unstuck: > > show processlist; > > kill ; > > After getting your hosts unstuck, you can try to check if the two > aforementioned patches fix the deadlock indeed or not. > Or it might be completely unrelated and just a one-time problem that > locked them in the first place. > > > On Mon, Oct 26, 2015 at 8:12 PM, Kyle Flavin wrote: > > Hi Tomer, > Unfortunately it doesn't look like we have logging turned on for this > particular server. Error logs are turned on, but nothing appears there > since September. > > I was able to login and do a "show engine innodb status\G" and got the > following (sanitized output) - it seems to show two transactions attempting > to update host_count on the architecture table. These transactions have > been there since Friday. I expected to find something even older, as we've > had a few "stuck" host records that we cannot delete going back to Oct 1. > > ------------------------ > LATEST DETECTED DEADLOCK > ------------------------ > > 2015-10-23 00:06:10 7fc5696a0700 > *** (1) TRANSACTION: > TRANSACTION 9523120346, ACTIVE 3 sec starting index read > mysql tables in use 1, locked 1 > LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log > entries 2 > MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id > 3037051607 foreman.example.com 1.1.1.1 foreman updating > UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 > WHERE `architectures`.`id` = 1 > *** (1) WAITING FOR THIS LOCK TO BE GRANTED: > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523120346 lock_mode X locks rec but not > gap waiting > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > 6: len 4; hex 80001c80; asc ;; > 7: len 4; hex 80000003; asc ;; > > *** (2) TRANSACTION: > TRANSACTION 9523116898, ACTIVE 15 sec starting index read > mysql tables in use 1, locked 1 > 23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3 > MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id > 3037052724 foreman.example.com 1.1.1.1 foreman updating > UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) + 1 > WHERE `architectures`.`id` = 1 > *** (2) HOLDS THE LOCK(S): > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523116898 lock mode S locks rec but not > gap > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > 6: len 4; hex 80001c80; asc ;; > 7: len 4; hex 80000003; asc ;; > > *** (2) WAITING FOR THIS LOCK TO BE GRANTED: > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523116898 lock_mode X locks rec but not > gap waiting > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > 6: len 4; hex 80001c80; asc ;; > 7: len 4; hex 80000003; asc ;; > > *** WE ROLL BACK TRANSACTION (1) > > > > On Sunday, October 25, 2015 at 12:43:08 AM UTC-7, Tomer Brisker wrote: > > Hello Kyle, > It is possible that this is related indeed to the counter_cache_fix, but > in order to confirm that I would need a bit more context from your logs as > to the error causing this. > The mysql log should contain the details of which two transactions are > conflicting and causing the deadlock. > If this is indeed the case, it should be possible to cherry pick only > these two patches (once the second one gets merged) to 1.7.1 if needed. > > On Sat, Oct 24, 2015 at 1:06 AM, Kyle Flavin wrote: > > We're having trouble removing a host record from Foreman (1.7.1). It > looks like a process is holding a lock open. When we try to delete the > record from the UI, it sits there for a long time (several minutes) and > then fails with this error: > > Mysql2::Error: Lock wait timeout exceeded; try > > ...

Kyle_Flavin · October 30, 2015, 1:39am

Ugh, sorry - I may have jumped the gun on that last one. I changed the
model on the problematic host, and I still can't remove it.

···

On Thursday, October 29, 2015 at 6:32:23 PM UTC-7, Kyle Flavin wrote: > > More information to add onto my last post... > > It ONLY appears to affect updates on the models table that occur on > id=46. If the record has a different model, the delete works. > > On Thursday, October 29, 2015 at 4:41:36 PM UTC-7, Kyle Flavin wrote: >> >> Hi Tomer, >> The value is set to 400: >> >> > show variables like 'innodb_lock_wait_timeout'; >> +--------------------------+-------+ >> > Variable_name | Value | >> +--------------------------+-------+ >> > innodb_lock_wait_timeout | 400 | >> +--------------------------+-------+ >> 1 row in set (0.00 sec) >> >> Over the past day I've backported the two changes from: >> https://github.com/theforeman/foreman/pull/2362 >> https://github.com/theforeman/foreman/pull/2839 >> >> into our 1.7.1 Foreman servers (we're running two), and I've restarted >> the MySQL database, just to start fresh. However, I'm still unable to >> remove a number of Foreman records. When I go to delete the record through >> the Web UI, I still see this error: >> >> Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: >> UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 WHERE >> `models`.`id` = 46 ORDER BY models.name >> >> >> When I do a "show processlist" on the database, I can see the query >> timing out: >> >> > 60 | foreman | myhost.example.com:48892 | foreman | Query | >> 390 | Searching rows for update >> > UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 >> WHERE `models`.`id` = 46 ORDER BY >> >> >> Where *390 *is the timeout. Once it gets to 400, the process clears. >> If I try the delete again, a new process appears and also times out after >> 400 seconds. I could drop the timeout to 100, but looking through the list >> of processes, it doesn't appear that there are any other long running >> processes (other than a process doing a binlog dump to our standby >> database, but I don't think that's an issue) timing out, which might also >> be holding a lock. When I try to delete the bad record, I see the one long >> lived process start up and die. >> >> The odd part is, there are other records I can remove. The records I >> can't remove appear to be the same every time. It's as if a particular >> group of records has a problem. What would cause one set of records to be >> unable to update the host_count after repeated tries, while others can >> immediately? >> >> We run 3 puppet masters with ~8000 hosts, and two Foreman servers that >> talk to a single MySQL database. None of the machines look particularly >> overloaded, but I'm wondering if the database isn't happy with two app >> servers trying to talk to it simultaneously. Are there any additional >> settings you could recommend I set to give me more visibility? >> >> >> >> On Wednesday, October 28, 2015 at 2:16:30 AM UTC-7, Tomer Brisker wrote: >> >> Hello Kyle, >> >> You can check the value of your lock time out using: >> > show variables like 'innodb_lock_wait_timeout'; >> If it is set to a very high value, you can change it using >> > set innodb_lock_wait_timeout=100 >> for example to set it to time out after 100 seconds. >> >> You can also force kill the stuck process(es) so you can get the hosts >> unstuck: >> > show processlist; >> > kill ; >> >> After getting your hosts unstuck, you can try to check if the two >> aforementioned patches fix the deadlock indeed or not. >> Or it might be completely unrelated and just a one-time problem that >> locked them in the first place. >> >> >> On Mon, Oct 26, 2015 at 8:12 PM, Kyle Flavin wrote: >> >> Hi Tomer, >> Unfortunately it doesn't look like we have logging turned on for this >> particular server. Error logs are turned on, but nothing appears there >> since September. >> >> I was able to login and do a "show engine innodb status\G" and got the >> following (sanitized output) - it seems to show two transactions attempting >> to update host_count on the architecture table. These transactions have >> been there since Friday. I expected to find something even older, as we've >> had a few "stuck" host records that we cannot delete going back to Oct 1. >> >> ------------------------ >> LATEST DETECTED DEADLOCK >> ------------------------ >> >> 2015-10-23 00:06:10 7fc5696a0700 >> *** (1) TRANSACTION: >> TRANSACTION 9523120346, ACTIVE 3 sec starting index read >> mysql tables in use 1, locked 1 >> LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log >> entries 2 >> MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id >> 3037051607 foreman.example.com 1.1.1.1 foreman updating >> UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 >> WHERE `architectures`.`id` = 1 >> *** (1) WAITING FOR THIS LOCK TO BE GRANTED: >> RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table >> `foreman`.`architectures` trx id 9523120346 lock_mode X locks rec but not >> gap waiting >> Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info >> bits 0 >> 0: len 4; hex 80000001; asc ;; >> 1: len 6; hex 0002379f2b5f; asc 7 +_;; >> 2: len 7; hex 340009803725da; asc 4 7% ;; >> 3: len 6; hex 7838365f3634; asc x86_64;; >> 4: len 5; hex 9986ee3442; asc 4B;; >> 5: len 5; hex 9986ee3442; asc 4B;; >> 6: len 4; hex 80001c80; asc ;; >> 7: len 4; hex 80000003; asc ;; >> >> *** (2) TRANSACTION: >> TRANSACTION 9523116898, ACTIVE 15 sec starting index read >> mysql tables in use 1, locked 1 >> 23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3 >> MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id >> 3037052724 foreman.example.com 1.1.1.1 foreman updating >> UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) + 1 >> WHERE `architectures`.`id` = 1 >> *** (2) HOLDS THE LOCK(S): >> RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table >> `foreman`.`architectures` trx id 9523116898 lock mode S locks rec but not >> gap >> Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info >> bits 0 >> 0: len 4; hex 80000001; asc ;; >> 1: len 6; hex 0002379f2b5f; asc 7 +_;; >> 2: len 7; hex 340009803725da; asc 4 7% ;; >> 3: len 6; hex 7838365f3634; asc x86_64;; >> 4: len 5; hex 9986ee3442; asc 4B;; >> 5: len 5; hex 9986ee3442; asc 4B;; >> 6: len 4; hex 80001c80; asc ;; >> 7: len 4; hex 80000003; asc ;; >> >> *** (2) WAITING FOR THIS LOCK TO BE GRANTED: >> RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table >> `foreman`.`architectures` trx id 9523116898 lock_mode X locks rec but not >> gap waiting >> Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info >> bits 0 >> 0: len 4; hex 80000001; asc ;; >> 1: len 6; hex 0002379f2b5f; asc 7 +_;; >> 2: len 7; hex 340009803725da; asc 4 7% ;; >> 3: len 6; hex 7838365f3634; asc x86_64;; >> 4: len 5; hex 9986ee3442; asc 4B;; >> 5: len 5; hex 9986ee3442; asc 4B;; >> 6: len 4; hex 80001c80; asc ;; >> 7: len 4; hex 80000003; asc ;; >> >> *** WE ROLL BACK TRANSACTION (1) >> >> >> >> On Sunday, October 25, 2015 at 12:43:08 AM UTC-7, Tomer Brisker wrote: >> >> Hello Kyle, >> It is possible that this is related indeed to the counter_cache_fix, but >> in order to confirm that I would need a bit more context from your logs as >> to the error causing this. >> The mysql log should contain the details of which two transactions are >> conflicting and causing the deadlock. >> If this is indeed the case, it should be possible to cherry pick only >> these two patches (once the second one gets merged) to 1.7.1 if needed. >> >> On Sat, Oct 24, 2015 at 1:06 AM, Kyle Flavin wrote: >> >> We're having trouble removing a host record from Foreman (1.7.1). It >> looks like a process is holding a lock open. When we try to delete the >> record from the UI, it sits there for a long time (several minutes) and >> then fails with this error: >> >> Mysql2::Error: Lock wait timeout exceeded; try >> >> ... > >

tbrisker · November 1, 2015, 7:19am

Hi Kyle,
I'm afraid it seems that the source of your deadlocks is in the rails
implementation of counter_cache which is sub-optimal (it is not
transaction-safe, leading to such deadlocks sometimes).
As a work around, you can try to disable the counter_cache for these
columns by removing the :counter_cache => counter_cache part of the lines
in host_common.rb [1].
You can then manually update the counters afterwards by running
foreman-rake fix_cached_counters, which will recount all the counters in
the database.

[1]

github.com

theforeman/foreman/blob/1.7.1/app/models/concerns/host_common.rb#L13


      
          #Common methods between host and hostgroup
          # mostly for template rendering consistency
          module HostCommon
            extend ActiveSupport::Concern
          
            included do
              include CounterCacheFix
          
              counter_cache = "#{model_name.split(":").first.pluralize.downcase}_count".to_sym  # e.g. :hosts_count
          
              belongs_to :architecture,    :counter_cache => counter_cache
              belongs_to :environment,     :counter_cache => counter_cache
              belongs_to :operatingsystem, :counter_cache => counter_cache
              belongs_to :medium
              belongs_to :ptable
              belongs_to :puppet_proxy,    :class_name => "SmartProxy"
              belongs_to :puppet_ca_proxy, :class_name => "SmartProxy"
              belongs_to :domain,          :counter_cache => counter_cache
              belongs_to :realm,           :counter_cache => counter_cache
              belongs_to :subnet
              belongs_to :compute_profile

···

On Fri, Oct 30, 2015 at 3:39 AM, Kyle Flavin wrote:

Ugh, sorry - I may have jumped the gun on that last one. I changed the
model on the problematic host, and I still can’t remove it.

On Thursday, October 29, 2015 at 6:32:23 PM UTC-7, Kyle Flavin wrote:

More information to add onto my last post…

It ONLY appears to affect updates on the models table that occur on
id=46. If the record has a different model, the delete works.

On Thursday, October 29, 2015 at 4:41:36 PM UTC-7, Kyle Flavin wrote:

Hi Tomer,
The value is set to 400:

show variables like ‘innodb_lock_wait_timeout’;
±-------------------------±------+
Variable_name | Value |
±-------------------------±------+
innodb_lock_wait_timeout | 400 |
±-------------------------±------+
1 row in set (0.00 sec)

Over the past day I’ve backported the two changes from:
https://github.com/theforeman/foreman/pull/2362
Fixes #12241 - Correct counter_cache deadlock fix by tbrisker · Pull Request #2839 · theforeman/foreman · GitHub

into our 1.7.1 Foreman servers (we’re running two), and I’ve restarted
the MySQL database, just to start fresh. However, I’m still unable to
remove a number of Foreman records. When I go to delete the record through
the Web UI, I still see this error:

Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1
WHERE models.id = 46 ORDER BY models.name

When I do a “show processlist” on the database, I can see the query
timing out:

60 | foreman | myhost.example.com:48892 | foreman | Query |
390 | Searching rows for update
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1
WHERE models.id = 46 ORDER BY

Where *390 *is the timeout. Once it gets to 400, the process clears.
If I try the delete again, a new process appears and also times out after
400 seconds. I could drop the timeout to 100, but looking through the list
of processes, it doesn’t appear that there are any other long running
processes (other than a process doing a binlog dump to our standby
database, but I don’t think that’s an issue) timing out, which might also
be holding a lock. When I try to delete the bad record, I see the one long
lived process start up and die.

The odd part is, there are other records I can remove. The records I
can’t remove appear to be the same every time. It’s as if a particular
group of records has a problem. What would cause one set of records to be
unable to update the host_count after repeated tries, while others can
immediately?

We run 3 puppet masters with ~8000 hosts, and two Foreman servers that
talk to a single MySQL database. None of the machines look particularly
overloaded, but I’m wondering if the database isn’t happy with two app
servers trying to talk to it simultaneously. Are there any additional
settings you could recommend I set to give me more visibility?

On Wednesday, October 28, 2015 at 2:16:30 AM UTC-7, Tomer Brisker wrote:

Hello Kyle,

You can check the value of your lock time out using:

show variables like ‘innodb_lock_wait_timeout’;
If it is set to a very high value, you can change it using
set innodb_lock_wait_timeout=100
for example to set it to time out after 100 seconds.

You can also force kill the stuck process(es) so you can get the hosts
unstuck:

show processlist;
kill <process_id>;

After getting your hosts unstuck, you can try to check if the two
aforementioned patches fix the deadlock indeed or not.
Or it might be completely unrelated and just a one-time problem that
locked them in the first place.

On Mon, Oct 26, 2015 at 8:12 PM, Kyle Flavin kyle....@gmail.com wrote:

Hi Tomer,
Unfortunately it doesn’t look like we have logging turned on for this
particular server. Error logs are turned on, but nothing appears there
since September.

I was able to login and do a “show engine innodb status\G” and got the
following (sanitized output) - it seems to show two transactions attempting
to update host_count on the architecture table. These transactions have
been there since Friday. I expected to find something even older, as we’ve
had a few “stuck” host records that we cannot delete going back to Oct 1.

LATEST DETECTED DEADLOCK

2015-10-23 00:06:10 7fc5696a0700
*** (1) TRANSACTION:
TRANSACTION 9523120346, ACTIVE 3 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log
entries 2
MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id
3037051607 foreman.example.com 1.1.1.1 foreman updating
UPDATE architectures SET hosts_count = COALESCE(hosts_count, 0) -
1 WHERE architectures.id = 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523120346 lock_mode X locks rec but
not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format;
info bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** (2) TRANSACTION:
TRANSACTION 9523116898, ACTIVE 15 sec starting index read
mysql tables in use 1, locked 1
23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3
MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id
3037052724 foreman.example.com 1.1.1.1 foreman updating
UPDATE architectures SET hosts_count = COALESCE(hosts_count, 0) +
1 WHERE architectures.id = 1
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523116898 lock mode S locks rec but
not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format;
info bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523116898 lock_mode X locks rec but
not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format;
info bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** WE ROLL BACK TRANSACTION (1)

On Sunday, October 25, 2015 at 12:43:08 AM UTC-7, Tomer Brisker wrote:

Hello Kyle,
It is possible that this is related indeed to the counter_cache_fix, but
in order to confirm that I would need a bit more context from your logs as
to the error causing this.
The mysql log should contain the details of which two transactions are
conflicting and causing the deadlock.
If this is indeed the case, it should be possible to cherry pick only
these two patches (once the second one gets merged) to 1.7.1 if needed.

On Sat, Oct 24, 2015 at 1:06 AM, Kyle Flavin kyle....@gmail.com wrote:

We’re having trouble removing a host record from Foreman (1.7.1). It
looks like a process is holding a lock open. When we try to delete the
record from the UI, it sits there for a long time (several minutes) and
then fails with this error:

Mysql2::Error: Lock wait timeout exceeded; try

…

–
You received this message because you are subscribed to the Google Groups
“Foreman users” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-users+unsubscribe@googlegroups.com.
To post to this group, send email to foreman-users@googlegroups.com.
Visit this group at http://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

–
Have a nice day,
Tomer Brisker
Red Hat Engineering

Kyle_Flavin · November 3, 2015, 4:54pm

Thanks Tomer. It appears we also have issues with the size our database,
which has grown unbounded. We had tables exceeding 500 million records and
85GB in size. I think this has been causing issues for us as well, and may
be the reason some of our records are "stuck".

···

On Sunday, November 1, 2015 at 12:19:54 AM UTC-7, Tomer Brisker wrote: > > Hi Kyle, > I'm afraid it seems that the source of your deadlocks is in the rails > implementation of counter_cache which is sub-optimal (it is not > transaction-safe, leading to such deadlocks sometimes). > As a work around, you can try to disable the counter_cache for these > columns by removing the `:counter_cache => counter_cache` part of the lines > in host_common.rb [1]. > You can then manually update the counters afterwards by running > `foreman-rake fix_cached_counters`, which will recount all the counters in > the database. > > [1] > https://github.com/theforeman/foreman/blob/1.7.1/app/models/concerns/host_common.rb#L13 > > > On Fri, Oct 30, 2015 at 3:39 AM, Kyle Flavin > wrote: > >> Ugh, sorry - I may have jumped the gun on that last one. I changed the >> model on the problematic host, and I still can't remove it. >> >> >> >> On Thursday, October 29, 2015 at 6:32:23 PM UTC-7, Kyle Flavin wrote: >>> >>> More information to add onto my last post... >>> >>> It ONLY appears to affect updates on the models table that occur on >>> id=46. If the record has a different model, the delete works. >>> >>> On Thursday, October 29, 2015 at 4:41:36 PM UTC-7, Kyle Flavin wrote: >>>> >>>> Hi Tomer, >>>> The value is set to 400: >>>> >>>> > show variables like 'innodb_lock_wait_timeout'; >>>> +--------------------------+-------+ >>>> > Variable_name | Value | >>>> +--------------------------+-------+ >>>> > innodb_lock_wait_timeout | 400 | >>>> +--------------------------+-------+ >>>> 1 row in set (0.00 sec) >>>> >>>> Over the past day I've backported the two changes from: >>>> https://github.com/theforeman/foreman/pull/2362 >>>> https://github.com/theforeman/foreman/pull/2839 >>>> >>>> into our 1.7.1 Foreman servers (we're running two), and I've restarted >>>> the MySQL database, just to start fresh. However, I'm still unable to >>>> remove a number of Foreman records. When I go to delete the record through >>>> the Web UI, I still see this error: >>>> >>>> Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: >>>> UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 >>>> WHERE `models`.`id` = 46 ORDER BY models.name >>>> >>>> >>>> When I do a "show processlist" on the database, I can see the query >>>> timing out: >>>> >>>> > 60 | foreman | myhost.example.com:48892 | foreman | Query | >>>> 390 | Searching rows for update >>>> > UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) >>>> - 1 WHERE `models`.`id` = 46 ORDER BY >>>> >>>> >>>> Where *390 *is the timeout. Once it gets to 400, the process clears. >>>> If I try the delete again, a new process appears and also times out after >>>> 400 seconds. I could drop the timeout to 100, but looking through the list >>>> of processes, it doesn't appear that there are any other long running >>>> processes (other than a process doing a binlog dump to our standby >>>> database, but I don't think that's an issue) timing out, which might also >>>> be holding a lock. When I try to delete the bad record, I see the one long >>>> lived process start up and die. >>>> >>>> The odd part is, there are other records I can remove. The records I >>>> can't remove appear to be the same every time. It's as if a particular >>>> group of records has a problem. What would cause one set of records to be >>>> unable to update the host_count after repeated tries, while others can >>>> immediately? >>>> >>>> We run 3 puppet masters with ~8000 hosts, and two Foreman servers that >>>> talk to a single MySQL database. None of the machines look particularly >>>> overloaded, but I'm wondering if the database isn't happy with two app >>>> servers trying to talk to it simultaneously. Are there any additional >>>> settings you could recommend I set to give me more visibility? >>>> >>>> >>>> >>>> On Wednesday, October 28, 2015 at 2:16:30 AM UTC-7, Tomer Brisker wrote: >>>> >>>> Hello Kyle, >>>> >>>> You can check the value of your lock time out using: >>>> > show variables like 'innodb_lock_wait_timeout'; >>>> If it is set to a very high value, you can change it using >>>> > set innodb_lock_wait_timeout=100 >>>> for example to set it to time out after 100 seconds. >>>> >>>> You can also force kill the stuck process(es) so you can get the hosts >>>> unstuck: >>>> > show processlist; >>>> > kill ; >>>> >>>> After getting your hosts unstuck, you can try to check if the two >>>> aforementioned patches fix the deadlock indeed or not. >>>> Or it might be completely unrelated and just a one-time problem that >>>> locked them in the first place. >>>> >>>> >>>> On Mon, Oct 26, 2015 at 8:12 PM, Kyle Flavin >>>> wrote: >>>> >>>> Hi Tomer, >>>> Unfortunately it doesn't look like we have logging turned on for this >>>> particular server. Error logs are turned on, but nothing appears there >>>> since September. >>>> >>>> I was able to login and do a "show engine innodb status\G" and got the >>>> following (sanitized output) - it seems to show two transactions attempting >>>> to update host_count on the architecture table. These transactions have >>>> been there since Friday. I expected to find something even older, as we've >>>> had a few "stuck" host records that we cannot delete going back to Oct 1. >>>> >>>> ------------------------ >>>> LATEST DETECTED DEADLOCK >>>> ------------------------ >>>> >>>> 2015-10-23 00:06:10 7fc5696a0700 >>>> *** (1) TRANSACTION: >>>> TRANSACTION 9523120346, ACTIVE 3 sec starting index read >>>> mysql tables in use 1, locked 1 >>>> LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log >>>> entries 2 >>>> MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id >>>> 3037051607 foreman.example.com 1.1.1.1 foreman updating >>>> UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) - >>>> 1 WHERE `architectures`.`id` = 1 >>>> *** (1) WAITING FOR THIS LOCK TO BE GRANTED: >>>> RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table >>>> `foreman`.`architectures` trx id 9523120346 lock_mode X locks rec but >>>> not gap waiting >>>> Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; >>>> info bits 0 >>>> 0: len 4; hex 80000001; asc ;; >>>> 1: len 6; hex 0002379f2b5f; asc 7 +_;; >>>> 2: len 7; hex 340009803725da; asc 4 7% ;; >>>> 3: len 6; hex 7838365f3634; asc x86_64;; >>>> 4: len 5; hex 9986ee3442; asc 4B;; >>>> 5: len 5; hex 9986ee3442; asc 4B;; >>>> 6: len 4; hex 80001c80; asc ;; >>>> 7: len 4; hex 80000003; asc ;; >>>> >>>> *** (2) TRANSACTION: >>>> TRANSACTION 9523116898, ACTIVE 15 sec starting index read >>>> mysql tables in use 1, locked 1 >>>> 23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3 >>>> MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id >>>> 3037052724 foreman.example.com 1.1.1.1 foreman updating >>>> UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) + >>>> 1 WHERE `architectures`.`id` = 1 >>>> *** (2) HOLDS THE LOCK(S): >>>> RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table >>>> `foreman`.`architectures` trx id 9523116898 lock mode S locks rec but >>>> not gap >>>> Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; >>>> info bits 0 >>>> 0: len 4; hex 80000001; asc ;; >>>> 1: len 6; hex 0002379f2b5f; asc 7 +_;; >>>> 2: len 7; hex 340009803725da; asc 4 7% ;; >>>> 3: len 6; hex 7838365f3634; asc x86_64;; >>>> 4: len 5; hex 9986ee3442; asc 4B;; >>>> 5: len 5; hex 9986ee3442; asc 4B;; >>>> 6: len 4; hex 80001c80; asc ;; >>>> 7: len 4; hex 80000003; asc ;; >>>> >>>> *** (2) WAITING FOR THIS LOCK TO BE GRANTED: >>>> RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table >>>> `foreman`.`architectures` trx id 9523116898 lock_mode X locks rec but >>>> not gap waiting >>>> Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; >>>> info bits 0 >>>> 0: len 4; hex 80000001; asc ;; >>>> 1: len 6; hex 0002379f2b5f; asc 7 +_;; >>>> 2: len 7; hex 340009803725da; asc 4 7% ;; >>>> 3: len 6; hex 7838365f3634; asc x86_64;; >>>> 4: len 5; hex 9986ee3442; asc 4B;; >>>> 5: len 5; hex 9986ee3442; asc 4B;; >>>> 6: len 4; hex 80001c80; asc ;; >>>> 7: len 4; hex 80000003; asc ;; >>>> >>>> *** WE ROLL BACK TRANSACTION (1) >>>> >>>> >>>> >>>> On Sunday, October 25, 2015 at 12:43:08 AM UTC-7, Tomer Brisker wrote: >>>> >>>> Hello Kyle, >>>> It is possible that this is related indeed to the counter_cache_fix, >>>> but in order to confirm that I would need a bit more context from your logs >>>> as to the error causing this. >>>> The mysql log should contain the details of which two transactions are >>>> conflicting and causing the deadlock. >>>> If this is indeed the case, it should be possible to cherry pick only >>>> these two patches (once the second one gets merged) to 1.7.1 if needed. >>>> >>>> On Sat, Oct 24, 2015 at 1:06 AM, Kyle Flavin >>>> wrote: >>>> >>>> We're having trouble removing a host record from Foreman (1.7.1). It >>>> looks like a process is holding a lock open. When we try to delete the >>>> record from the UI, it sits there for a long time (several minutes) and >>>> then fails with this error: >>>> >>>> Mysql2::Error: Lock wait timeout exceeded; try >>>> >>>> ... >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "Foreman users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to foreman-user...@googlegroups.com . >> To post to this group, send email to forema...@googlegroups.com >> . >> Visit this group at http://groups.google.com/group/foreman-users. >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Have a nice day, > Tomer Brisker > Red Hat Engineering >

tbrisker · December 28, 2015, 7:17pm

For the benefit of future readers of this thread,
The suggested workaround does not work as disabling counter_cache on the
columns breaks the rake task.
The rake task can be fixed to work without counter_cache, by changing the
code slightly.
For example, the Model counter can be fixed by using:
Model.all.each{|el| el.update_attribute('hosts_count',
Host::Managed.where(:model_id => el.id).count)}
instead of using the reset_counter method which depends on the
counter_cache being set.

···

On Sun, Nov 1, 2015 at 9:19 AM, Tomer Brisker wrote:

Hi Kyle,
I’m afraid it seems that the source of your deadlocks is in the rails
implementation of counter_cache which is sub-optimal (it is not
transaction-safe, leading to such deadlocks sometimes).
As a work around, you can try to disable the counter_cache for these
columns by removing the :counter_cache => counter_cache part of the lines
in host_common.rb [1].
You can then manually update the counters afterwards by running
foreman-rake fix_cached_counters, which will recount all the counters in
the database.

[1]
foreman/app/models/concerns/host_common.rb at 1.7.1 · theforeman/foreman · GitHub

On Fri, Oct 30, 2015 at 3:39 AM, Kyle Flavin kyle.flavin@gmail.com > wrote:

Ugh, sorry - I may have jumped the gun on that last one. I changed the
model on the problematic host, and I still can’t remove it.

On Thursday, October 29, 2015 at 6:32:23 PM UTC-7, Kyle Flavin wrote:

More information to add onto my last post…

It ONLY appears to affect updates on the models table that occur on
id=46. If the record has a different model, the delete works.

On Thursday, October 29, 2015 at 4:41:36 PM UTC-7, Kyle Flavin wrote:

Hi Tomer,
The value is set to 400:

show variables like ‘innodb_lock_wait_timeout’;
±-------------------------±------+
Variable_name | Value |
±-------------------------±------+
innodb_lock_wait_timeout | 400 |
±-------------------------±------+
1 row in set (0.00 sec)

Over the past day I’ve backported the two changes from:
https://github.com/theforeman/foreman/pull/2362
Fixes #12241 - Correct counter_cache deadlock fix by tbrisker · Pull Request #2839 · theforeman/foreman · GitHub

into our 1.7.1 Foreman servers (we’re running two), and I’ve restarted
the MySQL database, just to start fresh. However, I’m still unable to
remove a number of Foreman records. When I go to delete the record through
the Web UI, I still see this error:

Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE models SET hosts_count = COALESCE(hosts_count, 0) - 1
WHERE models.id = 46 ORDER BY models.name

When I do a “show processlist” on the database, I can see the query
timing out:

60 | foreman | myhost.example.com:48892 | foreman | Query |
390 | Searching rows for update
UPDATE models SET hosts_count = COALESCE(hosts_count, 0)

1 WHERE models.id = 46 ORDER BY

Where *390 *is the timeout. Once it gets to 400, the process clears.
If I try the delete again, a new process appears and also times out after
400 seconds. I could drop the timeout to 100, but looking through the list
of processes, it doesn’t appear that there are any other long running
processes (other than a process doing a binlog dump to our standby
database, but I don’t think that’s an issue) timing out, which might also
be holding a lock. When I try to delete the bad record, I see the one long
lived process start up and die.

The odd part is, there are other records I can remove. The records I
can’t remove appear to be the same every time. It’s as if a particular
group of records has a problem. What would cause one set of records to be
unable to update the host_count after repeated tries, while others can
immediately?

We run 3 puppet masters with ~8000 hosts, and two Foreman servers that
talk to a single MySQL database. None of the machines look particularly
overloaded, but I’m wondering if the database isn’t happy with two app
servers trying to talk to it simultaneously. Are there any additional
settings you could recommend I set to give me more visibility?

On Wednesday, October 28, 2015 at 2:16:30 AM UTC-7, Tomer Brisker wrote:

Hello Kyle,

You can check the value of your lock time out using:

show variables like ‘innodb_lock_wait_timeout’;
If it is set to a very high value, you can change it using
set innodb_lock_wait_timeout=100
for example to set it to time out after 100 seconds.

You can also force kill the stuck process(es) so you can get the hosts
unstuck:

show processlist;
kill <process_id>;

After getting your hosts unstuck, you can try to check if the two
aforementioned patches fix the deadlock indeed or not.
Or it might be completely unrelated and just a one-time problem that
locked them in the first place.

On Mon, Oct 26, 2015 at 8:12 PM, Kyle Flavin kyle....@gmail.com >>>> wrote:

Hi Tomer,
Unfortunately it doesn’t look like we have logging turned on for this
particular server. Error logs are turned on, but nothing appears there
since September.

I was able to login and do a “show engine innodb status\G” and got the
following (sanitized output) - it seems to show two transactions attempting
to update host_count on the architecture table. These transactions have
been there since Friday. I expected to find something even older, as we’ve
had a few “stuck” host records that we cannot delete going back to Oct 1.

LATEST DETECTED DEADLOCK

2015-10-23 00:06:10 7fc5696a0700
*** (1) TRANSACTION:
TRANSACTION 9523120346, ACTIVE 3 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log
entries 2
MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id
3037051607 foreman.example.com 1.1.1.1 foreman updating
UPDATE architectures SET hosts_count = COALESCE(hosts_count, 0) -
1 WHERE architectures.id = 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523120346 lock_mode X locks rec but
not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format;
info bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** (2) TRANSACTION:
TRANSACTION 9523116898, ACTIVE 15 sec starting index read
mysql tables in use 1, locked 1
23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3
MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id
3037052724 foreman.example.com 1.1.1.1 foreman updating
UPDATE architectures SET hosts_count = COALESCE(hosts_count, 0) +
1 WHERE architectures.id = 1
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523116898 lock mode S locks rec but
not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format;
info bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index PRIMARY of table
foreman.architectures trx id 9523116898 lock_mode X locks rec but
not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format;
info bits 0
0: len 4; hex 80000001; asc ;;
1: len 6; hex 0002379f2b5f; asc 7 +_;;
2: len 7; hex 340009803725da; asc 4 7% ;;
3: len 6; hex 7838365f3634; asc x86_64;;
4: len 5; hex 9986ee3442; asc 4B;;
5: len 5; hex 9986ee3442; asc 4B;;
6: len 4; hex 80001c80; asc ;;
7: len 4; hex 80000003; asc ;;

*** WE ROLL BACK TRANSACTION (1)

On Sunday, October 25, 2015 at 12:43:08 AM UTC-7, Tomer Brisker wrote:

Hello Kyle,
It is possible that this is related indeed to the counter_cache_fix,
but in order to confirm that I would need a bit more context from your logs
as to the error causing this.
The mysql log should contain the details of which two transactions are
conflicting and causing the deadlock.
If this is indeed the case, it should be possible to cherry pick only
these two patches (once the second one gets merged) to 1.7.1 if needed.

On Sat, Oct 24, 2015 at 1:06 AM, Kyle Flavin kyle....@gmail.com >>>> wrote:

We’re having trouble removing a host record from Foreman (1.7.1). It
looks like a process is holding a lock open. When we try to delete the
record from the UI, it sits there for a long time (several minutes) and
then fails with this error:

Mysql2::Error: Lock wait timeout exceeded; try

…

–
You received this message because you are subscribed to the Google Groups
“Foreman users” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-users+unsubscribe@googlegroups.com.
To post to this group, send email to foreman-users@googlegroups.com.
Visit this group at http://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

–
Have a nice day,
Tomer Brisker
Red Hat Engineering

–
Have a nice day,
Tomer Brisker
Red Hat Engineering

Shimon_Shtein · December 29, 2015, 8:06am

Hello Kyle,

About the abnormal size of the database: Can you tell which tables are
growing? I think I have a suspect, but I want to confirm it with you first.

Shimon.

···

On Tuesday, November 3, 2015 at 6:54:32 PM UTC+2, Kyle Flavin wrote: > > Thanks Tomer. It appears we also have issues with the size our database, > which has grown unbounded. We had tables exceeding 500 million records and > 85GB in size. I think this has been causing issues for us as well, and may > be the reason some of our records are "stuck". > > On Sunday, November 1, 2015 at 12:19:54 AM UTC-7, Tomer Brisker wrote: > > Hi Kyle, > I'm afraid it seems that the source of your deadlocks is in the rails > implementation of counter_cache which is sub-optimal (it is not > transaction-safe, leading to such deadlocks sometimes). > As a work around, you can try to disable the counter_cache for these > columns by removing the `:counter_cache => counter_cache` part of the lines > in host_common.rb [1]. > You can then manually update the counters afterwards by running > `foreman-rake fix_cached_counters`, which will recount all the counters in > the database. > > [1] > https://github.com/theforeman/foreman/blob/1.7.1/app/models/concerns/host_common.rb#L13 > > > On Fri, Oct 30, 2015 at 3:39 AM, Kyle Flavin wrote: > > Ugh, sorry - I may have jumped the gun on that last one. I changed the > model on the problematic host, and I still can't remove it. > > > > On Thursday, October 29, 2015 at 6:32:23 PM UTC-7, Kyle Flavin wrote: > > More information to add onto my last post... > > It ONLY appears to affect updates on the models table that occur on > id=46. If the record has a different model, the delete works. > > On Thursday, October 29, 2015 at 4:41:36 PM UTC-7, Kyle Flavin wrote: > > Hi Tomer, > The value is set to 400: > > > show variables like 'innodb_lock_wait_timeout'; > +--------------------------+-------+ > > Variable_name | Value | > +--------------------------+-------+ > > innodb_lock_wait_timeout | 400 | > +--------------------------+-------+ > 1 row in set (0.00 sec) > > Over the past day I've backported the two changes from: > https://github.com/theforeman/foreman/pull/2362 > https://github.com/theforeman/foreman/pull/2839 > > into our 1.7.1 Foreman servers (we're running two), and I've restarted the > MySQL database, just to start fresh. However, I'm still unable to remove a > number of Foreman records. When I go to delete the record through the Web > UI, I still see this error: > > Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: > UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 WHERE > `models`.`id` = 46 ORDER BY models.name > > > When I do a "show processlist" on the database, I can see the query timing > out: > > > 60 | foreman | myhost.example.com:48892 | foreman | Query | > 390 | Searching rows for update > > UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 > WHERE `models`.`id` = 46 ORDER BY > > > Where *390 *is the timeout. Once it gets to 400, the process clears. If > I try the delete again, a new process appears and also times out after 400 > seconds. I could drop the timeout to 100, but looking through the list of > processes, it doesn't appear that there are any other long running > processes (other than a process doing a binlog dump to our standby > database, but I don't think that's an issue) timing out, which might also > be holding a lock. When I try to delete the bad record, I see the one long > lived process start up and die. > > The odd part is, there are other records I can remove. The records I > can't remove appear to be the same every time. It's as if a particular > group of records has a problem. What would cause one set of records to be > unable to update the host_count after repeated tries, while others can > immediately? > > We run 3 puppet masters with ~8000 hosts, and two Foreman servers that > talk to a single MySQL database. None of the machines look particularly > overloaded, but I'm wondering if the database isn't happy with two app > servers trying to talk to it simultaneously. Are there any additional > settings you could recommend I set to give me more visibility? > > > > On Wednesday, October 28, 2015 at 2:16:30 AM UTC-7, Tomer Brisker wrote: > > Hello Kyle, > > You can check the value of your lock time out using: > > show variables like 'innodb_lock_wait_timeout'; > If it is set to a very high value, you can change it using > > set innodb_lock_wait_timeout=100 > for example to set it to time out after 100 seconds. > > You can also force kill the stuck process(es) so you can get the hosts > unstuck: > > show processlist; > > kill ; > > After getting your hosts unstuck, you can try to check if the two > aforementioned patches fix the deadlock indeed or not. > Or it might be completely unrelated and just a one-time problem that > locked them in the first place. > > > On Mon, Oct 26, 2015 at 8:12 PM, Kyle Flavin wrote: > > Hi Tomer, > Unfortunately it doesn't look like we have logging turned on for this > particular server. Error logs are turned on, but nothing appears there > since September. > > I was able to login and do a "show engine innodb status\G" and got the > following (sanitized output) - it seems to show two transactions attempting > to update host_count on the architecture table. These transactions have > been there since Friday. I expected to find something even older, as we've > had a few "stuck" host records that we cannot delete going back to Oct 1. > > ------------------------ > LATEST DETECTED DEADLOCK > ------------------------ > > 2015-10-23 00:06:10 7fc5696a0700 > *** (1) TRANSACTION: > TRANSACTION 9523120346, ACTIVE 3 sec starting index read > mysql tables in use 1, locked 1 > LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log > entries 2 > MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id > 3037051607 foreman.example.com 1.1.1.1 foreman updating > UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 > WHERE `architectures`.`id` = 1 > *** (1) WAITING FOR THIS LOCK TO BE GRANTED: > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523120346 lock_mode X locks rec but not > gap waiting > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > 6: len 4; hex 80001c80; asc ;; > 7: len 4; hex 80000003; asc ;; > > *** (2) TRANSACTION: > TRANSACTION 9523116898, ACTIVE 15 sec starting index read > mysql tables in use 1, locked 1 > 23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3 > MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id > 3037052724 foreman.example.com 1.1.1.1 foreman updating > UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) + 1 > WHERE `architectures`.`id` = 1 > *** (2) HOLDS THE LOCK(S): > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523116898 lock mode S locks rec but not > gap > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > 6: len 4; hex 80001c80; asc ;; > 7: len 4; hex 80000003; asc ;; > > *** (2) WAITING FOR THIS LOCK TO BE GRANTED: > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523116898 lock_mode X locks rec but not > gap waiting > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > 6: len 4; hex 80001c80; asc ;; > 7: len 4; hex 80000003; asc ;; > > *** WE ROLL BACK TRANSACTION (1) > > > > On Sunday, October 25, 2015 at 12:43:08 AM UTC-7, Tomer Brisker wrote: > > Hello Kyle, > It is possible that this is related indeed to the counter_cache_fix, but > in order to confirm that I would need a bit more context from your logs as > to the error causing this. > The mysql log should contain the details of which two transactions are > conflicting and causing the deadlock. > If this is indeed the case, it should be possible to cherry pick only > these two patches (once the second one gets merged) to 1.7.1 if needed. > > On Sat, Oct 24, 2015 at 1:06 AM, Kyle Flavin wrote: > > We're having trouble removing a host record from Foreman (1.7.1). It > looks like a process is holding a lock open. When we try to delete the > record from the UI, it sits there for a long time (several minutes) and > then fails with this error: > > Mysql2::Error: Lock wait timeout exceeded; try > > ... > > -- > You received this message because you are subscribed to the Google > > ...

Kyle_Flavin · December 29, 2015, 8:38pm

Hi Shimon,
The logs table was the largest. Here were the sizes on disk:

du -s * | sort -r -k1n

…
1321876 sources.MYD
1441796 param_values.ibd
1919892 sources.MYI
38199352 sessions.ibd
45965380 reports.ibd
88887416 logs.ibd

I remember that the logs table was the largest one - over 500 million
records in the table. We ended up truncating logs, reports, messages, and
sources (possibly a couple others), a little over a month ago.

···

On Tuesday, December 29, 2015 at 12:06:50 AM UTC-8, ssh...@redhat.com wrote: > > Hello Kyle, > > About the abnormal size of the database: Can you tell which tables are > growing? I think I have a suspect, but I want to confirm it with you first. > > Shimon. > > On Tuesday, November 3, 2015 at 6:54:32 PM UTC+2, Kyle Flavin wrote: > > Thanks Tomer. It appears we also have issues with the size our database, > which has grown unbounded. We had tables exceeding 500 million records and > 85GB in size. I think this has been causing issues for us as well, and may > be the reason some of our records are "stuck". > > On Sunday, November 1, 2015 at 12:19:54 AM UTC-7, Tomer Brisker wrote: > > Hi Kyle, > I'm afraid it seems that the source of your deadlocks is in the rails > implementation of counter_cache which is sub-optimal (it is not > transaction-safe, leading to such deadlocks sometimes). > As a work around, you can try to disable the counter_cache for these > columns by removing the `:counter_cache => counter_cache` part of the lines > in host_common.rb [1]. > You can then manually update the counters afterwards by running > `foreman-rake fix_cached_counters`, which will recount all the counters in > the database. > > [1] > https://github.com/theforeman/foreman/blob/1.7.1/app/models/concerns/host_common.rb#L13 > > > On Fri, Oct 30, 2015 at 3:39 AM, Kyle Flavin wrote: > > Ugh, sorry - I may have jumped the gun on that last one. I changed the > model on the problematic host, and I still can't remove it. > > > > On Thursday, October 29, 2015 at 6:32:23 PM UTC-7, Kyle Flavin wrote: > > More information to add onto my last post... > > It ONLY appears to affect updates on the models table that occur on > id=46. If the record has a different model, the delete works. > > On Thursday, October 29, 2015 at 4:41:36 PM UTC-7, Kyle Flavin wrote: > > Hi Tomer, > The value is set to 400: > > > show variables like 'innodb_lock_wait_timeout'; > +--------------------------+-------+ > > Variable_name | Value | > +--------------------------+-------+ > > innodb_lock_wait_timeout | 400 | > +--------------------------+-------+ > 1 row in set (0.00 sec) > > Over the past day I've backported the two changes from: > https://github.com/theforeman/foreman/pull/2362 > https://github.com/theforeman/foreman/pull/2839 > > into our 1.7.1 Foreman servers (we're running two), and I've restarted the > MySQL database, just to start fresh. However, I'm still unable to remove a > number of Foreman records. When I go to delete the record through the Web > UI, I still see this error: > > Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: > UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 WHERE > `models`.`id` = 46 ORDER BY models.name > > > When I do a "show processlist" on the database, I can see the query timing > out: > > > 60 | foreman | myhost.example.com:48892 | foreman | Query | > 390 | Searching rows for update > > UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 > WHERE `models`.`id` = 46 ORDER BY > > > Where *390 *is the timeout. Once it gets to 400, the process clears. If > I try the delete again, a new process appears and also times out after 400 > seconds. I could drop the timeout to 100, but looking through the list of > processes, it doesn't appear that there are any other long running > processes (other than a process doing a binlog dump to our standby > database, but I don't think that's an issue) timing out, which might also > be holding a lock. When I try to delete the bad record, I see the one long > lived process start up and die. > > The odd part is, there are other records I can remove. The records I > can't remove appear to be the same every time. It's as if a particular > group of records has a problem. What would cause one set of records to be > unable to update the host_count after repeated tries, while others can > immediately? > > We run 3 puppet masters with ~8000 hosts, and two Foreman servers that > talk to a single MySQL database. None of the machines look particularly > overloaded, but I'm wondering if the database isn't happy with two app > servers trying to talk to it simultaneously. Are there any additional > settings you could recommend I set to give me more visibility? > > > > On Wednesday, October 28, 2015 at 2:16:30 AM UTC-7, Tomer Brisker wrote: > > Hello Kyle, > > You can check the value of your lock time out using: > > show variables like 'innodb_lock_wait_timeout'; > If it is set to a very high value, you can change it using > > set innodb_lock_wait_timeout=100 > for example to set it to time out after 100 seconds. > > You can also force kill the stuck process(es) so you can get the hosts > unstuck: > > show processlist; > > kill ; > > After getting your hosts unstuck, you can try to check if the two > aforementioned patches fix the deadlock indeed or not. > Or it might be completely unrelated and just a one-time problem that > locked them in the first place. > > > On Mon, Oct 26, 2015 at 8:12 PM, Kyle Flavin wrote: > > Hi Tomer, > Unfortunately it doesn't look like we have logging turned on for this > particular server. Error logs are turned on, but nothing appears there > since September. > > I was able to login and do a "show engine innodb status\G" and got the > following (sanitized output) - it seems to show two transactions attempting > to update host_count on the architecture table. These transactions have > been there since Friday. I expected to find something even older, as we've > had a few "stuck" host records that we cannot delete going back to Oct 1. > > ------------------------ > LATEST DETECTED DEADLOCK > ------------------------ > > 2015-10-23 00:06:10 7fc5696a0700 > *** (1) TRANSACTION: > TRANSACTION 9523120346, ACTIVE 3 sec starting index read > mysql tables in use 1, locked 1 > LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log > entries 2 > MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id > 3037051607 foreman.example.com 1.1.1.1 foreman updating > UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 > WHERE `architectures`.`id` = 1 > *** (1) WAITING FOR THIS LOCK TO BE GRANTED: > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523120346 lock_mode X locks rec but not > gap waiting > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > 6: len 4; hex 80001c80; asc ;; > 7: len 4; hex 80000003; asc ;; > > *** (2) TRANSACTION: > TRANSACTION 9523116898, ACTIVE 15 sec starting index read > mysql tables in use 1, locked 1 > 23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3 > MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id > 3037052724 foreman.example.com 1.1.1.1 foreman updating > UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) + 1 > WHERE `architectures`.`id` = 1 > *** (2) HOLDS THE LOCK(S): > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523116898 lock mode S locks rec but not > gap > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > 6: len 4; hex 80001c80; asc ;; > 7: len 4; hex 80000003; asc ;; > > *** (2) WAITING FOR THIS LOCK TO BE GRANTED: > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523116898 lock_mode X locks rec but not > gap waiting > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > 6: len 4; hex 80001c80; asc ;; > 7: len 4; hex 80000003; asc ;; > > *** WE ROLL BACK TRANSACTION (1) > > > > On Sunday, October 25, 2015 at 12:43:08 AM UTC-7, Tomer Brisker wrote: > > Hello Kyle, > It is possible that this is related indeed to the counter_cache_fix, but > in order to confirm that I would need a bit more context from your logs as > to the error causing this. > The mysql log should contain the details of which two transactions are > conflicting and causing the deadlock. > If this is indeed the case, it should be possible to cherry pick only > these two patches (once the second one gets merged) to 1.7.1 if needed. > > On Sat, Oct 24, 2015 at 1:06 AM, Kyle Flavin wrote: > > We're having trouble removing a host record from Foreman (1.7.1). It > looks like a process is holding a lock open. When we try to delete the > record from the UI, it sits there for a long time (several minutes) and > then fails with this error: > > Mysql2::Error: Lock wait timeout exceeded; try > > ... > > ...

Shimon_Shtein · December 30, 2015, 7:50am

My suspicions were incorrect, but still there is something you can do:
Do you have a cron job for deleting reports? Foreman has 'rake
reports:expire [days=xxx][status=yyy][report_type=zzz]' command that can
keep the reports table size controlled.
The logs table is connected to reports, so I think once you will expire
reports, logs table will shrink too.

···

On Tuesday, December 29, 2015 at 10:38:33 PM UTC+2, Kyle Flavin wrote: > > Hi Shimon, > The logs table was the largest. Here were the sizes on disk: > > # du -s * | sort -r -k1n > ... > 1321876 sources.MYD > 1441796 param_values.ibd > 1919892 sources.MYI > 38199352 sessions.ibd > 45965380 reports.ibd > 88887416 logs.ibd > > I remember that the logs table was the largest one - over 500 million > records in the table. We ended up truncating logs, reports, messages, and > sources (possibly a couple others), a little over a month ago. > > > > On Tuesday, December 29, 2015 at 12:06:50 AM UTC-8, ssh...@redhat.com > wrote: > > Hello Kyle, > > About the abnormal size of the database: Can you tell which tables are > growing? I think I have a suspect, but I want to confirm it with you first. > > Shimon. > > On Tuesday, November 3, 2015 at 6:54:32 PM UTC+2, Kyle Flavin wrote: > > Thanks Tomer. It appears we also have issues with the size our database, > which has grown unbounded. We had tables exceeding 500 million records and > 85GB in size. I think this has been causing issues for us as well, and may > be the reason some of our records are "stuck". > > On Sunday, November 1, 2015 at 12:19:54 AM UTC-7, Tomer Brisker wrote: > > Hi Kyle, > I'm afraid it seems that the source of your deadlocks is in the rails > implementation of counter_cache which is sub-optimal (it is not > transaction-safe, leading to such deadlocks sometimes). > As a work around, you can try to disable the counter_cache for these > columns by removing the `:counter_cache => counter_cache` part of the lines > in host_common.rb [1]. > You can then manually update the counters afterwards by running > `foreman-rake fix_cached_counters`, which will recount all the counters in > the database. > > [1] > https://github.com/theforeman/foreman/blob/1.7.1/app/models/concerns/host_common.rb#L13 > > > On Fri, Oct 30, 2015 at 3:39 AM, Kyle Flavin wrote: > > Ugh, sorry - I may have jumped the gun on that last one. I changed the > model on the problematic host, and I still can't remove it. > > > > On Thursday, October 29, 2015 at 6:32:23 PM UTC-7, Kyle Flavin wrote: > > More information to add onto my last post... > > It ONLY appears to affect updates on the models table that occur on > id=46. If the record has a different model, the delete works. > > On Thursday, October 29, 2015 at 4:41:36 PM UTC-7, Kyle Flavin wrote: > > Hi Tomer, > The value is set to 400: > > > show variables like 'innodb_lock_wait_timeout'; > +--------------------------+-------+ > > Variable_name | Value | > +--------------------------+-------+ > > innodb_lock_wait_timeout | 400 | > +--------------------------+-------+ > 1 row in set (0.00 sec) > > Over the past day I've backported the two changes from: > https://github.com/theforeman/foreman/pull/2362 > https://github.com/theforeman/foreman/pull/2839 > > into our 1.7.1 Foreman servers (we're running two), and I've restarted the > MySQL database, just to start fresh. However, I'm still unable to remove a > number of Foreman records. When I go to delete the record through the Web > UI, I still see this error: > > Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: > UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 WHERE > `models`.`id` = 46 ORDER BY models.name > > > When I do a "show processlist" on the database, I can see the query timing > out: > > > 60 | foreman | myhost.example.com:48892 | foreman | Query | > 390 | Searching rows for update > > UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 > WHERE `models`.`id` = 46 ORDER BY > > > Where *390 *is the timeout. Once it gets to 400, the process clears. If > I try the delete again, a new process appears and also times out after 400 > seconds. I could drop the timeout to 100, but looking through the list of > processes, it doesn't appear that there are any other long running > processes (other than a process doing a binlog dump to our standby > database, but I don't think that's an issue) timing out, which might also > be holding a lock. When I try to delete the bad record, I see the one long > lived process start up and die. > > The odd part is, there are other records I can remove. The records I > can't remove appear to be the same every time. It's as if a particular > group of records has a problem. What would cause one set of records to be > unable to update the host_count after repeated tries, while others can > immediately? > > We run 3 puppet masters with ~8000 hosts, and two Foreman servers that > talk to a single MySQL database. None of the machines look particularly > overloaded, but I'm wondering if the database isn't happy with two app > servers trying to talk to it simultaneously. Are there any additional > settings you could recommend I set to give me more visibility? > > > > On Wednesday, October 28, 2015 at 2:16:30 AM UTC-7, Tomer Brisker wrote: > > Hello Kyle, > > You can check the value of your lock time out using: > > show variables like 'innodb_lock_wait_timeout'; > If it is set to a very high value, you can change it using > > set innodb_lock_wait_timeout=100 > for example to set it to time out after 100 seconds. > > You can also force kill the stuck process(es) so you can get the hosts > unstuck: > > show processlist; > > kill ; > > After getting your hosts unstuck, you can try to check if the two > aforementioned patches fix the deadlock indeed or not. > Or it might be completely unrelated and just a one-time problem that > locked them in the first place. > > > On Mon, Oct 26, 2015 at 8:12 PM, Kyle Flavin wrote: > > Hi Tomer, > Unfortunately it doesn't look like we have logging turned on for this > particular server. Error logs are turned on, but nothing appears there > since September. > > I was able to login and do a "show engine innodb status\G" and got the > following (sanitized output) - it seems to show two transactions attempting > to update host_count on the architecture table. These transactions have > been there since Friday. I expected to find something even older, as we've > had a few "stuck" host records that we cannot delete going back to Oct 1. > > ------------------------ > LATEST DETECTED DEADLOCK > ------------------------ > > 2015-10-23 00:06:10 7fc5696a0700 > *** (1) TRANSACTION: > TRANSACTION 9523120346, ACTIVE 3 sec starting index read > mysql tables in use 1, locked 1 > LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log > entries 2 > MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id > 3037051607 foreman.example.com 1.1.1.1 foreman updating > UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 > WHERE `architectures`.`id` = 1 > *** (1) WAITING FOR THIS LOCK TO BE GRANTED: > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523120346 lock_mode X locks rec but not > gap waiting > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > 6: len 4; hex 80001c80; asc ;; > 7: len 4; hex 80000003; asc ;; > > *** (2) TRANSACTION: > TRANSACTION 9523116898, ACTIVE 15 sec starting index read > mysql tables in use 1, locked 1 > 23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3 > MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id > 3037052724 foreman.example.com 1.1.1.1 foreman updating > UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) + 1 > WHERE `architectures`.`id` = 1 > *** (2) HOLDS THE LOCK(S): > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523116898 lock mode S locks rec but not > gap > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > 6: len 4; hex 80001c80; asc ;; > 7: len 4; hex 80000003; asc ;; > > *** (2) WAITING FOR THIS LOCK TO BE GRANTED: > RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table > `foreman`.`architectures` trx id 9523116898 lock_mode X locks rec but not > gap waiting > Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info > bits 0 > 0: len 4; hex 80000001; asc ;; > 1: len 6; hex 0002379f2b5f; asc 7 +_;; > 2: len 7; hex 340009803725da; asc 4 7% ;; > 3: len 6; hex 7838365f3634; asc x86_64;; > 4: len 5; hex 9986ee3442; asc 4B;; > 5: len 5; hex 9986ee3442; asc 4B;; > > > ...

Kyle_Flavin · December 30, 2015, 5:17pm

Yeah, after truncating the tables we enabled the rake jobs for cleaning the
reports, which we run daily. That seems to be keeping the logs at a
manageable size (~1-2 million records).

···

On Tuesday, December 29, 2015 at 11:50:57 PM UTC-8, ssh...@redhat.com wrote: > > > My suspicions were incorrect, but still there is something you can do: > Do you have a cron job for deleting reports? Foreman has 'rake > reports:expire [days=xxx][status=yyy][report_type=zzz]' command that can > keep the reports table size controlled. > The logs table is connected to reports, so I think once you will expire > reports, logs table will shrink too. > > > > On Tuesday, December 29, 2015 at 10:38:33 PM UTC+2, Kyle Flavin wrote: >> >> Hi Shimon, >> The logs table was the largest. Here were the sizes on disk: >> >> # du -s * | sort -r -k1n >> ... >> 1321876 sources.MYD >> 1441796 param_values.ibd >> 1919892 sources.MYI >> 38199352 sessions.ibd >> 45965380 reports.ibd >> 88887416 logs.ibd >> >> I remember that the logs table was the largest one - over 500 million >> records in the table. We ended up truncating logs, reports, messages, and >> sources (possibly a couple others), a little over a month ago. >> >> >> >> On Tuesday, December 29, 2015 at 12:06:50 AM UTC-8, ssh...@redhat.com >> wrote: >> >> Hello Kyle, >> >> About the abnormal size of the database: Can you tell which tables are >> growing? I think I have a suspect, but I want to confirm it with you first. >> >> Shimon. >> >> On Tuesday, November 3, 2015 at 6:54:32 PM UTC+2, Kyle Flavin wrote: >> >> Thanks Tomer. It appears we also have issues with the size our database, >> which has grown unbounded. We had tables exceeding 500 million records and >> 85GB in size. I think this has been causing issues for us as well, and may >> be the reason some of our records are "stuck". >> >> On Sunday, November 1, 2015 at 12:19:54 AM UTC-7, Tomer Brisker wrote: >> >> Hi Kyle, >> I'm afraid it seems that the source of your deadlocks is in the rails >> implementation of counter_cache which is sub-optimal (it is not >> transaction-safe, leading to such deadlocks sometimes). >> As a work around, you can try to disable the counter_cache for these >> columns by removing the `:counter_cache => counter_cache` part of the lines >> in host_common.rb [1]. >> You can then manually update the counters afterwards by running >> `foreman-rake fix_cached_counters`, which will recount all the counters in >> the database. >> >> [1] >> https://github.com/theforeman/foreman/blob/1.7.1/app/models/concerns/host_common.rb#L13 >> >> >> On Fri, Oct 30, 2015 at 3:39 AM, Kyle Flavin wrote: >> >> Ugh, sorry - I may have jumped the gun on that last one. I changed the >> model on the problematic host, and I still can't remove it. >> >> >> >> On Thursday, October 29, 2015 at 6:32:23 PM UTC-7, Kyle Flavin wrote: >> >> More information to add onto my last post... >> >> It ONLY appears to affect updates on the models table that occur on >> id=46. If the record has a different model, the delete works. >> >> On Thursday, October 29, 2015 at 4:41:36 PM UTC-7, Kyle Flavin wrote: >> >> Hi Tomer, >> The value is set to 400: >> >> > show variables like 'innodb_lock_wait_timeout'; >> +--------------------------+-------+ >> > Variable_name | Value | >> +--------------------------+-------+ >> > innodb_lock_wait_timeout | 400 | >> +--------------------------+-------+ >> 1 row in set (0.00 sec) >> >> Over the past day I've backported the two changes from: >> https://github.com/theforeman/foreman/pull/2362 >> https://github.com/theforeman/foreman/pull/2839 >> >> into our 1.7.1 Foreman servers (we're running two), and I've restarted >> the MySQL database, just to start fresh. However, I'm still unable to >> remove a number of Foreman records. When I go to delete the record through >> the Web UI, I still see this error: >> >> Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: >> UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 WHERE >> `models`.`id` = 46 ORDER BY models.name >> >> >> When I do a "show processlist" on the database, I can see the query >> timing out: >> >> > 60 | foreman | myhost.example.com:48892 | foreman | Query | >> 390 | Searching rows for update >> > UPDATE `models` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 >> WHERE `models`.`id` = 46 ORDER BY >> >> >> Where *390 *is the timeout. Once it gets to 400, the process clears. >> If I try the delete again, a new process appears and also times out after >> 400 seconds. I could drop the timeout to 100, but looking through the list >> of processes, it doesn't appear that there are any other long running >> processes (other than a process doing a binlog dump to our standby >> database, but I don't think that's an issue) timing out, which might also >> be holding a lock. When I try to delete the bad record, I see the one long >> lived process start up and die. >> >> The odd part is, there are other records I can remove. The records I >> can't remove appear to be the same every time. It's as if a particular >> group of records has a problem. What would cause one set of records to be >> unable to update the host_count after repeated tries, while others can >> immediately? >> >> We run 3 puppet masters with ~8000 hosts, and two Foreman servers that >> talk to a single MySQL database. None of the machines look particularly >> overloaded, but I'm wondering if the database isn't happy with two app >> servers trying to talk to it simultaneously. Are there any additional >> settings you could recommend I set to give me more visibility? >> >> >> >> On Wednesday, October 28, 2015 at 2:16:30 AM UTC-7, Tomer Brisker wrote: >> >> Hello Kyle, >> >> You can check the value of your lock time out using: >> > show variables like 'innodb_lock_wait_timeout'; >> If it is set to a very high value, you can change it using >> > set innodb_lock_wait_timeout=100 >> for example to set it to time out after 100 seconds. >> >> You can also force kill the stuck process(es) so you can get the hosts >> unstuck: >> > show processlist; >> > kill ; >> >> After getting your hosts unstuck, you can try to check if the two >> aforementioned patches fix the deadlock indeed or not. >> Or it might be completely unrelated and just a one-time problem that >> locked them in the first place. >> >> >> On Mon, Oct 26, 2015 at 8:12 PM, Kyle Flavin wrote: >> >> Hi Tomer, >> Unfortunately it doesn't look like we have logging turned on for this >> particular server. Error logs are turned on, but nothing appears there >> since September. >> >> I was able to login and do a "show engine innodb status\G" and got the >> following (sanitized output) - it seems to show two transactions attempting >> to update host_count on the architecture table. These transactions have >> been there since Friday. I expected to find something even older, as we've >> had a few "stuck" host records that we cannot delete going back to Oct 1. >> >> ------------------------ >> LATEST DETECTED DEADLOCK >> ------------------------ >> >> 2015-10-23 00:06:10 7fc5696a0700 >> *** (1) TRANSACTION: >> TRANSACTION 9523120346, ACTIVE 3 sec starting index read >> mysql tables in use 1, locked 1 >> LOCK WAIT 5 lock struct(s), heap size 1184, 2 row lock(s), undo log >> entries 2 >> MySQL thread id 88248, OS thread handle 0x7fc5692d1700, query id >> 3037051607 foreman.example.com 1.1.1.1 foreman updating >> UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) - 1 >> WHERE `architectures`.`id` = 1 >> *** (1) WAITING FOR THIS LOCK TO BE GRANTED: >> RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table >> `foreman`.`architectures` trx id 9523120346 lock_mode X locks rec but not >> gap waiting >> Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info >> bits 0 >> 0: len 4; hex 80000001; asc ;; >> 1: len 6; hex 0002379f2b5f; asc 7 +_;; >> 2: len 7; hex 340009803725da; asc 4 7% ;; >> 3: len 6; hex 7838365f3634; asc x86_64;; >> 4: len 5; hex 9986ee3442; asc 4B;; >> 5: len 5; hex 9986ee3442; asc 4B;; >> 6: len 4; hex 80001c80; asc ;; >> 7: len 4; hex 80000003; asc ;; >> >> *** (2) TRANSACTION: >> TRANSACTION 9523116898, ACTIVE 15 sec starting index read >> mysql tables in use 1, locked 1 >> 23 lock struct(s), heap size 2936, 11 row lock(s), undo log entries 3 >> MySQL thread id 87505, OS thread handle 0x7fc5696a0700, query id >> 3037052724 foreman.example.com 1.1.1.1 foreman updating >> UPDATE `architectures` SET `hosts_count` = COALESCE(`hosts_count`, 0) + 1 >> WHERE `architectures`.`id` = 1 >> *** (2) HOLDS THE LOCK(S): >> RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table >> `foreman`.`architectures` trx id 9523116898 lock mode S locks rec but not >> gap >> Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info >> bits 0 >> 0: len 4; hex 80000001; asc ;; >> 1: len 6; hex 0002379f2b5f; asc 7 +_;; >> 2: len 7; hex 340009803725da; asc 4 7% ;; >> 3: len 6; hex 7838365f3634; asc x86_64;; >> 4: len 5; hex 9986ee3442; asc 4B;; >> 5: len 5; hex 9986ee3442; asc 4B;; >> 6: len 4; hex 80001c80; asc ;; >> 7: len 4; hex 80000003; asc ;; >> >> *** (2) WAITING FOR THIS LOCK TO BE GRANTED: >> RECORD LOCKS space id 163 page no 3 n bits 72 index `PRIMARY` of table >> `foreman`.`architectures` trx id 9523116898 lock_mode X locks rec but not >> gap waiting >> Record lock, heap no 2 PHYSICAL RECORD: n_fields 8; compact format; info >> bits 0 >> 0: len 4; hex 80000001; asc ;; >> 1: len 6; hex 0002379f2b5f; asc 7 +_;; >> 2: len 7; hex 340009803725da; asc 4 7% ;; >> 3: len 6; hex 7838365f3634; asc x86_64;; >> 4: len 5; hex 9986ee3442; asc 4B;; >> 5: len 5; hex 9986ee3442; asc 4B;; >> >> >> ... > >