Foreman reports have long-standing problem that has something to do with concurrency of reports creation and deletion. Managed hosts uploads lots of reports and due to how reports are being stored in our RDBM, this operation is slow, but scaleable. The problem starts when there is a rake task which is regularly called (weekly or daily I think) which tries to delete those records. If the rate of creation is fast enough, the rake task is unable to delete the records because for deletion whole table needs to be locked which slows down or even kills new report creation. We’ve seen this misbehavior on both MySQL and PostgreSQL.
I created a patch which aims to solve this by breaking the (rake) deletion into batches (by default by 500 reports - configurable). There is a small sleep between batches (100ms by default) so the locked transactions have a chance to start. We need to find the right balance of deletion of records so you actually keep up with incoming records.
I hereby ask users with some decent amount of reports (ideally live instance so new reports are incoming) to test this small patch. It should be any Foreman version compatible. Measure how reports are being deleted before and after the patch. I am interested in time spent deleting those records and performance of report upload calls. If you want, change the batch size or delay time and share results.
Without enough testing, we can’t be confident enough this won’t make things even worse. Please test both on MySQL and/or PostgreSQL - the patch should work on both.
Worked for me wonderfully on a particularly “bad” and very busy Foreman cluster (3 Foreman nodes sharing 3-way Galera cluster running on the same VMs). Previously I had to manually to stop Foreman on all nodes and manually clear out the reports as this job would never complete on its own.
I patched report.rb with your patch on Friday night and over last weekend it ran with no issues and dropped reports from 73K+ reports for last 2 months to 12K+ for a week.
Thanks so much, this gives a rough deletion rate about 1k reports per hour. Was the batch size 500 records? That’s still quite long transaction lock, we should consider smaller batch size of let’s say 200 records. Is this PgSQL or MySQL?
I think I am gonna add some metrics into the deletion and print them to the stdout.
It wasn’t that long - I wanna say ~1000 reports per 15-20 minutes before it runs into a deadlock in that env. Metrics/messages would be very helpful, yes.
Oh yeah, sorry. Simply remove the with_field wrapper around log line, that’s all. We now have structured logging support in 1.17+ which allows us to send more structured fields into logs. Then it will work.
Foreman::Logging.with_fields(deleted_messages: message_count, expired_sources: source_count, expired_logs: log_count, expired_total: count, expire_rate: rate) do
logger.info "Expired #{count} #{to_s.underscore.humanize.pluralize} at rate #{rate} rec/sec"
end
just into this:
logger.info "Expired #{count} #{to_s.underscore.humanize.pluralize} at rate #{rate} rec/sec"
Yes? If so, I don’t see anything new reported, but it removed at least 4 reports:
foreman-rake reports:expire
Successfully encrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully encrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Just replace logging.info with puts. We have a bug in older versions of foreman when all logs from models go into sql logger instead of app logger and that one is turned off by default. I figured out today this does not work for class methods, filed a patch:
# foreman-rake reports:expire
Successfully encrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully encrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Expired 300 Config reports at rate 566 rec/sec
Expired 300 Config reports at rate 547 rec/sec
Expired 300 Config reports at rate 64 rec/sec
Expired 300 Config reports at rate 108 rec/sec
Expired 300 Config reports at rate 186 rec/sec
Expired 300 Config reports at rate 73 rec/sec
rake aborted!
ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: COMMIT
<snip>
There is still a deadlock, but deadlock of the rake task is better than deadlock of request being processed. Can you try to lower number of batch size to 50? Better or worse? Currently it is not configurable, you need to edit the file. I will make it configurable.
Is it possible to only store different reports then for each host report just increment a count of how many identical reports there are? That way you just add a table with PK (report uuid or something) to a FK (MMDDYYYYHHMMSS timestamp?) in a new table and have a count of how many after that are identical.
Then you can kill off all the dupes of a report with one row deletion. This also allows for addl functionality like reporting on changes (like when that new table gets a new row, etc.)
IMHO, in a large environment like ours, I don’t believe it would be possible to find a magic number - reports from hosts may come at any moment resulting a deadlock, so just adding this feature is already a big improvement (at least, for us).
I tried 50, but it is still the same:
# foreman-rake reports:expire
Successfully encrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully encrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Expired 50 Config reports at rate 200 rec/sec
Expired 50 Config reports at rate 290 rec/sec
Expired 50 Config reports at rate 130 rec/sec
Expired 50 Config reports at rate 18 rec/sec
Expired 50 Config reports at rate 261 rec/sec
Expired 50 Config reports at rate 328 rec/sec
Expired 50 Config reports at rate 153 rec/sec
Expired 50 Config reports at rate 59 rec/sec
Expired 50 Config reports at rate 23 rec/sec
Expired 50 Config reports at rate 284 rec/sec
rake aborted!
ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: COMMIT
<snip>
BTW, the calculation above does not seem to match the reality - before I ran above rake, I checked in UI to see that I had 12717 reports. After the rake got aborted, I had 12504 reports, so only 213 reports truly got expired vs. 450 reported by rake (I’m not counting the last batch as it was the one interrupted). Am I looking at this in a wrong way?
This is a possibility, but I wonder how much slow will be actually comparing reports each other. My concern is that this will be actually slower than simply storing it as-is (and deleting old reports early on).
Thanks for numbers. But according to your earlier comments, the patch is still an improvement, is it?
You are counting right, the moment “Expired…” log is reported, transaction ends and as long as there was no rollback the transaction should be commited.
Were you comparing count reports older than XYZ or total count? If the latter, you might have incoming entries.