Testing needed: Improved reports expiration rake task

Hey Foreman users and devs!

Foreman reports have long-standing problem that has something to do with concurrency of reports creation and deletion. Managed hosts uploads lots of reports and due to how reports are being stored in our RDBM, this operation is slow, but scaleable. The problem starts when there is a rake task which is regularly called (weekly or daily I think) which tries to delete those records. If the rate of creation is fast enough, the rake task is unable to delete the records because for deletion whole table needs to be locked which slows down or even kills new report creation. We’ve seen this misbehavior on both MySQL and PostgreSQL.

I created a patch which aims to solve this by breaking the (rake) deletion into batches (by default by 500 reports - configurable). There is a small sleep between batches (100ms by default) so the locked transactions have a chance to start. We need to find the right balance of deletion of records so you actually keep up with incoming records.

I hereby ask users with some decent amount of reports (ideally live instance so new reports are incoming) to test this small patch. It should be any Foreman version compatible. Measure how reports are being deleted before and after the patch. I am interested in time spent deleting those records and performance of report upload calls. If you want, change the batch size or delay time and share results.

Without enough testing, we can’t be confident enough this won’t make things even worse. Please test both on MySQL and/or PostgreSQL - the patch should work on both.

https://github.com/theforeman/foreman/pull/5587

Thanks!

Also, I have tested this only with Puppet reports, if you can test this with other kind of reports share the results.

Worked for me wonderfully on a particularly “bad” and very busy Foreman cluster (3 Foreman nodes sharing 3-way Galera cluster running on the same VMs). Previously I had to manually to stop Foreman on all nodes and manually clear out the reports as this job would never complete on its own.

I patched report.rb with your patch on Friday night and over last weekend it ran with no issues and dropped reports from 73K+ reports for last 2 months to 12K+ for a week.

Thanks, Lukas!

2 Likes

That’s puppet reports only, don’t have others to test against.

1 Like

Thanks so much, this gives a rough deletion rate about 1k reports per hour. Was the batch size 500 records? That’s still quite long transaction lock, we should consider smaller batch size of let’s say 200 records. Is this PgSQL or MySQL?

I think I am gonna add some metrics into the deletion and print them to the stdout.

It wasn’t that long - I wanna say ~1000 reports per 15-20 minutes before it runs into a deadlock in that env. Metrics/messages would be very helpful, yes.

It is Galera-replicated MySQL.

Changes done in the PR if you want to test the improved version. Batch size of 300 and some stats in logs.

Hmm, no, does not work on 1.14.x nor 1.15.x anymore:

foreman-rake --trace reports:expire

** Invoke reports:expire (first_time)
** Invoke environment (first_time)
** Execute environment
** Execute reports:expire
rake aborted!
NoMethodError: undefined method with_fields' for #<Foreman::LoggingImpl:0x00000005f66eb8> /usr/share/foreman/app/models/report.rb:89:inblock (2 levels) in expire’
/opt/rh/sclo-ror42/root/usr/share/gems/gems/activerecord-4.2.5.1/lib/active_record/connection_adapters/abstract/database_statements.rb:213:in block in transaction' /opt/rh/sclo-ror42/root/usr/share/gems/gems/activerecord-4.2.5.1/lib/active_record/connection_adapters/abstract/transaction.rb:184:inwithin_new_transaction’
/opt/rh/sclo-ror42/root/usr/share/gems/gems/activerecord-4.2.5.1/lib/active_record/connection_adapters/abstract/database_statements.rb:213:in transaction' /opt/rh/sclo-ror42/root/usr/share/gems/gems/activerecord-4.2.5.1/lib/active_record/transactions.rb:220:intransaction’
/usr/share/foreman/app/models/report.rb:79:in block in expire' /usr/share/foreman/app/models/report.rb:78:inloop’
/usr/share/foreman/app/models/report.rb:78:in expire' /usr/share/foreman/lib/tasks/reports.rake:34:inblock (2 levels) in <top (required)>’
/opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/task.rb:240:in call' /opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/task.rb:240:inblock in execute’
/opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/task.rb:235:in each' /opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/task.rb:235:inexecute’
/opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/task.rb:179:in block in invoke_with_call_chain' /opt/rh/rh-ruby22/root/usr/share/ruby/monitor.rb:211:inmon_synchronize’
/opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/task.rb:172:in invoke_with_call_chain' /opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/task.rb:165:ininvoke’
/opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/application.rb:150:in invoke_task' /opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/application.rb:106:inblock (2 levels) in top_level’
/opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/application.rb:106:in each' /opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/application.rb:106:inblock in top_level’
/opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/application.rb:115:in run_with_threads' /opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/application.rb:100:intop_level’
/opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/application.rb:78:in block in run' /opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/application.rb:176:instandard_exception_handling’
/opt/rh/rh-ruby22/root/usr/share/gems/gems/rake-10.4.2/lib/rake/application.rb:75:in run' /opt/rh/rh-ruby22/root/usr/bin/rake:33:in’
Tasks: TOP => reports:expire

Oh yeah, sorry. Simply remove the with_field wrapper around log line, that’s all. We now have structured logging support in 1.17+ which allows us to send more structured fields into logs. Then it will work.

If I understand you correctly, make this:

      Foreman::Logging.with_fields(deleted_messages: message_count, expired_sources: source_count, expired_logs: log_count, expired_total: count, expire_rate: rate) do
        logger.info "Expired #{count} #{to_s.underscore.humanize.pluralize} at rate #{rate} rec/sec"
      end

just into this:

logger.info "Expired #{count} #{to_s.underscore.humanize.pluralize} at rate #{rate} rec/sec"

Yes? If so, I don’t see anything new reported, but it removed at least 4 reports:

foreman-rake reports:expire

Successfully encrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully encrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret

Yes I can confirm this is the correct change. There should be at least two logs in production.log tho.

Hmm, sorry, just re-ran reports:expire cronjob again, but still nothing:

# grep -i expire /var/log/foreman/production.log| grep -v memcach
2018-06-05 13:17:54 3ef8cf65 [app] [I] Expire fragment views/tabs_and_title_records-6 (0.3ms)
2018-06-05 13:17:54 3ef8cf65 [app] [I] Expire fragment views/tabs_and_title_records-6 (0.3ms)

Just replace logging.info with puts. We have a bug in older versions of foreman when all logs from models go into sql logger instead of app logger and that one is turned off by default. I figured out today this does not work for class methods, filed a patch:

http://projects.theforeman.org/issues/23826

Anyway, use the puts workaround to see some logs.

Yep, now I’m getting reports:

# foreman-rake reports:expire
Successfully encrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully encrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Expired 300 Config reports at rate 566 rec/sec
Expired 300 Config reports at rate 547 rec/sec
Expired 300 Config reports at rate 64 rec/sec
Expired 300 Config reports at rate 108 rec/sec
Expired 300 Config reports at rate 186 rec/sec
Expired 300 Config reports at rate 73 rec/sec
rake aborted!
ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: COMMIT
<snip>

Thanks!

There is still a deadlock, but deadlock of the rake task is better than deadlock of request being processed. Can you try to lower number of batch size to 50? Better or worse? Currently it is not configurable, you need to edit the file. I will make it configurable.

Is it possible to only store different reports then for each host report just increment a count of how many identical reports there are? That way you just add a table with PK (report uuid or something) to a FK (MMDDYYYYHHMMSS timestamp?) in a new table and have a count of how many after that are identical.

Then you can kill off all the dupes of a report with one row deletion. This also allows for addl functionality like reporting on changes (like when that new table gets a new row, etc.)

1 Like

IMHO, in a large environment like ours, I don’t believe it would be possible to find a magic number - reports from hosts may come at any moment resulting a deadlock, so just adding this feature is already a big improvement (at least, for us).

I tried 50, but it is still the same:

# foreman-rake reports:expire
Successfully encrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully decrypted field for Setting::Auth oauth_consumer_key
Successfully encrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Successfully decrypted field for Setting::Auth oauth_consumer_secret
Expired 50 Config reports at rate 200 rec/sec
Expired 50 Config reports at rate 290 rec/sec
Expired 50 Config reports at rate 130 rec/sec
Expired 50 Config reports at rate 18 rec/sec
Expired 50 Config reports at rate 261 rec/sec
Expired 50 Config reports at rate 328 rec/sec
Expired 50 Config reports at rate 153 rec/sec
Expired 50 Config reports at rate 59 rec/sec
Expired 50 Config reports at rate 23 rec/sec
Expired 50 Config reports at rate 284 rec/sec
rake aborted!
ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: COMMIT
<snip>

BTW, the calculation above does not seem to match the reality - before I ran above rake, I checked in UI to see that I had 12717 reports. After the rake got aborted, I had 12504 reports, so only 213 reports truly got expired vs. 450 reported by rake (I’m not counting the last batch as it was the one interrupted). Am I looking at this in a wrong way?

This is a possibility, but I wonder how much slow will be actually comparing reports each other. My concern is that this will be actually slower than simply storing it as-is (and deleting old reports early on).

Thanks for numbers. But according to your earlier comments, the patch is still an improvement, is it?

You are counting right, the moment “Expired…” log is reported, transaction ends and as long as there was no rollback the transaction should be commited.

Were you comparing count reports older than XYZ or total count? If the latter, you might have incoming entries.

I just looked at total counts before and after.

Yes, this patch was a very big help in my situation and I’m not too concerned with accuracy of the reporting, so thank you again!