Foreman Reports:Expire rake timeout?

Lang_Jason · May 17, 2018, 1:02pm

Hey All, Does the Foreman Rake reports:expire task support an argument for a longer timeout?

We have 13,000 hosts, checking in every 30 minutes, with around 800 classes and like 1000+ catalog items per host. Keeping 30 days of report(s) means around 9 million rows in the report table, and 120 million “log” rows.

I’ve never had the reports:expire task complete successfully, IT always errors due to timeout, but the MySQL query keeps running on the DB, so we’ve let it be. Recently we hit some issues where it began taking longer than an hour to run, which meant our DB backup scheduled task (MySQL/Percona XtraBackup specifically) was killing it by design to do the backup. This eventually lead to the two tables blowing up even more, and affected DB performance adversely.

I took the database “offline” yesterday and manually Ran the following query, which is what the rake task runs:
DELETE FROM logs WHERE logs.report_id IN (SELECT reports.id FROM reports WHERE reports.type IN (‘ConfigReport’) AND (reports.created_at < ‘2018-05-03 12:31:48’))
This deleted something like 68 million rows and got me to a point where our database is online and working again, and im re-ordering the CRON jobs to ideally prevent this in the future.

However this took 25,000 seconds to complete, I think this can be tweaked to be “better”

If the rake task doesn’t have a configurable timeout - make it configurable, then I can set it to something longer. At minimum - this will allow me to monitor the RAKE task for failure properly, as right now it always “fails” after 300 seconds, but completes hundreds of seconds later.
Right now - this single delete probably goes super quick on small databases, but on larger takes forever (one single transaction) and is “blocking”. I can watch hundreds to thousands of “Insert Report” statements queue up on the database while this delete happens, eventually backing up my puppetmasters entirely. I don’t necessarily care when this happens at 3am, but restructuring the job to do 10,000 or 50,000 at a time in a loop with a 2-5 second pause in between execution loops would go a long way towards making this “nicer”. I do believe this would add a bit of overhead and runtime length in a smaller environment, but this should be negligible overall.
Can it use a join rather than a nested select? I chatted with one of our DBA’s and he said at face value, something like the below might be faster (at least in the lookup/query/setup stage) but without knowing the DB schema in detail he cannot be sure (We looked for an ERD diagram on the foreman website - but only came across a broken link in a forum post from several years ago)
DELETE FROM logs INNER JOIN reports on reports.id = logs.report_id WHERE reports.type IN (‘ConfigReport’) AND (reports.created_at < ‘2018-05-03 12:31:48’)

lzap · May 17, 2018, 1:38pm

Hey, the problem here is concurrency. If you want to delete many records in one transaction, the RDBMS must lock whole table and this blocks further inserts or updates. The same problem is on PostgreSQL.

You can indeed use any SQL query which will do the job of getting rid of the records. Inner join is something that is exceptionally problematic on MySQL, but it’s super slow on PostgreSQL as well.

The recommended solution to this long-standing issue is to break the deletion into chunks. Each chunk needs to be just hundreds of thousands of records, then do sleep (dozens of seconds or minutes) so the locked transactions have a chance to complete. You need to find the right balance of deletion of records so you actually keep up with incoming records.

To do that, we really need to use joins because whole chain of joined records must be deleted in a single transaction (we don’t have ON DELETE CASCADE set for those). Or if MySQL supports cascade deletion, it is worth setting that and then it is as easy as deleting batch of records from reports table but this was problematic with Rails AFAIK.

Currently the SQL statements are:

DELETE FROM “logs” WHERE “logs”.“report_id” IN (SELECT “reports”.“id” FROM “reports” WHERE “reports”.“type” IN (‘ConfigReport’) AND (reports.created_at < ‘2018-05-10 13:36:21’))

DELETE FROM “messages” WHERE (id not IN (SELECT DISTINCT message_id FROM “logs”))

DELETE FROM “sources” WHERE (id not IN (SELECT DISTINCT source_id FROM “logs”))

DELETE FROM “reports” WHERE “reports”.“type” IN (‘ConfigReport’) AND (reports.created_at < ‘2018-05-10 13:36:21’)

Lang_Jason · May 17, 2018, 1:53pm

Thanks!

So it looks like my assumption of “the rails task fails but the delete task completes eventually” was wrong, as it was only ever completing the first of 4 total delete tasks. SO i’ve been building up entries in messages, sources, and reports for quite a while (in addition to logs over the past few weeks)

I’m working with my DBA right now to adapt these queries into loops where we can do exactly what you mentioned (purge x records, sleep, purge more, until complete) as we’ve done this with other large databases before.

Any information on what makes joins problematic? I’m curious as my non DBA background makes me surprised as i thought joins were DB 101 type stuff?

Finally, is there any option to increase/override the rake task timeout? I fear that even after ive cleaned things up, trying to delete records across these 4 tables even every night might take more than 300 seconds total, causing not all 4 to execute… I can certainly do it manually for now, but at some point id imagine wanting to return to the RAKE task.

lzap · May 17, 2018, 2:17pm

I filed Bug #23623: Break report expiration into batches - Foreman. I am working on something like (does not work yet - breaks tests for now need to figure out tomorrow):

  # Expire reports based on time and status
  # Defaults to expire reports older than a week regardless of the status
  # This method will IS very slow, use only from rake task.
  def self.expire(conditions = {}, chunk_size = 1000, sleep_time = 1)
    timerange = conditions[:timerange] || 1.week
    status = conditions[:status]
    cond = "created_at < \'#{(Time.now.utc - timerange).to_formatted_s(:db)}\'"
    cond += " and status = #{status}" unless status.nil?
    report_ids = Report.unscoped.where(cond).reorder('').limit(chunk_size).pluck(:id)
    log_count = Log.unscoped.where(:report_id => report_ids).reorder('').delete_all
    message_count = Message.unscoped.joins(:logs).where("logs.report_id" => report_ids).delete_all
    source_count = Source.unscoped.joins(:logs).where("logs.report_id" => report_ids).delete_all
    count = Report.unscoped.where(:id => report_ids).delete_all
    logger.info Time.now.utc.to_s + ": Expired #{count} #{to_s.underscore.humanize.pluralize} (M:#{message_count}, S:#{source_count}, L:#{log_count})"
    count
  end

There is a loop missing, the plan is to delete records in a loop with slight delay (1 second, maybe less). Beware this code currently does not work properly.

lzap · May 18, 2018, 10:01am

Can someone take from here? Tests are failing, there is some issue. We have a whole week meeting next week, I am not sure if I will find time to work on this.

  def self.expire(conditions = {}, chunk_size = 1000, sleep_time = 0.2)
    timerange = conditions[:timerange] || 1.week
    status = conditions[:status]
    cond = "created_at < \'#{(Time.now.utc - timerange).to_formatted_s(:db)}\'"
    cond += " and status = #{status}" unless status.nil?
    total_count = 0
    report_ids = []
    while true
      Report.transaction do
        report_ids = Report.unscoped.where(cond).reorder('').limit(chunk_size).pluck(:id)
        if report_ids.count > 0
          log_count = Log.unscoped.where(:report_id => report_ids).reorder('').delete_all
          message_count = Message.unscoped.joins(:logs).where("logs.report_id" => report_ids).delete_all
          source_count = Source.unscoped.joins(:logs).where("logs.report_id" => report_ids).delete_all
          count = Report.unscoped.where(:id => report_ids).delete_all
          logger.info "expired #{count} #{to_s.underscore.humanize.pluralize} (M:#{message_count}, S:#{source_count}, L:#{log_count})"
          total_count += count
        end
      end
      break if (report_ids.nil? || report_ids.empty?)
      sleep sleep_time
    end
    logger.info "expired #{total_count} total records of #{to_s.underscore.humanize.pluralize}"
    count
  end

Edit: Repasted better version - still does not work.

lzap · June 1, 2018, 2:46pm

I finished the patch, moving into separate thread: Testing needed: Improved reports expiration rake task