Error when purging puppet reports

> Ohad,
>
> I've enabled debug and snipped out the section in the log that I think
> pertains to this. Let me know if I snipped too much.
>
> So not using 'find in batches' I did: rake18 reports:expire days=300
> RAILS_ENV="production" and it deleted some older reports! It took about 45
> seconds to run and I didn't see any crazy memory utilization (maybe because
> not a ton was deleted?)
>
> Here is output from 'rake18 reports:expire days=280 RAILS_ENV="production"'
> after having run 'rake18 reports:expire days=290 RAILS_ENV="production"', so
> deleting 10 days of data from a time I didn't have many systems reporting
> yet.
>
> http://pastebin.com/NmkZYSGL

Looking at your logs, I'm really troubled.

Are you sure you are running a recent version of foreman (e.g. 0.4.x) ?!
The logs you show are consistent with pre
Bug #687: rake reports:expire abuses memory and network bandwidth - Foreman behavior.

the app/model/report.rb expire method should look something like this:

cheers,
Ohad

··· On Tue, Mar 20, 2012 at 9:38 PM, jmccann wrote:

Let me know if there is anything else that could help.

Thanks,
Jake

On Tuesday, March 20, 2012 1:57:58 PM UTC-5, jmccann wrote:

Ohad,

This only worked better for me when so far doing status=0 for now,
probably as I have no reports with status of 0. Maybe when I try to do it
with the other status I am looking for I will again run into some kind of
issue.

I will enable debugging before I do anything more and provide results.

Thanks,
Jake

On Tuesday, March 20, 2012 1:57:58 PM UTC-5, jmccann wrote:

Ohad,

This only worked better for me when so far doing status=0 for now,
probably as I have no reports with status of 0. Maybe when I try to do it
with the other status I am looking for I will again run into some kind of
issue.

I will enable debugging before I do anything more and provide results.

Thanks,
Jake

On Tuesday, March 20, 2012 1:57:58 PM UTC-5, jmccann wrote:

Ohad,

This only worked better for me when so far doing status=0 for now,
probably as I have no reports with status of 0. Maybe when I try to do it
with the other status I am looking for I will again run into some kind of
issue.

I will enable debugging before I do anything more and provide results.

Thanks,
Jake


You received this message because you are subscribed to the Google Groups
"Foreman users" group.
To view this discussion on the web visit
https://groups.google.com/d/msg/foreman-users/-/0ZR5LsSII28J.

To post to this group, send email to foreman-users@googlegroups.com.
To unsubscribe from this group, send email to
foreman-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/foreman-users?hl=en.

Ohad,

Sorry, as you said I was using the pre-0.4 method. That is currently what
is working for me right now (I found it in the git repo to test it out as
part of troubleshooting).

I've put back the more recent code that came with 0.4 (what I am running)
and ran it again with debugging. I have a 23M logfile (<1M bz2) file that
is a mix of what's from the rake as well as 'regular activity' (mostly
'regular activity'). Anyway to clean this up easily to just what you would
want so I can put in pastebin? I don't really want to make the complete
contents available … otherwise any other way I could get it to you?

As an FYI, my web server used 2590m VIRT and 2.3g RES according to top …
using this newer method. Using the pre-0.4 method as I did in the previous
post I did not have this memory usage. Also, the command was 'time rake18
reports:expire days=279 RAILS_ENV="production"' which took 21 minutes to
run before the error. Yesterday I ran 'time rake18 reports:expire days=280
RAILS_ENV="production"' using the pre-0.4 method so this should only be
deleting about 2 days of data. The pre-0.4 method took <45 sec for 10 days
of reports.

The error I received was 'Mysql::Error: Got a packet bigger than
'max_allowed_packet' bytes: SELECT id FROM reports WHERE (reports.id
IN
(15462,15463,15464,15465,15466,15467,15468,15469,15470,15471,15475,16252,16256,16257,16258,16259,16260,16263,16267,16268,16269,16270,16272,16279,16282,16283,16284,16285,16286,16287,16302,18042,18043,18044,18045,18046,18062,18063,18064,18065,18066,18067,18068,18084,18085,'.
This sequence of numbers goes on for another 114M (size of file with
complete error). My max_allowed_packet on the DB server is currently set
to 100M as part of troubleshooting this, which seems high for normal
everyday use.

Another question I have is if the pre-0.4 method is dangerous to use. It
seems to work great for me, but I want to make sure that the new 0.4 only
was to address memory issues and didn't have any
other cleaning/functionality pasted on it to also. Basically, since
pre-0.4 seems to work for me could I just use that safely?

Thanks!
Jake

··· >

> Ohad,
>
> Sorry, as you said I was using the pre-0.4 method. That is currently what
> is working for me right now (I found it in the git repo to test it out as
> part of troubleshooting).
>
> I've put back the more recent code that came with 0.4 (what I am running)
> and ran it again with debugging. I have a 23M logfile (<1M bz2) file that
> is a mix of what's from the rake as well as 'regular activity' (mostly
> 'regular activity'). Anyway to clean this up easily to just what you would
> want so I can put in pastebin? I don't really want to make the complete
> contents available … otherwise any other way I could get it to you?
>
> As an FYI, my web server used 2590m VIRT and 2.3g RES according to top …
> using this newer method. Using the pre-0.4 method as I did in the previous
> post I did not have this memory usage. Also, the command was 'time rake18
> reports:expire days=279 RAILS_ENV="production"' which took 21 minutes to run
> before the error. Yesterday I ran 'time rake18 reports:expire days=280
> RAILS_ENV="production"' using the pre-0.4 method so this should only be
> deleting about 2 days of data. The pre-0.4 method took <45 sec for 10 days
> of reports.
>
> The error I received was 'Mysql::Error: Got a packet bigger than
> 'max_allowed_packet' bytes: SELECT id FROM reports WHERE (reports.id
> IN
> (15462,15463,15464,15465,15466,15467,15468,15469,15470,15471,15475,16252,16256,16257,16258,16259,16260,16263,16267,16268,16269,16270,16272,16279,16282,16283,16284,16285,16286,16287,16302,18042,18043,18044,18045,18046,18062,18063,18064,18065,18066,18067,18068,18084,18085,'.
> This sequence of numbers goes on for another 114M (size of file with
> complete error). My max_allowed_packet on the DB server is currently set to
> 100M as part of troubleshooting this, which seems high for normal everyday
> use.
>
> Another question I have is if the pre-0.4 method is dangerous to use. It
> seems to work great for me, but I want to make sure that the new 0.4 only
> was to address memory issues and didn't have any
> other cleaning/functionality pasted on it to also. Basically, since pre-0.4
> seems to work for me could I just use that safely?
>
> Thanks!
> Jake
>

Jake,

Can you please paste (or send me directly) the first few lines, where
the report related queries are starting?

I can not figure out why would it less performing, and would
appreciate the logs as cant reproduce it locally.

thanks,
Ohad

··· On Wed, Mar 21, 2012 at 5:01 PM, jmccann wrote: > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To view this discussion on the web visit > https://groups.google.com/d/msg/foreman-users/-/kH0ISyjoMhgJ. > > To post to this group, send email to foreman-users@googlegroups.com. > To unsubscribe from this group, send email to > foreman-users+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/foreman-users?hl=en.

Ohad,

I emailed the log to you according to the email registered from this group.
Let me know if there are any issues.

Thanks,
Jake

> Ohad,
>
> I emailed the log to you according to the email registered from this group.
> Let me know if there are any issues.

Ok, one simple thing that I found out, was that we were using the
following query:

Report Load (16588.0ms) SELECT id FROM reports WHERE (reports.id
>= 0) AND (created_at < '2011-06-16 14:13:02') ORDER BY reports.id ASC
LIMIT 1000

if you run the following query in mysql console using explain:

explain SELECT id FROM reports WHERE (reports.id >= 0) AND
(created_at < '2012-06-16 14:13:02') ORDER BY reports.id ASC LIMIT
1000;

You would see that the query was forced to scan the entire reports
table and was not using the indexes.

it should be a "bit" faster to change from created_at to reported_at
(as created_at has no index at all).
I assume that the reason why using the older code was better for you
(as it was using reported_at which has an index) and using the in
batches was just a side effect.

we need to figure out the correct index for making this process a bit
faster, in my limited tests, creating index on multiple columns (id
and reported_at) were not very helpful.

in your case, creating an index is not trivial (as you have a large
set of data) as the indexes usually locks the db.

please open a ticket about this one, and hopefully someone with better
knowledge of mysql index can help out a bit here.

cheers,
Ohad

··· On Wed, Mar 21, 2012 at 5:33 PM, jmccann wrote:

Thanks,
Jake


You received this message because you are subscribed to the Google Groups
"Foreman users" group.
To view this discussion on the web visit
https://groups.google.com/d/msg/foreman-users/-/WjfCSdfqRHkJ.

To post to this group, send email to foreman-users@googlegroups.com.
To unsubscribe from this group, send email to
foreman-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/foreman-users?hl=en.

Ohad,

I created Bug #1546: report::expire slow with many reports - Foreman for the slowness of the issue.

This doesn't seem to deal with the memory issues though … the fact that
rake gets to 2G memory usage and max_allowed_packet set to 100M on my DB
isn't enough.

Since the pre-0.4 method works better for me (faster and no memory issues)
can I safely continue to use that? I just want to make sure the pre-0.4
method is not missing cleaning up some other pieces or potentially some
other issue …

Thanks so much for checking this out!
Jake

··· >

> Ohad,
>
> I created Bug #1546: report::expire slow with many reports - Foreman for the slowness of the issue.
>
> This doesn't seem to deal with the memory issues though … the fact that
> rake gets to 2G memory usage and max_allowed_packet set to 100M on my DB
> isn't enough.
>
> Since the pre-0.4 method works better for me (faster and no memory issues)
> can I safely continue to use that? I just want to make sure the pre-0.4
> method is not missing cleaning up some other pieces or potentially some
> other issue …
>
Jake,

Before giving up on this, would you mind changing the created_at to
reported_at at the following line?

Ohad

Thanks,
Ohad

··· On Thu, Mar 22, 2012 at 8:42 PM, jmccann wrote: > Thanks so much for checking this out! > Jake > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To view this discussion on the web visit > https://groups.google.com/d/msg/foreman-users/-/4iYsE71asmgJ. > > To post to this group, send email to foreman-users@googlegroups.com. > To unsubscribe from this group, send email to > foreman-users+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/foreman-users?hl=en.

Ohad,

I changed the created_at to reported_at and it was the same result (20 min
to run, error, memory use). Looking at the log I sent you, it seems that
Report Load that took ~16600ms (16 sec) happens only once. So while
improving that is good, I don't think it makes a very big difference with
the issue I'm running into.

If you'd like me to generate a new log to send you with this change let me
know.

Thanks!
Jake

··· >

Ohad,

Is there anything else you'd like me to try? I don't want to give up on
this, but I also would like to bring my DB size down.

Thanks!
Jake

> Ohad,
>
> Is there anything else you'd like me to try? I don't want to give up on
> this, but I also would like to bring my DB size down.

Sorry for the late reply Jake,
been busy few days, just wondering, what are the chances you could
pass me a dump of the db?, its fairly hard to debug this way :slight_smile:

cheers,
Ohad

··· On Mon, Mar 26, 2012 at 5:28 PM, jmccann wrote: > > Thanks! > Jake > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To view this discussion on the web visit > https://groups.google.com/d/msg/foreman-users/-/jDB2jZapobsJ. > > To post to this group, send email to foreman-users@googlegroups.com. > To unsubscribe from this group, send email to > foreman-users+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/foreman-users?hl=en.

Ohad,

I don't think I'll be able to provide a DB dump to you. Will this put an
end to any TS that you could help me with? I'm willing to try other things
and provide logs.

One thing I noticed (and maybe you did too) is it seems that with the find
in batches most of the time is 'lost' between gathering the reports in
batches. What I mean is running the query to get data itself is quick
(about 5 sec per batch, ~5ms per query in each batch * 1000 queries per
batch) but it seems there is about 1 minute between each batch … from the
log I provided here are the timestamps of the query immediately previous to
the 'Log Load' query batches:

09:14:04
09:14:52
09:15:39
09:16:28
09:17:18
09:18:10
09:19:02
09:19:55
09:20:47
09:21:41
09:22:30
09:23:20
09:24:11
09:25:02
09:25:57
09:26:48

So for me there are 16 batches, each with about 1 minute between a batch.
This doesn't seem to change with the conditions I provide, so it seems to
query all records? Anyways, that comes up to ~16 min which is getting
close to the ~20 minutes its taking to run this for me.

Not super familiar with this code, but I'm ASSUMING (could totally be
wrong) that since this is all from Log Load that it has to do with block:

# first extract all information from our logs
all_reports, used_messages, used_sources = [],[],[]
Log.find_in_batches do |logs|
  logs.each do |log|
    all_reports &lt;&lt; log.report_id unless log.report_id.blank?
    used_messages &lt;&lt; log.message_id unless log.message_id.blank?
    used_sources &lt;&lt; log.source_id unless log.source_id.blank?
  end
end

And maybe its the code there that is causing this to take so long? So
maybe this is where issue #1 (~20 min run) is occurring?

Then the next issue, where it bombs out with out of memory, is occurring
from somewhere in:

all_reports.uniq! ; used_messages.uniq! ; used_sources.uniq!

# reports which have logs entries
used_reports = Report.all(:select =&gt; :id, :conditions =&gt; {:id =&gt; 

all_reports}).m
ap(&:id)

orphaned_logs = all_reports - used_reports
Log.delete_all({:report_id =&gt; orphaned_logs}) unless 

orphaned_logs.empty?

all_messages = Message.all(:select =&gt; :id).map(&amp;:id)
orphaned_messages = all_messages - used_messages
Message.delete_all({:id =&gt; orphaned_messages}) unless 

orphaned_messages.empty?

all_sources = Source.all(:select =&gt; :id).map(&amp;:id)
orphaned_sources = all_sources - used_sources
Source.delete_all({:id =&gt; orphaned_sources}) unless 

orphaned_sources.empty?

logger.info Time.now.to_s + &quot;: Expired #{count} Reports&quot;
return count

And since the 'bad' query starts with: SELECT id FROM reports WHERE
(reports.id IN (15462,15463,15464,15465 my guess is its specifically
line:

reports which have logs entries

  used_reports = Report.all(:select =&gt; :id, :conditions =&gt; {:id =&gt; 

all_reports}).m
ap(&:id)

I could be totally off though … :wink:

Thanks,
Jake

··· >

> Ohad,
>
> I don't think I'll be able to provide a DB dump to you. Will this put an
> end to any TS that you could help me with? I'm willing to try other things
> and provide logs.
>
> One thing I noticed (and maybe you did too) is it seems that with the find
> in batches most of the time is 'lost' between gathering the reports in
> batches. What I mean is running the query to get data itself is quick
> (about 5 sec per batch, ~5ms per query in each batch * 1000 queries per
> batch) but it seems there is about 1 minute between each batch … from the
> log I provided here are the timestamps of the query immediately previous to
> the 'Log Load' query batches:
>
> 09:14:04
> 09:14:52
> 09:15:39
> 09:16:28
> 09:17:18
> 09:18:10
> 09:19:02
> 09:19:55
> 09:20:47
> 09:21:41
> 09:22:30
> 09:23:20
> 09:24:11
> 09:25:02
> 09:25:57
> 09:26:48
>
> So for me there are 16 batches, each with about 1 minute between a batch.
> This doesn't seem to change with the conditions I provide, so it seems to
> query all records? Anyways, that comes up to ~16 min which is getting close
> to the ~20 minutes its taking to run this for me.
>
> Not super familiar with this code, but I'm ASSUMING (could totally be wrong)
> that since this is all from Log Load that it has to do with block:
>
> # first extract all information from our logs
> all_reports, used_messages, used_sources = [],[],[]
> Log.find_in_batches do |logs|
> logs.each do |log|
> all_reports << log.report_id unless log.report_id.blank?
> used_messages << log.message_id unless log.message_id.blank?
> used_sources << log.source_id unless log.source_id.blank?
> end
> end
>
> And maybe its the code there that is causing this to take so long? So maybe
> this is where issue #1 (~20 min run) is occurring?
>
> Then the next issue, where it bombs out with out of memory, is occurring
> from somewhere in:
>
> all_reports.uniq! ; used_messages.uniq! ; used_sources.uniq!
>
> # reports which have logs entries
> used_reports = Report.all(:select => :id, :conditions => {:id =>
> all_reports}).m
> ap(&:id)
>
> orphaned_logs = all_reports - used_reports
> Log.delete_all({:report_id => orphaned_logs}) unless
> orphaned_logs.empty?
>
> all_messages = Message.all(:select => :id).map(&:id)
> orphaned_messages = all_messages - used_messages
> Message.delete_all({:id => orphaned_messages}) unless
> orphaned_messages.empty?
>
> all_sources = Source.all(:select => :id).map(&:id)
> orphaned_sources = all_sources - used_sources
> Source.delete_all({:id => orphaned_sources}) unless
> orphaned_sources.empty?
>
> logger.info Time.now.to_s + ": Expired #{count} Reports"
> return count
>
> And since the 'bad' query starts with: SELECT id FROM reports WHERE
> (reports.id IN (15462,15463,15464,15465 my guess is its specifically
> line:
>
> # reports which have logs entries
> used_reports = Report.all(:select => :id, :conditions => {:id =>
> all_reports}).m
> ap(&:id)
>
> I could be totally off though … :wink:

I'll try running it under debugger, please keep reminding me if I'm
not responsive enough :slight_smile:

Cheers,
Ohad

··· On Tue, Apr 3, 2012 at 7:34 PM, jmccann wrote: > > Thanks, > Jake > > -- > You received this message because you are subscribed to the Google Groups > "Foreman users" group. > To view this discussion on the web visit > https://groups.google.com/d/msg/foreman-users/-/EADjgREX3Z4J. > > To post to this group, send email to foreman-users@googlegroups.com. > To unsubscribe from this group, send email to > foreman-users+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/foreman-users?hl=en.

Just curious if you have any updates on this.

Thanks!
Jake

Hi Jake,

sadly its in my todo list, this week I'm traveling, so not sure I
could get much in it, please keep reminding me, we should get it done
before 0.5 comes out (which is hopefully not far away anyway).

Thanks!
Ohad

··· On Mon, Apr 16, 2012 at 10:12 PM, jmccann wrote: > Just curious if you have any updates on this.

I think I made some progress. A co-worker and I worked on just getting
this functional in our environment. Since the error was occurring when
generating the used_reports array we set that up to be done in 'batches' as
well.

# used_reports = Report.all(:select =&gt; :id, :conditions =&gt; {:id =&gt; 

all_reports}).map(&:id)
used_reports = []
tmp_arr=all_reports.dup
slice_size = 10000
until tmp_arr.empty? do
first = (tmp_arr.size < slice_size) ? 0 : (tmp_arr.size - slice_size)
last = tmp_arr.size
tmp = tmp_arr[first…last]
used_reports << Report.all(:select => :id, :conditions => {:id =>
tmp}).map(&:id)
tmp_arr = tmp_arr - tmp
end
used_reports.flatten!.sort

So instead of trying to run against all 21 million report rows at once in a
query, it does it in batches of 10,000. However, this loop takes even
longer then the Log loop used to generate the all_reports, used_messages
and used_sources variables. I think what is causing these loops to take so
long is the generation of the arrays. As we get larger and larger arrays,
using those arrays takes longer. Part of what makes me believe this is for
the loop above the longer it runs the longer each iteration takes. So what
started as taking 8 seconds to process a 'slice' of 10,000 records then
becomes 15 seconds after running for a couple hours. Since used_reports
array is growing each iteration I think each append to the array takes
longer and longer.

So after making my changes it ended up taking over 8 hours for expire to
run, but I got no errors! Not that it means it necessarily worked, but I'm
running it again now with some before/after results to compare. (x #
reports and y # messages exist before and how many exist after).

So as far as making this go faster when ran against a DB with millions of
reports I'm not really sure what to do, but it seems that using all these
arrays is a bottleneck.

I'll post again when I've completed my before/after checks and/or if I have
anymore information to add.

Thanks,
Jake

··· >

So it does seem to be working and cleaning up (although slow)!

Ran 'rake18 reports:expire days=7 status=100663296 RAILS_ENV="production"'
which cleans up systems with 6 skipped resources and 0 for everything else.

Went from:

table>rows>data>indx>total_size

puppet.reports | 18.16M | 7.66G | 2.22G | 9.89G
puppet.logs | 22.47M | 1.48G | 1.29G | 2.77G

to:

> puppet.reports | 16.61M | 7.42G | 2.04G | 9.46G
> puppet.logs | 20.57M | 1.47G | 1.26G | 2.73G

And the run took:
hostname:# time rake18 reports:expire days=7 status=100663296
RAILS_ENV="production"
(in )

real 405m48.577s
user 270m58.900s
sys 9m36.672s

Regards,
Jake

> So it does seem to be working and cleaning up (although slow)!

awesome!

>
> Ran 'rake18 reports:expire days=7 status=100663296 RAILS_ENV="production"'
> which cleans up systems with 6 skipped resources and 0 for everything else.
>
> Went from:
>
> table>rows>data>indx>total_size
>
> puppet.reports | 18.16M | 7.66G | 2.22G | 9.89G
> puppet.logs | 22.47M | 1.48G | 1.29G | 2.77G
>
> to:
>
> > puppet.reports | 16.61M | 7.42G | 2.04G | 9.46G
> > puppet.logs | 20.57M | 1.47G | 1.26G | 2.73G
>
> And the run took:
> hostname:# time rake18 reports:expire days=7 status=100663296
> RAILS_ENV="production"
> (in )
>
> real 405m48.577s
> user 270m58.900s
> sys 9m36.672s
>
> Regards,
> Jake
>

Looking at your code changes, it makes a whole lot of sense, I'll try to
review the whole thing before we release 0.5

Thanks!!
Ohad

··· On Fri, Apr 27, 2012 at 11:32 PM, jmccann wrote:


You received this message because you are subscribed to the Google Groups
"Foreman users" group.
To view this discussion on the web visit
https://groups.google.com/d/msg/foreman-users/-/M87Lo7X8vscJ.

To post to this group, send email to foreman-users@googlegroups.com.
To unsubscribe from this group, send email to
foreman-users+unsubscribe@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/foreman-users?hl=en.

Look forward to seing this addressed of course. :slight_smile:

As an FYI, tickets relating to these issues are:
Slowness - Bug #1546: report::expire slow with many reports - Foreman
Error - Bug #1592: report::expire errors with millions of records - Foreman

I put my hack code into #1592. Hopefully with those tickets its easier to
make references then digging into this huge thread.

Thanks!
Jake