Strange errors intermittently appear during puppet runs

Krist_van_Besien · June 3, 2014, 8:39am

Hello,

Last night, out of nowhere a problem appeared on our foreman server. This
problem is at the moment making it impossible for us to manage our systems
with foreman.

What happens is that on "File" resources the errors are thrown:

/Stage[main]/Concat::Setup/File[/var/lib/puppet/concat/bin/concatfragments.sh]
Could not evaluate: Could not retrieve file metadata for
puppet:///modules/concat/concatfragments.sh: Error 400 on SERVER: undefined
method encoding' for #<String:0x7f5949534928> Wrapped exception: Error 400 on SERVER: undefined methodencoding' for #<String:0x7f5949534928>
notice Puppet Using cached catalog
err Puppet Could not retrieve catalog from remote server: Error 400
on SERVER: Failed to submit 'replace facts' command for
gbtcs-bast-r270q12014-g4s1202.eu-ch-1.swisswebservices.ch to PuppetDB at
lupin-pup-vm-0736d5df.eu-ch-1.swisswebservices.ch:8081: undefined method
encoding' for "{\n \"uuid\" : \"d3c5733f-ae52-45d7-8f57-b9840950c591\"\n}":String err /File[/var/lib/puppet/lib] Could not evaluate: Could not retrieve file metadata for puppet://lupin-pup-vm-0736d5df.eu-ch-1.swisswebservices.ch/plugins: Error 400 on SERVER: undefined methodencoding' for #<String:0x7f59497521d8>
Wrapped exception: Error 400 on SERVER: undefined method encoding' for #<String:0x7f59497521d8> err /File[/var/lib/puppet/lib] Failed to generate additional resources using 'eval_generate': Error 400 on SERVER: undefined methodencoding' for #<String:0x7f59497768f8>
err /File[/var/lib/puppet/facts.d] Could not evaluate: Could not
retrieve file metadata for
puppet://lupin-pup-vm-0736d5df.eu-ch-1.swisswebservices.ch/pluginfacts:
Error 400 on SERVER: undefined method encoding' for #<String:0x7f5949a95d68> Wrapped exception: Error 400 on SERVER: undefined methodencoding' for #<String:0x7f5949a95d68>
err /File[/var/lib/puppet/facts.d] Failed to generate additional
resources using 'eval_generate': Error 400 on SERVER: undefined method
`encoding' for #<String:0x7f5949b4c450>

If I run puppet by hand I get similar errors.

What has us at the moment completely scratching our heads is that the
errors are not consistent. Sometimes when I run it by hand I don't get
errors. Sometimes I do. I don't always get the same number of errors.
Also with the automatic runs there are instances where the run goes without
problem, and instances where it runs in to problems.

As a result I have about 2/3 of my servers in "error" at the moment. The
odd thing is that the group of servers in error shrinks if I reboot the
puppet/foreman server. So more hosts sucessfully manage to do a puppet run.
After a while it increases again, until it stabilises.

The systems are RHEL 5.6, and they have:
Foreman 1.4.2
Ruby 1.8.7
Puppet 3.6

I'm completely at a loss here. Any suggestions where I can start digging
for a cause?

Krist

Gwmngilfen · June 3, 2014, 3:11pm

> The systems are RHEL 5.6, and they have:
> Foreman 1.4.2
> Ruby 1.8.7
> Puppet 3.6

Puppet 3.6 is currently unsupported, so try downgrading that to 3.5

Krist_van_Besien · June 4, 2014, 5:29am

Ok. i will try that. But this however does not explain why the problems
only appeared several days after the upgrade to puppet 3.6… And also does
not explain why the errors sometimes appear, and sometimes don't. I've seen
instances where I ran puppet on the command line twice in a row, with
different errors. I see servers are having runs both with and without
errors. And I am seeing a reduction in the number of servers in errors if I
reboot the puppetmaster/foreman host.

Al this is very puzzling…

···

On Tuesday, June 3, 2014 5:11:50 PM UTC+2, Greg Sutcliffe wrote: > > > The systems are RHEL 5.6, and they have: > > Foreman 1.4.2 > > Ruby 1.8.7 > > Puppet 3.6 > > Puppet 3.6 is currently unsupported, so try downgrading that to 3.5 >

Krist_van_Besien · June 4, 2014, 7:26am

I rebooted the puppetmaster/foreman server, downgraded to puppet 3.5.1 and
for a while everything looked good. All servers performed their runs
without problems. But after about three hours the problems started to
return. At the moment about 1/3 of my servers are throwing errors again.
Not always, though. And the errors seem to be the same all the time:

"Could not evaluate: Could not retrieve file metadata for
puppet:///modules/sws/RPM-GPG-KEY-SWS: Error 400 on SERVER: undefined
method encoding' for #<String:0x7fa6d91e5e60> Wrapped exception: Error 400 on SERVER: undefined methodencoding' for #<String:0x7fa6d91e5e60>"

Anybody has a clue to what is going on here?

Krist

···

On Wednesday, June 4, 2014 7:29:23 AM UTC+2, Krist van Besien wrote: > > > > On Tuesday, June 3, 2014 5:11:50 PM UTC+2, Greg Sutcliffe wrote: >> >> > The systems are RHEL 5.6, and they have: >> > Foreman 1.4.2 >> > Ruby 1.8.7 >> > Puppet 3.6 >> >> Puppet 3.6 is currently unsupported, so try downgrading that to 3.5 >> > > Ok. i will try that. But this however does not explain why the problems > only appeared several days after the upgrade to puppet 3.6... And also does > not explain why the errors sometimes appear, and sometimes don't. I've seen > instances where I ran puppet on the command line twice in a row, with > different errors. I see servers are having runs both with and without > errors. And I am seeing a reduction in the number of servers in errors if I > reboot the puppetmaster/foreman host. > > Al this is very puzzling... >

Dominic_Cleal · June 4, 2014, 7:57am

The trouble is, this error could be originating from anywhere, but most
likely some sort of plugin (e.g. a custom type/provider, terminus,
report processor). As it's server-side too, you ideally want to run the
Puppet master with "–trace" to get a full stack trace, then you can
narrow down the source.

With Passenger, you could try adding an 'ARGV << "–trace"' line to
/etc/puppet/rack/config.ru and restarting it, then check syslog and see
if it logs the full stack trace there. If not, you could disable the
Puppet vhost and run the master by hand with "puppet master
–no-daemonize -v --trace".

···

On 04/06/14 08:26, Krist van Besien wrote: > > > On Wednesday, June 4, 2014 7:29:23 AM UTC+2, Krist van Besien wrote: > > > > On Tuesday, June 3, 2014 5:11:50 PM UTC+2, Greg Sutcliffe wrote: > > > The systems are RHEL 5.6, and they have: > > Foreman 1.4.2 > > Ruby 1.8.7 > > Puppet 3.6 > > Puppet 3.6 is currently unsupported, so try downgrading that to 3.5 > > > Ok. i will try that. But this however does not explain why the > problems only appeared several days after the upgrade to puppet > 3.6... And also does not explain why the errors sometimes appear, > and sometimes don't. I've seen instances where I ran puppet on the > command line twice in a row, with different errors. I see servers > are having runs both with and without errors. And I am seeing a > reduction in the number of servers in errors if I reboot the > puppetmaster/foreman host. > > Al this is very puzzling... > > > > I rebooted the puppetmaster/foreman server, downgraded to puppet 3.5.1 > and for a while everything looked good. All servers performed their runs > without problems. But after about three hours the problems started to > return. At the moment about 1/3 of my servers are throwing errors again. > Not always, though. And the errors seem to be the same all the time: > > "Could not evaluate: Could not retrieve file metadata for > puppet:///modules/sws/RPM-GPG-KEY-SWS: Error 400 on SERVER: undefined > method `encoding' for # Wrapped exception: Error > 400 on SERVER: undefined method `encoding' for #" > > Anybody has a clue to what is going on here?

–
Dominic Cleal
Red Hat Engineering

Krist_van_Besien · June 4, 2014, 5:27pm

Thank you for this tip. This led me to the root cause, and it appears that
I've managed to solve our problem.
The trace showed that the errors apparently originated in rubygems-json.

We are using foreman to build an openstack installation, and for this
reason I had at one time subscribed all our machines to the RHEL OST4
channel on our satellite. This lead to rubygems-json being updated, as the
version in RHEL OST4 is higher than the one in the puppet labs repository.
I looked carefully at the history of the latest updates and managed to roll
back those that caused this problem.

Everything appears to be OK now.

···

On Wednesday, June 4, 2014 9:57:28 AM UTC+2, Dominic Cleal wrote: > > > > The trouble is, this error could be originating from anywhere, but most > likely some sort of plugin (e.g. a custom type/provider, terminus, > report processor). As it's server-side too, you ideally want to run the > Puppet master with "--trace" to get a full stack trace, then you can > narrow down the source. >

Krist_van_Besien · June 17, 2014, 11:54am

Running puppet goes well for most host, most of the time. But sometimes, on
some hosts, and error is thrown. This error repeats itself a couple of
runs, then disappears, then often returns. Out of 60+ hosts about 4 are
usually showing as "in error" in Foreman.

The error looks like this:

$ sudo puppet agent --test
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb
Info: Loading facts in /var/lib/puppet/lib/facter/concat_basedir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/ip6tables_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/rabbitmq_erlang_cookie.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/os_maj_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/sws_mariadb_active_node.rb
Info: Loading facts in
/var/lib/puppet/lib/facter/iptables_persistent_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/lvm_support.rb
Info: Loading facts in /var/lib/puppet/lib/facter/iptables_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/pacemaker.rb
Info: Loading facts in /var/lib/puppet/lib/facter/facter_dot_d.rb
Error: Could not autoload puppet/provider/logical_volume/aix: can't convert
Hash into Integer
Error: Could not autoload puppet/type/logical_volume: Could not autoload
puppet/provider/logical_volume/aix: can't convert Hash into Integer
Error: Could not retrieve catalog from remote server: Could not intern from
text/pson: Could not autoload puppet/type/logical_volume: Could not
autoload puppet/provider/logical_volume/aix: can't convert Hash into Integer
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

The odd things about this error is:

the autoload doesn't always complain about the same file. Run puppet a
couple of times, and you'll see different files that are being complained
about.
the error doesn't always appear.

When I run puppet it will sometimes produce this error, and sometimes it
doesn't. This is thus again an error that seems to appear and disappear out
of nowhere.

I tried to increase trace and debug level by editing
/etc/puppet/rack/config.ru, but this time I'm not getting any usefull
information about these errors in the log file.
So is there some other trace/debug setting I can switch on? Some other log
file I should look at?
Is there a command I can type in on the puppetmaster that does the same
thing as whatever it is that the puppet client is trying to do?

I really need to find out why this happens, and I'm at a loss here.

Krist

···

On Wednesday, June 4, 2014 7:27:38 PM UTC+2, Krist van Besien wrote: > > > Everything appears to be OK now. > > Well ,everything was ok. Random errors have returned.

Krist_van_Besien · June 17, 2014, 12:10pm

puppet agent --test --trace -> almost always gives me errors.
puppet agent --test --trace --debug -> gives me a lot of output, but no
errors. Puppet run finishes without problems.

This looks like something on the client, not the server, and it looks like
a race condition to me. Or a Heisenbug…

Krist

···

On Tuesday, June 17, 2014 1:54:30 PM UTC+2, Krist van Besien wrote: > > The error looks like this: > > $ sudo puppet agent --test > Info: Retrieving pluginfacts > Info: Retrieving plugin > Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb > Info: Loading facts in /var/lib/puppet/lib/facter/concat_basedir.rb > Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb > Info: Loading facts in /var/lib/puppet/lib/facter/ip6tables_version.rb > Info: Loading facts in /var/lib/puppet/lib/facter/rabbitmq_erlang_cookie.rb > Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb > Info: Loading facts in /var/lib/puppet/lib/facter/os_maj_version.rb > Info: Loading facts in > /var/lib/puppet/lib/facter/sws_mariadb_active_node.rb > Info: Loading facts in > /var/lib/puppet/lib/facter/iptables_persistent_version.rb > Info: Loading facts in /var/lib/puppet/lib/facter/lvm_support.rb > Info: Loading facts in /var/lib/puppet/lib/facter/iptables_version.rb > Info: Loading facts in /var/lib/puppet/lib/facter/pacemaker.rb > Info: Loading facts in /var/lib/puppet/lib/facter/facter_dot_d.rb > Error: Could not autoload puppet/provider/logical_volume/aix: can't > convert Hash into Integer > Error: Could not autoload puppet/type/logical_volume: Could not autoload > puppet/provider/logical_volume/aix: can't convert Hash into Integer > Error: Could not retrieve catalog from remote server: Could not intern > from text/pson: Could not autoload puppet/type/logical_volume: Could not > autoload puppet/provider/logical_volume/aix: can't convert Hash into Integer > Warning: Not using cache on failed catalog > Error: Could not retrieve catalog; skipping run > > The odd things about this error is: > - the autoload doesn't always complain about the same file. Run puppet a > couple of times, and you'll see different files that are being complained > about. > - the error doesn't always appear. > > When I run puppet it will sometimes produce this error, and sometimes it > doesn't. This is thus again an error that seems to appear and disappear out > of nowhere. > > It gets weirder. I tried running puppet with --trace and --debug. eg

David_P · June 25, 2014, 8:56pm

I've seen weird, odd errors myself. Not the same errors as yours, but odd
errors nonetheless. I've boiled it down to one thing:

Netapp load.

I have all my puppet modules and hiera config on a shared nfs mount on a
netapp. During the midnight puppet run, some hosts get random errors,
some done. Almost all of them have the same exact manifest. The midnight
hours are when all the snapshots, backups, etc happen.

If I pull all my puppet modules to local disk, the errors go away.

I'm not saying this is your issue, I'm just tossing out a potential place
to look.

···

On Tuesday, June 17, 2014 5:10:13 AM UTC-7, Krist van Besien wrote: > > > > On Tuesday, June 17, 2014 1:54:30 PM UTC+2, Krist van Besien wrote: >> >> The error looks like this: >> >> $ sudo puppet agent --test >> Info: Retrieving pluginfacts >> Info: Retrieving plugin >> Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb >> Info: Loading facts in /var/lib/puppet/lib/facter/concat_basedir.rb >> Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb >> Info: Loading facts in /var/lib/puppet/lib/facter/ip6tables_version.rb >> Info: Loading facts in >> /var/lib/puppet/lib/facter/rabbitmq_erlang_cookie.rb >> Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb >> Info: Loading facts in /var/lib/puppet/lib/facter/os_maj_version.rb >> Info: Loading facts in >> /var/lib/puppet/lib/facter/sws_mariadb_active_node.rb >> Info: Loading facts in >> /var/lib/puppet/lib/facter/iptables_persistent_version.rb >> Info: Loading facts in /var/lib/puppet/lib/facter/lvm_support.rb >> Info: Loading facts in /var/lib/puppet/lib/facter/iptables_version.rb >> Info: Loading facts in /var/lib/puppet/lib/facter/pacemaker.rb >> Info: Loading facts in /var/lib/puppet/lib/facter/facter_dot_d.rb >> Error: Could not autoload puppet/provider/logical_volume/aix: can't >> convert Hash into Integer >> Error: Could not autoload puppet/type/logical_volume: Could not autoload >> puppet/provider/logical_volume/aix: can't convert Hash into Integer >> Error: Could not retrieve catalog from remote server: Could not intern >> from text/pson: Could not autoload puppet/type/logical_volume: Could not >> autoload puppet/provider/logical_volume/aix: can't convert Hash into Integer >> Warning: Not using cache on failed catalog >> Error: Could not retrieve catalog; skipping run >> >> The odd things about this error is: >> - the autoload doesn't always complain about the same file. Run puppet a >> couple of times, and you'll see different files that are being complained >> about. >> - the error doesn't always appear. >> >> When I run puppet it will sometimes produce this error, and sometimes it >> doesn't. This is thus again an error that seems to appear and disappear out >> of nowhere. >> >> It gets weirder. I tried running puppet with --trace and --debug. eg > puppet agent --test --trace -> almost always gives me errors. > puppet agent --test --trace --debug -> gives me a lot of output, but no > errors. Puppet run finishes without problems. > > This looks like something on the client, not the server, and it looks like > a race condition to me. Or a Heisenbug... > > Krist >

Krist_van_Besien · July 8, 2014, 7:10am

Our puppet runs in a VM. I suspected a load/resource contention/race issue
as well. I'll increase the resources available to the VM and see what
happens…

···

On Wednesday, June 25, 2014 10:56:09 PM UTC+2, David Patterson wrote: > > I've seen weird, odd errors myself. Not the same errors as yours, but odd > errors nonetheless. I've boiled it down to one thing: > > Netapp load. > >

Krist_van_Besien · July 9, 2014, 5:26am

Well, I think I found the source of our problem…

Searching google again for error messages similar to mine I landed on this
page:

http://github.com/thias/puppet-libvirt/issues/31

We run RHEL 6.5, and it is similarly plagued with an outdated Ruby version.
I applied the fix suggested in the above post, and the errors have
completely disappeared…

I think that moving to a more recent version of Ruby would be a good idea
as well. Are there any things to watch out for when moving to the SCL
version of Ruby on RHEL 6.5?

Krist

···

On Tuesday, July 8, 2014 9:10:57 AM UTC+2, Krist van Besien wrote: > > > > On Wednesday, June 25, 2014 10:56:09 PM UTC+2, David Patterson wrote: >> >> I've seen weird, odd errors myself. Not the same errors as yours, but >> odd errors nonetheless. I've boiled it down to one thing: >> >> Netapp load. >> >> > Our puppet runs in a VM. I suspected a load/resource contention/race issue > as well. I'll increase the resources available to the VM and see what > happens... >