Puppet timeouts after Katello 3.8 -> 3.10 upgrade

Tony_Coffman · January 16, 2019, 7:45pm

Problem:
A small percentage of puppet runs are intermittently resulting in timeouts after a 3.8 -> 3.10 upgrade.

Expected outcome:
Puppet runs succeed without timeout

Foreman and Proxy versions:
foreman-1.20.1-1.el7.noarch
foreman-proxy-1.20.1-1.el7.noarch

Foreman and Proxy plugin versions:

Other relevant data:
puppetserver-2.8.1-1.el7.noarch
Puppet 4 clients

From production.log

2019-01-16T14:35:43 [W|app|3ef47] Failed to generate external nodes for rich-iweb-83-iv.snaponglobal.com
Timeout::Error: execution expired

I halfway suspect a thundering herd issue may be involved but I’ve never seen a timeout on this instance before this latest upgrade. I wanted to see if perhaps there was a performance regression and if I can simply adjust the timeout to mitigate? (not sure where to set the timeout value)

Tony_Coffman · January 17, 2019, 3:48pm

It looks like most manifest generations run sub-second. A small percentage of the runs are failing after 10 seconds with the message below.

[root@foreman ~]# time /etc/puppetlabs/puppet/node.rb puppetclient.domain.com
class or module required for rescue clause

real    0m10.518s
user    0m0.127s
sys     0m0.023s

Does anybody know where the 10 second timeout is defined?

ekohl · January 17, 2019, 4:05pm

It’s defined in /etc/puppetlabs/puppet/foreman.yaml and should be controlled by --puppet-server-request-timeout in the installer.

Tony_Coffman · January 17, 2019, 4:22pm

Hmm, that value is set to 60. 10 seconds is the fallback value in /etc/puppetlabs/puppet/node.rb

Based on testing, it appears to be timing out at 10 seconds rather than 60.

I think there may be a bug where it isn’t correctly reading the value from /etc/puppetlabs/puppet/foreman.yaml

I’ll keep poking at it but any suggestions are welcome

Tony_Coffman · January 17, 2019, 8:09pm

Ok, I spent a little more time trying to figure out to get it to respect the timeout but no matter what I did, it seems to consistently timeout at 10 seconds.

Since changing the timeout value path seems to be a dead end for me, instead I spent time trying to figure out why it was intermettently running so long. It appears to be possibly related to PostGres query errors that look like this (from puppetserver.log)

“error”: {“message”:"PG::ProtocolViolation: ERROR: invalid message format\n: SELECT “fact_names”.* FROM “fact_names” WHERE “fact_names”.“type” IN (‘PuppetFactName’) AND “fact_names”.“name” IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64, $65, $66, $67, $68, $69, $70, $71, $72, $73, $74, $75, $76, $77, $78, $79, $80, $81, $82, $83, $84, $85, $86, $87, $88, $89, $90, $91, $92, $93, $94, $95, $96, $97, $98, $99, $100, $101, $102, $103, $104, $105, $106, $107, $108, $109, $110, $111, $112, $113, $114, $115, $116, $117, $118, $119, $120, $121, $122, $123, $124, $125, $126, $127, $128, $129, $130, $131, $132, $133, $134, $135, $136, $137, $138, $139, $140, $141, $142, $143, $144, $145, $146, $147, $148, $149, $150, $151, $152, $153, $154, $155, $156, $157, $158, $159, $160, $161, $162, $163, $164, $165, $166, $167, $168, $169, $170, $171, $172, $173, $174, $175, $176, $177, $178, $179, $180, $181, $182, $18

These first few lines are followed by several dozen more pages of this up to $66437

This PG error seems to newly introduced in 3.10.

Any thoughts about this?

ekohl · January 23, 2019, 5:25pm

I wonder if there’s a problem with too many values for the PG protocol but I’m not that familiar with PostgreSQL to answer this.

Tony_Coffman · January 23, 2019, 8:08pm

Thanks. I’m still struggling with this a week later.

I’ll file an issue for the PG error.

Tony_Coffman · January 25, 2019, 6:27pm

The Postgres protocol violation turned out to be related to hosts with large numbers of LUNs attached. I opened issue #25923 for that.

I still can’t figure out if this is causing my other hosts to randomly fail to pull their manifest or not.

tbrisker · January 27, 2019, 10:40am

Looks like this may be related to Refactor #25053: Run expensive db_facts query just once and unordered - Foreman which was included in foreman 1.20. As a workaround, you may be able to use the Exclude pattern for facts stored in foreman setting (under provisioning) to skip the LUN facts and reduce the size of the imported fact list, that would both likely resolve this issue and speed up fact importing in general - that is, unless you need the facts related to the LUNs in foreman.