Hosts without changes in Puppet reports show as "out-of-sync" on host status

areyus · January 4, 2023, 10:34am

Problem:
Hosts that have performed puppet runs regularly (more recently than out-of-sync interval) via the puppet service show as “out-of-sync” in the host status, both in UI and via API (api/hosts/:id/status/configuration):

Yet, there are Puppet reports processed for the hosts regularly:

We have Puppet configured to run hourly and out-of-sync interval is set to 70 minutes.
The hosts do not show as out of sync in the dashboard widget with “last_report < “70 minutes ago” and status.enabled = true” search query and do not have alerts disabled.
The status turns to “No changes” when I trigger a manual puppet run via puppet agent -t on that host, but changes back to “out-of-sync” after the out-of-sync interval.
The only difference that I noticed is that puppet runs triggered from the agent service only contain the line “Applied catalog in X.Y seconds”, while runs from puppet agent -t contain some more info lines:

I first suspected Fixes #35684 - Drop Applied catalog lines by ekohl · Pull Request #25 · theforeman/puppet-puppetserver_foreman · GitHub to be the culprit, but since we are still on 3.2 something else seems to be broken here. This was working before we upgraded to 3.2 (I we were on 2.5 before iirc).

Expected outcome:
Hosts reporting their configuration status as “No changes” again on uneventful reports.

Foreman and Proxy versions:
3.2.1

Foreman and Proxy plugin versions:
Plugins:

foreman-tasks 6.0.1
foreman_expire_hosts 7.0.4
foreman_hooks 0.3.17
foreman_puppet 3.0.7
foreman_remote_execution 6.0.0
foreman_scc_manager 1.8.20
foreman_snapshot_management 2.0.1
foreman_templates 9.1.0
katello 4.4.1
puppetdb_foreman 5.0.0

Distribution and version:
RHEL7 (production machine) and RHEL8 (testing machine) both show the same behavior

Other relevant data:

areyus · January 9, 2023, 9:37am

Noone else experiencing this?
I tried taking a look at the code, but I cannot find where the Puppet Report status is determined.

gvde · January 9, 2023, 10:28am

--test is a combination of multiple other options:

* --test:
  Enable the most common options used for testing. These are 'onetime',
  'verbose', 'no-daemonize', 'no-usecacheonfailure', 'detailed-exitcodes',
  'no-splay', and 'show_diff'.

Thus, you’ll see verbose output if you are using --test.

Did you set the puppet_interval setting accordingly?

Default is 35 according to https://github.com/theforeman/foreman_puppet/blob/master/lib/foreman_puppet/register.rb#L31

Settings - Content Management - Puppet Interval

areyus · January 9, 2023, 10:33am

Yes, I know. I am not surprised, that the --test runs have more verbose output, just wanted to point out that the this resulted in more messages in the report as well.

Yes, both puppet_interval and out_of_sync_interval are set to 70.

We have not touched any of these settings in a long time, and it has worked in the past.

kalessin · January 9, 2023, 3:12pm

Hi, If I remember correctly, it was something about katello becoming incompatible, deprecated or something. It’s anoying.
having that randomly also. Setting the out of sync to a bigger # than puppet interval may help.

areyus · January 10, 2023, 9:18am

In fact, it looks like I may have misinterpreted the behavior (or just picked bad samples).
I have now observed that the runs triggered by the agent service in fact do cause a “no changes” status, but that only lasts for about 10 minutes and then changes to “out-of-sync”, despite both intervals are set to 70 minutes.
Runs triggered manually via “puppet agent -t” just keep the “no changes” status longer, for what I assume is in fact the out-of-sync-interval setting.

nadjaheitmann · January 10, 2023, 12:30pm

Hi @areyus, there is a change for the out of sync status that affects Foreman 3.2

Maybe you could check if this relates to your problem.

areyus · January 11, 2023, 8:42am

I have tried reverting that patch on our test Foreman, and it in fact looks like the issue I see is related to that change. Once I reverted to the previous state, hosts show as “No Changes” again correctly. After restoring the state as “shipped” in 3.2, the hosts returned to “out of sync” again.
So it looks like something causes Foreman to think the outofsync_interval should be 10 Minutes instead of the configured 70 for reports that only contain the “Applied catalog in X.Y seconds” line.

areyus · January 11, 2023, 10:32am

I took a closer look at the code and added some log statements for debugging.
It appears to me that patch only revealed another underlying problem. From what I see, it appears that last_report actually picks up the second-to-last report.
The host I am currently looking at has it’s last reports at 11:16, 11:13 and 10:32 AM all in UTC+1.
I added a log statement that gives me the time of the last report compared to the time for outofsync calculation. For that, I patched the out_of_sync? function in configuration_status.rb to look like this:

    def out_of_sync?
      if (host && !host.enabled?) || no_reports? || out_of_sync_disabled?
        false
      else
        !reported_at.nil? && logger.error {"reported_at: #{reported_at} oos at: #{(Time.now.utc - expected_report_interval)}"} && reported_at < (Time.now.utc - expected_report_interval)
      end
    end

If I call the hosts details page with that, I get the following output in the log:

reported_at: 2023-01-11 11:13:03 +0100 oos at: 2023-01-11 09:10:57 UTC

Before I triggered the 11:16 puppet run, I got reported_at as 10:32 UTC+1 and before that it was 09:32 UTC+1. So it never actually checks the most current report, but the report before that.
Since out Puppet agents are set up to run once an hour, and the puppet_outofsync interval is set to 70 minutes, that explaines the behaviour I observed. Manually triggered Puppet runs are not processed differently, the just add another report so a report within the configured timeframe is processed as last_report.
Before the patch you linked, the interval that was checked for outofsync was just higher (due to adding puppet_interval and outofsync interval), which did hide this problem by (in our case) doubling the out of sync time to 2:20 hours.

nadjaheitmann · January 12, 2023, 10:08am

Any idea if this is a known bug? I remember there was a reason for this particular implementation of the out-of-sync-interval. However, there was also the agreement that it needs cleanup

Maybe @Marek_Hulan can comment on this issue?

areyus · January 12, 2023, 10:57am

No idea if this might be a known bug, I found Bug #24658: Foreman out of sync with puppet - Foreman, which I am not entirely sure if it’s describing the same problem and is rejected as a question to the community. All other issues I could find mentioning last_report_object (https://projects.theforeman.org/search?utf8=✓&q=last_report_object&all_words=&all_words=1&titles_only=&issues=1&news=1&documents=1&changesets=1&wiki_pages=1&messages=1&projects=1&attachments=0&options=0&commit=Submit) are around the time of 1.11, and seem to be resolved.
Looking at the code (with my very limited ruby knowledge) screening through the old issues, it appears to my that last_report_object in Host::Managed is picking up the right report, but something is going wrong when processed through configuration_status:

irb(main):002:0> Host.find_by_name('host.example.com').last_report_object.attributes
=> {"id"=>1890944, "host_id"=>3026, "reported_at"=>Thu, 12 Jan 2023 10:32:53 UTC +00:00, "created_at"=>Thu, 12 Jan 2023 10:33:22 UTC +00:00, "updated_at"=>Thu, 12 Jan 2023 10:33:22 UTC +00:00, "status"=>0, "metrics"=>"---\nresources: !ruby/hash:ActiveSupport::HashWithIndifferentAccess\n  changed: 0\n  corrective_change: 0\n  failed: 0\n  failed_to_restart: 0\n  out_of_sync: 0\n  restarted: 0\n  scheduled: 0\n  skipped: 0\n  total: 389\ntime: !ruby/hash:ActiveSupport::HashWithIndifferentAccess\n  alternatives: 0.0723974\n  anchor: 0.00046828399999999996\n  augeas: 0.977220589\n  catalog_application: 12.243165450985543\n  concat_file: 0.0006799409999999999\n  concat_fragment: 0.0022346659999999993\n  config_retrieval: 6.711204092949629\n  convert_catalog: 0.4055725709768012\n  cron: 0.000595406\n  dnf_automatic_config: 0.0060485009999999995\n  exec: 5.430379371000001\n  fact_generation: 4.254753187997267\n  file: 1.8560995030000003\n  file_line: 0.008712191000000001\n  filebucket: 5.7626e-05\n  filesystem: 0.059137277999999995\n  group: 0.002087073\n  logical_volume: 0.5985904820000001\n  mount: 0.006485834000000001\n  nm_network_config: 0.052451849\n  package: 1.5809123450000013\n  plugin_sync: 2.2959843969438225\n  posix_acl: 0.010011\n  selboolean: 0.020820456\n  selinux_fcontext: 0.000404316\n  service: 0.449978667\n  ssh_authorized_key: 0.003472221\n  total: 26.745568624\n  transaction_evaluation: 12.100615225965157\n  user: 0.012725420000000001\nchanges: !ruby/hash:ActiveSupport::HashWithIndifferentAccess\n  total: 0\nevents: !ruby/hash:ActiveSupport::HashWithIndifferentAccess\n  failure: 0\n  success: 0\n  total: 0", "type"=>"ConfigReport", "origin"=>"Puppet"}

The time of the last_report_object is correct, although it was converted to UTC from UTC+1, obviously during report processing since the times in the DB are the UTC ones:

foreman=# select id, host_id, reported_at, type, origin from reports where host_id = 3026 order by id desc;
   id    | host_id |     reported_at     |     type     | origin
---------+---------+---------------------+--------------+--------
 1890961 |    3026 | 2023-01-12 10:43:27 | ConfigReport | Puppet
 1890944 |    3026 | 2023-01-12 10:32:53 | ConfigReport | Puppet
 1890861 |    3026 | 2023-01-12 09:32:53 | ConfigReport | Puppet
...

Marek_Hulan · February 9, 2023, 2:56pm

So is this a TZ issue? Like we drop the TZ somewhere on the way and therefore comparing wrong report date? That would explain why it wasn’t seen by more users.

areyus · February 10, 2023, 8:00am

I don’t know what happened, but I just took another look and the problem just disappeared.
We had some maintenance this weekend (RHEL8 upgrade, migration to shellhooks from foreman_hooks), but I doubt that is related since the testing environment was also affected and that had those changes already applied when we first encountered this problem.
Unsatisfying to not know what caused this and how it went away, but at least it’s gone for now-