New config report summary columns

lzap · December 13, 2021, 2:09pm

Hey,

we are working on a new config report format. There was a great discussion on the mapping of status/metric from Puppet and Ansible, I think this deserves its own thread. This page is a wiki and I will be updating it to fine-tune according to the proposals.

Summary

These fields are stored in reports table as integer column and represent the following:

changed - number or changes (events) occured in the report
unchanged - number of resources/actions that are aligned with configuration and no action was taken
failed - number of failures in the report

To retrieve eventful reports, the query is: changed > 0 OR failed > 0, the other field should not be taken into consideration. We can optionally set HasChange keyword but probably not worth it since we can easily query the data right away from the summary columns. The other field is rather informative as multiple statuses are mapped to this field. There is a proposal perhaps to remove this column completely.

Unfortunately Ruby on Rails have changed as a reserved word therefore the names in the DB and API must be:

change
nochange
failure

The plan is to map these to one or more statuses (or metrics) of various implementations. Here is the mapping proposal, of course the mapping is additive (when multiple states map to a single summary the sum is added).

Note: In the original proposal, there was a summary field “other” but it does not proven to be useful as keywords can serve pretty well to recognize the special cases.

Puppet

Summary

I am marking root of the report as yaml so it is clear where to find the data: smart_proxy_reports/puppet6-foreman-deb.yaml at develop · theforeman/smart_proxy_reports · GitHub

yaml.metrics.events.failure → failed
yaml.metrics.events.success → changed
yaml.metrics.resources.total - events.failure - events.success → unchanged

Pupet report also have yaml.metrics.resources but these are only stored in the JSON body field and can be optionally rendered to UI/CLI when report is opened. Comment in the UX thread if you want to discuss where/when/how to present these metrics on a page. Note for performance reasons, this cannot be presented on reports index page as the app would need to open and parse all records field body which is a JSON (blob/text).

Keywords

yaml.status=changed → PuppetStatusChanged
yaml.status=unchanged → PuppetStatusUnchanged
yaml.status=failed → PuppetStatusFailed
yaml.noop=true → PuppetNoop
yaml.metrics.resources.corrective_change > 0 → PuppetCorrectiveChange
yaml.metrics.resources.skipped > 0 → PuppetSkipped
yaml.metrics.resources.restarted > 0 → PuppetRestarted
yaml.metrics.resources.failed_to_restart > 0 → PuppetFailedToRestart
yaml.metrics.resources.scheduled > 0 → PuppetScheduled
yaml.metrics.resources.out_of_sync > 0 → PuppetOutOfSync

Keywords allow searching for particular reports by individual resources, but it is not practical to break down all resources into keywords this would create too many records. The proposal is:

yaml.resource_statuses.RESOURCE[yyz].failed = true → PuppetFailed:RESOURCE[xyz]

Let us know if you need to search for other resources states, but let’s keep this as minimum as possible to keep reports fast.

Ansible

Summary

To differentiate the fields, json prefix is the root of a report. Example report: Send report summary unchanged by lzap · Pull Request #1325 · theforeman/foreman-ansible-modules · GitHub

Because the final summary is not very well defined by Ansible (nothing in the documentation, code is unclear, our testing did show various results) and because we want consistent mapping, let’s ignore summary and aggregate data from the list of results:

`json.results[].result.changed`	`json.results[].failed`	Action
true	true	failed++
false	true	failed++
true	false	changed++
false	false	unchanged++

Previously there was a json.status field but it did not report all possible Ansible states so new field named json.summary will be sent: Send report summary unchanged by lzap · Pull Request #1325 · theforeman/foreman-ansible-modules · GitHub

Keywords

Keywords, on the other hand, are mapped to summary fields. This allows searching capability.

json.summary.changed > 0 → AnsibleChanged
json.summary.failures > 0 → AnsibleFailures
json.summary.unreachable > 0 → AnsibleUnreachable
json.summary.rescued > 0 → AnsibleRescued
json.summary.ignored > 0 → AnsibleIgnored
json.summary.skipped > 0 → AnsibleSkipped

Status

Overall easy to understand status calculated from summary fields:

Changes - YELLOW: when changed > 0 and failed = 0
Failures - RED: when failed > 0
No changes - GREEN: unchanged > 0 and failed = 0
Empty - GREY: when everything is 0

The colors do not necessary represent colors that needs to be used in the UI, these are proposals about how things are important with RED obviously mearning “attention”, YELLOW meaning “something interesting happened” and GREEN with “everying in order”.

CHANGELOG

renamed applied to changed (not sure why I started with applied but both in Puppet and Ansible the statuses are both named “changed”)
renamed “pending” to “unchanged”
keeping other summary field for now but we might consider dropping it alltogether
defined what is eventful report
corrected ansible metrics: Send report summary unchanged by lzap · Pull Request #1325 · theforeman/foreman-ansible-modules · GitHub
updated overall status to be more simple (just four states)
dropped other summary field
puppet mapping is now based on events
puppet kewords redefined
ansible ignores rescued/ignored/skipped
added ansible keyword mapping
added AnsibleUnreachable
added Ansible mapping from change/failure instead of summary fields
added note about Rails name problem (changed)
total events is only available in metrics.resources so changing to: yaml.metrics.resources.total - events.failure - events.success → unchanged

mhjacks · December 13, 2021, 4:29pm

I can agree to all column mappings.

Maybe the synthetic “eventful” should also include “other > 0”, since we’re mapping “skipped” to “other”? (Another solution might be putting skips in the “failed” column, which would preserve the definition for eventful that you proposed.) I think the most normal case of skips (resource failures and dependency chains) WOULD be caught by the existing mapping. I don’t recall offhand how things that are skipped due to schedule would be reported, but if those are all in other, the proposed mapping works.

ekohl · December 14, 2021, 11:58am

In this post I already linked to the relevant reference documentation.

What you are mentioning are all the individual metrics, but I think you’re missing some of the higher level this that make it much easier to parse.

First of all, at the top level there’s a field status which is an enum (failed, changed, or unchanged).

I really think that’s really useful since you don’t need to parse anything. The query becomes very quick. You can essentially use if (status != 'unchanged') { set_keyword('HasChange') }.

Then for the table, at first I thought you were referring to Puppet::Resource::Status in the report. However, later I looked at some more details and now I wonder if you are actually looking at this:

github.com

puppetlabs/puppet/blob/c4e72a38fad7d768a38a6f80d9a4c435a5ee9b8b/api/schemas/report.json#L304-L317


      
          {
              "description": "The name of the value",
              "enum": [
                  "failed",
                  "out_of_sync",
                  "changed",
                  "total",
                  "skipped",
                  "failed_to_restart",
                  "restarted",
                  "scheduled",
                  "corrective_change"
              ]
          },

For that it’s odd that the reference documentation states:

Includes the metrics failed, out_of_sync, changed, and total. Each value in the resources category is an integer.

So please clarify what you are parsing.

I also think corrective_change is already captured in changed. Some background: a corrective_change says is that the change happened while the catalog didn’t change while a non-corrective change is when the catalog changes. The catalog changes (for example) when a classes is added/removed or some variable changes. In other words: the admin changed the expectations on the system.

So a corrective_change means that is something on the system wasn’t in line with the intended profile. However, a change is already a change. This would be good to test. I’ll see about writing up some tests.

If you do want to introduce a new feature, this could be highlighted. However, you also have it at the report level (so no need to get it out of the metrics). So you can set a keyword CorrectiveChange on the report itself so you can easily search for it. I think the log lines themselves already contain (corrective change) so there’s no need to do anything on the individual events.

ekohl · December 14, 2021, 12:26pm

unchanged is not eventful so it should be changed > 0 OR failed > 0.

I also did some further reading on Puppet and opened:

https://tickets.puppetlabs.com/browse/DOCUMENT-1286
https://tickets.puppetlabs.com/browse/DOCUMENT-1287

I have a hunch your current suggestion may be counting things twice.

ekohl · December 14, 2021, 5:06pm

Because I was uncertain about the various cases, I started this:
https://github.com/theforeman/smart_proxy_reports/pull/17
The goal is to produce relevant reports. Since it’s source based, you can easily swap to a different Puppet version to compare. You can also easily modify it to send reports to your real Foreman and analyze.

If people have more examples, I’d welcome them.

This makes me think that the easiest way to summarize a Puppet report is to look at metrics.events. There you have:

total: total number of events
failure: number of failed resources
success: number of changes
(optionally) noop: number of noop resources

So if you want a summary, the mapping becomes:

success → changed
failed → failed
noop → other

What would really help this discussion if we actually had a reference to the data model where we’re trying to put data into.

Then we can apply keywords:

HasChange if status != ‘unchanged’
NoOp if noop

It should also be marked as a failure if transaction_completed is false.

lzap · December 15, 2021, 8:57am

I have looked into this, the problem is that with tags you can easily create many skipped resources and these must not count as failures as you intentionally skipped them. That was the reason to go for other.

Ah I was so hard-focused on metrics that I missed these completely.

Ok, makes sense then to do the following:

status=changed → keyword:PuppetStatusChanged
status=unchanged → keyword:PuppetStatusUnchanged
status=failed → keyword:PuppetStatu Failed

Added, this is quite flexible yeah.

I am looking at a specific example as I prefer real-world examples rather than spec docs:

github.com

theforeman/smart_proxy_reports/blob/develop/test/fixtures/puppet6-foreman-deb.yaml#L22-L53


      
          metrics:
            resources:
              name: resources
              label: Resources
              values:
              - - total
                - Total
                - 405
              - - skipped
                - Skipped
                - 0
              - - failed
                - Failed
                - 1
              - - failed_to_restart
                - Failed to restart
                - 0
              - - restarted
                - Restarted
                - 0

This file has been truncated. show original

Very good point, I think I am not utilizing keywords enough in the new design. It allows much flexible searching that is possible today. Added.

Good catch, this was indeed an overlook.

Oh this looks promising and it nicely maps. I did not see that, again, I was so narrow-focused.

I dont quite get this one, noop is a boolean in Puppet YAML, other summary field is an int.

To me it looks like we should probably drop the other summary field completely as we can use keywords for other interesting state presence.

What could be interesting is a setting that would be a list of keywords you want to show on the index page. There can be eventually many keywords (all failed resources) so it is not practical to try to put them all on the page, however, if user would specify which are interesting for them, let’s say up to 3 keywords it could be pretty easy to show them. I would imagine that interesting keywords would include something like noop or corrective change.

Ok I am going to update the proposal now, I think we are getting there.

lzap · December 15, 2021, 9:11am

I’ve updated the OP. See the CHANGELOG at the botton or use diff feature to see the changes.

I would appreciate a final review and also I would appreciate someone who is using Ansible, Chef or Salt to chip in.

ekohl · December 15, 2021, 10:11am

noop should be a count. It is the number of resources that it would try to apply.

So from this example:

If we quickly retrieve the metrics we see (using yq .metrics.events 8/*.yaml):

{
  "name": "events",
  "label": "Events",
  "values": [
    [
      "total",
      "Total",
      4
    ],
    [
      "failure",
      "Failure",
      0
    ],
    [
      "success",
      "Success",
      0
    ],
    [
      "noop",
      "Noop",
      4
    ]
  ]
}

There is also another case which I haven’t covered in that PR, but that’s where the agent is in noop mode while in sync (fixup! Add a script to collect various Puppet reports · theforeman/smart_proxy_reports@a464425 · GitHub).

In that case it reports 0 changes and doesn’t report the noop event:

{
  "name": "events",
  "label": "Events",
  "values": [
    [
      "total",
      "Total",
      0
    ],
    [
      "failure",
      "Failure",
      0
    ],
    [
      "success",
      "Success",
      0
    ]
  ]
}

So the noop column shows the number of resources it will apply if the agent moves out of noop mode.

evgeni · December 15, 2021, 10:22am

FWIW, I am not sure you should ignore skipped, rescued and ignored for Ansible, but that doesn’t make the raised callback PR invalid, as it actually enables the data to be reported and maybe processed by Foreman.

evgeni · December 15, 2021, 10:58am

And to add some more context:

ok means a task was executed, it doesn’t always mean it was uneventful:

- hosts: localhost
  gather_facts: false
  tasks:
    - name: fail
      command: "true"
      ignore_errors: true

$ ansible-playbook /tmp/lol.yml
PLAY [localhost] ***************************************************************

TASK [fail] ********************************************************************
changed: [localhost]

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

rescued means a task failed, but was in a block/rescue context, and then the rescue part triggered. I am not sure this should be counted as “changed” or not – a change was attempted, but wasn’t successful, but it surely should count towards the total number of (eventful) tasks

---
- hosts: localhost
  gather_facts: false
  tasks:
    - block:
      - name: fail
        command: "false"
      rescue:
        - name: lol
          debug:
            msg: lol

$ ansible-playbook /tmp/lol.yml
PLAY [localhost] ***************************************************************

TASK [fail] ********************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["false"], "delta": "0:00:00.001514", "end": "2021-12-15 11:56:28.332099", "msg": "non-zero return code", "rc": 1, "start": "2021-12-15 11:56:28.330585", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

TASK [lol] *********************************************************************
ok: [localhost] => {
    "msg": "lol"
}

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=1    ignored=0

ignored means a task failed but had ignore_errors: true set, thus not failing the play. if this task did a change, the changed counter is also incremented:

---
- hosts: localhost
  gather_facts: false
  tasks:
    - name: fail
      command: "false"
      ignore_errors: true

$ ansible-playbook /tmp/lol.yml
PLAY [localhost] ***************************************************************

TASK [fail] ********************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["false"], "delta": "0:00:00.001561", "end": "2021-12-15 11:54:28.704991", "msg": "non-zero return code", "rc": 1, "start": "2021-12-15 11:54:28.703430", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=1

lzap · December 15, 2021, 1:48pm

OMG this is so bad: https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/strategy/init.py

So let me try to read this:

on failure the following can be increased: ok, ignored, changed, rescued, failures
on unreachable the following can be increased: skipped, dark
on succes the following can be incremented: ok, changed

I am really not sure how to map this. I am asking on Ansible IRC for now.

lzap · December 20, 2021, 9:09am

I really tried hard to find a better mapping, but I think ignoring these is the best thing to do. We will use keywords which can be also shown on the listing page: AnsibleRescued, AnsibleIgnored, AnsibleSkipped… so users will not miss on that information.

Please keep in mind this mapping is only for the three summary fields which are shown on listing pages (index page, list of last reports etc). On a detail page, we will also show the original status flags for each report type in its glory.

So this is it, unless there are any objections I introduce you the final proposal for mapping of Puppet and Ansible.

Dewar · December 20, 2021, 2:58pm

I don’t see why we don’t let the user choose which of these to show in the component, and just do a direct mapping?

“failed”,
“out_of_sync”,
“changed”,
“total”,
“skipped”,
“failed_to_restart”,
“restarted”,
“scheduled”,
“corrective_change”

Also, you could only show items that are greater than 0, and have a “Show more >>” button that that will take you to the main puppet page.

I think if Puppet users are looking for these specific values, why not allow them to choose them?

lzap · December 21, 2021, 11:36am

Isn’t better to find mapping that would work from the day one without customization? We only need this for the index page, on the detail page let’s indeed show the original values - no mapping needed.

It was articulated multiple times in this and the other thread from Puppet users than they are mainly interested in the following:

I want to know there was a failure
I want to know there was an event occured
I want to drill down and investigate from the logs what happened

Correct me if there are other use cases that should be considered.

I would love to achieve the simplest possible yet useful design.

lzap · January 10, 2022, 9:29am

So after some time digging in the Ansible codebase, trying out various tests and asking on IRC my conclusion is:

Ansible summary is not useful metric to map to anything.

I propose to ignore it and aggregate data from the list of results. Every task report has an array of result element and both have two fields: changed and failed. The presence in the array means that a task was either eventful or not. And presence of failed flag means the task failed. So here is a simple mapping:

`json.results[].result.changed`	`json.results[].failed`	Action
true	true	failed++
false	true	failed++
true	false	changed++
false	false	unchanged++

We will of course still store and how Ansible summary to the user, but this mapping makes sense to me.

Are there any objections to either Puppet or Ansible mapping? See my OP for the final proposal.

Marek_Hulan · January 11, 2022, 4:30pm

+1, there’s your mapping

ekohl · January 11, 2022, 5:05pm

Given the large thread, I’m not exactly sure what the current Puppet proposal is. Did you update the original post so is it New config report summary columns?

lzap · January 12, 2022, 6:52am

Yes, sorry I should have repeat it. I kept the original proposal updating. My plan to show this on the demo too.

Edit: There is a CHANGELOG too.

lzap · January 13, 2022, 1:07pm

For the record: New mapping implementation:

Implement final mapping by lzap · Pull Request #20 · theforeman/smart_proxy_reports · GitHub
Implement final mapping by lzap · Pull Request #14 · theforeman/foreman_host_reports · GitHub

I will show it on the demo too.