New config report summary columns

Hey,

we are working on a new config report format. There was a great discussion on the mapping of status/metric from Puppet and Ansible, I think this deserves its own thread. This page is a wiki and I will be updating it to fine-tune according to the proposals.

Summary

These fields are stored in reports table as integer column and represent the following:

  • changed - number or changes (events) occured in the report
  • unchanged - number of resources/actions that are aligned with configuration and no action was taken
  • failed - number of failures in the report

To retrieve eventful reports, the query is: changed > 0 OR failed > 0, the other field should not be taken into consideration. We can optionally set HasChange keyword but probably not worth it since we can easily query the data right away from the summary columns. The other field is rather informative as multiple statuses are mapped to this field. There is a proposal perhaps to remove this column completely.

Unfortunately Ruby on Rails have changed as a reserved word therefore the names in the DB and API must be:

  • change
  • nochange
  • failure

The plan is to map these to one or more statuses (or metrics) of various implementations. Here is the mapping proposal, of course the mapping is additive (when multiple states map to a single summary the sum is added).

Note: In the original proposal, there was a summary field ā€œotherā€ but it does not proven to be useful as keywords can serve pretty well to recognize the special cases.

Puppet

Summary

I am marking root of the report as yaml so it is clear where to find the data: smart_proxy_reports/puppet6-foreman-deb.yaml at develop Ā· theforeman/smart_proxy_reports Ā· GitHub

  • yaml.metrics.events.failure ā†’ failed
  • yaml.metrics.events.success ā†’ changed
  • yaml.metrics.resources.total - events.failure - events.success ā†’ unchanged

Pupet report also have yaml.metrics.resources but these are only stored in the JSON body field and can be optionally rendered to UI/CLI when report is opened. Comment in the UX thread if you want to discuss where/when/how to present these metrics on a page. Note for performance reasons, this cannot be presented on reports index page as the app would need to open and parse all records field body which is a JSON (blob/text).

Keywords

  • yaml.status=changed ā†’ PuppetStatusChanged
  • yaml.status=unchanged ā†’ PuppetStatusUnchanged
  • yaml.status=failed ā†’ PuppetStatusFailed
  • yaml.noop=true ā†’ PuppetNoop
  • yaml.metrics.resources.corrective_change > 0 ā†’ PuppetCorrectiveChange
  • yaml.metrics.resources.skipped > 0 ā†’ PuppetSkipped
  • yaml.metrics.resources.restarted > 0 ā†’ PuppetRestarted
  • yaml.metrics.resources.failed_to_restart > 0 ā†’ PuppetFailedToRestart
  • yaml.metrics.resources.scheduled > 0 ā†’ PuppetScheduled
  • yaml.metrics.resources.out_of_sync > 0 ā†’ PuppetOutOfSync

Keywords allow searching for particular reports by individual resources, but it is not practical to break down all resources into keywords this would create too many records. The proposal is:

  • yaml.resource_statuses.RESOURCE[yyz].failed = true ā†’ PuppetFailed:RESOURCE[xyz]

Let us know if you need to search for other resources states, but letā€™s keep this as minimum as possible to keep reports fast.

Ansible

Summary

To differentiate the fields, json prefix is the root of a report. Example report: Send report summary unchanged by lzap Ā· Pull Request #1325 Ā· theforeman/foreman-ansible-modules Ā· GitHub

Because the final summary is not very well defined by Ansible (nothing in the documentation, code is unclear, our testing did show various results) and because we want consistent mapping, letā€™s ignore summary and aggregate data from the list of results:

json.results[].result.changed json.results[].failed Action
true true failed++
false true failed++
true false changed++
false false unchanged++

Previously there was a json.status field but it did not report all possible Ansible states so new field named json.summary will be sent: Send report summary unchanged by lzap Ā· Pull Request #1325 Ā· theforeman/foreman-ansible-modules Ā· GitHub

Keywords

Keywords, on the other hand, are mapped to summary fields. This allows searching capability.

  • json.summary.changed > 0 ā†’ AnsibleChanged
  • json.summary.failures > 0 ā†’ AnsibleFailures
  • json.summary.unreachable > 0 ā†’ AnsibleUnreachable
  • json.summary.rescued > 0 ā†’ AnsibleRescued
  • json.summary.ignored > 0 ā†’ AnsibleIgnored
  • json.summary.skipped > 0 ā†’ AnsibleSkipped

Status

Overall easy to understand status calculated from summary fields:

  • Changes - YELLOW: when changed > 0 and failed = 0
  • Failures - RED: when failed > 0
  • No changes - GREEN: unchanged > 0 and failed = 0
  • Empty - GREY: when everything is 0

The colors do not necessary represent colors that needs to be used in the UI, these are proposals about how things are important with RED obviously mearning ā€œattentionā€, YELLOW meaning ā€œsomething interesting happenedā€ and GREEN with ā€œeverying in orderā€.

CHANGELOG

  • renamed applied to changed (not sure why I started with applied but both in Puppet and Ansible the statuses are both named ā€œchangedā€)
  • renamed ā€œpendingā€ to ā€œunchangedā€
  • keeping other summary field for now but we might consider dropping it alltogether
  • defined what is eventful report
  • corrected ansible metrics: Send report summary unchanged by lzap Ā· Pull Request #1325 Ā· theforeman/foreman-ansible-modules Ā· GitHub
  • updated overall status to be more simple (just four states)
  • dropped other summary field
  • puppet mapping is now based on events
  • puppet kewords redefined
  • ansible ignores rescued/ignored/skipped
  • added ansible keyword mapping
  • added AnsibleUnreachable
  • added Ansible mapping from change/failure instead of summary fields
  • added note about Rails name problem (changed)
  • total events is only available in metrics.resources so changing to: yaml.metrics.resources.total - events.failure - events.success ā†’ unchanged
3 Likes

I can agree to all column mappings.

Maybe the synthetic ā€œeventfulā€ should also include ā€œother > 0ā€, since weā€™re mapping ā€œskippedā€ to ā€œotherā€? (Another solution might be putting skips in the ā€œfailedā€ column, which would preserve the definition for eventful that you proposed.) I think the most normal case of skips (resource failures and dependency chains) WOULD be caught by the existing mapping. I donā€™t recall offhand how things that are skipped due to schedule would be reported, but if those are all in other, the proposed mapping works.

1 Like

In this post I already linked to the relevant reference documentation.

What you are mentioning are all the individual metrics, but I think youā€™re missing some of the higher level this that make it much easier to parse.

First of all, at the top level thereā€™s a field status which is an enum (failed, changed, or unchanged).

I really think thatā€™s really useful since you donā€™t need to parse anything. The query becomes very quick. You can essentially use if (status != 'unchanged') { set_keyword('HasChange') }.

Then for the table, at first I thought you were referring to Puppet::Resource::Status in the report. However, later I looked at some more details and now I wonder if you are actually looking at this:

For that itā€™s odd that the reference documentation states:

Includes the metrics failed, out_of_sync, changed, and total. Each value in the resources category is an integer.

So please clarify what you are parsing.

I also think corrective_change is already captured in changed. Some background: a corrective_change says is that the change happened while the catalog didnā€™t change while a non-corrective change is when the catalog changes. The catalog changes (for example) when a classes is added/removed or some variable changes. In other words: the admin changed the expectations on the system.

So a corrective_change means that is something on the system wasnā€™t in line with the intended profile. However, a change is already a change. This would be good to test. Iā€™ll see about writing up some tests.

If you do want to introduce a new feature, this could be highlighted. However, you also have it at the report level (so no need to get it out of the metrics). So you can set a keyword CorrectiveChange on the report itself so you can easily search for it. I think the log lines themselves already contain (corrective change) so thereā€™s no need to do anything on the individual events.

unchanged is not eventful so it should be changed > 0 OR failed > 0.

I also did some further reading on Puppet and opened:

https://tickets.puppetlabs.com/browse/DOCUMENT-1286
https://tickets.puppetlabs.com/browse/DOCUMENT-1287

I have a hunch your current suggestion may be counting things twice.

Because I was uncertain about the various cases, I started this:
https://github.com/theforeman/smart_proxy_reports/pull/17
The goal is to produce relevant reports. Since itā€™s source based, you can easily swap to a different Puppet version to compare. You can also easily modify it to send reports to your real Foreman and analyze.

If people have more examples, Iā€™d welcome them.

This makes me think that the easiest way to summarize a Puppet report is to look at metrics.events. There you have:

  • total: total number of events
  • failure: number of failed resources
  • success: number of changes
  • (optionally) noop: number of noop resources

So if you want a summary, the mapping becomes:

  • success ā†’ changed
  • failed ā†’ failed
  • noop ā†’ other

What would really help this discussion if we actually had a reference to the data model where weā€™re trying to put data into.

Then we can apply keywords:

  • HasChange if status != ā€˜unchangedā€™
  • NoOp if noop

It should also be marked as a failure if transaction_completed is false.

I have looked into this, the problem is that with tags you can easily create many skipped resources and these must not count as failures as you intentionally skipped them. That was the reason to go for other.

Ah I was so hard-focused on metrics that I missed these completely.

Ok, makes sense then to do the following:

  • status=changed ā†’ keyword:PuppetStatusChanged
  • status=unchanged ā†’ keyword:PuppetStatusUnchanged
  • status=failed ā†’ keyword:PuppetStatu Failed

Added, this is quite flexible yeah.

I am looking at a specific example as I prefer real-world examples rather than spec docs:

Very good point, I think I am not utilizing keywords enough in the new design. It allows much flexible searching that is possible today. Added.

Good catch, this was indeed an overlook.

Oh this looks promising and it nicely maps. I did not see that, again, I was so narrow-focused.

I dont quite get this one, noop is a boolean in Puppet YAML, other summary field is an int.

To me it looks like we should probably drop the other summary field completely as we can use keywords for other interesting state presence.

What could be interesting is a setting that would be a list of keywords you want to show on the index page. There can be eventually many keywords (all failed resources) so it is not practical to try to put them all on the page, however, if user would specify which are interesting for them, letā€™s say up to 3 keywords it could be pretty easy to show them. I would imagine that interesting keywords would include something like noop or corrective change.

Ok I am going to update the proposal now, I think we are getting there.

Iā€™ve updated the OP. See the CHANGELOG at the botton or use diff feature to see the changes.

I would appreciate a final review and also I would appreciate someone who is using Ansible, Chef or Salt to chip in.

noop should be a count. It is the number of resources that it would try to apply.

So from this example:

If we quickly retrieve the metrics we see (using yq .metrics.events 8/*.yaml):

{
  "name": "events",
  "label": "Events",
  "values": [
    [
      "total",
      "Total",
      4
    ],
    [
      "failure",
      "Failure",
      0
    ],
    [
      "success",
      "Success",
      0
    ],
    [
      "noop",
      "Noop",
      4
    ]
  ]
}

There is also another case which I havenā€™t covered in that PR, but thatā€™s where the agent is in noop mode while in sync (fixup! Add a script to collect various Puppet reports Ā· theforeman/smart_proxy_reports@a464425 Ā· GitHub).

In that case it reports 0 changes and doesnā€™t report the noop event:

{
  "name": "events",
  "label": "Events",
  "values": [
    [
      "total",
      "Total",
      0
    ],
    [
      "failure",
      "Failure",
      0
    ],
    [
      "success",
      "Success",
      0
    ]
  ]
}

So the noop column shows the number of resources it will apply if the agent moves out of noop mode.

FWIW, I am not sure you should ignore skipped, rescued and ignored for Ansible, but that doesnā€™t make the raised callback PR invalid, as it actually enables the data to be reported and maybe processed by Foreman.

And to add some more context:

  • ok means a task was executed, it doesnā€™t always mean it was uneventful:
- hosts: localhost
  gather_facts: false
  tasks:
    - name: fail
      command: "true"
      ignore_errors: true
$ ansible-playbook /tmp/lol.yml
PLAY [localhost] ***************************************************************

TASK [fail] ********************************************************************
changed: [localhost]

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
  • rescued means a task failed, but was in a block/rescue context, and then the rescue part triggered. I am not sure this should be counted as ā€œchangedā€ or not ā€“ a change was attempted, but wasnā€™t successful, but it surely should count towards the total number of (eventful) tasks
---
- hosts: localhost
  gather_facts: false
  tasks:
    - block:
      - name: fail
        command: "false"
      rescue:
        - name: lol
          debug:
            msg: lol
$ ansible-playbook /tmp/lol.yml
PLAY [localhost] ***************************************************************

TASK [fail] ********************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["false"], "delta": "0:00:00.001514", "end": "2021-12-15 11:56:28.332099", "msg": "non-zero return code", "rc": 1, "start": "2021-12-15 11:56:28.330585", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

TASK [lol] *********************************************************************
ok: [localhost] => {
    "msg": "lol"
}

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=1    ignored=0   

  • ignored means a task failed but had ignore_errors: true set, thus not failing the play. if this task did a change, the changed counter is also incremented:
---
- hosts: localhost
  gather_facts: false
  tasks:
    - name: fail
      command: "false"
      ignore_errors: true
$ ansible-playbook /tmp/lol.yml
PLAY [localhost] ***************************************************************

TASK [fail] ********************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["false"], "delta": "0:00:00.001561", "end": "2021-12-15 11:54:28.704991", "msg": "non-zero return code", "rc": 1, "start": "2021-12-15 11:54:28.703430", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=1   
1 Like

OMG this is so bad: https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/strategy/init.py

So let me try to read this:

  • on failure the following can be increased: ok, ignored, changed, rescued, failures
  • on unreachable the following can be increased: skipped, dark
  • on succes the following can be incremented: ok, changed

I am really not sure how to map this. I am asking on Ansible IRC for now.

I really tried hard to find a better mapping, but I think ignoring these is the best thing to do. We will use keywords which can be also shown on the listing page: AnsibleRescued, AnsibleIgnored, AnsibleSkippedā€¦ so users will not miss on that information.

Please keep in mind this mapping is only for the three summary fields which are shown on listing pages (index page, list of last reports etc). On a detail page, we will also show the original status flags for each report type in its glory.

So this is it, unless there are any objections I introduce you the final proposal for mapping of Puppet and Ansible.

I donā€™t see why we donā€™t let the user choose which of these to show in the component, and just do a direct mapping?

ā€œfailedā€,
ā€œout_of_syncā€,
ā€œchangedā€,
ā€œtotalā€,
ā€œskippedā€,
ā€œfailed_to_restartā€,
ā€œrestartedā€,
ā€œscheduledā€,
ā€œcorrective_changeā€

Also, you could only show items that are greater than 0, and have a ā€œShow more >>ā€ button that that will take you to the main puppet page.

I think if Puppet users are looking for these specific values, why not allow them to choose them?

Isnā€™t better to find mapping that would work from the day one without customization? We only need this for the index page, on the detail page letā€™s indeed show the original values - no mapping needed.

It was articulated multiple times in this and the other thread from Puppet users than they are mainly interested in the following:

  • I want to know there was a failure
  • I want to know there was an event occured
  • I want to drill down and investigate from the logs what happened

Correct me if there are other use cases that should be considered.

I would love to achieve the simplest possible yet useful design.

So after some time digging in the Ansible codebase, trying out various tests and asking on IRC my conclusion is:

Ansible summary is not useful metric to map to anything.

I propose to ignore it and aggregate data from the list of results. Every task report has an array of result element and both have two fields: changed and failed. The presence in the array means that a task was either eventful or not. And presence of failed flag means the task failed. So here is a simple mapping:

json.results[].result.changed json.results[].failed Action
true true failed++
false true failed++
true false changed++
false false unchanged++

We will of course still store and how Ansible summary to the user, but this mapping makes sense to me.

Are there any objections to either Puppet or Ansible mapping? See my OP for the final proposal.

2 Likes

+1, thereā€™s your mapping :slight_smile:

Given the large thread, Iā€™m not exactly sure what the current Puppet proposal is. Did you update the original post so is it New config report summary columns?

Yes, sorry I should have repeat it. I kept the original proposal updating. My plan to show this on the demo too.

Edit: There is a CHANGELOG too.

For the record: New mapping implementation:

I will show it on the demo too.