Foreman telemetry API for developers

lzap · December 3, 2021, 11:34am

Hey,

a quick reminder: we have a pretty simple telemetry API in Foreman core that can be used to export valuable monitoring data. It is turned off by default but users or developers can easily turn it on in settings.yaml. It supports three outputs:

logging - simply prints all telemetry into Rails log (not meant for production)
prometheus - metrics are gathered in memory, aggregated and presented via /metrics URL
statsd - metrics are sent over statsd protocol to statsd server for aggregation

The best possible output currently is statsd because it is light-weight and works reliably. I suggest to run any kind of statsd daemon on localhost so statsd packets won’t get lost along the way. The Prometheus Ruby client library has some scaling issues - when processes are restarted it leaves some temporary files behind causing the /metrics endpoint to be slower and slower. I’ve sent a patch recently to solve this, hopefully it can be fixed soon. I do not recommend to use Prometheus output yet.

Since Prometheus is very popular, what you can do is run statsd_exporter that can turn statsd measurement, aggregate them and export them in a reliable way. There is a rake task telemetry:prometheus_statsd that can generate configuration for statsd_exporter so it is very easy to setup.

How does it work

There are three types:

counter - monotonic counter, simply a number that is either increasing or decreasing. Typically used for things like amount of requests processed or number of failed logins.
gauge - simply a float. A number that can change over time. Typical use: CPU temperature, number of entries in queue etc.
histogram - simply a duration. How long something took. It has this “weird” name because the data structure used to store these durations is histogram for practical reasons. Typical use: time spent in db, view or controller.

How to use telemetry API

It is very simple. First off, register new metric either in config/initializers/5_telemetry.rb (for core) or via plugin API:

add_counter_telemetry(:new_discovered_hosts, 'Number of requests processed as new discovered hosts')
add_counter_telemetry(:updated_discovered_hosts, 'Number of requests processed as discovered updates')
add_counter_telemetry(:failed_discovered_hosts, 'Number of failed discovery or fact update requests')
add_histogram_telemetry(:discovery_request_duration, 'Time spent in request to discovered node (ms)', [:method])
add_counter_telemetry(:discovery_failed_requests, 'Number of discovery node requests failed')

To increase a counter:

# by one
telemetry_increment_counter(:my_metric)
# decrement by five
telemetry_increment_counter(:my_metric, -5)

Each metric can optionally have labels, essentially a ruby hash with additional data:

telemetry_increment_counter(:my_metric, 1, controller: controller_name, action: action_name)

This way, data can be broken down per labels like controller/action, model class, table name, queue name etc.

To set a gauge:

telemetry_set_gauge(:queue_size, queue.size)

To record a duration of something either call block or non-block variant:

telemetry_duration_histogram(:importer_facts_import_duration) do
  # do something possibly slow
end

Duration is measured in milliseconds by default, it is possible to change the resolution via the second argument. Make sure to describe it in the metric description when registering a metric if duration is in something else than milliseconds.

Labels can be also used for gauge and histogram, just provide the third argument (hash).

Labels filtering

Because some metrics can add ton of measurements, labels can be optionally filtered. Currently only labels named controller and class are, if you create any other name for your label it will not be filtered out.

Prometheus scaling issues

As I said above, I do not currently recommend to use Prometheus output because the Ruby client library does not handle process restarting well. Aggregation must be done in the master Puma process and this is done by temporary files. It is fast (actually faster than memcache or DB), however, there is no squashing implemented yet. If you want more details, head over to the project site, or my patch that aims to solve it:

github.com/prometheus/client_ruby

DFS: proctitle naming scheme for Puma and others

prometheus:master ← lzap:dfs-proctitle-naming

opened 01:01PM - 30 Nov 21 UTC

lzap

+141 -29

I solved the long-term problem that is bugging us for years: when app server (Pu…ma in our case) performs any kind of rolling restart or recycles dead worker, new data file is created causing the temporary directory for DirectFileStore to grow causing the `/metrics` endpoint to be slower and slower. I figured out how to solve this for Puma, and for other web servers which set process titles in a consistent way. Initially, I had the similar implementation for `Puma.stats` (https://github.com/puma/puma/blob/master/docs/stats.md), however, this did not work because this call returns relevant data only when called from the master process. Therefore, the implementation was done via process titles, which is also more generic and pretty fast. The patch includes a README explanation, start with that to understand what is going on. Edit: Fixed tests.

Production deployment

At the moment, the only reliable way is to configure statsd_exporter and scrape the data into prometheus. Or if you run any kind of monitoring system that supports statsd, just export the data there.

If you have Foreman running on EL8, there is now a native statsd agent for PCP monitoring framework which is the recommanded and supported monitoring tool by Red Hat. In that case, follow this tutorial to set it up for long-term production use:

Questions

Let me know if you have any questions or concerns.