Monitoring Foreman with Prometheus

Hello,

prometheus-client rubygem 1.0 has been finally released, it fixed one of the major pain-points of supporting multiple instances by sharing metric storage surprisingly in a simple file. I’ve been in touch with the developers, if you want more details there is exciting talk available from RubyConf 2019:

I’ve just bumped our RPM dependency we carry in our repos, for Debian there is nothing to wait for:

To start monitoring Foreman via Prometheus simply:

yum -y install foreman-telemetry prometheus-client

And enable it in settings.yaml:

:telemetry:
  :prefix: 'fm_rails'
  :statsd:
    :enabled: false
    :host: '127.0.0.1:8125'
    :protocol: 'statsd'
  :prometheus:
    :enabled: true
  :logger:
    :enabled: false
    :level: 'INFO'

Then make Prometheus to scrape /metrics endpoint. The client library will now work under Passenger or any other forking server giving the correct numbers.

Warning: It looks like both Passenger and Puma, our web app servers, do recycle worker processes quite often. This leads to many temporary files created in relatively short period of time (hours on some deployments) which unfortunately makes /metrics endpoint slower and slower to the point it kills the whole deployment. Only restart of the app helps which cleans the temporary directory completely. Ruby client library maintainers are aware of the problem and they said it will be challenging to implement some kind of “squash” mechanism.

Edit 2020: I recommend to use statsd exporter instead of native Prometheus and use stasts_exporter Prometheus bridge to collect the data. Monitoring Foreman with Prometheus via statsd

4 Likes

The client library RPM update will be part of the 1.24.1 minor update. Follow this guide to enable Prometheus and scrape telemetry data from Foreman application from version 1.24.1! Thanks @tbrisker for the extra effort.

Thanks!
I’m testing this on Satellite 6(.7 snap) and it seems to work great. Anyway, is there any plan to extend this with katello-related metrics?
I think scraping some katello instance counts via fm_rails_activerecord_instances might be really helpful in regards to performance monitoring.

Great job!

Telemetry does work with all the plugins, including Katello ones. For example I see “class=Katello::Subscription” labeled metric and others.

1 Like

Here is an example dashboard for Grafana via Prometheus official Data Source:

https://lzap.fedorapeople.org/projects/foreman-monitoring/Foreman-1.24-1578398134479.json.gz

@rplevka I no longer have the setup, can you make some screenshots and post it here if it works for you?

Hey, tnx for this.
I’ve set this up and it works on Foreman main server. Running Foreman 2.0/Katello 3.15. Is there a way to do this on smart proxies? I’m not seeing telemetry settings in /etc/foreman-proxy/settings.yml on smart proxy servers.

Also, any good grafana dashboards you could recommend?
tnx :slight_smile:

I’ve found this for monitoring smart proxies, but I see it’s only for Icinga 2 https://github.com/theforeman/smart_proxy_monitoring
Anything for Prometheus that I’m missing?

Hey, unfortunately this was not yet implemented for smart proxy. There is no telemetry available there.

Also, there are currently no dashboards. We are currently looking into Grafana which ships with RHEL 8.x series, it has been greatly improved and there is new PCP source as well. We will likely ship a dashboard in some future. For now, create your own one and please share it with us.

That dashboard linked above is from RHEL7 Grafana, a very old version and I do not recommend it as the PCP integration does not work well.

Oki, tnx. I’ll make my own then and open source it. I’m thinking then in addition to foreman prometheus metrics I’ll combine it with node exporter metrics, explicitly systemd metrics so at least alerts work on smart proxies if a service on smart proxy fails. I’ll update this thread with links when it’s done.

2 Likes

Cool, we would appreciate if the implementation is same/similar as what we have in Foreman core. It is a small facade with two implementations: statsd and prometheus.

I would be also happy to accept a patch to smart_proxy_monitoring, if you want to add an additional provider. It was always meant to be open for other monitoring tools.

Of course it is also fine if you want to provide a separate feature if the goal is too different as smart_proxy_monitoring and foreman_monitoring is meant to integrate monitoring data into Foreman and automate the monitoring solution’s configuration via the Smart Proxy.

Monitoring similar to core is on our TODO, we would like to get that merged into smart-proxy core repo.

You can actually extract the implementation form core into a gem or even copy and paste it - it is very small.

I tried to activate telemetry for Prometheus: I installed the two packages (foreman-telemetry prometheus-client), changed the config (enabled prometheus) and restarted all services. Unfortunately I can not access http://satellite/metrics: “The page you were looking for doesn’t exist.”

Did I miss something?

Tested on Satellite 6.7.1

Hello,

did you see “Unable to initialize XYZ telemetry” warning message in production.log after start? That’s what Foreman would do if there is a missing dependency.

I have just tested this on my 6.8 (alpha build) instance and it works fine. Note that you need the prometheus client library 1.0 or newer in order for this to work. This is not in 6.7 yet if I remember correctly.

Hi,

No sign of that message:

[root@satellite ~]# grep -i "Unable to initialize XYZ telemetry" /var/log/foreman/production.log
[root@satellite ~]# curl --noproxy localhost http://localhost/metrics -I
HTTP/1.1 404 Not Found
Date: Mon, 15 Jun 2020 13:27:00 GMT
Server: Apache
X-Request-Id: a4745b08-f072-493f-a9df-86f1a0e1623f
X-Runtime: 0.009050
X-Frame-Options: sameorigin
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Download-Options: noopen
X-Permitted-Cross-Domain-Policies: none
Content-Security-Policy: default-src 'self'; child-src 'self'; connect-src 'self' ws: wss:; img-src 'self' data: *.gravatar.com; script-src 'unsafe-eval' 'unsafe-inline' 'self'; style-src 'unsafe-inline' 'self'
X-Powered-By: Phusion Passenger 4.0.18
Content-Length: 1564
Status: 404 Not Found
Content-Type: text/html; charset=utf-8

[root@satellite ~]# grep -i "Unable to initialize XYZ telemetry" /var/log/foreman/production.log
[root@satellite ~]#

Prometheus client library is installed with version 1.0.0:

[root@satellite ~]# yum history info 304
Loaded plugins: foreman-protector, langpacks, product-id, search-disabled-repos, subscription-manager
Transaction ID : 304
Begin time     : Wed Jun 10 15:23:01 2020
Begin rpmdb    : 1382:21b696568a1413edeca009836fc00b7d1e18b477
End time       :            15:23:03 2020 (2 seconds)
End rpmdb      : 1385:7a4733db7491ad689f920f1c0d0b6239550e56c6
User           : ***
Return-Code    : Success
Command Line   : install foreman-telemetry prometheus-client
Transaction performed with:
    Installed     rpm-4.11.3-43.el7.x86_64                    @rhel-7-server-rpms
    Installed     subscription-manager-1.24.26-3.el7_8.x86_64 @rhel-7-server-rpms
    Installed     yum-3.4.3-167.el7.noarch                    @rhel-7-server-rpms
    Installed     yum-metadata-parser-1.1.4-10.el7.x86_64     @anaconda/7.3
Packages Altered:
    Install     foreman-telemetry-1.24.1.21-1.el7sat.noarch         @rhel-7-server-satellite-6.7-rpms
    Dep-Install tfm-rubygem-prometheus-client-1.0.0-1.el7sat.noarch @rhel-7-server-satellite-6.7-rpms
    Dep-Install tfm-rubygem-quantile-0.2.0-3.el7sat.noarch          @rhel-7-server-satellite-6.7-rpms

Grep this: Unable to initialize because the XYZ is a variable I could not remember what set there. :slight_smile:

Ah, sorry… blindly copied your line for my answer. But I also grepped for single words (“unable”, “telemetry”,…) and didn’t find any suspicious message.

When should I expect to see this message? After loading /metrics or after reloading/restarting foreman/httpd?

Sorry, my bad :frowning:

Sometimes you overlook the obvious. The prometheus option (settings.yaml) was set to false … thanks Puppet. After (re-)activating the option and restarting the services it now works.

1 Like

Great get back to us how it works. We have one report from a user that the endpoint takes 1 minute to process, we don’t see that behavior so be careful and monitor the monitoring :slight_smile: