Monitoring Foreman with Prometheus

zatarc · June 17, 2020, 9:15am

The response times from /metrics look good to me.

[root@satellite ~]# while true; do curl -w '%{time_total}\n' https://localhost/metrics -o /dev/null -k -s --noproxy localhost; done
1.253
0.941
0.948
0.952
0.933
0.942
0.926
0.932
0.959
0.983
0.946
1.184
1.305
0.964
1.645
1.598
1.940
0.933
0.929
1.946
0.964
0.941
1.960
0.958
0.948

This is from our test instance with an 8C/16GB setup. The prod instance has more power

zatarc · June 19, 2020, 7:12am

Is there any way to restrict access to /metrics (e.g. to logged in users)?

lzap · June 22, 2020, 9:15am

Question more fore the library authors, I don’t think so. However we use Apache httpd in front of passenger/puma depending on the version, you can use any Apache module… e.g. authentication or IP filtering.

matemikulic · July 15, 2020, 1:02pm

Hey, I can confirm that user.
# while true; do curl -w ‘%{time_total}\n’ https://localhost/metrics -o /dev/null -k -s --noproxy localhost; done
26,244
27,270
27,133
27,407
25,433
26,221

My prometheus has scrape interval set at 10 seconds and timeout at 10 as well.
I’ve fixed this originally by increasing java heap memory and restarting services, but that worked only for about a day and then the issue was back. I thought this must have manifested to a lot of users and was expecting a fix in v2.1

lzap · July 16, 2020, 9:18am

We are implementing a filtering mechanism to bring amount of metrics down to reasonable level:

github.com/theforeman/foreman

Fixes #30029 - telemetry label filter

theforeman:develop ← lzap:less-metrics-30029

opened 01:00PM - 22 Jun 20 UTC

lzap

+97 -1

We have reports that reading telemetry via Prometheus API can be as slow as one …minute. Foreman gaters a lot of telemetry data and it grows quickly up to several thousands of metrics. Not all of the controllers are needed, we are usually only interested in few. This patch adds an allow-list of labels, all metrics with labels not matching are dropped. Amount of buckets was lowered too which has significant impact of total amount of metrics.

matemikulic · September 10, 2020, 10:40am

Excellent, since version 2.1.1 this issue has been resolved with prometheus timing out on metrics scrape. I can now monitor it without issues. I’ve made simple grafana monitoring response times for http requests and facts processed. And from node exported monitoring disk, CPU, RAM and each Foreman service uptime. I’m writing alerts now. Will publish it when done if someone finds it helpful.

Bryan_Kearney · September 10, 2020, 12:08pm

@matemikulic We would find that very helpful, thanks!

lzap · September 23, 2020, 10:45am

For the records, it looks like Prometheus Ruby client library uses PIDs for temporary filenames. But Ruby web servers do restart worker processes quite often, this can leave many files behind after short period of time (days, weeks) which makes /metrics endpoint slower and slower.

Until we fix this, I do recommend using statsd protocol, there is https://github.com/prometheus/statsd_exporter that can be utilized to load data into Prometheus if needed.

Here is an attempt to fix this: https://github.com/theforeman/foreman/pull/8011

odhub · October 16, 2020, 7:47am

Hai @lzap

It may a little stupid qustion, but i need to ask, what port we should put on prometheus scrapt setting

lzap · October 16, 2020, 7:51am

By default its 5000 but I am not sure if Katello deployment has something on it. Just pick one.

lzap · October 16, 2020, 7:52am

For the record, this will fix the temporary files problem. We are almost there. If you need it right now just backport the patch.

odhub · October 16, 2020, 7:57am

Thanks, @lzap
I’ve try port 5000, and status on prometheus Down (BAD Request). I think I need to deep dive a little bit, or causing by my network rules. I will back here when there is a result.

Thanks

Yalafeg · November 25, 2020, 1:08pm

Hello, is it possible to share with us the json file of the Grafana dashboard that you made? it is based on prometheus datasource right? Thank you

matemikulic · November 26, 2020, 11:23am

Hey,
I’m sorry for the delay. I’ve been really busy. I have planned publishing this in repo on github with grafana, prometheus config and alertmanager config. I’ve also decided to write a blogbost for the community how I integrated Foreman/Katello since my infra is managed by Chef, not Puppet. I’ve caught time now to publish it on grafana.com, but I will publish the rest, if I could just get some time.

Here is the grafana with node exporter and prometheus config instructions for now. https://grafana.com/grafana/dashboards/13469

Yalafeg · November 30, 2020, 8:22am

Thank you

lzap · November 30, 2020, 2:12pm

Warning: It looks like both Passenger and Puma, our web app servers, do recycle worker processes quite often. This leads to many temporary files created in relatively short period of time (hours on some deployments) which unfortunately makes /metrics endpoint slower and slower to the point it kills the whole deployment. Only restart of the app helps which cleans the temporary directory completely. Ruby client library maintainers are aware of the problem and they said it will be challenging to implement some kind of “squash” mechanism.

I recommend to use statsd exporter instead of native Prometheus and use stasts_exporter Prometheus bridge to collect the data. Or restart the app regularly to clean out temporary data.

Yalafeg · December 1, 2020, 8:35am

Thank you for your reply. I am checking the statsd configuration and I am facing issues detecting some metrics. Can you guide me please to a link that might help me doing the proper configuration? Kindly advise also about the dashboard, I will be able to use the same dashboard used for prometheus metrics? I might need to change the variable name, but I need to make sure that I will be able to use the Grafana dashboard. Thank you.

lzap · December 1, 2020, 4:14pm

In Red Hat product we don’t use Prometheus, but this should give you some overview on how to configure statsd: https://access.redhat.com/documentation/en-us/red_hat_satellite/6.8/html/monitoring_red_hat_satellite/index

The rest is different - you need to run statsd_exporter instead of mmvstatsd and from there scrape the data into Prometheus. There is also a rake task you can use to generate statsd_exporter mapping automatically for you:

foreman-rake telemetry:prometheus_statsd

Get back to me if you got this working or not. If you do, write a short tutorial and share it please. We need to document this at some point because not everybody use PCP.

lzap · December 2, 2020, 2:11pm

Here is complete guide: Monitoring Foreman with Prometheus via statsd

Let me know if it worked for you.

Yalafeg · December 3, 2020, 12:44pm

Thank you for your help and advise. It’s working fine now and I am able to monitor foreman from Prometheus successfully. Thank you