Monitoring Foreman with Prometheus

The response times from /metrics look good to me.

[root@satellite ~]# while true; do curl -w '%{time_total}\n' https://localhost/metrics -o /dev/null -k -s --noproxy localhost; done
1.253
0.941
0.948
0.952
0.933
0.942
0.926
0.932
0.959
0.983
0.946
1.184
1.305
0.964
1.645
1.598
1.940
0.933
0.929
1.946
0.964
0.941
1.960
0.958
0.948

This is from our test instance with an 8C/16GB setup. The prod instance has more power :slight_smile:

1 Like

Is there any way to restrict access to /metrics (e.g. to logged in users)?

Question more fore the library authors, I don’t think so. However we use Apache httpd in front of passenger/puma depending on the version, you can use any Apache module… e.g. authentication or IP filtering.

Hey, I can confirm that user.
# while true; do curl -w ‘%{time_total}\n’ https://localhost/metrics -o /dev/null -k -s --noproxy localhost; done
26,244
27,270
27,133
27,407
25,433
26,221

My prometheus has scrape interval set at 10 seconds and timeout at 10 as well.
I’ve fixed this originally by increasing java heap memory and restarting services, but that worked only for about a day and then the issue was back. I thought this must have manifested to a lot of users and was expecting a fix in v2.1 :slight_smile:

We are implementing a filtering mechanism to bring amount of metrics down to reasonable level:

2 Likes

Excellent, since version 2.1.1 this issue has been resolved with prometheus timing out on metrics scrape. I can now monitor it without issues. I’ve made simple grafana monitoring response times for http requests and facts processed. And from node exported monitoring disk, CPU, RAM and each Foreman service uptime. I’m writing alerts now. Will publish it when done if someone finds it helpful.

3 Likes

@matemikulic We would find that very helpful, thanks!

2 Likes

For the records, it looks like Prometheus Ruby client library uses PIDs for temporary filenames. But Ruby web servers do restart worker processes quite often, this can leave many files behind after short period of time (days, weeks) which makes /metrics endpoint slower and slower.

Until we fix this, I do recommend using statsd protocol, there is https://github.com/prometheus/statsd_exporter that can be utilized to load data into Prometheus if needed.

Here is an attempt to fix this: https://github.com/theforeman/foreman/pull/8011

2 Likes

Hai @lzap

It may a little stupid qustion, but i need to ask, what port we should put on prometheus scrapt setting

By default its 5000 but I am not sure if Katello deployment has something on it. Just pick one.

For the record, this will fix the temporary files problem. We are almost there. If you need it right now just backport the patch.

Thanks, @lzap
I’ve try port 5000, and status on prometheus Down (BAD Request). I think I need to deep dive a little bit, or causing by my network rules. I will back here when there is a result.

Thanks

Hello, is it possible to share with us the json file of the Grafana dashboard that you made? it is based on prometheus datasource right? Thank you

Hey,
I’m sorry for the delay. I’ve been really busy. I have planned publishing this in repo on github with grafana, prometheus config and alertmanager config. I’ve also decided to write a blogbost for the community how I integrated Foreman/Katello since my infra is managed by Chef, not Puppet. I’ve caught time now to publish it on grafana.com, but I will publish the rest, if I could just get some time.

Here is the grafana with node exporter and prometheus config instructions for now. https://grafana.com/grafana/dashboards/13469

3 Likes

Thank you

Warning: It looks like both Passenger and Puma, our web app servers, do recycle worker processes quite often. This leads to many temporary files created in relatively short period of time (hours on some deployments) which unfortunately makes /metrics endpoint slower and slower to the point it kills the whole deployment. Only restart of the app helps which cleans the temporary directory completely. Ruby client library maintainers are aware of the problem and they said it will be challenging to implement some kind of “squash” mechanism.

I recommend to use statsd exporter instead of native Prometheus and use stasts_exporter Prometheus bridge to collect the data. Or restart the app regularly to clean out temporary data.

1 Like

Thank you for your reply. I am checking the statsd configuration and I am facing issues detecting some metrics. Can you guide me please to a link that might help me doing the proper configuration? Kindly advise also about the dashboard, I will be able to use the same dashboard used for prometheus metrics? I might need to change the variable name, but I need to make sure that I will be able to use the Grafana dashboard. Thank you.

In Red Hat product we don’t use Prometheus, but this should give you some overview on how to configure statsd: https://access.redhat.com/documentation/en-us/red_hat_satellite/6.8/html/monitoring_red_hat_satellite/index

The rest is different - you need to run statsd_exporter instead of mmvstatsd and from there scrape the data into Prometheus. There is also a rake task you can use to generate statsd_exporter mapping automatically for you:

foreman-rake telemetry:prometheus_statsd

Get back to me if you got this working or not. If you do, write a short tutorial and share it please. We need to document this at some point because not everybody use PCP. :slight_smile:

2 Likes

Here is complete guide: Monitoring Foreman with Prometheus via statsd

Let me know if it worked for you.

1 Like

Thank you for your help and advise. It’s working fine now and I am able to monitor foreman from Prometheus successfully. Thank you