Foreman instrumenting analysis

lzap · October 31, 2017, 7:33pm

Hello,

I am seeking for app instrumenting protocol for Foreman Rails
application that will fulfill the following requirements:

The protocol must work with multi-process server like Passneger.
The protocol can be easily integrated into Foreman Tasks and Smart Proxy.
The protocol or agent must support aggregation of time-based data
(quantiles, average).
The protocol must integrate with top three open-source monitoring frameworks.

Let me summarize my findings so far. I am looking for advice or
comments on this topic. I already worked on some prototypes, but
before I commit to some final solution, I want to be sure I will not
miss something I don't know about.

Before you send comments, please keep in mind I am not searching for
monitoring solution to integrate with. I want an application
instrumentation library (or protocol) to be able export measurements
(or telemetry data if you like) from Rails (like number or requests
processed, SQL queries, time spent in db or view, time spent rendering
a template or calling a backend system).

Prometheus

Flexible text-based protocol (alternatively protobuf) with HTTP
REST-like communication. It was designed to be pull-based, meaning
that an agent makes HTTP calls to web application which holds all
metrics until they are flushed. It was build for Prometheus monitoring
framework (Apache licenced) created by SoundCloud initially. Server
and most agents are written in Go, can run without external database
or export into 3rd party storage backends.

It looks great, but it has a major problem - the Ruby client library
(called client_ruby) does not support multi-process web servers at
all. There are some hacks but these are using local temp files or
shared memory with rather bad benchmark results (see the links down
below).

There is a possibility to push metrics into a separate component
called PushGateway, but this was created for things like cron jobs or
rake tasks. Doing multiple HTTP requests for each metric per single
app request will unlikely perform well. In the README authors have
note that this should be considered as "temporary solution".

Although Prometheus seems to have vibrant community, the Ruby library
development pace slowed down as SoundCloud "does not use many Ruby
apps anymore". But it is still a good option to have.

https://github.com/prometheus/client_ruby/issues/9
https://github.com/prometheus/client_ruby/commits/multiprocess

OpenTSDB

OpenTSDB consists of a Time Series Daemon (TSD) as well as set of
command line utilities. Interaction with OpenTSDB is primarily
achieved by running one or more of the TSDs. Each TSD is independent.
There is no master, no shared state so you can run as many TSDs as
required to handle any load you throw at it. Each TSD uses the open
source database Hadoop/HBase or hosted Google Bigtable service to
store and retrieve time-series data.

It uses push mechanism via REST JSON API with alternative
"telnet-like" text endpoint. Although it does have some agents, it is
more used as a storage backend than end-to-end monitoring solution.

http://opentsdb.net/overview.html

Statsd

Main idea behind this instrumentation protocol is simple - get the
measurement out of the application as fast as possible using UDP
datagram. A collector agent usually runs locally, it does aggregation
and relays the measurements to target backend system. The vanilla
version does not support tagging, but there are extensions or mappings
possible to support that.

Almost all monitoring platforms has some kind of
agent/importer/exporter that talks via statsd. The original statsd
daemon was written in Perl years ago, then it was re-popularized by
node.js implementation, but there are many alternative agents from
which the most promising is statsite with very easy extensibility.

This protocol is my favourite because it plays well with multiprocess
Ruby servers or other Foreman components (all can just send UDP
packets to localhost) and it also takes all aggregation and storing
temporary data out of Ruby application. It also brings chances of
regressions in our codebase to bare minimum - in the worst case the
aggregating agent can fail but UDP packets will simply get lost
without interrupting the application. The best Ruby client library
seems to be statsd-instrument actively maintained by Shopify, it is
very small without any runtime dependency.

github.com

statsd/statsd/blob/master/docs/metric_types.md

# StatsD Metric Types

## Counting

    gorets:1|c

This is a simple counter. Add 1 to the "gorets" bucket.
At each flush the current count is sent and reset to 0.
If the count at flush is 0 then you can opt to send no metric at all for
this counter, by setting `config.deleteCounters` (applies only to graphite
backend).  StatsD will send both the rate as well as the count at each flush.

## Sampling

    gorets:1|c|@0.1

Tells StatsD that this counter is being sent sampled every 1/10th of the time.

## Timing

This file has been truncated. show original

https://codeascraft.com/2011/02/15/measure-anything-measure-everything/

New Relic, Instrumental, DataDog, Rollbar

All are paid services, some clients are open-source (Instrumental is
MIT licenced) but usually with not well documented protocol and worse
integration to different monitoring solutions. There are plenty of
similar offerings, I might have missed some here.

https://instrumentalapp.com
https://instrumentalapp.com/docs/tcp-collector

Zabbix, Nagios, Icinga

These are more of "alerting" systems (system or service is down) and
they all support application instrumentation to some degree, but it is
not the core of what they do. I have seen them referred as "legacy
monitoring systems", but I think they are still very relevant. They
are not good fit for my use case tho at all.

Conclusion

To me it looks like the most open and flexible protocol seems to be
statsd. This will give our users the largest flexibility for further
integration - there are plenty of generic agents which can relay data
to backend systems.

Comments?

···

-- Later, Lukas @lzap Zapletal

ehelms · October 31, 2017, 7:57pm

Before I give a full reply, can you comment, for each given library below,
which are able to run without requiring a separate process? For context, I
am thinking of the container scenario where I am running Foreman as a
single process in a container and want it to be serving up metrics for
collection.

···

On Tue, Oct 31, 2017 at 3:33 PM, Lukas Zapletal wrote:

Hello,

I am seeking for app instrumenting protocol for Foreman Rails
application that will fulfill the following requirements:

The protocol must work with multi-process server like Passneger.
The protocol can be easily integrated into Foreman Tasks and Smart Proxy.
The protocol or agent must support aggregation of time-based data
(quantiles, average).
The protocol must integrate with top three open-source monitoring
frameworks.

Let me summarize my findings so far. I am looking for advice or
comments on this topic. I already worked on some prototypes, but
before I commit to some final solution, I want to be sure I will not
miss something I don’t know about.

Before you send comments, please keep in mind I am not searching for
monitoring solution to integrate with. I want an application
instrumentation library (or protocol) to be able export measurements
(or telemetry data if you like) from Rails (like number or requests
processed, SQL queries, time spent in db or view, time spent rendering
a template or calling a backend system).

Prometheus

Flexible text-based protocol (alternatively protobuf) with HTTP
REST-like communication. It was designed to be pull-based, meaning
that an agent makes HTTP calls to web application which holds all
metrics until they are flushed. It was build for Prometheus monitoring
framework (Apache licenced) created by SoundCloud initially. Server
and most agents are written in Go, can run without external database
or export into 3rd party storage backends.

It looks great, but it has a major problem - the Ruby client library
(called client_ruby) does not support multi-process web servers at
all. There are some hacks but these are using local temp files or
shared memory with rather bad benchmark results (see the links down
below).

There is a possibility to push metrics into a separate component
called PushGateway, but this was created for things like cron jobs or
rake tasks. Doing multiple HTTP requests for each metric per single
app request will unlikely perform well. In the README authors have
note that this should be considered as “temporary solution”.

Although Prometheus seems to have vibrant community, the Ruby library
development pace slowed down as SoundCloud “does not use many Ruby
apps anymore”. But it is still a good option to have.

https://prometheus.io
Pushing metrics | Prometheus
GitHub - prometheus/client_ruby: Prometheus instrumentation library for Ruby applications
Support pre-fork servers · Issue #9 · prometheus/client_ruby · GitHub
Commits · prometheus/client_ruby · GitHub

OpenTSDB

OpenTSDB consists of a Time Series Daemon (TSD) as well as set of
command line utilities. Interaction with OpenTSDB is primarily
achieved by running one or more of the TSDs. Each TSD is independent.
There is no master, no shared state so you can run as many TSDs as
required to handle any load you throw at it. Each TSD uses the open
source database Hadoop/HBase or hosted Google Bigtable service to
store and retrieve time-series data.

It uses push mechanism via REST JSON API with alternative
“telnet-like” text endpoint. Although it does have some agents, it is
more used as a storage backend than end-to-end monitoring solution.

OpenTSDB - A Distributed, Scalable Monitoring System

Statsd

Main idea behind this instrumentation protocol is simple - get the
measurement out of the application as fast as possible using UDP
datagram. A collector agent usually runs locally, it does aggregation
and relays the measurements to target backend system. The vanilla
version does not support tagging, but there are extensions or mappings
possible to support that.

Almost all monitoring platforms has some kind of
agent/importer/exporter that talks via statsd. The original statsd
daemon was written in Perl years ago, then it was re-popularized by
node.js implementation, but there are many alternative agents from
which the most promising is statsite with very easy extensibility.

This protocol is my favourite because it plays well with multiprocess
Ruby servers or other Foreman components (all can just send UDP
packets to localhost) and it also takes all aggregation and storing
temporary data out of Ruby application. It also brings chances of
regressions in our codebase to bare minimum - in the worst case the
aggregating agent can fail but UDP packets will simply get lost
without interrupting the application. The best Ruby client library
seems to be statsd-instrument actively maintained by Shopify, it is
very small without any runtime dependency.

statsd/docs/metric_types.md at master · statsd/statsd · GitHub
GitHub - Shopify/statsd-instrument: A StatsD client for Ruby apps. Provides metaprogramming methods to inject StatsD instrumentation into your code.
GitHub - prometheus/statsd_exporter: StatsD to Prometheus metrics exporter
GitHub - statsite/statsite: C implementation of statsd
Etsy Engineering | Measure Anything, Measure Everything

New Relic, Instrumental, DataDog, Rollbar

All are paid services, some clients are open-source (Instrumental is
MIT licenced) but usually with not well documented protocol and worse
integration to different monitoring solutions. There are plenty of
similar offerings, I might have missed some here.

https://newrelic.com
https://instrumentalapp.com
Expected Behavior | Our Products | Instrumental

Zabbix, Nagios, Icinga

These are more of “alerting” systems (system or service is down) and
they all support application instrumentation to some degree, but it is
not the core of what they do. I have seen them referred as “legacy
monitoring systems”, but I think they are still very relevant. They
are not good fit for my use case tho at all.

Conclusion

To me it looks like the most open and flexible protocol seems to be
statsd. This will give our users the largest flexibility for further
integration - there are plenty of generic agents which can relay data
to backend systems.

Comments?

–
Later,
Lukas @lzap Zapletal

–
You received this message because you are subscribed to the Google Groups
“foreman-dev” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

–
Eric D. Helms
Red Hat Engineering

lzap · November 1, 2017, 7:51am

Statsd can be configured for remote transport, meaning that the
collecting agent (or aggregating process if you like) can run on
remote server (or container). It is recommended to run it either on
localhost or at least LAN, it is not a good idea to route the UDP
packets through complex environments tho as they can get lost. Also
creating a SPOF is not a good idea, but I've seen articles or comments
about having one central statsd collector for all hosts. Those people
had usually questions around scaleability because single point of
entry was getting overloaded.

There are some WIP patches for Prometheus as well giving a possibility
to have single HTTP REST endpoint for all subprocesses of a Rails app
server, but if you take a look into this (links are in my original
email) these are pretty hacky. One is creating a local shared memory
block for communication, the other is doing the same via serialized db
file. This is doing dozens of system calls per single measurement,
compared to just one or two for UDP datagram this is way too much
IMHO.

The question tho is if there is another protocol I am not aware of.
There are actually two which I both tested to be honest:

PCP trace API - http://pcp.io/man/man3/pmdatrace.3.html

PCP is a monitoring collecting daemon which is in most Linux distros
(and in RHEL as well) and it has a very simple API which uses TCP
connection for communication with trace agent (called PMDA trace). I
wrote a Ruby wrapper around this simple API
(https://github.com/lzap/ruby-pcptrace) and I have a working
prototype. Disadvantage is that in PCP world this API is seen as
legacy, might get removed in the future. Also aggregation is only done
for transaction type observation.

PCP MMV API - http://pcp.io/books/PCP_PG/html/id5213288nat.html

Another agent which uses memory mapped files for ultra-fast
communication. This is the fastest possible application
instrumentation I've seen, but it is a little bit of an overkill
primarily targeted to HPC environment. Also no aggregation is done and
there is no Ruby bindings at all. In both cases, a PCP daemon needs to
be running.

One question tho - isn't standard practice to have one container per
pod that will serve as monitoring endpoint? I am no expert with
Kubernetes, but I believe that's exactly what this technology is built
for - you can specify services and their dependency on each other. The
price we need to pay (an extra service) is balanced with better
reliability - I can imagine when Rails/Passenger stops responding you
won't be able to reach the monitoring endpoint as well thus we'd need
to maintain a separate web stack for that.

···

-- Later, Lukas @lzap Zapletal

ehelms · November 1, 2017, 1:33pm

> Statsd can be configured for remote transport, meaning that the
> collecting agent (or aggregating process if you like) can run on
> remote server (or container). It is recommended to run it either on
> localhost or at least LAN, it is not a good idea to route the UDP
> packets through complex environments tho as they can get lost. Also
> creating a SPOF is not a good idea, but I've seen articles or comments
> about having one central statsd collector for all hosts. Those people
> had usually questions around scaleability because single point of
> entry was getting overloaded.
>
> There are some WIP patches for Prometheus as well giving a possibility
> to have single HTTP REST endpoint for all subprocesses of a Rails app
> server, but if you take a look into this (links are in my original
> email) these are pretty hacky. One is creating a local shared memory
> block for communication, the other is doing the same via serialized db
> file. This is doing dozens of system calls per single measurement,
> compared to just one or two for UDP datagram this is way too much
> IMHO.
>

Does Prometheus only not work in a multi-process Rails web server? Does it
work for a single process multi-threaded web server? This is an interesting
roadblock given you'd expect this to affect lots of webserver across
multiple languages out there.

>
> The question tho is if there is another protocol I am not aware of.
> There are actually two which I both tested to be honest:
>
> 1) PCP trace API - http://pcp.io/man/man3/pmdatrace.3.html
>
> PCP is a monitoring collecting daemon which is in most Linux distros
> (and in RHEL as well) and it has a very simple API which uses TCP
> connection for communication with trace agent (called PMDA trace). I
> wrote a Ruby wrapper around this simple API
> (GitHub - lzap/ruby-pcptrace: Ruby bindings for PCP trace API and pmda) and I have a working
> prototype. Disadvantage is that in PCP world this API is seen as
> legacy, might get removed in the future. Also aggregation is only done
> for transaction type observation.
>
> 1) PCP MMV API - http://pcp.io/books/PCP_PG/html/id5213288nat.html
>
> Another agent which uses memory mapped files for ultra-fast
> communication. This is the fastest possible application
> instrumentation I've seen, but it is a little bit of an overkill
> primarily targeted to HPC environment. Also no aggregation is done and
> there is no Ruby bindings at all. In both cases, a PCP daemon needs to
> be running.
>
> One question tho - isn't standard practice to have one container per
> pod that will serve as monitoring endpoint? I am no expert with
> Kubernetes, but I believe that's exactly what this technology is built
> for - you can specify services and their dependency on each other. The
> price we need to pay (an extra service) is balanced with better
> reliability - I can imagine when Rails/Passenger stops responding you
> won't be able to reach the monitoring endpoint as well thus we'd need
> to maintain a separate web stack for that.
>

Yes, standard practice is to think about one container per pod (in a
Kubernetes environment). However, there are patterns for things like log
aggregation and monitoring such as doing a sidecar container that ensures
co-location. The part I don't entirely get with sidecars is if I scale the
pod to say 5, I get 5 web applications and 5 monitoring containers and that
seems odd. Which I why I think the tendency is towards models where your
single process/application is the end point for your metrics to be scrapped
by an outside agent or services.

I agree you want the collector to be separate, but if your web application
is down what value would a monitoring endpoint being alive provide? The
application would be down, thus no metrics to serve up. The other exporters
such as the one exporting metrics about the underlying system would be
responsible for giving system metrics. In the Kube world, this is handled
by readiness and liveness probes for Kubenernetes to re-spin the container
if it stops responding.

···

On Wed, Nov 1, 2017 at 3:51 AM, Lukas Zapletal wrote:

–
Later,
Lukas @lzap Zapletal

–
You received this message because you are subscribed to the Google Groups
“foreman-dev” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

–
Eric D. Helms
Red Hat Engineering

lzap · November 1, 2017, 3:54pm

> Does Prometheus only not work in a multi-process Rails web server? Does it
> work for a single process multi-threaded web server? This is an interesting
> roadblock given you'd expect this to affect lots of webserver across
> multiple languages out there.

Any Rails app that has multiple processes needs currently to figure
out how to deliver data to the HTTP endpoint. E.g. store it in a
database or something, which is not the best approach.

Absolutely, it lacks quite important feature right there. It stems
from the design which is pull-based.

> Yes, standard practice is to think about one container per pod (in a
> Kubernetes environment). However, there are patterns for things like log
> aggregation and monitoring such as doing a sidecar container that ensures
> co-location. The part I don't entirely get with sidecars is if I scale the
> pod to say 5, I get 5 web applications and 5 monitoring containers and that
> seems odd. Which I why I think the tendency is towards models where your
> single process/application is the end point for your metrics to be scrapped
> by an outside agent or services.
>
> I agree you want the collector to be separate, but if your web application
> is down what value would a monitoring endpoint being alive provide? The
> application would be down, thus no metrics to serve up. The other exporters
> such as the one exporting metrics about the underlying system would be
> responsible for giving system metrics. In the Kube world, this is handled by
> readiness and liveness probes for Kubenernetes to re-spin the container if
> it stops responding.

In container world, monitoring agents are running on hosts, not in
containers themselves. And collector agents can be 1:1 or 1:N (e.g.
for each container host). I am not sure I follow you here. Why you
don't see added value again? Monitoring agent without any apps
connected is as useful as ssh deamon waiting for connections.

Let me put it this way - push approach seems to be more appropriate
for multi process Ruby application than pull approach. That's what we
are discussing here, unless there are better protocols/agents I am not
aware of.

Honestly, pull approach via simple HTTP REST API seems cleaner but it
is just not good fit and also it creates other unnecessary
responsibility on the app itself. You are working on containerizing
Foreman, so it is also actually against this effort.

Anyway, let me throw another integration. Collectd has an agent (or
plugin) that opens a local socket which can be used to receive data
from other applications. I wrote Ruby client library the other day
(https://github.com/lzap/collectd-uxsock) but I believe this make no
difference than statsd - you still need a local process to gather the
data.

···

-- Later, Lukas @lzap Zapletal

iNecas · November 2, 2017, 9:05pm

I lean towards the push model here. The main reason is
the simpler way to publish the instrumentation data from whatever
process we want to track. Also, my understanding is, that we don't care
only if the service is up or down (readiness and liveness) but also
about trends during the processing.

Eric: could you more describe the 5 web applications requiring 5
monitoring containers?
I might be missing where this implication came from?

– Ivan

···

On Wed, Nov 1, 2017 at 4:54 PM, Lukas Zapletal wrote: >> Does Prometheus only not work in a multi-process Rails web server? Does it >> work for a single process multi-threaded web server? This is an interesting >> roadblock given you'd expect this to affect lots of webserver across >> multiple languages out there. > > Any Rails app that has multiple processes needs currently to figure > out how to deliver data to the HTTP endpoint. E.g. store it in a > database or something, which is not the best approach. > > Absolutely, it lacks quite important feature right there. It stems > from the design which is pull-based. > >> Yes, standard practice is to think about one container per pod (in a >> Kubernetes environment). However, there are patterns for things like log >> aggregation and monitoring such as doing a sidecar container that ensures >> co-location. The part I don't entirely get with sidecars is if I scale the >> pod to say 5, I get 5 web applications and 5 monitoring containers and that >> seems odd. Which I why I think the tendency is towards models where your >> single process/application is the end point for your metrics to be scrapped >> by an outside agent or services. >> >> I agree you want the collector to be separate, but if your web application >> is down what value would a monitoring endpoint being alive provide? The >> application would be down, thus no metrics to serve up. The other exporters >> such as the one exporting metrics about the underlying system would be >> responsible for giving system metrics. In the Kube world, this is handled by >> readiness and liveness probes for Kubenernetes to re-spin the container if >> it stops responding. > > In container world, monitoring agents are running on hosts, not in > containers themselves. And collector agents can be 1:1 or 1:N (e.g. > for each container host). I am not sure I follow you here. Why you > don't see added value again? Monitoring agent without any apps > connected is as useful as ssh deamon waiting for connections. > > Let me put it this way - push approach seems to be more appropriate > for multi process Ruby application than pull approach. That's what we > are discussing here, unless there are better protocols/agents I am not > aware of. > > Honestly, pull approach via simple HTTP REST API seems cleaner but it > is just not good fit and also it creates other unnecessary > responsibility on the app itself. You are working on containerizing > Foreman, so it is also actually against this effort. > > Anyway, let me throw another integration. Collectd has an agent (or > plugin) that opens a local socket which can be used to receive data > from other applications. I wrote Ruby client library the other day > (https://github.com/lzap/collectd-uxsock) but I believe this make no > difference than statsd - you still need a local process to gather the > data. > > -- > Later, > Lukas @lzap Zapletal > > -- > You received this message because you are subscribed to the Google Groups "foreman-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an email to foreman-dev+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/d/optout.

lzap · November 3, 2017, 7:56am

> I lean towards the push model here. The main reason is
> the simpler way to publish the instrumentation data from whatever
> process we want to track. Also, my understanding is, that we don't care
> only if the service is up or down (readiness and liveness) but also
> about trends during the processing.

My ultimate goal with this thread is to find an agreement and then
file a PR so Foreman Rails can send telemetry data in a way that
anyone can integrate this into any kind of existing or future
monitoring framework.

I am currently NOT interested in any kind of system monitoring,
services or performance at this point, although I already have some
plans in this regard. Once we have Rails telemetry integration, I will
likely show demo of how this works with PCP monitoring framework.

···

-- Lukas @lzap Zapletal

lzap · November 7, 2017, 9:49am

Any other ideas for telemetry protocols?

If there are none, I will rebase my telemetry patch back to the
original version based on statsd.

LZ

···

On Tue, Oct 31, 2017 at 8:33 PM, Lukas Zapletal wrote: > Hello, > > I am seeking for app instrumenting protocol for Foreman Rails > application that will fulfill the following requirements: > > The protocol must work with multi-process server like Passneger. > The protocol can be easily integrated into Foreman Tasks and Smart Proxy. > The protocol or agent must support aggregation of time-based data > (quantiles, average). > The protocol must integrate with top three open-source monitoring frameworks. > > Let me summarize my findings so far. I am looking for advice or > comments on this topic. I already worked on some prototypes, but > before I commit to some final solution, I want to be sure I will not > miss something I don't know about. > > Before you send comments, please keep in mind I am not searching for > monitoring solution to integrate with. I want an application > instrumentation library (or protocol) to be able export measurements > (or telemetry data if you like) from Rails (like number or requests > processed, SQL queries, time spent in db or view, time spent rendering > a template or calling a backend system). > > > Prometheus > > > Flexible text-based protocol (alternatively protobuf) with HTTP > REST-like communication. It was designed to be pull-based, meaning > that an agent makes HTTP calls to web application which holds all > metrics until they are flushed. It was build for Prometheus monitoring > framework (Apache licenced) created by SoundCloud initially. Server > and most agents are written in Go, can run without external database > or export into 3rd party storage backends. > > > It looks great, but it has a major problem - the Ruby client library > (called client_ruby) does not support multi-process web servers at > all. There are some hacks but these are using local temp files or > shared memory with rather bad benchmark results (see the links down > below). > > > There is a possibility to push metrics into a separate component > called PushGateway, but this was created for things like cron jobs or > rake tasks. Doing multiple HTTP requests for each metric per single > app request will unlikely perform well. In the README authors have > note that this should be considered as "temporary solution". > > > Although Prometheus seems to have vibrant community, the Ruby library > development pace slowed down as SoundCloud "does not use many Ruby > apps anymore". But it is still a good option to have. > > > https://prometheus.io > https://prometheus.io/docs/instrumenting/pushing/ > https://github.com/prometheus/client_ruby > https://github.com/prometheus/client_ruby/issues/9 > https://github.com/prometheus/client_ruby/commits/multiprocess > > > OpenTSDB > > > OpenTSDB consists of a Time Series Daemon (TSD) as well as set of > command line utilities. Interaction with OpenTSDB is primarily > achieved by running one or more of the TSDs. Each TSD is independent. > There is no master, no shared state so you can run as many TSDs as > required to handle any load you throw at it. Each TSD uses the open > source database Hadoop/HBase or hosted Google Bigtable service to > store and retrieve time-series data. > > > It uses push mechanism via REST JSON API with alternative > "telnet-like" text endpoint. Although it does have some agents, it is > more used as a storage backend than end-to-end monitoring solution. > > > http://opentsdb.net/overview.html > > > Statsd > > > Main idea behind this instrumentation protocol is simple - get the > measurement out of the application as fast as possible using UDP > datagram. A collector agent usually runs locally, it does aggregation > and relays the measurements to target backend system. The vanilla > version does not support tagging, but there are extensions or mappings > possible to support that. > > > Almost all monitoring platforms has some kind of > agent/importer/exporter that talks via statsd. The original statsd > daemon was written in Perl years ago, then it was re-popularized by > node.js implementation, but there are many alternative agents from > which the most promising is statsite with very easy extensibility. > > > This protocol is my favourite because it plays well with multiprocess > Ruby servers or other Foreman components (all can just send UDP > packets to localhost) and it also takes all aggregation and storing > temporary data out of Ruby application. It also brings chances of > regressions in our codebase to bare minimum - in the worst case the > aggregating agent can fail but UDP packets will simply get lost > without interrupting the application. The best Ruby client library > seems to be statsd-instrument actively maintained by Shopify, it is > very small without any runtime dependency. > > > https://github.com/etsy/statsd/blob/master/docs/metric_types.md > https://github.com/Shopify/statsd-instrument > https://github.com/prometheus/statsd_exporter > https://github.com/statsite/statsite > https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ > > > New Relic, Instrumental, DataDog, Rollbar > > > All are paid services, some clients are open-source (Instrumental is > MIT licenced) but usually with not well documented protocol and worse > integration to different monitoring solutions. There are plenty of > similar offerings, I might have missed some here. > > > https://newrelic.com > https://instrumentalapp.com > https://instrumentalapp.com/docs/tcp-collector > > > Zabbix, Nagios, Icinga > > > These are more of "alerting" systems (system or service is down) and > they all support application instrumentation to some degree, but it is > not the core of what they do. I have seen them referred as "legacy > monitoring systems", but I think they are still very relevant. They > are not good fit for my use case tho at all. > > > Conclusion > > > To me it looks like the most open and flexible protocol seems to be > statsd. This will give our users the largest flexibility for further > integration - there are plenty of generic agents which can relay data > to backend systems. > > > Comments? > > -- > Later, > Lukas @lzap Zapletal

–
Later,
Lukas @lzap Zapletal

ekohl · November 7, 2017, 2:05pm

Have you had a look at influxdb? While I have limited experience with
it, it's push based which can have benefits. There's infludb-rails1
which states:

Out of the box, you'll automatically get reporting of your controller,
view, and db runtimes for each request.

That sounds a lot like what you're interested in.

···

On Tue, Nov 07, 2017 at 10:49:56AM +0100, Lukas Zapletal wrote: >Any other ideas for telemetry protocols? > >If there are none, I will rebase my telemetry patch back to the >original version based on statsd. > >LZ > >On Tue, Oct 31, 2017 at 8:33 PM, Lukas Zapletal wrote: >> Hello, >> >> I am seeking for app instrumenting protocol for Foreman Rails >> application that will fulfill the following requirements: >> >> The protocol must work with multi-process server like Passneger. >> The protocol can be easily integrated into Foreman Tasks and Smart Proxy. >> The protocol or agent must support aggregation of time-based data >> (quantiles, average). >> The protocol must integrate with top three open-source monitoring frameworks. >> >> Let me summarize my findings so far. I am looking for advice or >> comments on this topic. I already worked on some prototypes, but >> before I commit to some final solution, I want to be sure I will not >> miss something I don't know about. >> >> Before you send comments, please keep in mind I am not searching for >> monitoring solution to integrate with. I want an application >> instrumentation library (or protocol) to be able export measurements >> (or telemetry data if you like) from Rails (like number or requests >> processed, SQL queries, time spent in db or view, time spent rendering >> a template or calling a backend system). >> >> >> Prometheus >> >> >> Flexible text-based protocol (alternatively protobuf) with HTTP >> REST-like communication. It was designed to be pull-based, meaning >> that an agent makes HTTP calls to web application which holds all >> metrics until they are flushed. It was build for Prometheus monitoring >> framework (Apache licenced) created by SoundCloud initially. Server >> and most agents are written in Go, can run without external database >> or export into 3rd party storage backends. >> >> >> It looks great, but it has a major problem - the Ruby client library >> (called client_ruby) does not support multi-process web servers at >> all. There are some hacks but these are using local temp files or >> shared memory with rather bad benchmark results (see the links down >> below). >> >> >> There is a possibility to push metrics into a separate component >> called PushGateway, but this was created for things like cron jobs or >> rake tasks. Doing multiple HTTP requests for each metric per single >> app request will unlikely perform well. In the README authors have >> note that this should be considered as "temporary solution". >> >> >> Although Prometheus seems to have vibrant community, the Ruby library >> development pace slowed down as SoundCloud "does not use many Ruby >> apps anymore". But it is still a good option to have. >> >> >> https://prometheus.io >> https://prometheus.io/docs/instrumenting/pushing/ >> https://github.com/prometheus/client_ruby >> https://github.com/prometheus/client_ruby/issues/9 >> https://github.com/prometheus/client_ruby/commits/multiprocess >> >> >> OpenTSDB >> >> >> OpenTSDB consists of a Time Series Daemon (TSD) as well as set of >> command line utilities. Interaction with OpenTSDB is primarily >> achieved by running one or more of the TSDs. Each TSD is independent. >> There is no master, no shared state so you can run as many TSDs as >> required to handle any load you throw at it. Each TSD uses the open >> source database Hadoop/HBase or hosted Google Bigtable service to >> store and retrieve time-series data. >> >> >> It uses push mechanism via REST JSON API with alternative >> "telnet-like" text endpoint. Although it does have some agents, it is >> more used as a storage backend than end-to-end monitoring solution. >> >> >> http://opentsdb.net/overview.html >> >> >> Statsd >> >> >> Main idea behind this instrumentation protocol is simple - get the >> measurement out of the application as fast as possible using UDP >> datagram. A collector agent usually runs locally, it does aggregation >> and relays the measurements to target backend system. The vanilla >> version does not support tagging, but there are extensions or mappings >> possible to support that. >> >> >> Almost all monitoring platforms has some kind of >> agent/importer/exporter that talks via statsd. The original statsd >> daemon was written in Perl years ago, then it was re-popularized by >> node.js implementation, but there are many alternative agents from >> which the most promising is statsite with very easy extensibility. >> >> >> This protocol is my favourite because it plays well with multiprocess >> Ruby servers or other Foreman components (all can just send UDP >> packets to localhost) and it also takes all aggregation and storing >> temporary data out of Ruby application. It also brings chances of >> regressions in our codebase to bare minimum - in the worst case the >> aggregating agent can fail but UDP packets will simply get lost >> without interrupting the application. The best Ruby client library >> seems to be statsd-instrument actively maintained by Shopify, it is >> very small without any runtime dependency. >> >> >> https://github.com/etsy/statsd/blob/master/docs/metric_types.md >> https://github.com/Shopify/statsd-instrument >> https://github.com/prometheus/statsd_exporter >> https://github.com/statsite/statsite >> https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ >> >> >> New Relic, Instrumental, DataDog, Rollbar >> >> >> All are paid services, some clients are open-source (Instrumental is >> MIT licenced) but usually with not well documented protocol and worse >> integration to different monitoring solutions. There are plenty of >> similar offerings, I might have missed some here. >> >> >> https://newrelic.com >> https://instrumentalapp.com >> https://instrumentalapp.com/docs/tcp-collector >> >> >> Zabbix, Nagios, Icinga >> >> >> These are more of "alerting" systems (system or service is down) and >> they all support application instrumentation to some degree, but it is >> not the core of what they do. I have seen them referred as "legacy >> monitoring systems", but I think they are still very relevant. They >> are not good fit for my use case tho at all. >> >> >> Conclusion >> >> >> To me it looks like the most open and flexible protocol seems to be >> statsd. This will give our users the largest flexibility for further >> integration - there are plenty of generic agents which can relay data >> to backend systems. >> >> >> Comments? >> >> -- >> Later, >> Lukas @lzap Zapletal > > > >-- >Later, > Lukas @lzap Zapletal > >-- >You received this message because you are subscribed to the Google Groups "foreman-dev" group. >To unsubscribe from this group and stop receiving emails from it, send an email to foreman-dev+unsubscribe@googlegroups.com. >For more options, visit https://groups.google.com/d/optout.

lzap · November 7, 2017, 3:43pm

Thanks for tip, I considered InfluxDB but I have major problem with
that. It's a backend system rather than open protocol with many
implementations, more of a database rather than monitoring protocol.

But it is a viable option, the it has a push API which is HTTP JSON
based, which will be much slower than statsd (just a UDP packet).

Let me see if we settle down on my proposal which I just sent of
having two implementations in Foreman core - one push, one pull. I can
even do just common API in Foreman core and write the two
implementations as plugins, then adding a new one (InfluxDB) would be
quite easy if needed.

···

On Tue, Nov 7, 2017 at 3:05 PM, Ewoud Kohl van Wijngaarden wrote: > Have you had a look at influxdb? While I have limited experience with it, > it's push based which can have benefits. There's infludb-rails[1] which > states: > > Out of the box, you'll automatically get reporting of your controller, view, > and db runtimes for each request. > > That sounds a lot like what you're interested in. > > [1]: https://github.com/influxdata/influxdb-rails > > > On Tue, Nov 07, 2017 at 10:49:56AM +0100, Lukas Zapletal wrote: >> >> Any other ideas for telemetry protocols? >> >> If there are none, I will rebase my telemetry patch back to the >> original version based on statsd. >> >> LZ >> >> On Tue, Oct 31, 2017 at 8:33 PM, Lukas Zapletal wrote: >>> >>> Hello, >>> >>> I am seeking for app instrumenting protocol for Foreman Rails >>> application that will fulfill the following requirements: >>> >>> The protocol must work with multi-process server like Passneger. >>> The protocol can be easily integrated into Foreman Tasks and Smart Proxy. >>> The protocol or agent must support aggregation of time-based data >>> (quantiles, average). >>> The protocol must integrate with top three open-source monitoring >>> frameworks. >>> >>> Let me summarize my findings so far. I am looking for advice or >>> comments on this topic. I already worked on some prototypes, but >>> before I commit to some final solution, I want to be sure I will not >>> miss something I don't know about. >>> >>> Before you send comments, please keep in mind I am not searching for >>> monitoring solution to integrate with. I want an application >>> instrumentation library (or protocol) to be able export measurements >>> (or telemetry data if you like) from Rails (like number or requests >>> processed, SQL queries, time spent in db or view, time spent rendering >>> a template or calling a backend system). >>> >>> >>> Prometheus >>> >>> >>> Flexible text-based protocol (alternatively protobuf) with HTTP >>> REST-like communication. It was designed to be pull-based, meaning >>> that an agent makes HTTP calls to web application which holds all >>> metrics until they are flushed. It was build for Prometheus monitoring >>> framework (Apache licenced) created by SoundCloud initially. Server >>> and most agents are written in Go, can run without external database >>> or export into 3rd party storage backends. >>> >>> >>> It looks great, but it has a major problem - the Ruby client library >>> (called client_ruby) does not support multi-process web servers at >>> all. There are some hacks but these are using local temp files or >>> shared memory with rather bad benchmark results (see the links down >>> below). >>> >>> >>> There is a possibility to push metrics into a separate component >>> called PushGateway, but this was created for things like cron jobs or >>> rake tasks. Doing multiple HTTP requests for each metric per single >>> app request will unlikely perform well. In the README authors have >>> note that this should be considered as "temporary solution". >>> >>> >>> Although Prometheus seems to have vibrant community, the Ruby library >>> development pace slowed down as SoundCloud "does not use many Ruby >>> apps anymore". But it is still a good option to have. >>> >>> >>> https://prometheus.io >>> https://prometheus.io/docs/instrumenting/pushing/ >>> https://github.com/prometheus/client_ruby >>> https://github.com/prometheus/client_ruby/issues/9 >>> https://github.com/prometheus/client_ruby/commits/multiprocess >>> >>> >>> OpenTSDB >>> >>> >>> OpenTSDB consists of a Time Series Daemon (TSD) as well as set of >>> command line utilities. Interaction with OpenTSDB is primarily >>> achieved by running one or more of the TSDs. Each TSD is independent. >>> There is no master, no shared state so you can run as many TSDs as >>> required to handle any load you throw at it. Each TSD uses the open >>> source database Hadoop/HBase or hosted Google Bigtable service to >>> store and retrieve time-series data. >>> >>> >>> It uses push mechanism via REST JSON API with alternative >>> "telnet-like" text endpoint. Although it does have some agents, it is >>> more used as a storage backend than end-to-end monitoring solution. >>> >>> >>> http://opentsdb.net/overview.html >>> >>> >>> Statsd >>> >>> >>> Main idea behind this instrumentation protocol is simple - get the >>> measurement out of the application as fast as possible using UDP >>> datagram. A collector agent usually runs locally, it does aggregation >>> and relays the measurements to target backend system. The vanilla >>> version does not support tagging, but there are extensions or mappings >>> possible to support that. >>> >>> >>> Almost all monitoring platforms has some kind of >>> agent/importer/exporter that talks via statsd. The original statsd >>> daemon was written in Perl years ago, then it was re-popularized by >>> node.js implementation, but there are many alternative agents from >>> which the most promising is statsite with very easy extensibility. >>> >>> >>> This protocol is my favourite because it plays well with multiprocess >>> Ruby servers or other Foreman components (all can just send UDP >>> packets to localhost) and it also takes all aggregation and storing >>> temporary data out of Ruby application. It also brings chances of >>> regressions in our codebase to bare minimum - in the worst case the >>> aggregating agent can fail but UDP packets will simply get lost >>> without interrupting the application. The best Ruby client library >>> seems to be statsd-instrument actively maintained by Shopify, it is >>> very small without any runtime dependency. >>> >>> >>> https://github.com/etsy/statsd/blob/master/docs/metric_types.md >>> https://github.com/Shopify/statsd-instrument >>> https://github.com/prometheus/statsd_exporter >>> https://github.com/statsite/statsite >>> https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ >>> >>> >>> New Relic, Instrumental, DataDog, Rollbar >>> >>> >>> All are paid services, some clients are open-source (Instrumental is >>> MIT licenced) but usually with not well documented protocol and worse >>> integration to different monitoring solutions. There are plenty of >>> similar offerings, I might have missed some here. >>> >>> >>> https://newrelic.com >>> https://instrumentalapp.com >>> https://instrumentalapp.com/docs/tcp-collector >>> >>> >>> Zabbix, Nagios, Icinga >>> >>> >>> These are more of "alerting" systems (system or service is down) and >>> they all support application instrumentation to some degree, but it is >>> not the core of what they do. I have seen them referred as "legacy >>> monitoring systems", but I think they are still very relevant. They >>> are not good fit for my use case tho at all. >>> >>> >>> Conclusion >>> >>> >>> To me it looks like the most open and flexible protocol seems to be >>> statsd. This will give our users the largest flexibility for further >>> integration - there are plenty of generic agents which can relay data >>> to backend systems. >>> >>> >>> Comments? >>> >>> -- >>> Later, >>> Lukas @lzap Zapletal >> >> >> >> >> -- >> Later, >> Lukas @lzap Zapletal >> >> -- >> You received this message because you are subscribed to the Google Groups >> "foreman-dev" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to foreman-dev+unsubscribe@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. > > > -- > You received this message because you are subscribed to the Google Groups > "foreman-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to foreman-dev+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/d/optout.

–
Later,
Lukas @lzap Zapletal

ehelms · November 7, 2017, 1:56pm

> I lean towards the push model here. The main reason is
> the simpler way to publish the instrumentation data from whatever
> process we want to track. Also, my understanding is, that we don't care
> only if the service is up or down (readiness and liveness) but also
> about trends during the processing.
>

In reading about push vs. pull, my biggest issue with it is that the
application has to have knowledge of where it's pushing. Whereas the pull
model allows an application to say I have metrics here and anything that
knows how to scrape and interpret those metrics can grab them at their
leisure. This provides nicer de-coupling and potentially more choice if
there is a standard-ish data format used to expose the metrics.

>
> Eric: could you more describe the 5 web applications requiring 5
> monitoring containers?
> I might be missing where this implication came from?
>

This part comes from if you do the sidecar method of running something like
a statsd process. A sidecar is essentially just running two containers in a
pod where one container is considered the main application and the other an
addon that provides some additional value or performs some actions without
being baked into your main application container. If you picture this idea,
and then think about pod scaling (in Kube), for every scale up you'd also
be adding another statsd process container. This might be fine in practice,
but could in theory be overkill.

Another method with statsd is simply to run it on the host itself and have
the containers send data to it but this provides some security concerns
from what I understand.

The biggest limiting factor appears to be how forking webservers are
handled and probably constraints us the most. Lukas, have you seen anything
related to being to define what the metrics are and how they get published
being able to be separated from the publishing mechanism? My thinking being
if we started with statsd and wrote code within the application generating
statsd metrics, if at a later point one could simply say now publish this
via HTTP endpoint in Prometheus data style for scraping?

Eric

···

On Thu, Nov 2, 2017 at 5:05 PM, Ivan Necas wrote:

– Ivan

On Wed, Nov 1, 2017 at 4:54 PM, Lukas Zapletal lzap@redhat.com wrote:

Does Prometheus only not work in a multi-process Rails web server? Does
it
work for a single process multi-threaded web server? This is an
interesting
roadblock given you’d expect this to affect lots of webserver across
multiple languages out there.

Any Rails app that has multiple processes needs currently to figure
out how to deliver data to the HTTP endpoint. E.g. store it in a
database or something, which is not the best approach.

Absolutely, it lacks quite important feature right there. It stems
from the design which is pull-based.

Yes, standard practice is to think about one container per pod (in a
Kubernetes environment). However, there are patterns for things like log
aggregation and monitoring such as doing a sidecar container that
ensures
co-location. The part I don’t entirely get with sidecars is if I scale
the
pod to say 5, I get 5 web applications and 5 monitoring containers and
that
seems odd. Which I why I think the tendency is towards models where your
single process/application is the end point for your metrics to be
scrapped
by an outside agent or services.

I agree you want the collector to be separate, but if your web
application
is down what value would a monitoring endpoint being alive provide? The
application would be down, thus no metrics to serve up. The other
exporters
such as the one exporting metrics about the underlying system would be
responsible for giving system metrics. In the Kube world, this is
handled by
readiness and liveness probes for Kubenernetes to re-spin the container
if
it stops responding.

In container world, monitoring agents are running on hosts, not in
containers themselves. And collector agents can be 1:1 or 1:N (e.g.
for each container host). I am not sure I follow you here. Why you
don’t see added value again? Monitoring agent without any apps
connected is as useful as ssh deamon waiting for connections.

Let me put it this way - push approach seems to be more appropriate
for multi process Ruby application than pull approach. That’s what we
are discussing here, unless there are better protocols/agents I am not
aware of.

Honestly, pull approach via simple HTTP REST API seems cleaner but it
is just not good fit and also it creates other unnecessary
responsibility on the app itself. You are working on containerizing
Foreman, so it is also actually against this effort.

Anyway, let me throw another integration. Collectd has an agent (or
plugin) that opens a local socket which can be used to receive data
from other applications. I wrote Ruby client library the other day
(GitHub - lzap/collectd-uxsock: Ruby API for collectd socket protocol) but I believe this make no
difference than statsd - you still need a local process to gather the
data.

–
Later,
Lukas @lzap Zapletal

–
You received this message because you are subscribed to the Google
Groups “foreman-dev” group.
To unsubscribe from this group and stop receiving emails from it, send
an email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

–
You received this message because you are subscribed to the Google Groups
“foreman-dev” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

–
Eric D. Helms
Red Hat Engineering

Marek_Hulan · November 7, 2017, 3:31pm

Does that imply we'd need to keep that stored? This can generate huge amount
of data, which I suppose would mean we need to implement some regular cleaning
like logrotate etc. I'd prefer to just throw data away, whoever listens gets
it.

···

On úterý 7. listopadu 2017 14:56:05 CET Eric D Helms wrote: > On Thu, Nov 2, 2017 at 5:05 PM, Ivan Necas wrote: > > I lean towards the push model here. The main reason is > > the simpler way to publish the instrumentation data from whatever > > process we want to track. Also, my understanding is, that we don't care > > only if the service is up or down (readiness and liveness) but also > > about trends during the processing. > > In reading about push vs. pull, my biggest issue with it is that the > application has to have knowledge of where it's pushing. Whereas the pull > model allows an application to say I have metrics here and anything that > knows how to scrape and interpret those metrics can grab them at their > leisure. This provides nicer de-coupling and potentially more choice if > there is a standard-ish data format used to expose the metrics.

–
Marek

Eric: could you more describe the 5 web applications requiring 5
monitoring containers?
I might be missing where this implication came from?

This part comes from if you do the sidecar method of running something like
a statsd process. A sidecar is essentially just running two containers in a
pod where one container is considered the main application and the other an
addon that provides some additional value or performs some actions without
being baked into your main application container. If you picture this idea,
and then think about pod scaling (in Kube), for every scale up you’d also
be adding another statsd process container. This might be fine in practice,
but could in theory be overkill.

Another method with statsd is simply to run it on the host itself and have
the containers send data to it but this provides some security concerns
from what I understand.

The biggest limiting factor appears to be how forking webservers are
handled and probably constraints us the most. Lukas, have you seen anything
related to being to define what the metrics are and how they get published
being able to be separated from the publishing mechanism? My thinking being
if we started with statsd and wrote code within the application generating
statsd metrics, if at a later point one could simply say now publish this
via HTTP endpoint in Prometheus data style for scraping?

Eric

– Ivan

On Wed, Nov 1, 2017 at 4:54 PM, Lukas Zapletal lzap@redhat.com wrote:

Does Prometheus only not work in a multi-process Rails web server? Does

it

work for a single process multi-threaded web server? This is an

interesting

roadblock given you’d expect this to affect lots of webserver across
multiple languages out there.

Any Rails app that has multiple processes needs currently to figure
out how to deliver data to the HTTP endpoint. E.g. store it in a
database or something, which is not the best approach.

Absolutely, it lacks quite important feature right there. It stems
from the design which is pull-based.

Yes, standard practice is to think about one container per pod (in a
Kubernetes environment). However, there are patterns for things like
log
aggregation and monitoring such as doing a sidecar container that

ensures

co-location. The part I don’t entirely get with sidecars is if I scale

the

pod to say 5, I get 5 web applications and 5 monitoring containers and

that

seems odd. Which I why I think the tendency is towards models where
your
single process/application is the end point for your metrics to be

scrapped

by an outside agent or services.

I agree you want the collector to be separate, but if your web

application

is down what value would a monitoring endpoint being alive provide? The
application would be down, thus no metrics to serve up. The other

exporters

such as the one exporting metrics about the underlying system would be
responsible for giving system metrics. In the Kube world, this is

handled by

readiness and liveness probes for Kubenernetes to re-spin the container

if

it stops responding.

In container world, monitoring agents are running on hosts, not in
containers themselves. And collector agents can be 1:1 or 1:N (e.g.
for each container host). I am not sure I follow you here. Why you
don’t see added value again? Monitoring agent without any apps
connected is as useful as ssh deamon waiting for connections.

Let me put it this way - push approach seems to be more appropriate
for multi process Ruby application than pull approach. That’s what we
are discussing here, unless there are better protocols/agents I am not
aware of.

Honestly, pull approach via simple HTTP REST API seems cleaner but it
is just not good fit and also it creates other unnecessary
responsibility on the app itself. You are working on containerizing
Foreman, so it is also actually against this effort.

Anyway, let me throw another integration. Collectd has an agent (or
plugin) that opens a local socket which can be used to receive data
from other applications. I wrote Ruby client library the other day
(GitHub - lzap/collectd-uxsock: Ruby API for collectd socket protocol) but I believe this make no
difference than statsd - you still need a local process to gather the
data.

–
Later,

Lukas @lzap Zapletal

–
You received this message because you are subscribed to the Google

Groups “foreman-dev” group.

To unsubscribe from this group and stop receiving emails from it, send

an email to foreman-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

–
You received this message because you are subscribed to the Google Groups
“foreman-dev” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

lzap · November 7, 2017, 3:37pm

> In reading about push vs. pull, my biggest issue with it is that the
> application has to have knowledge of where it's pushing. Whereas the pull
> model allows an application to say I have metrics here and anything that
> knows how to scrape and interpret those metrics can grab them at their
> leisure. This provides nicer de-coupling and potentially more choice if
> there is a standard-ish data format used to expose the metrics.

On the other hand, pull model requires to store the data in memory or
disk. That's more code in our app, wasting cpu cycles and memory,
because amount of telemetry data can be large and the only reasonable
way of handling that (that's what telemetry clients pull libraries do)
is aggregating data (more cpu cycles spent) and bigger chance for
errors. I personally see this as bigger issue than knowledge where to
push data. In statsd for example, it's standard practice to push to
localhost which is sane default.

But I understand your concerns let me elaborate my proposal on a
common API below.

> The biggest limiting factor appears to be how forking webservers are handled
> and probably constraints us the most. Lukas, have you seen anything related
> to being to define what the metrics are and how they get published being
> able to be separated from the publishing mechanism? My thinking being if we
> started with statsd and wrote code within the application generating statsd
> metrics, if at a later point one could simply say now publish this via HTTP
> endpoint in Prometheus data style for scraping?

Instrumenting in both statsd and prometheus are very similar
interfaces and writing a simple API that would allow us to switch over
to Prometheus and Statsd would be pretty easy, I can incorporate this
into my patch and even provide Prometheus support from the day one, we
would have two dependencies: statsd and prometheus ruby libraries, not
big deal. We just need to do big warning that it won't work in
multiprocess deployment so users won't accidentally enable this losing
lot of data. Let me describe both APIs:

Prometheus (https://github.com/prometheus/client_ruby)

Types: Counter, Gauge, Histogram (time data), Summary (we would not
use this as it is not recommended and CPU heavy)
Metric naming: single string with underscores (my_metric_name)
Tag support: Any metric can have key/value pairs, arbitrary amount.
Previously there was "instance" flag, but that is now handled via tag
named "instance".

Statsd (https://github.com/etsy/statsd/blob/master/docs/metric_types.md)

Types: Counter, Gauge, Time
Metric naming: single string with dots (my.metric.name)
Tag support: Does not support by default, but can be mapped into name.
There is a common protocol extension tho, any metric can have
key/value pairs as well, I would not use it for now.

Instances are important, allows us to group common metric, e.g.
request_processed would be counter of overall requests with instances
of all HTTP codes (200, 201, 500 etc) so you can easily compare those
on a single graph. On the other hand arbitrary tags are not that
useful, I would not add them in the first phase although there are
common statsd extensions to support it.

PROPOSAL FOR COMMON PROMETHEUS/STATSD API

Types: Counter, Gauge, Time (Histogram in Prometheus, Time in Statsd)
Metric naming: Single string with underscores, Prometheus naming
conventions (https://prometheus.io/docs/practices/naming/)
Instance name: Single string with underscore (or can be nil if there
is no instance for a metric)

For Prometheus this is natural mapping, no surprises there. For statsd
which is little bit more free form I propose this mapping:

metric_name.optional_instance_name

E.g.:

http_request_duration_seconds.architectures_show

Using this simple approach we can have a global setting that user can
change - no telemetry, telemetry via rails logger, prometheus and
statsd.

···

-- Later, Lukas @lzap Zapletal

iNecas · November 7, 2017, 10:06pm

>
>> I lean towards the push model here. The main reason is
>> the simpler way to publish the instrumentation data from whatever
>> process we want to track. Also, my understanding is, that we don't care
>> only if the service is up or down (readiness and liveness) but also
>> about trends during the processing.
>>
>
> In reading about push vs. pull, my biggest issue with it is that the
> application has to have knowledge of where it's pushing. Whereas the pull
> model allows an application to say I have metrics here and anything that
> knows how to scrape and interpret those metrics can grab them at their
> leisure. This provides nicer de-coupling and potentially more choice if
> there is a standard-ish data format used to expose the metrics.
>
>
>>
>> Eric: could you more describe the 5 web applications requiring 5
>> monitoring containers?
>> I might be missing where this implication came from?
>>
>
> This part comes from if you do the sidecar method of running something
> like a statsd process. A sidecar is essentially just running two containers
> in a pod where one container is considered the main application and the
> other an addon that provides some additional value or performs some actions
> without being baked into your main application container. If you picture
> this idea, and then think about pod scaling (in Kube), for every scale up
> you'd also be adding another statsd process container. This might be fine
> in practice, but could in theory be overkill.
>

Let me ask again: why new stats process for every container? It seems to me
like other infra service, similar to db: you don't run a separate db
process for every container either. I might be missing something.

– Ivan

···

On Tue, 7 Nov 2017 at 14:56, Eric D Helms wrote: > On Thu, Nov 2, 2017 at 5:05 PM, Ivan Necas wrote:

Another method with statsd is simply to run it on the host itself and have
the containers send data to it but this provides some security concerns
from what I understand.

The biggest limiting factor appears to be how forking webservers are
handled and probably constraints us the most. Lukas, have you seen anything
related to being to define what the metrics are and how they get published
being able to be separated from the publishing mechanism? My thinking being
if we started with statsd and wrote code within the application generating
statsd metrics, if at a later point one could simply say now publish this
via HTTP endpoint in Prometheus data style for scraping?

Eric

– Ivan

On Wed, Nov 1, 2017 at 4:54 PM, Lukas Zapletal lzap@redhat.com wrote:

Does Prometheus only not work in a multi-process Rails web server?
Does it
work for a single process multi-threaded web server? This is an
interesting
roadblock given you’d expect this to affect lots of webserver across
multiple languages out there.

Any Rails app that has multiple processes needs currently to figure
out how to deliver data to the HTTP endpoint. E.g. store it in a
database or something, which is not the best approach.

Absolutely, it lacks quite important feature right there. It stems
from the design which is pull-based.

Yes, standard practice is to think about one container per pod (in a
Kubernetes environment). However, there are patterns for things like
log
aggregation and monitoring such as doing a sidecar container that
ensures
co-location. The part I don’t entirely get with sidecars is if I scale
the
pod to say 5, I get 5 web applications and 5 monitoring containers and
that
seems odd. Which I why I think the tendency is towards models where
your
single process/application is the end point for your metrics to be
scrapped
by an outside agent or services.

I agree you want the collector to be separate, but if your web
application
is down what value would a monitoring endpoint being alive provide? The
application would be down, thus no metrics to serve up. The other
exporters
such as the one exporting metrics about the underlying system would be
responsible for giving system metrics. In the Kube world, this is
handled by
readiness and liveness probes for Kubenernetes to re-spin the
container if
it stops responding.

In container world, monitoring agents are running on hosts, not in
containers themselves. And collector agents can be 1:1 or 1:N (e.g.
for each container host). I am not sure I follow you here. Why you
don’t see added value again? Monitoring agent without any apps
connected is as useful as ssh deamon waiting for connections.

Let me put it this way - push approach seems to be more appropriate
for multi process Ruby application than pull approach. That’s what we
are discussing here, unless there are better protocols/agents I am not
aware of.

Honestly, pull approach via simple HTTP REST API seems cleaner but it
is just not good fit and also it creates other unnecessary
responsibility on the app itself. You are working on containerizing
Foreman, so it is also actually against this effort.

Anyway, let me throw another integration. Collectd has an agent (or
plugin) that opens a local socket which can be used to receive data
from other applications. I wrote Ruby client library the other day
(GitHub - lzap/collectd-uxsock: Ruby API for collectd socket protocol) but I believe this make no
difference than statsd - you still need a local process to gather the
data.

–
Later,
Lukas @lzap Zapletal

–
You received this message because you are subscribed to the Google
Groups “foreman-dev” group.
To unsubscribe from this group and stop receiving emails from it, send
an email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

–
You received this message because you are subscribed to the Google Groups
“foreman-dev” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

–
Eric D. Helms
Red Hat Engineering

–
You received this message because you are subscribed to the Google Groups
“foreman-dev” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.