Hello,
I am seeking for app instrumenting protocol for Foreman Rails
application that will fulfill the following requirements:
The protocol must work with multi-process server like Passneger.
The protocol can be easily integrated into Foreman Tasks and Smart Proxy.
The protocol or agent must support aggregation of time-based data
(quantiles, average).
The protocol must integrate with top three open-source monitoring frameworks.
Let me summarize my findings so far. I am looking for advice or
comments on this topic. I already worked on some prototypes, but
before I commit to some final solution, I want to be sure I will not
miss something I don't know about.
Before you send comments, please keep in mind I am not searching for
monitoring solution to integrate with. I want an application
instrumentation library (or protocol) to be able export measurements
(or telemetry data if you like) from Rails (like number or requests
processed, SQL queries, time spent in db or view, time spent rendering
a template or calling a backend system).
Prometheus
Flexible text-based protocol (alternatively protobuf) with HTTP
REST-like communication. It was designed to be pull-based, meaning
that an agent makes HTTP calls to web application which holds all
metrics until they are flushed. It was build for Prometheus monitoring
framework (Apache licenced) created by SoundCloud initially. Server
and most agents are written in Go, can run without external database
or export into 3rd party storage backends.
It looks great, but it has a major problem - the Ruby client library
(called client_ruby) does not support multi-process web servers at
all. There are some hacks but these are using local temp files or
shared memory with rather bad benchmark results (see the links down
below).
There is a possibility to push metrics into a separate component
called PushGateway, but this was created for things like cron jobs or
rake tasks. Doing multiple HTTP requests for each metric per single
app request will unlikely perform well. In the README authors have
note that this should be considered as "temporary solution".
Although Prometheus seems to have vibrant community, the Ruby library
development pace slowed down as SoundCloud "does not use many Ruby
apps anymore". But it is still a good option to have.
https://github.com/prometheus/client_ruby/issues/9
https://github.com/prometheus/client_ruby/commits/multiprocess
OpenTSDB
OpenTSDB consists of a Time Series Daemon (TSD) as well as set of
command line utilities. Interaction with OpenTSDB is primarily
achieved by running one or more of the TSDs. Each TSD is independent.
There is no master, no shared state so you can run as many TSDs as
required to handle any load you throw at it. Each TSD uses the open
source database Hadoop/HBase or hosted Google Bigtable service to
store and retrieve time-series data.
It uses push mechanism via REST JSON API with alternative
"telnet-like" text endpoint. Although it does have some agents, it is
more used as a storage backend than end-to-end monitoring solution.
http://opentsdb.net/overview.html
Statsd
Main idea behind this instrumentation protocol is simple - get the
measurement out of the application as fast as possible using UDP
datagram. A collector agent usually runs locally, it does aggregation
and relays the measurements to target backend system. The vanilla
version does not support tagging, but there are extensions or mappings
possible to support that.
Almost all monitoring platforms has some kind of
agent/importer/exporter that talks via statsd. The original statsd
daemon was written in Perl years ago, then it was re-popularized by
node.js implementation, but there are many alternative agents from
which the most promising is statsite with very easy extensibility.
This protocol is my favourite because it plays well with multiprocess
Ruby servers or other Foreman components (all can just send UDP
packets to localhost) and it also takes all aggregation and storing
temporary data out of Ruby application. It also brings chances of
regressions in our codebase to bare minimum - in the worst case the
aggregating agent can fail but UDP packets will simply get lost
without interrupting the application. The best Ruby client library
seems to be statsd-instrument actively maintained by Shopify, it is
very small without any runtime dependency.
https://codeascraft.com/2011/02/15/measure-anything-measure-everything/
New Relic, Instrumental, DataDog, Rollbar
All are paid services, some clients are open-source (Instrumental is
MIT licenced) but usually with not well documented protocol and worse
integration to different monitoring solutions. There are plenty of
similar offerings, I might have missed some here.
https://instrumentalapp.com
https://instrumentalapp.com/docs/tcp-collector
Zabbix, Nagios, Icinga
These are more of "alerting" systems (system or service is down) and
they all support application instrumentation to some degree, but it is
not the core of what they do. I have seen them referred as "legacy
monitoring systems", but I think they are still very relevant. They
are not good fit for my use case tho at all.
Conclusion
To me it looks like the most open and flexible protocol seems to be
statsd. This will give our users the largest flexibility for further
integration - there are plenty of generic agents which can relay data
to backend systems.
Comments?