Back when I was hooking up Katello to Candlepin’s internal Artemis to consume events, I started out with some simple testing to wrap my head around Artemis by standing up an an instance external to Candlepin. By the end of that, I had spent quite a bit of time in the documentation. I also ended up looking through the Artemis code to debug an issue I was running into. The experience was good, and things worked the way I expected them to. Candlepin already supports connecting to external Artemis, and I’ve had it running in that mode. I’ve received useful help from the devs too. TLDR: I like the standalone Artemis route.
Re: HTTP Polling, if it’s not necessary, I would prefer to avoid it. My definition of necessary would be anything that is required for us to support large deployments with 60k, 100k, 150k clients using this new tech.
I have no experience with MQTT, but my intuition says that we should avoid clients connecting to the Smart Proxy directly. It’s currently not designed for large amounts of connections, especially if they’re long lived. Also, the connection authentication is not set up at all for it.
The first round of testing of my POC didn’t go too well. I had a single machine (2vCPUs, 8GB ram) running foreman, smart-proxy, mosquitto (the broker) and 100 mqtt clients written in ruby. Kicking off a job against all of them would be fine until the clients started reporting that they finished to the smart proxy. Since all of the clients received the job at almost the same time and all the scripts were the same, all the hosts finished almost the same time, they started hammering the smart proxy at almost the same time. Some of the callbacks would time out, because smart proxy wouldn’t be able to respond to so many requests at once. In here I also encountered a strange bug in curl, where the command would not exit, even if the underlying connection went away. To show some numbers, across several attempts, 80 hosts usually finished successfully, the rest timed out.
After putting some workarounds in place to deal with curl hanging, I was able to get 100 out of 100.
After OOM-killing my entire machine by trying to deploy more clients I reevaluated my deployment strategy. Instead of having N clients, each listening on a single topic I went with 1 client listening on N topics. Since I was mostly focused on testing “our” part of the stack, I didn’t mind slightly diverging from the real world deployment.
With this new deployment I tried running jobs against 1k, 2k and 5k hosts and haven’t encountered any real issues. A single echo hello; sleep 10; echo bye job against 5k hosts run for 29 minutes, which is not too shabby for an unoptimized POC.
Mosquitto (the one I used in my POC) is packaged for both EL7 and Debian buster so we should be safe on that front. Afaik artemis isn’t really packaged for anything, but hopefully it is true to the write once, run anywhere spirit of java.
Same. We would have to start keeping state on the local or remote (or both) side to prevent jobs from being run twice and so on, which mqtt would do for us.
I don’t understand. One of the main goal of a message bus is to enqueue messages so they can be picked up in a reliable way. Are you saying this did not work?
Don’t understand. The communication should be point-to-point, Foreman wants to contact hosts X, Y and Z to perform some action. And reply goes back as well to a specific endpoint. What exactly was or was not working well?
I implemented the first option Justin suggested (mqtt signalization + job retrieval over http). So for every single host and job there are at least two http requests going to the proxy. One to retrieve the job, the other to report results. The hammering was coming from the job results being sent back to the proxy. Delivery of the signal messages to clients over mqtt worked perfectly.
mqtt is not messaging, it is a publish-subscribe kind of thing. Many producers can write to a topic and many consumers can consume messages from a topic. That means we have to “emulate” the required point-to-point nature by having per-host topics. I didn’t have the resources to have N clients, where each client would consume messages from its own personal topic, I had a single client which consumed messages from all the per-host topics.
Mosquitto has an option to establish ACLs based on, among other things, client id from client certs. The idea would be that smart proxy can publish to any topic, but hosts can read only their own topics. My bet would be artemis can do something similar, although I haven’t checked yet.
The reason I was asking about the http polling option is that I’m worried we’re effectively replacing qpid, which is often a black box that is complex to properly set up, manage, ensure stays connected to all clients and debug, with some other tool that would potentially also be complex to properly set up, manage, ensure stays connected to all clients and debug.
There is also a good question about the expected scale and level of service expected - the right solution if we’re aiming for 5K hosts connected to a proxy executing within half an hour of the job triggering is quite different to the solution needed if we’re aiming at scaling to 100K hosts connected to a proxy with <1m to execute. How often are the jobs executed? every hour or maybe once in a few days? Also there is a question of the number of hosts targeted by a specific execution to ask - will most jobs target all machines connected to the proxy at the same time, or will most jobs be targeting a specific host or small set of hosts?
Now I understand, yes, Smart Proxy itself (the Ruby process) will not scale well and HTTP polling is not an option. That was not what I was suggesting, I was thinking developing our very own (external) lightweight process doing handling of all WebSocket requests. Smart Proxy would talk to the process via IPC (e.g. REST API over localhost or UNIX socket).
Well, there are brokers which supports both p2p and p/s messaging patterns. But let’s not play with words.
As Ewoud mentioned, this cannot be done due to security, our users do maintain servers for different clients and we simply can’t allow them to easily subscribe to all info. This must be point to point.
This smells like very much the only option we have. Therefore if a new host is created, we’d need to update Mosquitto ACL configuration files on the Foreman Server and all Smart Proxies. There is also a possibility to write an AUTH PLUGIN that would perform this dynamically by performing requests to Foreman Server for example.
I think we need to consider both security and scalability from the very beginning. I think ideal solution is that a client has X509 certificate (puppet, rhsm or foreman-cert tool) that entitles the client to access its very own messaging endpoint. Whatever this is (p2p or p/s) is an implementation detail, but the communication must be point-to-point.
Again, I need to think loud - wouldn’t be easier to develop our own simple websockets service (not in Ruby preferably) that can be tightly designed and integrated for our own use case? With HTTP(s) handshake it’s something our users know very well and firewalls allow by default, it will “just work”. This does not sound like an enterprise bus with tons of messages, types and endpoint. We need to communicate straightforward messages like “run this script” or “are you alive”?
I didn’t say I’m suggesting this for production deployments. In production deployments this should be forbidden and the point-to-point-ness enforced by ACLs. I did this to lower resource requirements when testing another part of the stack.
Not necessarily the rules can be generic in shape “client/$name”, where $name is taken from client’s certificate which the client presents when connecting to the broker. If it really works like that, we wouldn’t need to update the configuration.
How would we enforce that someone from organization A is not trying to connect with its own certificate to client/name.from.org.b? I just briefly read Mosquitto configuration but there’s not much flexibility.
I swear i had replied to this before the holiday, but apparently not!
The current target is the ability to execute a job on 10K hosts per smart proxy over the course of 4 hours. At the same time, a job executing on 2 hosts should not take 4 hours and should execute within a small amount of time (a few minutes at the most). A polling based mechanism handles the first situation easy, but the second situation demands much more performance.
I wanted to provide an update to this. The general flow is very similar with a few key differences. As we worked through user stories, it became obvious that creating a new pull provider would result in:
Duplicated job templates
Needing to enhance the UI to support running the same ‘job’ across multiple providers, potentially with different inputs
While this could long term lead to a break between the templating (scripts & ansible playbooks) and the execution technology (ssh & ansible) for the time being this has some implications for the ‘pull provider’ work.
We came away with:
There should not be a dedicated pull provider
The ‘SSH’ provider sould be named the ‘Script’ provider
The ‘Script’ provider should support executing via SSH or Pull on a given smart proxy
The ‘Script’ provider should support an optional MQTT notification if configured to use Pull
The sys admin installing & configuring the Smart proxy would decide if the Script provider should use SSH or Pull
The foreman application does not need to know which host will use pull or which host will use SSH, it will simply use a smart proxy for execution, that smart proxy will use the method it is configured for
This could have been a webhook event, so users can actually implement anything they want. Looks like a good fit, webhooks are fire-and-forget events which can be configured to be executed on remote services (HTTP) or via Smart Proxy Shellhook plugin. We could provide an example script for MQTT.