How We Think About and Use Smart Proxies Architecturally

ehelms · June 29, 2018, 12:38pm

The goal of this thread is to bring about discussion on how we think about and use smart proxies in the Foreman architecture when building and designing. The impetus for bringing this up started with a PR designed to unify how the Katello plugin know about Pulp deployments. From the PR came two distinct opinions about how we should think about smart proxies. I want to try to come to a collective view of them to drive changes. First some background to understand what inspired the broader conversation.

Background

Some brief background to help shed some light. Katello uses Pulp in two modes: master and child. The Pulp master is considered the source of all truth that the server talks to to create and manage all repositories that Katello knows about. A Pulp child is a remote Pulp that contains a subset of content in the Pulp master determined by assigning Lifecycle Environments to the Smart Proxy associated with the Pulp child. Today, in both cases a Smart Proxy is always associated with the Pulp deployment. The associated Smart Proxy serves to provide information Pulp does not directly such as storage usage. Further, and more importantly the Smart Proxy today is treated as the location of the Pulp. Katello treats the smart proxy installed on the server as the “default” and flag it as such. This nomenclature allows Katello to identify the Pulp master. One key difference is that in some areas of the code, Katello uses SETTINGS to define and reference Pulp attributes including the URL of the Pulp master. One key take away is that in SETTINGS Katello defines how to talk to the Pulp master and uses that in code, in some places Katello uses the Smart Proxy to define where Pulp is located.

Two Servers Diverged in a Smart Proxy

From the aforementioned thread, two opinions arose as to how to handle this difference and raised a broader architectural question. I will do my best to represent each opinion but @sean797 and @ekohl please correct anything I get wrong. As you read through this, be thinking about current deployments as well as forward thinking deployments where a Smart Proxy and a service are less co-located. For example, putting a database or Pulp on one server and a Smart Proxy on a different server. Or, a smart proxy container that is not at the same service address as the Pulp container.

Server is Server

This design principal is about letting the server be the sever, independent, and stand alone without the requirement for a smart proxy. In this concept, the server should be able to do everything it needs to do without having a smart proxy present. Smart proxies should then be used for remote operational management or scale out. If we relate this back to the original PR for an example, this would mean defining where the Pulp master is in SETTINGS at deployment time and using that to solely determine communication properties of the Pulp master. If a Pulp child is brought online near a datacenter, a smart proxy would also be installed that knows about the Pulp child and is able to relay information about it via a smart proxy back to the server. This provides a stand-alone, more lighter-weight server deployment that requires less to scale at the cost of potentially less discoverable service mechanisms and referencing remote servers in two separate ways.

I’m a Proxy, Let me Proxy

This design principal is about letting the Smart Proxy be a proxy and service discovery mechanism for all remote and backend services. Whenever a backend service is deployed, a Smart Proxy would need to exist somewhere that knows about where the service is located at and properties of the services deployment at a minimum. Taking the PR example, this would mean continuing the current usage of a Smart Proxy, requiring one on the Server and using it too inform Katello about where Pulp lives and other properties. Deployments would require a Smart Proxy to tell the server here is where the Pulp master or child is and information about it. This provides an abstraction and single point of contact for information about a remote or backend service uniformly within the application at the cost of always requiring an additional smart proxy service running somewhere.

Next Steps

First, I referenced a specific Katello scenario due to the PR that inspired this conversation but devs should expand their thinking to all usages of the Smart Proxy today and how one way or the other could shift those interactions. Secondly, within the container work this issue and the PR that sparked the conversation represent a functionality breakage that I can’t fix without resolving this discussion. Therefore, I will do my best to keep the conversation flowing, recapped and try to come to a solution in a timely manner.

lzap · June 29, 2018, 2:26pm

I haven’t realized that Katello/Foreman is directly talking to Pulp until today. Ideally we should have Katello talking to Smart Proxy API. I think @ekohl made a good assessment in the PR - we should do it right, but it’s bunch of work.

I don’t understand the details, but having a smart-proxy API which returns URL endpoint also smells. We have something similar in TFTP - the module returns “next-server” IP address that Foreman reads and then put into DHCP reservation over DHCP API. This has always been super confusing to users - you actually edit YAML file on proxy in order to have Foreman create correct DHCP record.

If you must, rather store this in Foreman DB. But I’d rather see something that is closer to “everything goes through proxy” - you could create a generic Pulp Proxy API in a similar way as in RHSM API in Katello. So then Pulp URL = Capsule URL.

ehelms · June 29, 2018, 7:59pm

Thanks for the thoughts. I have some follow up questions inline but I do have one over-arching request if you don’t mind. One thing I want to capture with this discussion is the why. Why we think about our architecture and smart proxies role in it. Would you mind expanding on, from your perspective, why we should have a smart proxy and treat it as the know-it-all for all services?

From your perspective, what are the benefits of having the smart proxy sit between the server and Pulp’s API for making requests? Does this put an unneeded or beneficial proxy between the communication? Do we get any additional benefits such as automatic retry?

One hallway idea, specific to Pulp, no matter the outcome (whether proxy exists as the go between or not) was to create a database object to represent Pulps. To cache details about Pulp, let admins configure some properties through the UI or to potentially add a known Pulp for us through the UI/API. I don’t necessarily want this broader discussion to delve into this detail though.

stbenjam · June 29, 2018, 8:22pm

I think on the Foreman side, there was always a clear understanding that Foreman talks to the smart proxy, and proxy is the one that talks to everyone else. The areas that Foreman talks directly to something are generally regarded as something that needs correcting. For example, Compute Resource communications really should go through a smart proxy.

There’s a lot of benefits to that design. In addition to abstractions for things like DNS (Foreman knows nothing about Active Directory or Infoblox, it just knows it’s talking to a DNS smart proxy), smart proxies provide the benefit of a single port, a single REST API, and a single authentication scheme to do everything. There’s only 1 firewall rule to open, too.

If we’re rearchitecting Katello’s pulp communication, I think all the config about where the pulp server is should be on the smart proxy, and Katello should just talk to the smart proxy.

lzap · June 30, 2018, 7:38am

My understanding is on infrastructure level - you want to have single point of communication.

Again, infrastructure (networking) reason. On top of that, I was thinking about proxy design pattern - multiple implementations of the same. What we do with DHCP/DNS/TFTP modules and providers is powerful. I have to admit that those APIs are super trivial.

ehelms · July 2, 2018, 12:02pm

Thanks for the replies so far. I’m gonna play a bit of devil’s advocate and ask some questions some of which may sound silly but they are designed to help me dig in further.

I do think the single point of communication is a strong pattern when talking security and latency. Reading through this you could interpret this a blanket pattern and I’d like to find out if we need to draw the line somewhere. There are some instances where we should not follow this pattern. I think the easiest backend system to look at is the fact we don’t talk to a smart proxy to then talk to our database. We consider it core to the application runtime and therefore interact directly. That begs the question, when do we consider something essential to the application runtime and if we do, should it be a direct communication stream?

Current state of the art, is the smart proxy in a state to handle being a single point of communication and thus failure? Advanced proxies today (e.g. Envoy) has built in traffic monitoring, resiliency through retries and timeout management, routing.

ekohl · July 2, 2018, 12:41pm

In theory I think that Katello should talk to a proxy to reach Pulp. This means it shouldn’t include Runcible but only its own API to the Proxy. The proxy can then realize the connection to Pulp. That way you can handle both Pulp 2 and Pulp 3 transparent to Katello.

Now that is in theory and in practice this might have a lot of overhead. Some actions might not be mappable and the abstraction can leak.

An added benefit is that you have to have different authentication from Foreman <-> Proxy and Proxy <-> Pulp. That means we don’t have to enforce our CA auth.

iNecas · July 2, 2018, 12:46pm

I count myself to the I’m a Proxy, Let me Proxy camp. Th reasons for this are:

it forces the developers of the server to think about cases when the service is available or not (proxy with the particular feature is not registered, now what?)
the proxy object on Foreman side allows unified way of assigning service providers to resources. I think we can benefit from this concept in the future (that doesn’t mean we need only static assignment of the proxies)
we tried providing both with-proxy and all-in-one solution in remote execution, and the experience was, that we got 2 code paths, and occasionally introduced some issues to one path, while it worked in the other one: 1 path is always better than two unless there is another reason that overweights the others.

ehelms · July 2, 2018, 12:51pm

That feels like a lot of additional overhead, given how we make use of Runcible today. This is not simply just an API mapper but also provides some higher-level extensions that make certain actions easier from a code perspective. I don’t think we want to re-implement the Pulp API on the proxy.

Which CA auth are you referencing here?

ehelms · July 2, 2018, 12:57pm

I agree this is one of the original changes that sparked the PR that seeded this discussion. I think, in this case, if code path differences were limited to only the configuration itself this wouldn’t be too bad. The aspect of this I keep coming back to is essential services vs. functional services. Essential services to me are those that are required for the application to function. Functional services are those that are required for a particular feature set to work. The best example of an essential service being a database. For Katello, the current design requires a Pulp and Candlepin to exist making them essential. For remote execution and Katello as an example, Dynflow is an essential service.

As I understand things, with the current REX and Foreman Ansible architecture, the Smart Proxy is now an essential service? As in, they can not function without one, is that correct?

ekohl · July 2, 2018, 1:13pm

I did say that it could be. Like @lzap said: for services like DNS and DHCP it’s more clearer there are multiple implementations and there is an abstraction.

In pulp we set the SSL username in the Apache config. This breaks pulp-admin. Granted, we can probably fix this also in a pure Katello scenario.

sean797 · July 2, 2018, 1:41pm

Personally, I see Smart Proxy as a client facing thing, I suspect the original reason we use Smart Proxy to communicate with DNS/DHCP ect… is that they are client facing services. As a user, if I reach a certain scale (in terms of client systems) I would probably deploy more Smart Proxies to assist.

Katello differs to Foreman in that the way it uses Pulp, it is required to pull something in from the internet into my local network, this is something that’s not done within Foreman Core and the majority of popular plugins AFAIK. So it makes sense that the pattern could differ with Pulp then compared to Puppet, DHCP, DNS, ect…

Inline with the thinking that I believe Smart Proxy is a client facing thing… Why would I want to deploy a Smart Proxy with the Pulp master (which I want to use to purely download, manipulate and federate content out to my Pulp children running with Smart Proxies where clients will consume the content from).

Looking to the future…
Our current pattern of close Pulp & Smart Proxy relationship is problematic when coupled with clusters of Pulp, what if I do a LOT of CI and want to use Pulp & Katello as part of that, I want to be able to build a Pulp cluster where all the content manipulation is offloaded to without a Smart Proxy - I don’t care for any disk usage monitoring Smart Proxy provides because I have my own fully featured monitoring service separate to Katello.

I see lots of larger enterprises starting to go down the automated testing & building of OS and Applications, this often requires the building of 1000s of artifacts (e.g RPMs) and testing them, if I use choose to use Katello & Pulp as an artifact repository for that, I might want a Pulp cluster.

td;lr
As a large scale user interested in a Pulp cluster either for availability or capacity (or both ) why would I want a Smart Proxy as yet another piece complexity, whats the benefit to me as the user?

dLobatog · July 2, 2018, 1:51pm

In general, when that “essential” thing does not involve interacting with 3rd party software that could be deployed on other box (Puppet, 1000 flavors of DNS/DHCP, etc…) then it’s fine to do the connection directly.

A counterexample is LDAP authentication. We use GitHub - theforeman/ldap_fluff: An LDAP gem for querying LDAP in various styles: Active Directory, FreeIPA & POSIX to abstract away the LDAP provider details, and Foreman uses the library directly. Would it be any simpler to just proxy the connection through smart-proxy? On the Foreman end, yes. However it’s also a PITA to force the user to deploy and configure a proxy on the LDAP system (which they might not have control of).

At least for Ansible & REX - they were designed to not need the proxy to be essential. foreman_remote_execution_coreandforeman_ansible_core` should be used by smart-proxy or by Foreman, and those 2 gems are the ones which contain the business logic need to do whatever. Of course this is harder to code and test and in many occasions that caused bugs. Reducing the possible “paths” to just 1 is easier to code, which in my experience leads to less bugs.

Bryan_Kearney · July 2, 2018, 1:52pm

There is a distinction here tho. DNS/DHCP is an interface with different implementations behind it. Pulp is not a interface, it is a sepcific service.

Bryan_Kearney · July 2, 2018, 1:53pm

Wouldnt this be better with an async model? Fire and forget, let the backend service pick it up later?

stbenjam · July 2, 2018, 1:57pm

The proxy is used to extend foreman’s reach into foreign networks, i.e. a distributed architecture. There’s not a lot of reason to proxy the database, you only have one (and maybe a replica) and it’s local to the Foreman instance. I don’t think the discussion should be about “essentialness” or not, but whether or not it’s a component of which there can be multiple instances that need to be orchestrated. Pulp definitely falls into that category.

Most of the features you mention should be handled by purpose-built software/hardware, like a load balancer IMHO. I don’t see the point in reinventing the wheel there. But even then, the ability to cluster smart proxies would be useful, but it largely depends on the service being proxied. Clients can’t talk to just any old pulp server because the metadata is different on every single one. #18717 would also need to get solved to have Foreman be aware of the concept.

Bryan_Kearney · July 2, 2018, 1:58pm

I have seen the following uses which dont really line up with your definitions @ehelms:

Service interface definition : This is the DNS/DHCP use case. Create a single API and allow the proxy to implement different instances under the cover.
Service location federation : How to locate and scale out service.
Restification : Create a rest interface where none exists.
Network Proxies : Allow foreman to bridge network boundaries.

Going foreward, I thinking keeping (1) and (4) are very relevant, but I think things like kube are better for (2).

stbenjam · July 2, 2018, 2:01pm

An abstraction isn’t necessary to get the benefits of a single auth scheme, single port, and single API endpoint.

ekohl · July 2, 2018, 2:04pm

A backend service doesn’t have to run on the same box as a proxy. It can be fine to open a TCP connection from the Proxy to an external box. What it can provide is an abstraction: the LDAP server is in DC1 while Foreman is in DC2. The proxy can be placed in a DMZ while the LDAP server is only on an internal network. Since the proxy has client side SSL certificates this can be secure enough. This is probably more common with services like DHCP or DNS though.

ekohl · July 2, 2018, 2:08pm

I disagree. I’d like to treat Pulp as just a REST API endpoint with no extra insight whatsoever. That means you can run Pulp on a different environment (Kubernetes, OpenShift, $cloud, bare metal) than your Foreman/Katello. Then you can’t rely on kube.