Highly Available Smart Proxies (part 2)

sean797 · February 8, 2018, 11:17pm

Some of us spoke about this at cfgcamp in Ghent, I want to summarize what was discussed and agreed for wider distribution and feedback.

Use cases we want to cover:

Clustering Smart Proxies together from a clients prospective & putting a load balancer in front of them.
Clients accessing a Smart Proxy on a different Interface/Hostname/Route than Foreman would contact the Smart Proxy on (this is actually a requirement of the above)

To support the first use-case some Smart Proxies features requires Foreman always does an action on all of the Smart Proxies in a cluster (i.e TFTP) and others requires Foreman does an action on one Smart Proxy, but try the next if that fails (i.e external DNS) and others could be done using both methods (i.e Content - explained at [1]).

I have put together a diagram explaining these 2 scenario’s, a dotted line means normally only one of the connections will happen (though if there is an error, Foreman will try the other) A solid line means a connection will always happen.

Where Features could use both modes we are initially only going to cover one of the scenario’s; In the long term we would like to provide for both cases, but for now this logic will be stored in code, we could store it in a model in future if we wanted to allow for both.

This means:

Foreman needs to have a way of Grouping Smart Proxies together on a per Feature basis & holding the Hostname/Route to access the service.
Hosts & Subnets are currently associated with Smart Proxies, this would need to change to the new SmartProxyGroup/Route object.
A SmartProxyGroup/Route containing one Smart Proxy will be created when a Smart Proxy is registered based of its URL.
We will migrate existing Smart Proxies, creating a Group/Route for each of them.
Were users do not want to Group Smart Proxies together, they can use a Smart Proxy Group of one.
For users want to use one Smart Proxy but want clients to access it via a different hostname/route (Multi-homing; second use case), then they can create an additional Group/Route containing one Smart Proxy but a different Hostname/Route than the URL.

[1] If a group of Smart Proxies share the same content storage (Pulp) you may want to only create a Sync task on one and have the other operate in a “read-only” mode using the same storage (think NFS share). Not really HA but does allow load-balancing for more capacity.

Where we need your help

Any feedback/input or questions?
- I encourage you to first checkout the existing PR, as it implements some the above plan.
What do we call this model we use to group Smart Proxies together?
- It needs to hold a Hostname/Route that the service is accessiable via from the client
- Has a n:m relation with Smart Proxies
- Users select this object on the Host form, i.e “Puppet Proxy” is now “Puppet Route”
- Users select this on the Subnet forms, i.e “DHCP Proxy” becomes “DHCP Route”
- Our current best idea is “Route”, maybe “Client Route”.

SideAngleSide · February 8, 2018, 11:39pm

As far as naming, wouldn’t Group be preferred to Route? Especially because you can have a group of 1 or more proxies. I toyed with using cluster, but that probably implies a sharing of state between proxies which wouldn’t generally be true.

iNecas · February 9, 2018, 8:07am

It seems like move right direction. My biggest concern in the previous proposal was the attempt to combine grouping and routing in the same concept, referred as hostname, which was quite confusing and counterintuitive to me.

Having proxy-groups, with those optionally having some routing attribtes sound better to me.

iNecas · February 9, 2018, 9:19am

Some comments after re-reading the proposal:

Concerns:

Mixing two concepts

Reading through the proposal again, it seems there is still an attempt to address two things at once: the HA and multi-home scenario. While they are often needed at the same time, I’m not sure it means we need to solve those things with a single concept. See below for consequences

Increased host assignment complexity

The proxies assignment from the host is already quite complex for the user and every pluging makes the host form worse. If we add additional layer in there, while keeping the status quo, I’m worried that it will make the usability even worse (from the setup and understanding perspective).

Allowing one proxy in multiple groups

From the use of the group itself for multi-homing, it implies, that one proxy can be in multiple groups. This can lead to strange situations:

You can have Group1 with Proxy1 and Proxy2 and Group2 with Proxy2 and Proxy3. What happens when I sync content to Proxy1? Should the Proxy3 be affected as well?

Proposal

Spliting the grouping/clustering and the routing/multi-home into two things.

Proxy Groups

The group would represent a set of proxies in a synchronized state and any of the proxy could be used for the job. Any proxy could be either in none or exactly one group. A host could be assigned (ideally through the proxies profile) to a proxy that is not in any group, or to a group.

We also need to count on the fact, that some features might not support the ha mode: for those, we would not allow to have multiple proxies with this feature in a single group.

Proxies profile (or some better name for the concept)

Instead of assigning the proxies directly to the hosts, one could create a reusable profile, where they could choose: proxy or group for specific roles, as well as hostnames to by used by the host using this profile. This way, we would reduce the complexity from the host form, as well as allowing additional attributes, that can’t be added to the proxy/group directly, as they can differ on per-host basis.

sean797 · February 9, 2018, 11:48am

Thanks, I hadn’t thought about that making the route attribute optional, I think its a good idea. If you create a group only for DHCP, a routing attribute wouldn’t make sense.

I don’t think so, having a “route” type object actually makes sense; its how you configure infrastructure outside of Foreman (using DNS as a relation) continuing to use the same concept makes sense especially for new users who understand how DNS relates to systems. Obviously its a changes and there’s friction related to all changes but the benefits far out weigh that friction IMO.

Related to this, I think we should have a “use the same proxy-group for all features” functionality to alleviate some of the pain on the hosts form.

No… if you Sync content to Proxy1 then it only goes to Proxy1. However we could allow syncing of Groups, which would be the recommended thing if they are grouped.
Also when you group Proxies together there would be comprehensive checking to ensure features are exactly the same, they are in the same lifecycle environment & Organization ect… (probably should have added that above)

As I understand your proposal, a user would create a Proxy Group for Smart Proxies they wish to collate together. You would then have/create a Proxy Profile for each Smart Proxy & Proxy Group ? So in the scenario in my diagram above I would have 3 Proxy Profiles to choose from? (one for each Proxy and an extra one for the group)

I don’t understand how that would “reduce the complexity from the host form”? A user still needs to select a Proxy Profile per feature.
Could you explain “as well as allowing additional attributes, that can’t be added to the proxy/group directly, as they can differ on per-host basis.” I don’t understand that at all?

As I understand your proposal you are moving the grouping of Smart Proxies into a separate object? Why would I want to create 2 objects (Proxy Group & Proxy Profile) that are always going to have a One-to-One relation, in that a Proxy Profile will be assigned either a Smart Proxy or a Proxy Group ? You might as well have 1 object. Also with 2 objects we would have more menu items and objects which increases complexity.

I think the 2 proposals are very similar, I just don’t understand how making the Grouping of Smart Proxies to a separate object to that of the routing info is advantageous? I mean you are linking them because they have the same routing attributes.

lzap · February 9, 2018, 12:43pm

How about pulling this outside of Foreman core by creating a smart proxy module/plugin that will essentially proxy all requests to 1:N other proxies. So in Foreman we are still talking to one proxy, nothing changes here, it reports the same modules (plus “proxy-proxy” one that provides the proxy capability). We can start small (just DHCP and whatever we need) and add more and more endpoints when necessary. We already have the know-how, we do this for templates already from the other side. This is for HA/clustering.

For multi-home I still think that creating one name alias for proxy (extra column in smart-proxy table - a simple string) is a good start. Clients could be configured to use alternate names per subnet, per hostgroup or even per host when needed.

This might look like over-simplifying your proposal, but I am trying to find the easiest way of helping the biggest amount of Foreman users. Won’t cover all of them, that never happens.

sean797 · February 9, 2018, 1:08pm

because the single point of failure then becomes the Smart Proxy that is going to essentially proxy all requests to the others.

Yes - so do I, but were not just trying to solve multi-homing, we’ve had this conversation many times before.

iNecas · February 9, 2018, 2:00pm

Nope, there would be actualy just one profile for host, encapsulating the setting of proxies for the host (and other hosts in the same situation).

If I have a proxy in a group, I would not expect the system allows we to get into a state where I can get the proxies in the group into an inconsistent state.

Yes, that would not make sense at all: that also was not what was proposed. See above for explanation.

By the additional attribute I mean the things, like multi-home, where different hosts access to the same proxy by different name (please correct me if my understanding of multi-home is different from yours).

ekohl · February 9, 2018, 2:34pm

I do like the idea. Any one of the proxies would be able to update the other proxies in this case. If the smart proxy plugin (provider) doesn’t need replicating (because it’s talking to a HA REST API for example) then it’s done. That would simplify the orchestration in Foreman.

In the case of TFTP it could also save bandwidth if you don’t need to download from a (potentially) slow external source but use your fast internal network.

You do need to consider the error scenario and rollbacks though.

sean797 · February 9, 2018, 5:10pm

@iNecas Right, I understand now… I like the idea of Proxy Profile as a way of simplifying the Host form, though I think that is a separate issue to that was intended in this discussion and could be added at a later date.

As I understand your your proposal the thing you have called Proxy Group is very similar if not the same to what I have called SmartProxyGroup or Route ? Is there a difference? Could you explain if so?

I guess its a question of were we want this logic, I quite like that our Smart Proxies are “dumb”, I think Foreman being fully aware the infrastructure layout is beneficial in other areas like Templates and probably others with plugins.

I think that could be done in both scenario’s.

Dmitri_Dolguikh · February 9, 2018, 8:13pm

I think what is missing from this conversation is how this proposal affects smart-proxies we have now and shapes the direction in which smart-proxy design can be evolved.

In particular, I don’t think we should assume that all smart-proxies are stateless: this isn’t the case now (core isc dhcpd provider is stateful. I think openscap module is stateful as well), and I would like to keep this option for the future.

Unfortunately, putting a dumb load-balancer in front of proxies limits what how the proxies need to be configured/what modules can be used. I’d like to have a conversation (probably in addition to this one) about highly-available smart-proxies, and possible ways to achieve that.

James_Shewey · February 9, 2018, 10:49pm

If stateful smart-proxies are an issue, then they are an issue regardless of whether there is a load balancer in the mix. What happens if my smart-proxy daemon crashes or the server powers off in the middle of an operation now?

We don’t need to assume all smart-proxies are stateless, just that they can gracefully recover from a crash. If there isn’t already a means of graceful recovery, this may mean that Foreman needs to signal the beginning and ending of an operation. If a Smart proxy sees an intermediary request without seeing a request to start an op, it should signal Foreman to start over from the beginning.

sean797 · February 10, 2018, 11:43am

@Dmitri_Dolguikh isc dhcpd has support for a failover mode, though I’m not sure how well that works or if there are differences between ipv4 and ipv6, were hoping to take advantage of that. We went over a lot of features in Ghent and came to the conclusion that they mostly fit into the 2 “modes” as I described in the first post.

Foreman needs to do something to both of them (TFTP, Pulp Content, OpenSCAP)
Foreman needs to do something to either (external DNS/DHCP, Monitoring, hopefully ISC DHCP with replication
No change required (Templates)

Obviously there are more features than the ones lists above, those are just examples. Each feature will need testing, ones that cant be or we don’t want to make HA don’t need to be.

That’s exactly what this is about See my first post and the diagram.

Sorry, forgot to reply to this earlier. We could easily has validation to stop this kind of thing. @iNecas Regarding syncing Pulp content then maybe we can disable syncing of Proxies and instead sync groups, I have no strong feeling here, I suggest leaving that until the Katello PR.

Dmitri_Dolguikh · February 10, 2018, 8:53pm

isc dhcpd has support for a failover mode, though I’m not sure how well that works or if there are differences between ipv4 and ipv6, were hoping to take advantage of that. We went over a lot of features in Ghent and came to the conclusion that they mostly fit into the 2 “modes” as I described in the first post.

I’m aware of dhcpd failover functionality: if I recall correctly failover doesn’t support replication of records created via omapi (only changes triggered by dhcp protocol are covered).

iNecas · February 12, 2018, 8:42am

Yes, I’m sorry, if my original response looked like brand new proposal: it’s rather a suggestion to the original one.

Yes, the proxy group is very similar to your original propose. What it, however, doesn’t try to address is the multi-homing (that I see more fitting to the proposed proxy profiles).

If we remove the attempt to address the multi-homing at once with this, we can limit the proxy to belong to either zero (the standalone proxy, as we know it now, that would be assigned to the host) or exactly one group (in this case, the proxy could not be assigned directly, but instead the group would be used).

The group should have a possibility to provide a hostname the hosts should prefer for reaching the proxies in the group.

However, the group itself would not try to address the multi-homing scenario, and the proposed Proxy Profile could address this case instead.

The nice thing about this is we don’t need to rollout both concepts at once: we can focus at the proxy groups now (without multi-homing), and then think more about the profile concept.

iNecas · February 12, 2018, 9:04pm

We had a discussion over lunch today in the office + some Monday whiteboarding, which lead to the current take on the load-balancing part.

All proxies behind a load-balancer

This takes the original Proxy Group/Route approach, while trying to address were the following issues:

how to deal with the fact that there might be some features at the proxy that could not be replicated over the others in the group (a.k.a proxy1 and proxy2 are setup in HA for Puppet, but PuppetCA is only at proxy1 )
how to address the proxy in multiple groups issue
how to resolve the direct assignment to proxy case

What we came with is a LoadBalancer object, that is specific to a given feature (to address the issue (1)).

loadballancer-1

We came with two special cases for a feature load balancer (I put some names for them here for better reference - I’m not sure the names are the right ones thou):

Routed - 2 or more proxies behind a load balancer
Passthrough - 1 proxy

Let’s have a look as the routed load balancer would look like:

loadballancer-2

The important thing is that one proxy can belong at most to one routed load balancer per feature (to address the issue (2)

There was still a question how to model the simple use-case without load-balancers so that it would still work as it does today. Therefore, there is the passthrough load balancer added for every proxy.

loadballancer-3

This address the issue (3). At the end, the reason for choosing this over using a polymorphic association, where one could assign either proxy directly, or a load balancer (in case of multiple proxies) was to be able to uniquely identify the specific proxy/load balancer by numeric id.

Personally, I’m still not sure about the name (especially in the passthough case). Seems a bit weird to call something like hammer load-balancer list and seeing there also the built-in one-per-proxy “load-balancers”.

Another issue is that one would get different ids for different proxy features. So in case there is one proxy with 4 features (puppet, puppet ca, content source and open scap), I would have 4 different numbers to assign to the host (instead of the current one).

Btw. we purposefully left out the multi-homing, as whenever we tried to put this into the model, it complicated the things, which didn’t look like worth taking given the edge case of the multi-homing scenario. We would recommend other approaches (such ash using host params) to override an url to specific proxy/load-balancer on per-host basis.

Virtual Proxy

While I was putting the notes down and thinking about the downsides, another approach to this problem there popped up.

What if, instead of defining the load-balancer with url and feature, we would allow to define a Virtual Proxy. The virtual proxy would not represent a real proxy, but could be used as one when assigning to the host.

The virtual proxy could point to multiple real proxies. It could have multiple features. The constrains would be, that:

each feature of the virtual proxy can be handled by any of the assigned proxies (the state would be synced between the proxies, if needed)
each real proxy is assigned to at most on virtual proxy per feature

This approach would resolve the issues with single-proxy load-balancer, as well as multiple ids per proxy
and the hammer usability: it would work very similarly to what we have right now.

Another nice thing about this is, that we could even handle the multi-homing with this approach. What we could have is:

virtual proxy - load balancer
virtual proxy - alias

The alias would be a virtual proxy that points exactly to one proxy and it would have the same features as the target proxy. The alias could also point to the load balancer virtual proxy.

All of this without need to define any new hard to name concept in Foreman.

sean797 · February 12, 2018, 11:25pm

Nice, thanks! My thoughts…

TL;DR
I generally think we shouldn’t go overkill on stopping some scenarios but instead enable most and document the ones we support. Users setting up loadbalancing will need to have some leave of competency & knowledge of the features they are using.

It looks like Loadbalancer is similar as what was originally proposed (in my first post) with one difference, there is a direct has_one feature association instead of an indirect has_many.

For reference later:
Scenario A
20 Smart Proxies each pair in their own cluster behind a loadbalancer, all with Puppet, Puppet CA, Pulp, OpenSCAP, RemoteEx features.

Scenario B
3 Smart Proxies with Puppet feature
GroupA with Smart Proxy 1 & 2
GroupB with Smart Proxy 2 & 3

This is going to result in a lot of Loadbalancers objects, for example in scenario A I’m going to end up with 120 Loadbalancers*, this is a lot and unnecessary, let me explain…

Without that direct feature association in scenario A your have 30 Loadbalancers (though I called these SmartProxyGroups or Routes)

Under the original proposal I suggest the following solutions to your concerns:

how to deal with the fact that there might be some features at the proxy that could not be replicated over the others in the group (a.k.a proxy1 and proxy2 are setup in HA for Puppet, but PuppetCA is only at proxy1 )

We would make those only possible to select a SmartProxyGroup/Route/Loadbalancer with 1 Smart Proxy in for those features. TBH even Puppet CA is probably possible with Shared Storage or GFS2 so I’m not even sure about that.
I’m also not sure we should activity stop people from trying something when it might work, unless there’s is defiantly a limitation making it impossible. But we don’t need a direct has_one feature association, we could limit selection to SmartProxyGroup/Route/Loadbalancer with 1 Smart Proxy in.
We could document ways features can be made highly available and ones we feel shouldn’t or can’t be.

how to address the proxy in multiple groups issue

You may have groupA with SP1 & SP2 in using loadbalancer SP.example.com, but you may also want to use a-different-name.example.com to access the same group, which would require you create another group.

I see how your Loadbalancers proposal with a direct feature association would stop someone doing scenario B but I see this as an edge case and causing 4x the amount of SmartProxyGroup/Route/Loadbalancer objects to be created a big costs for this. (refer to my TL;DR)

We can probably add validation to the original proposal to stop this as well; check existing groups for each proxy has the same Smart Proxies as the one we are creating.

how to resolve the direct assignment to proxy case

We have both solved this in the exact same way

* 20 Smart Proxies x 4 different features + (10 Loadbalancers x 4 different features)

Would adding the validation I have described above be a happy medium between these 2 proposals?
What is the difference between Virtual Proxy and what I described in my first post? I think they are the same?

Gwmngilfen · February 13, 2018, 12:11am

Staying out of the technical details, but some community feedback for folks - we often hear that we are not opinionated enough in how we do things. It’s totally acceptable to define what we do support, as @sean797 suggests, and to optionally allow different/more complex stuff via advanced config. Naturally this latter config could be added in later PRs.

iNecas · February 13, 2018, 8:18am

I think we should make sure we set the right expectations. Also, we should also think about how easy is to reason about the concept. With the risk of bringing a straw man here, I think we suffer with taxonomies by trying to address every possible situation.

Users in general don’t read documentation. If we can automate here, we really should.

And the load-balancer approach explicitly doesn’t try to solve this: the fact that you need to define the group twice actually suggests, that it’s not the right approach to solve this case.

They are different. The thing is this approach doesnt introduce any new first class citizens (as groups/routes/hostgroups), but rather extends the proxy object (with STI) to be able to define a proxy-like object that serve the load-balancing/aliasing.

Honestly, unlike with the any of the previous propsals (counting in the ones I wrote here), with virtual proxies it’s the first time I can think about how to fit into the whole picture (including things like API/cli) without getting my brain boiled.

lzap · February 13, 2018, 9:34am

Can you perhaps elaborate concept of virtual proxies on two typical examples. Full disclosure, I like it because it seems more simple design. We need to identify what are the limits of this, but it is worth finding out.