Highly Available Smart Proxies (part 2)

@iNecas Right, I understand now… I like the idea of Proxy Profile as a way of simplifying the Host form, though I think that is a separate issue to that was intended in this discussion and could be added at a later date.

As I understand your your proposal the thing you have called Proxy Group is very similar if not the same to what I have called SmartProxyGroup or Route ? Is there a difference? Could you explain if so?

I guess its a question of were we want this logic, I quite like that our Smart Proxies are “dumb”, I think Foreman being fully aware the infrastructure layout is beneficial in other areas like Templates and probably others with plugins.

I think that could be done in both scenario’s.

I think what is missing from this conversation is how this proposal affects smart-proxies we have now and shapes the direction in which smart-proxy design can be evolved.

In particular, I don’t think we should assume that all smart-proxies are stateless: this isn’t the case now (core isc dhcpd provider is stateful. I think openscap module is stateful as well), and I would like to keep this option for the future.

Unfortunately, putting a dumb load-balancer in front of proxies limits what how the proxies need to be configured/what modules can be used. I’d like to have a conversation (probably in addition to this one) about highly-available smart-proxies, and possible ways to achieve that.

If stateful smart-proxies are an issue, then they are an issue regardless of whether there is a load balancer in the mix. What happens if my smart-proxy daemon crashes or the server powers off in the middle of an operation now?

We don’t need to assume all smart-proxies are stateless, just that they can gracefully recover from a crash. If there isn’t already a means of graceful recovery, this may mean that Foreman needs to signal the beginning and ending of an operation. If a Smart proxy sees an intermediary request without seeing a request to start an op, it should signal Foreman to start over from the beginning.

@Dmitri_Dolguikh isc dhcpd has support for a failover mode, though I’m not sure how well that works or if there are differences between ipv4 and ipv6, were hoping to take advantage of that. We went over a lot of features in Ghent and came to the conclusion that they mostly fit into the 2 “modes” as I described in the first post.

  1. Foreman needs to do something to both of them (TFTP, Pulp Content, OpenSCAP)
  2. Foreman needs to do something to either (external DNS/DHCP, Monitoring, hopefully ISC DHCP with replication
  3. No change required (Templates)

Obviously there are more features than the ones lists above, those are just examples. Each feature will need testing, ones that cant be or we don’t want to make HA don’t need to be.

That’s exactly what this is about :thinking: See my first post and the diagram.

Sorry, forgot to reply to this earlier. We could easily has validation to stop this kind of thing. @iNecas Regarding syncing Pulp content then maybe we can disable syncing of Proxies and instead sync groups, I have no strong feeling here, I suggest leaving that until the Katello PR.

isc dhcpd has support for a failover mode, though I’m not sure how well that works or if there are differences between ipv4 and ipv6, were hoping to take advantage of that. We went over a lot of features in Ghent and came to the conclusion that they mostly fit into the 2 “modes” as I described in the first post.

I’m aware of dhcpd failover functionality: if I recall correctly failover doesn’t support replication of records created via omapi (only changes triggered by dhcp protocol are covered).

Yes, I’m sorry, if my original response looked like brand new proposal: it’s rather a suggestion to the original one.

Yes, the proxy group is very similar to your original propose. What it, however, doesn’t try to address is the multi-homing (that I see more fitting to the proposed proxy profiles).

If we remove the attempt to address the multi-homing at once with this, we can limit the proxy to belong to either zero (the standalone proxy, as we know it now, that would be assigned to the host) or exactly one group (in this case, the proxy could not be assigned directly, but instead the group would be used).

The group should have a possibility to provide a hostname the hosts should prefer for reaching the proxies in the group.

However, the group itself would not try to address the multi-homing scenario, and the proposed Proxy Profile could address this case instead.

The nice thing about this is we don’t need to rollout both concepts at once: we can focus at the proxy groups now (without multi-homing), and then think more about the profile concept.

We had a discussion over lunch today in the office + some Monday whiteboarding, which lead to the current take on the load-balancing part.

All proxies behind a load-balancer

This takes the original Proxy Group/Route approach, while trying to address were the following issues:

  1. how to deal with the fact that there might be some features at the proxy that could not be replicated over the others in the group (a.k.a proxy1 and proxy2 are setup in HA for Puppet, but PuppetCA is only at proxy1 )
  2. how to address the proxy in multiple groups issue
  3. how to resolve the direct assignment to proxy case

What we came with is a LoadBalancer object, that is specific to a given feature (to address the issue (1)).

loadballancer-1

We came with two special cases for a feature load balancer (I put some names for them here for better reference - I’m not sure the names are the right ones thou):

  • Routed - 2 or more proxies behind a load balancer
  • Passthrough - 1 proxy

Let’s have a look as the routed load balancer would look like:

loadballancer-2

The important thing is that one proxy can belong at most to one routed load balancer per feature (to address the issue (2)

There was still a question how to model the simple use-case without load-balancers so that it would still work as it does today. Therefore, there is the passthrough load balancer added for every proxy.

loadballancer-3

This address the issue (3). At the end, the reason for choosing this over using a polymorphic association, where one could assign either proxy directly, or a load balancer (in case of multiple proxies) was to be able to uniquely identify the specific proxy/load balancer by numeric id.

Personally, I’m still not sure about the name (especially in the passthough case). Seems a bit weird to call something like hammer load-balancer list and seeing there also the built-in one-per-proxy “load-balancers”.

Another issue is that one would get different ids for different proxy features. So in case there is one proxy with 4 features (puppet, puppet ca, content source and open scap), I would have 4 different numbers to assign to the host (instead of the current one).

Btw. we purposefully left out the multi-homing, as whenever we tried to put this into the model, it complicated the things, which didn’t look like worth taking given the edge case of the multi-homing scenario. We would recommend other approaches (such ash using host params) to override an url to specific proxy/load-balancer on per-host basis.

Virtual Proxy

While I was putting the notes down and thinking about the downsides, another approach to this problem there popped up.

What if, instead of defining the load-balancer with url and feature, we would allow to define a Virtual Proxy. The virtual proxy would not represent a real proxy, but could be used as one when assigning to the host.

The virtual proxy could point to multiple real proxies. It could have multiple features. The constrains would be, that:

  1. each feature of the virtual proxy can be handled by any of the assigned proxies (the state would be synced between the proxies, if needed)
  2. each real proxy is assigned to at most on virtual proxy per feature

This approach would resolve the issues with single-proxy load-balancer, as well as multiple ids per proxy
and the hammer usability: it would work very similarly to what we have right now.

Another nice thing about this is, that we could even handle the multi-homing with this approach. What we could have is:

  1. virtual proxy - load balancer
  2. virtual proxy - alias

The alias would be a virtual proxy that points exactly to one proxy and it would have the same features as the target proxy. The alias could also point to the load balancer virtual proxy.

All of this without need to define any new hard to name concept in Foreman.

Nice, thanks! My thoughts…

TL;DR
I generally think we shouldn’t go overkill on stopping some scenarios but instead enable most and document the ones we support. Users setting up loadbalancing will need to have some leave of competency & knowledge of the features they are using.


It looks like Loadbalancer is similar as what was originally proposed (in my first post) with one difference, there is a direct has_one feature association instead of an indirect has_many.

For reference later:
Scenario A
20 Smart Proxies each pair in their own cluster behind a loadbalancer, all with Puppet, Puppet CA, Pulp, OpenSCAP, RemoteEx features.

Scenario B
3 Smart Proxies with Puppet feature
GroupA with Smart Proxy 1 & 2
GroupB with Smart Proxy 2 & 3


This is going to result in a lot of Loadbalancers objects, for example in scenario A I’m going to end up with 120 Loadbalancers*, this is a lot and unnecessary, let me explain…

Without that direct feature association in scenario A your have 30 Loadbalancers (though I called these SmartProxyGroups or Routes)

Under the original proposal I suggest the following solutions to your concerns:

how to deal with the fact that there might be some features at the proxy that could not be replicated over the others in the group (a.k.a proxy1 and proxy2 are setup in HA for Puppet, but PuppetCA is only at proxy1 )

We would make those only possible to select a SmartProxyGroup/Route/Loadbalancer with 1 Smart Proxy in for those features. TBH even Puppet CA is probably possible with Shared Storage or GFS2 so I’m not even sure about that.
I’m also not sure we should activity stop people from trying something when it might work, unless there’s is defiantly a limitation making it impossible. But we don’t need a direct has_one feature association, we could limit selection to SmartProxyGroup/Route/Loadbalancer with 1 Smart Proxy in.
We could document ways features can be made highly available and ones we feel shouldn’t or can’t be.

how to address the proxy in multiple groups issue

You may have groupA with SP1 & SP2 in using loadbalancer SP.example.com, but you may also want to use a-different-name.example.com to access the same group, which would require you create another group.

I see how your Loadbalancers proposal with a direct feature association would stop someone doing scenario B but I see this as an edge case and causing 4x the amount of SmartProxyGroup/Route/Loadbalancer objects to be created a big costs for this. (refer to my TL;DR)

We can probably add validation to the original proposal to stop this as well; check existing groups for each proxy has the same Smart Proxies as the one we are creating.

how to resolve the direct assignment to proxy case

We have both solved this in the exact same way :slight_smile:

* 20 Smart Proxies x 4 different features + (10 Loadbalancers x 4 different features)


Would adding the validation I have described above be a happy medium between these 2 proposals?
What is the difference between Virtual Proxy and what I described in my first post? I think they are the same?

Staying out of the technical details, but some community feedback for folks - we often hear that we are not opinionated enough in how we do things. It’s totally acceptable to define what we do support, as @sean797 suggests, and to optionally allow different/more complex stuff via advanced config. Naturally this latter config could be added in later PRs.

I think we should make sure we set the right expectations. Also, we should also think about how easy is to reason about the concept. With the risk of bringing a straw man here, I think we suffer with taxonomies by trying to address every possible situation.

Users in general don’t read documentation. If we can automate here, we really should.

And the load-balancer approach explicitly doesn’t try to solve this: the fact that you need to define the group twice actually suggests, that it’s not the right approach to solve this case.

They are different. The thing is this approach doesnt introduce any new first class citizens (as groups/routes/hostgroups), but rather extends the proxy object (with STI) to be able to define a proxy-like object that serve the load-balancing/aliasing.

Honestly, unlike with the any of the previous propsals (counting in the ones I wrote here), with virtual proxies it’s the first time I can think about how to fit into the whole picture (including things like API/cli) without getting my brain boiled.

Can you perhaps elaborate concept of virtual proxies on two typical examples. Full disclosure, I like it because it seems more simple design. We need to identify what are the limits of this, but it is worth finding out.

1 Like

And the load-balancer approach explicitly doesn’t try to solve this:

Sure, but the other approach does allow it.

the fact that you need to define the group twice actually suggests, that it’s not the right approach to solve this case.

Maybe, but there’s so many different scenario’s with groups and multihoming Smart Proxies, we cannot possible make everyone neat and tidy. I want to initially enable all (or as many as we can) but only support a subset, we don’t know how these features will evolve over time or people might use them, I would like to explicitly allow for experimentation with these features.

STI approach

I see SmartProxyGroup/Route/Loadbalancer/Whatever as a completely different object than a Smart Proxy, with a different purpose and attributes. The only attributes they would hold that are the same would be an alternative Hostname/Route and a Name attribute right? If we use STI I think we may regret that as the 2 objects evolve over time. They are different objects, not like Manged Hosts & Discovery Hosts, one is a way of Grouping Smart Proxies and the other is the Smart Proxies themselves. Very different objects with different purposes.

Can you explain this a little more? If we had a SmartProxyGroup/Route object that Hosts, Subnets & Domains are associated to I don’t understand how that would make your “brain boil”. You set some predefined set of SmartProxyGroups/Routes then associate them when creating a Hosts, Subnets or Domains.

Anyway, I explained one initial concern, but like @lzap I’m interested in hearing a fuller explanation and will reserve judgment until then

Is there value in doing a Deep Dive video chat on this, as a sort of panel discussion? It might help to figure out where the common ground is, at least :slight_smile:

I’m not sure which load balancer solution you are planning to use or want to support, however I can tell you that F5s have something called iApp templates that can be written to solve this particular problem of having a large quantity of virtual servers to manage (devcentral registration required, but free) - they use it to resolve this issue for things like load balancing Exchange and Lync / Skype for Business which also need a large number of virtual servers and need iRules for manging issues with statefulness.

The lab license for a VM of their load balancer is about $100. You can also use iRules LX to do anything crazy thing you can manage to code out in node.js.

Just to avoid few rounds of replies, @iNecas I recommen you read my proposal at https://github.com/theforeman/foreman/pull/4561#issuecomment-348196075 see the header Load balancing in this comment and then reactions. The idea was the same, but I didn’t want to introduce STI, flag on smart proxy seemed enough. Now I think everyone agrees on need of load-balancer object and the only question is whether we need it as new entity or it can be special case of proxy. I see the biggest benefit of reusing proxy object in that we don’t have to change any API (v2 and internal objects API). But after further recent discussion I don’t have extremely strong opinion on it as it might be a bit more confusing for users unless we clearly say “this smart proxy is in fact load balancer or alias” somehow in UI.

With this, for me is the most clear solution (taking all of the other porposal into consideration) to fit to the rest of the system.

@iNecas Can you explain the STI approach further please? (please see @lzap and mine latest replies). As I explained above I don’t see how using Single Table Inheritance is of any real benefit considering:

  • They would share and only Name, and maybe an alternative hostname attributes.
  • As a user if the UI/API parameter says “puppet_proxy” I’m expecting to put a Smart Proxy in, not a “Smart Proxy Group” type object. I think a change like that requires API changes so users know that something has changed, attempting to trick users into not noticing could lead to frustrated users, especially when debugging and not realizing something has changed.

First of all sry for late reply, my inbox kind of exploded earlier this week with the number of unread e-mails, including notifications from this one :slight_smile:

With the virtual proxies, it actually looks to me like we could :slight_smile:

It depends on which angle you are looking at that… From the host perspective, they are very similar with very similar perspective. Using STI doesn’t mean that everything needs to be in this one model, and we can use composition there (where the load balancer proxy would use internally a proxy group or whatever we would need to provide the best model).

I’m not big fan for STI either, and there are places were they cause troubles (cough…taxono…cough…mies). And even for discovered hosts, I’m not that sure. However, there are other places, such as compute resources, where IMO their usage is justified, even when the compute resources differ quite a lot.

The biggest struggle I had was when I started thinking about hammer/API commands. Currently, when I register a smart proxy, I do hammer proxy list, and from there I see all the ids is can use when assigning them to the hosts. With the original approach, I would need to go to hammer smart-proxy-group list, where by default, there would be just the single-proxy groups. As others mentioned in the original PR, the case where the users would actually use the group is quite edge comparing to the case of 1 proxy mapping.

Another assumption of the original proposal (and correct me, if I’m wrong) was that all the features of the proxies in the group should be load-balancable (I take it from the fact, that you don’t mention features assigned to the groups). This is quite a limitation from my point, as for example, I don’t think it would be that rare that I would want to define a load-balanced group just for, lets say remote execution, where there would be one proxy with rex and puppetca features, and additional proxy with rex only and I would like to use those as one group for rex, while the puppetca prevents doing that.

This lead us to introducing the group + feature concept, which lead to the enormous amount of groups, which would by super-confusing from hammer point of view (suddenly, when I register on proxy with 5 features, I would have 5 different ids to assign to the host, depending on the specific role the proxy would have).

No, they would shere:

  1. name
  2. url
  3. features
  4. taxonomies
  5. status
  6. assignments to hosts

I’m not sure how many people would be happy to see the host API changing, especially when most of the users will not care about the groups, but care about their current scripts working cross release. For me, this is actually reason against going the proxy group/route path.

Details on virtual proxy proposal

However, I don’t want to just talk about the downsides of the original proposal, which would be pretty odd if there wasn’t other alternative. So let’s spend some time on this.

LoadBalanced Proxy

When defining, the user selects:

  1. the features to be included for load balancing
  2. selects Real roxies that:
    a. have all the features that were selected in (1)
    b. were not already assigned in any other LoadBalanced proxy for any given feature
  3. fills in the url if any feature requires that (e.g. the rex feature itself would not require any url for the load balanced one)

As for orchestration, I can even imagine the LoadBalanced proxy having the orchestration methods (such as create_tftp_file) where it would decide how to handle the particular case (calling to all proxies, or just one of them…).

Alias Proxy

When defining, the user selects:

  1. the features to be exposed behind the alias
  2. selects a proxy (Real or LoadBalanced) that has all the features: there can be multiple aliases defined for any proxy.
  3. fills in the alias url

For orchestration, the Alias proxy could delegate all methods to the real proxy.

Here is also the difference between this approach and Marek’s from the original PR, as that proposal required additional attribute at host level.

Let me summarize the difference in the approaches in pros/cons table. I’m open for updating this table based on arguments (perhaps we can discuss this on a deep dive, that Greg suggested?)

Groups/Route Virtual Proxy
Supports load balancing :white_check_mark: :white_check_mark:
Supports multi homing :white_check_mark: :white_check_mark:
Does it require syncing members between groups :x: :white_check_mark:
Does it keep API backward-compatibility :x: :white_check_mark:
Allows selective load-balancing (per feature) :grey_question: :white_check_mark:
Doesn’t need single-item groups :x: :white_check_mark:
Doesn’t need STI :white_check_mark: :x:
Was it proposed by a cool developer :white_check_mark: :white_check_mark:
Does the proposing developer have capacity to work on the feature? :white_check_mark: :x:
2 Likes

As I understand, STI is designed for models that share the same or mostly the same attributes, relations and methods. I don’t see how Smart Proxies would share any real attributes, relations and methods to Smart Proxy Groups (lets just call them that for now).

Attributes
Smart Proxies:

  • name
  • url (foreman uses this for API, and clients base hostname from it)

Smart Proxies Groups:

  • name
  • alternative hostname (client use this)

So the only column they would share would be name and potentially alternative hostname. Though that would mean there would be 2 ways for a user to defining an alternative hostname. Also a user would have to select a Smart Proxy and choose to use the alternative hostname on the host form. I don’t like the idea of potentially selecting 2 things per feature.

Relations
Mostly similar or the same. Similar as features are shared but one is direct the other is in-direct via the Smart Proxy(ies)

Similar ones would probably have to be renamed :frowning:

Methods
Some overlap

We can share methods via other ways.

In the original proposal there is no need to select a feature per Smart Proxy Group, you just define a group between 2 proxies with the same features. If a user doesn’t want to use HA DNS for example then they wouldn’t select that group, they would select one of the other “default” groups containing a single Smart Proxy.

You are wrong :slight_smile: Features that we can’t or don’t want to make HA we can disallow selecting a Smart Proxy Group with more than one Smart Proxy in or adding more Smart Proxies to a group that is used by a host for that feature.

We can provide a deprecation for the life of APIv2. Ugly example: https://github.com/sean797/foreman/blob/23c105a9635b65bb2090f58b26e5430d810e77e8/app/controllers/api/v2/hostgroups_controller.rb#L145-L153

I think the API change is good for 3 main reasons:

  • advertise that something has changed
  • calling my puppet_proxy_id API param with my Smart Proxy Group doesn’t feel right.
  • not keeping old APIs around allows us to be free of our technical debt and make improvements

I would expect more users to take advantage of this, let me explain. Right now we have 3 reasons to add another Smart Proxy:

  1. capacity
  2. network security zone
  3. more client networks (think interface or NAT) AKA multi-homing

The feature removes #3 from that list and enables a better way of doing #1.
I think the name of the Smart Proxy Group object is important so users see and realize it as a positive change. IMO Smart Proxy Group wouldn’t give positive vibes to most users; hostname or route probably would.


Happy to do a deep dive if people want to, though I think a written form of discussion allows for a wider audience and lets people read, think, then make a decision and reply. IMO deep dives are good for explaining proposals (but I think we mostly both understand each other now) not making decisions.


I’ve added some entries to your table and reworded some items:

Groups/Route Virtual Proxy
Supports load balancing :white_check_mark: :white_check_mark:
Supports multi homing :white_check_mark: :white_check_mark:
Does it keep plugin API backward-compatible :x: :white_check_mark:
Does it keep user API backward-compatible :white_check_mark: :white_check_mark:
Allows selective load-balancing (per feature) :white_check_mark: :white_check_mark:
Doesn’t need single-item groups :x: :white_check_mark:
Doesn’t need STI and complex models :white_check_mark: :x:
Was it proposed by a cool developer :white_check_mark: :white_check_mark:
Does said cool developer have capacity to work on the feature? :white_check_mark: :x:
Doesn’t require a large number of group/proxy objects :white_check_mark: :x:

Having said all of the above I feel like a lot of those discussion points of moot, its a situation where 1 thing is marginally better with 1 method and another thing is marginally better with the other method.
I think we should make the decision based on the user API; its the thing that will impact people the most.

Reasons NOT to change the API:

  • no impact to users who don’t care (as explained above the size of this user base if up for debate)

Reasons to change the API:

  • advertise that something has changed (who reads the docs? :wink: )
  • calling my puppet_proxy_id API param with my SmartProxyGroup doesn’t feel right.
  • less technical debt - if we were to start Foreman today, I’m fairly sure we would have a Routing/Grouping object like this.

So I ask, are you okay with changing the API provided the following?

  • We provide deprecation’s for the entire life of APIv2
  • We call it Route (or something like that), as you say having a SmartProxyGroup of 1 could feel unnecessary to a user.