Phasing out smart_proxy_dynflow_core

aruzicka · March 19, 2021, 10:41pm

Intro

In the early days of remote execution, we wanted to get dynflow to the smart proxy to be able to reuse some code we had on Foreman’s side and to have the same engine on both sides. Sadly at the time, smart proxy wasn’t SCL’d and Dynflow required newer ruby than what was available outside of SCL.

To address this situation, we created smart_proxy_dynflow and smart_proxy_dynflow_core gems. smart_proxy_dynflow_core is a sidecar service to the smart proxy, essentially just Dynflow runtime with an API. smart_proxy_dynflow is a smart proxy plugin which rewrites requests coming to /dynflow/* and proxies them to smart_proxy_dynflow_core.

From this point onwards, I’ll refer to smart_proxy_dynflow_core as SPDC.

To make things more interesting, SPDC was written in a way that it could still act as a regular smart proxy plugin, meaning that if all dependencies could be satisfied, it could be loaded into smart proxy and not require the sidecar service. On debian based platforms, there are no SCLs and therefore we had no issues with things being or not being available in a SCL, and so we deployed SPDC as a smart proxy plugin, as opposed to EL* based platforms, where it was a standalone service.

The catch(es)

Having this solved the immediate issue, but created a couple of new ones:

different deployments on different platforms
higher network traffic when SPDC runs as a standalone service due to the request proxying between smart proxy and SPDC
code duplication (SPDC reimplements selected parts of smart proxy)
weird gems existing due to us wanting to be able to load the same code into Foreman, smart proxy and SPDC (foreman_{ansible,tasks,remote_execution}_core and friends)
getting SSL certs right is rather hard, installer does that for us but people are struggling when using custom certs
certain things became hard (client-cert based auth on SPDC's side, since we don’t get the “real” client’s cert)

Fast forward to today, smart proxy is SCL’d, the original issue is gone and we are left with just the downsides. To address this, I suggest getting rid of SPDC (the service, we would keep the gem for a bit more) for good.

The proposal

1) Patch RPM packaging to deploy bundler.d files to both smart proxy and `SPDC`

Currently when installing smart proxy plugins, they drop bundler.d files into locations where SPDC is able to pick them up. However, those are not files for the gems which are providing them, but for their *_core counterparts, so for example smart_proxy_ansible drops a file to have foreman_ansible_core loaded in SPDC, smart_proxy_remote_execution_ssh sets up foreman_remote_execution_core and so on.

The first step would change this to have the smart proxy plugins set up the bundler.d files to have them loaded into both SPDC and smart proxy. Until we change external_core setting for smart_proxy_dynflow_core, the gems would be loaded by both services, but only used by SPDC.

This step doesn’t really add value on its own, apart from making the following step safer by not having to coordinate gem, puppet module and installer releases precisely.

Packaging PR: Deploy bundlerd files for *_core gems to the smart proxy as well by adamruzicka · Pull Request #6468 · theforeman/foreman-packaging · GitHub

2) Toggle the setting

Once we have the gems loaded in both services, we can start phasing out SPDC. This step would be relatively straightforward, it would boil down just to flipping a single parameter in puppet-foreman_proxy, deprecating the standalone mode and optionally stopping SPDC service either by installer or from some packaging hook.

This should follow shortly after the first step is done. I’d be glad if we could get this done soon, ideally targeting Foreman 2.5.

3) Code cleanup

As far as SPDC (the gem) is concerned, we could get rid of most of the code we copied from smart proxy to be able to make it run on its own. We could also drop all the request rewriting and proxying code from smart_proxy_dynflow. Eventually, we should be able to get rid of SPDC completely by merging it into smart_proxy_dynflow.

There are many gems which are currently arguable at best and wouldn’t make any sense at all once we get rid of SPDC. This would be the stage where we move most of the existing code from *_core plugins to their smart_proxy_* counterparts. After this is done, whatever remains of the *_core gems can be dropped completely.

This is probably the biggest step of the three, but since the gems in question can be used as-is in both deployments, it is not urgent. In any case, I’d probably leave one release for the deployment change to settle in and then get to refactoring the plugins. If we manage to get the previous steps into 2.5, then it would suggest the refactored plugins landing in 2.6.

Finishing thoughts

While the split deployment model has some downsides, it is not all bad. The upside of it is that even if the things running on SPDC get CPU intensive, it won’t affect the smart proxy process.

Another thing worth considering is that in the split deployment, if you push SPDC hard enough, SPDC may not be able too keep up with incoming requests. If we run dynflow inside regular smart proxy as per this proposal, it might start dropping other, non-dynflow, requests when under load. On the other hand, smart-proxy uses puma (as opposed to webrick on SPDC) which should perform better. Sadly I didn’t try this at a large enough scale to have any meaningful data, so I cannot really confirm nor refute this. What keeps me optimistic is not recalling any debian users complaining about it happening.

We have a chance to simplify things quite a bit. This proposal should allow us to address all the catches described above and I believe it is a step in the right direction.

Marek_Hulan · March 22, 2021, 8:28am

I think the simiplification will be a great outcome. Regarding the scaling concern, I think it’s better to have one app stack to scale vs trying to scale smart proxy (puma) and SPDC (webrick) separately with different techniques.

ezr-ondrej · March 22, 2021, 12:27pm

I’m all for merging smart_proxy_dynflow_core to the smart_proxy_dynflow!

Just a thought on the topic though: This distinction would make sense to me though if we would have public dynflow API and possibility to have separate dynflow process, with distinct code that should be loaded in that process. I would argue that such code could still live in the same repo though. This shouldn’t be something we solve in smart_proxy and definitely not something shortterm.
That would solve the load concerns.

Short term we can enable sidekiq workers on smart_proxy if we see issues with the puma load, so entertaining this possibility and being ready to execute on it if neccessary I don’t see any blockers. And your proposal is quite safe even if we for some reason don’t hit 2.5 with 2) so

ehelms · March 22, 2021, 3:34pm

I think the overall simplification of design sounds great for all aspects. You raise the change in performance concerns which were the first thing that came to my mind. Debian deployments already work this way correct? Do we have any mention from users there around performance?

Are there any easy to setup tests we could do to get an idea of how things might perform in this new world?

Given we do not have Puma support yet merged, are you saying that it will be a requirement to make that change in order to make this change? Or are you saying you predict to enable optimal performance we need to make the change to Puma?

aruzicka · March 22, 2021, 4:09pm

Correct.

I don’t recall any, but we have to take into account the rex (and especially ansible) was broken for quite a long time on debian so that might be a reason why we haven’t heard anything about how it performs there.

Depends how you define easy to set up. If you’re after one command tests, then no. It would boil down to something like this:

install foreman
if on EL*, install rpms from pr from stage 1
if on EL*, stop smart_proxy_dynflow_core, set external_core to true in /etc/foreman-proxy/settings.d/dynflow.yml, restart foreman-proxy
Create a bunch of hosts
Run a job against hosts from 4

Oh, so that was just wishful thinking on my side then. In any case, even if we’re still running webrick on the smart proxy, I’d be surprised if a single webrick performed worse than webrick behind another webrick.

I wouldn’t say having puma support merged is a requirement, but I expect things to get a little bit smoother once we have it.

lzap · March 24, 2021, 7:02pm

I have no comments on the proposal other than I agree. I do not think you need to provide any performance data as long as the proposal means unification of deployments on both platforms (Red Hat vs Debian).

For the record, smart-proxy has always been using webrick in multi-threaded mode. I had a puma patch but I dropped the ball along the way as the major reason why do the migration was fixed by SCLing proxy.

In the retrospective, I think holding on SCLing proxy was a mistake we should not do again.

ekohl · April 13, 2021, 2:36pm

@aruzicka I think steps 1 and 2 are done now. Could you give a quick update?

aruzicka · April 14, 2021, 7:58am

As you said, steps 1 and 2 are done. If you set up a nightly box with something that uses dynflow on the proxy, you should get it running inside foreman-proxy. smart_proxy_dynflow_core.service is still present as a unit, but it is disabled and not running. Note this applies to both debian and EL* based platforms.

Once 2.5 is out of the door we can proceed with step 3. It will require a somewhat coordinated effort because the core gems have dependencies among themselves, but we should still be able to pull it off without breaking things.