Making Katello dev server more reliable

John_Mitsch · January 3, 2019, 2:29pm

Happy new year all,

I would like to start a discussion around our the centos7-devel box and its reliability. For Katello developers, being able to successfully spin up the box is a critical part of our workflow, we use centos7-devel box almost exclusively for developing. Lately, the box has not been spinning up successfully and we have had this issue many times in the past.

I know development environments are an ever-changing beast, but I think there is room for improvement. I’m going to try to lay out the problem as clearly as I can and hopefully that helps us discuss ways to improve the reliability of the centos7-devel box.

The problem

Forklift’s centos7-devel box does does not reliably spin up successfully each time you use it (vagrant up centos7-devel).

The expectation

Forklift’s centos7-devel box reliably spins up close to %100 of the time

Why this is a problem

Developers can’t confidently spin up a new environment when they need it
A lot of time is wasted debugging the broken devel environment.
A lot of time is wasted spinning up a new environment to replace a broken/outdated one, only to have this fail. This means a developer is blocked until things are fixed or has to use old/buggy environments.
Centos7-devel is used not only by developers but also QE, community members, and others involved in Foreman/Katello development

Areas of improvement

I see three areas that we can improve. Feel free to use these as discussion points or add your own. I have some thoughts around these, but I will add them separately to try to keep this post general and unbiased.

Prevention

How do we prevent the provisioning of the devel environment from being broken in the first place? Something changes in development or a development-adjacent area and now a step in the provisioning of centos7-devel is broken.

Reporting

How do we know when the centos7-devel environment is broken?

Diagnosing and fixing

How do we know specifically what went wrong and how do we fix the issue? Who has this responsibility?

Knowledge that would be helpful to share

I know many of the working parts in provisioning a dev server are a black box to me. I think it would be helpful if anyone could share more information about these topics so we all have a better understanding of our tooling that would affect the centos7-devel box (feel free to request more)

The relation between the nightly pipeline and dev environments
How our installer works and how our various puppet modules create our development installer
Various third-party dependencies that could affect provisioning and how we use them (rvm, node, etc…)

I hope this helps facilitate some discussion. Please note that I don’t think this thread is appropriate for debugging why the box is broken currently, rather I would like to look at the long-term stability of the box. I am looking forward to hearing everyone’s responses. Could 2019 be the year we confidently vagrant up our dev server?

akofink · January 3, 2019, 2:48pm

One thing that could be useful is to have multiple options. We’ve faced breakages in many parts of our devel environment (puppet modules, RVM, node/webpack, etc.). Having multiple options that can be toggled on/off with ansible variables would allow us to work around some of these issues.

One such alternative is RVM vs. SCL. I have a PR to use Ruby SCL in devel which currently doesn’t work (yet). There’s a corresponding forklift PR as well from @ekohl. I need to revisit it to iron out the remaining issues, but this would allow us to easily bring up a devel environment when RVM is causing failures (like right now with the incorrect GPG key in the puppet-rvm module).

Chris_Duryee · January 3, 2019, 2:53pm

Also, the time spent debugging and fixing is usually duplicated over and over, since multiple developers will hit the same issue and have to perform local workarounds until a fix is committed.

Chris_Duryee · January 3, 2019, 2:57pm

Would it be a big CPU burden to spin up a dev env every 6-12 hours ?

I’m not sure of a good way to prevent dev env breakages. We could do a pre-commit check to spin up a dev env with a PR, but that would not catch all issues and may be overkill.

Chris_Duryee · January 3, 2019, 2:59pm

This may result in multiple permutations to test and support.

John_Mitsch · January 3, 2019, 3:02pm

Ok, time to add my opinions

To me, this is the most important area we can improve on. Catching things early will save us a lot of time an frustration. How do we prevent breakages? I don’t know if we have any data on what typically breaks the provisioning of the development environment, but speaking anecdotally, I’ve seen the installer affect this quite a bit. Is there room for improvement in the testing of the puppet modules?

I am not too familiar with this area, but my first thought is: could we do an actual katello-installer run on a PR level for puppet module repos? This could even be opt-in to not have to run for small changes (i.e. bumping a version). the workflow could go like this: you open a PR for a puppet module, comment [test puppet-modules] (or w/e we want to call it), this kicks of an install using the installer built with the changes in your PR. Let me know how viable this option is, again, I’m not too familiar with this area, but I think running the installer at PR-level would increase our confidence in it quite a bit.

We have been talking about making a pipeline for the devel environment quite a bit. I think this is something we should do. But let’s separate the reporting from actually improving the provisioning of centos7-devel. Reporting will only let us know what is wrong. To share behind the curtain, we do have a hacky (I can say this because I set it up ) script that spins up a dev server every night and we can check on it. Unfortunately it is behind a VPN, which is why we should really create a proper pipeline for this.

I only mention this, because while this script allows us to see things are broken quicker, it doesn’t mean that the provisioning of the development server is any more reliable. It is still broken, and we still have to take time to fix it.

I’m not sure about this one, an auto-reporting comment like we do for the nightly pipeline would be nice, but I know those can easily be ignored. Whatever we wind up doing, its going to be best if we start the discussion and share workarounds in the community forum. Keeping things transparent helps developers know the status of the fixes

akofink · January 3, 2019, 3:05pm

Being able to check out specific commits in all the repos (so many, puppet modules, foreman+plugins, even forklift itself) from the last successful run still wouldn’t completely prevent breakages (like this one stemming from an externally managed puppet module). I agree this is a difficult problem. In computer security, prevention takes a back seat to detection for a reason! I’m still of the opinion that having the flexibility to easily work around any given issue is a more realistic solution.

John_Mitsch · January 3, 2019, 3:07pm

To me, having more options leads to more complexity, but I see your point about adding options if its a known workaround (i.e. how we currently have the option to use RVM’s head instead of stable, which is a documented workaround by RVM). I think we should use the most reliable tools we can.

Should we switch to using the SCL permanently? I know we have seen issues with RVM in the past.

akofink · January 3, 2019, 3:10pm

I think, yes, eventually. We use SCL ruby in production, so it makes sense to use it for our devel environment as well. I wouldn’t vote to remove the option of using RVM, however.

Chris_Duryee · January 3, 2019, 3:29pm

IMO the pipeline for dev env would be the first step, since it would let us measure how stable things are now, and measure how much impact any improvements make.

John_Mitsch · January 3, 2019, 6:03pm

I agree we should implement a better system of reporting and actually create a public pipeline that spins up the box, but my point is we have something like that now (albeit not a great and transparent way of doing it), and we are still getting frequent breakages that last weeks. I think we need to look for improvement in other areas in addition to this.

Justin_Sherrill · January 4, 2019, 2:48pm

[John_Mitsch] John_Mitsch
https://community.theforeman.org/u/john_mitsch Katello
January 3
Chris_Duryee:

IMO the pipeline for dev env would be the first step, since it
would let us measure how stable things are now, and measure how
much impact any improvements make.
I agree we should implement a better system of reporting and actually
create a public pipeline that spins up the box, but my point is we
have something like that now (albeit not a great and transparent way
of doing it), and we are still getting frequent breakages that last
weeks. I think we need to look for improvement in other areas in
addition to this.

The current system fails silently though, which is quite different than
us receiving an email every day until its fixed. If we have a ci
pipeline that is creating posts to our community forum, we can better
reply to those and assign people to look at the failures.

Justin

ehelms · January 4, 2019, 4:28pm

I tend to agree, a great first step is bringing more widespread awareness and view into when it is failing and what those failures are. This can be achieved fairly easily now:

Create a Forklift pipeline that spins up a devel box
Add a job to the Foreman Jenkins that creates a pipeline similar to this one
Add a groovy pipeline for the logic to kickoff a Centos CI job such as this one
Add a job definition to the Centos jenkins that creates a test job to run the Forklift pipeline similar to this one
Add a groovy pipeline code for the logic similar to this one that references the Forklift pipeline and any variables

This is a great learning opportunity for someone not as familiar with our CI ecosystem and why I have laid out the steps above rather than just going and doing it. I am happy to help guide along the way. I would recommend we set this up to trigger after the katello-nightly-rpm-pipeline. If development is broken, that tends to imply production is broken. Which should push us to want to fix both and work on strategies to prevent both from happening.

In general, there is on going work in a few areas to make development closer to production. For example, the SCL in a development environment work @akofink mentioned, and adding a reverse proxy to run Puma in production environments. Let’s ask ourselves, what other areas can we bring development and production closer together with?

We have talked about, but never explored pre-built development boxes as well. If folks wanted to explore that I’d be happy to help as we have some experience building Vagrant boxes.

John_Mitsch · January 4, 2019, 4:41pm

I’m all for the pipeline thanks for sharing this info

I think there is room for improvement in areas to reduce breakages in addition to reporting them. Here are the suggestions I have from the discussion so far:

Make using SCL ruby default, but keep RVM as an option (a big +1 to this from me after dealing with the latest RVM installation issue)
Improve the testing of the puppet modules. Perhaps by adding an automated way to run the installer built on the PR-level.

This is really interesting. I wonder if we could have some sort of pipeline that released a vagrant image that we could ensure is stable?

ekohl · January 4, 2019, 5:13pm

Given the (WIP) PR to change this was authored by me it’s probably obvious I’m heavily in favor of this. It was inspired by RVM issues we’ve had in the past. Then it became stable again so it wasn’t as much of a priority.

Another thing we’ve been working on is running Foreman as a puma process behind a reverse proxy in production. This is much closer to the deployment process in the devel scenario. This means we can remove a lot of the logic in the katello devel module. It’s a secondary goal of the reverse proxy effort, but very much on the top of my mind.

We do have tests in every individual module. Generally most of them are more unit/integration tests, but we do have some smoke tests. For example:

github.com

theforeman/puppet-foreman/blob/master/spec/acceptance/foreman_basic_spec.rb

require 'spec_helper_acceptance'

describe 'Scenario: install foreman' do
  before(:context) do
    case fact('osfamily')
    when 'RedHat'
      on default, 'yum -y remove foreman* tfm-* && rm -rf /etc/yum.repos.d/foreman*.repo'
      on default, 'service httpd stop', { :acceptable_exit_codes => [0, 5] }
    when 'Debian'
      on default, 'apt-get purge -y foreman*', { :acceptable_exit_codes => [0, 100] }
      on default, 'apt-get purge -y ruby-hammer-cli-*', { :acceptable_exit_codes => [0, 100] }
      on default, 'rm -rf /etc/apt/sources.list.d/foreman*'
      on default, 'service apache2 stop', { :acceptable_exit_codes => [0, 5] }
    end
  end

  let(:pp) do
    <<-EOS
    # Workarounds

This file has been truncated. show original

Note that all our modules are also on a cron schedule in Travis so we are notified about issues. These are more fine grained than a full pipeline and we test some variations. IMHO this is a case of both rather than one of the two.

Another thing we used to be able to do (but no longer) is testing the installer scenario actually compiles. Running in place is something the foreman-installer always could do. Now that we’ve merged the installer I made sure we can do the same for all katello-installer scenarios. We do this in every PR to the installer itself:

github.com

theforeman/foreman-installer/blob/716c9ccd9bf6ee67b58cfe670ec4953142107b74/.travis.yml#L19-L26


#
# Test basic installer configuration works
#
- INSTDIR=$(mktemp -d)
- bundle exec rake build install PREFIX=$INSTDIR
- bundle exec $INSTDIR/sbin/foreman-installer --help --scenario foreman
- bundle exec $INSTDIR/sbin/foreman-installer --help --scenario foreman-proxy-content
- bundle exec $INSTDIR/sbin/foreman-installer --help --scenario katello

AFAIK currently we don’t do this in the installer release process, but it’s something I’ve been looking at. If we merge the devel scenario back into the installer rather than doing it via forklift, we could give these guarantees again. There are some considerations we need to look into though, like how quickly we can get fixes to users again.

akofink · January 4, 2019, 6:57pm

Done: https://github.com/theforeman/forklift/pull/896 Reviews welcome!

John_Mitsch · February 8, 2019, 10:08pm

Just an update on this topic for those interested:

@akofink added a katello-devel pipeline in forklift and has an open PR to switch the dev environment to use ruby from the Red Hat SCL, which is more stable and matches our build environment. (@akofink feel free to correct/clarify any of this)

I added a jenkins job to run this devel pipeline, it runs in centos ci but is kicked off by foreman’s jenkin’s instance

This is a really great step in the right direction for helping create a more stable devel environment!

As a developer, you can check the latest jenkins build for that job to check the condititon of our development environment. We’ll continue to iterate and improve on this, feel free to give feedback on the changes so far.

Thanks to all involved!

John_Mitsch · February 18, 2019, 9:36pm

We were able to work out some of the kinks with the job and got the jenkins job aligned with the centos7-katello-devel box. It is finally green and passing

If you want to monitor this job (its really helpful to check before you spin up a box), you can bookmark this link https://ci.centos.org/job/foreman-katello-devel-test/lastBuild/

The next step I am looking into is creating a vagrant image (using packer) on successful builds, that would be hosted somewhere. This way, we would always have stable vagrant image of a devel available. The downside to this approach is that we are less incentivized to actually fix devel with a stable image available, but the upside is we are never blocked by not being able to spin up a dev environment and ensure new contributors and team members are always able to use a working devel environment.

I’m happy to see these changes in practice, thanks again to everyone who participated!

Making Katello dev server more reliable

The problem

The expectation

Why this is a problem

Areas of improvement

Prevention

Reporting

Diagnosing and fixing

Knowledge that would be helpful to share