I would like to start a discussion around our the centos7-devel box and its reliability. For Katello developers, being able to successfully spin up the box is a critical part of our workflow, we use centos7-devel box almost exclusively for developing. Lately, the box has not been spinning up successfully and we have had this issue many times in the past.
I know development environments are an ever-changing beast, but I think there is room for improvement. I’m going to try to lay out the problem as clearly as I can and hopefully that helps us discuss ways to improve the reliability of the centos7-devel box.
Forklift’s centos7-devel box does does not reliably spin up successfully each time you use it (vagrant up centos7-devel).
Forklift’s centos7-devel box reliably spins up close to %100 of the time
Why this is a problem
Developers can’t confidently spin up a new environment when they need it
A lot of time is wasted debugging the broken devel environment.
A lot of time is wasted spinning up a new environment to replace a broken/outdated one, only to have this fail. This means a developer is blocked until things are fixed or has to use old/buggy environments.
Centos7-devel is used not only by developers but also QE, community members, and others involved in Foreman/Katello development
Areas of improvement
I see three areas that we can improve. Feel free to use these as discussion points or add your own. I have some thoughts around these, but I will add them separately to try to keep this post general and unbiased.
How do we prevent the provisioning of the devel environment from being broken in the first place? Something changes in development or a development-adjacent area and now a step in the provisioning of centos7-devel is broken.
How do we know when the centos7-devel environment is broken?
Diagnosing and fixing
How do we know specifically what went wrong and how do we fix the issue? Who has this responsibility?
Knowledge that would be helpful to share
I know many of the working parts in provisioning a dev server are a black box to me. I think it would be helpful if anyone could share more information about these topics so we all have a better understanding of our tooling that would affect the centos7-devel box (feel free to request more)
The relation between the nightly pipeline and dev environments
How our installer works and how our various puppet modules create our development installer
Various third-party dependencies that could affect provisioning and how we use them (rvm, node, etc…)
I hope this helps facilitate some discussion. Please note that I don’t think this thread is appropriate for debugging why the box is broken currently, rather I would like to look at the long-term stability of the box. I am looking forward to hearing everyone’s responses. Could 2019 be the year we confidently vagrant up our dev server?
One thing that could be useful is to have multiple options. We’ve faced breakages in many parts of our devel environment (puppet modules, RVM, node/webpack, etc.). Having multiple options that can be toggled on/off with ansible variables would allow us to work around some of these issues.
One such alternative is RVM vs. SCL. I have a PR to use Ruby SCL in devel which currently doesn’t work (yet). There’s a corresponding forklift PR as well from @ekohl. I need to revisit it to iron out the remaining issues, but this would allow us to easily bring up a devel environment when RVM is causing failures (like right now with the incorrect GPG key in the puppet-rvm module).
To me, this is the most important area we can improve on. Catching things early will save us a lot of time an frustration. How do we prevent breakages? I don’t know if we have any data on what typically breaks the provisioning of the development environment, but speaking anecdotally, I’ve seen the installer affect this quite a bit. Is there room for improvement in the testing of the puppet modules?
I am not too familiar with this area, but my first thought is: could we do an actual katello-installer run on a PR level for puppet module repos? This could even be opt-in to not have to run for small changes (i.e. bumping a version). the workflow could go like this: you open a PR for a puppet module, comment [test puppet-modules] (or w/e we want to call it), this kicks of an install using the installer built with the changes in your PR. Let me know how viable this option is, again, I’m not too familiar with this area, but I think running the installer at PR-level would increase our confidence in it quite a bit.
We have been talking about making a pipeline for the devel environment quite a bit. I think this is something we should do. But let’s separate the reporting from actually improving the provisioning of centos7-devel. Reporting will only let us know what is wrong. To share behind the curtain, we do have a hacky (I can say this because I set it up ) script that spins up a dev server every night and we can check on it. Unfortunately it is behind a VPN, which is why we should really create a proper pipeline for this.
I only mention this, because while this script allows us to see things are broken quicker, it doesn’t mean that the provisioning of the development server is any more reliable. It is still broken, and we still have to take time to fix it.
I’m not sure about this one, an auto-reporting comment like we do for the nightly pipeline would be nice, but I know those can easily be ignored. Whatever we wind up doing, its going to be best if we start the discussion and share workarounds in the community forum. Keeping things transparent helps developers know the status of the fixes
Being able to check out specific commits in all the repos (so many, puppet modules, foreman+plugins, even forklift itself) from the last successful run still wouldn’t completely prevent breakages (like this one stemming from an externally managed puppet module). I agree this is a difficult problem. In computer security, prevention takes a back seat to detection for a reason! I’m still of the opinion that having the flexibility to easily work around any given issue is a more realistic solution.
To me, having more options leads to more complexity, but I see your point about adding options if its a known workaround (i.e. how we currently have the option to use RVM’s head instead of stable, which is a documented workaround by RVM). I think we should use the most reliable tools we can.
Should we switch to using the SCL permanently? I know we have seen issues with RVM in the past.
I agree we should implement a better system of reporting and actually create a public pipeline that spins up the box, but my point is we have something like that now (albeit not a great and transparent way of doing it), and we are still getting frequent breakages that last weeks. I think we need to look for improvement in other areas in addition to this.
IMO the pipeline for dev env would be the first step, since it
would let us measure how stable things are now, and measure how
much impact any improvements make.
I agree we should implement a better system of reporting and actually
create a public pipeline that spins up the box, but my point is we
have something like that now (albeit not a great and transparent way
of doing it), and we are still getting frequent breakages that last
weeks. I think we need to look for improvement in other areas in
addition to this.
The current system fails silently though, which is quite different than
us receiving an email every day until its fixed. If we have a ci
pipeline that is creating posts to our community forum, we can better
reply to those and assign people to look at the failures.
Add a job to the Foreman Jenkins that creates a pipeline similar to this one
Add a groovy pipeline for the logic to kickoff a Centos CI job such as this one
Add a job definition to the Centos jenkins that creates a test job to run the Forklift pipeline similar to this one
Add a groovy pipeline code for the logic similar to this one that references the Forklift pipeline and any variables
This is a great learning opportunity for someone not as familiar with our CI ecosystem and why I have laid out the steps above rather than just going and doing it. I am happy to help guide along the way. I would recommend we set this up to trigger after the katello-nightly-rpm-pipeline. If development is broken, that tends to imply production is broken. Which should push us to want to fix both and work on strategies to prevent both from happening.
In general, there is on going work in a few areas to make development closer to production. For example, the SCL in a development environment work @akofink mentioned, and adding a reverse proxy to run Puma in production environments. Let’s ask ourselves, what other areas can we bring development and production closer together with?
We have talked about, but never explored pre-built development boxes as well. If folks wanted to explore that I’d be happy to help as we have some experience building Vagrant boxes.
Given the (WIP) PR to change this was authored by me it’s probably obvious I’m heavily in favor of this. It was inspired by RVM issues we’ve had in the past. Then it became stable again so it wasn’t as much of a priority.
Another thing we’ve been working on is running Foreman as a puma process behind a reverse proxy in production. This is much closer to the deployment process in the devel scenario. This means we can remove a lot of the logic in the katello devel module. It’s a secondary goal of the reverse proxy effort, but very much on the top of my mind.
We do have tests in every individual module. Generally most of them are more unit/integration tests, but we do have some smoke tests. For example:
Note that all our modules are also on a cron schedule in Travis so we are notified about issues. These are more fine grained than a full pipeline and we test some variations. IMHO this is a case of both rather than one of the two.
Another thing we used to be able to do (but no longer) is testing the installer scenario actually compiles. Running in place is something the foreman-installer always could do. Now that we’ve merged the installer I made sure we can do the same for all katello-installer scenarios. We do this in every PR to the installer itself:
AFAIK currently we don’t do this in the installer release process, but it’s something I’ve been looking at. If we merge the devel scenario back into the installer rather than doing it via forklift, we could give these guarantees again. There are some considerations we need to look into though, like how quickly we can get fixes to users again.
This is a really great step in the right direction for helping create a more stable devel environment!
As a developer, you can check the latest jenkins build for that job to check the condititon of our development environment. We’ll continue to iterate and improve on this, feel free to give feedback on the changes so far.
The next step I am looking into is creating a vagrant image (using packer) on successful builds, that would be hosted somewhere. This way, we would always have stable vagrant image of a devel available. The downside to this approach is that we are less incentivized to actually fix devel with a stable image available, but the upside is we are never blocked by not being able to spin up a dev environment and ensure new contributors and team members are always able to use a working devel environment.
I’m happy to see these changes in practice, thanks again to everyone who participated!